Zero-to-jupyterhub-k8s: Kops AWS cluster: Hub pod status remains "ContainerCreating" with "FailedAttachVolume" warning

Created on 23 Aug 2018 · 14Comments · Source: jupyterhub/zero-to-jupyterhub-k8s

I've created a new cluster (which validates) with kops on AWS, but when I install the z2js helm chart:

helm upgrade --install jhub jupyterhub/jupyterhub --namespace jhub --version 0.7.0-beta.2 --values config.yaml

(with only secretToken in the config.yaml) the hub pod never creates:

kubectl -n jhub get pods
NAME                     READY     STATUS              RESTARTS   AGE
hub-5ff7fcb7bf-sgz27     0/1       ContainerCreating   0          8m
proxy-7b4fd468c9-27h92   1/1       Running             0          8m

When I describe the pod, I see some FailedAttachVolume warnings:

kubectl -n jhub describe pod hub-5ff7fcb7bf-sgz27
Name:           hub-5ff7fcb7bf-sgz27
Namespace:      jhub
Node:           ip-172-20-152-79.ec2.internal/172.20.152.79
Start Time:     Thu, 23 Aug 2018 16:26:35 +0000
Labels:         app=jupyterhub
                component=hub
                hub.jupyter.org/network-access-proxy-api=true
                hub.jupyter.org/network-access-proxy-http=true
                hub.jupyter.org/network-access-singleuser=true
                pod-template-hash=1993976369
                release=jhub
Annotations:    checksum/config-map=55a5924b375f6b733949f0d8f7290957e3097fe9b364c6425b7022ad3c79722e
                checksum/secret=3430b5b3781de0f84b057a70745a15c4f8d6b53e2032ab73ee1970693c9a436d
                prometheus.io/path=/hub/metrics
                prometheus.io/scrape=true
Status:         Pending
IP:
Controlled By:  ReplicaSet/hub-5ff7fcb7bf
Containers:
  hub:
    Container ID:
    Image:         jupyterhub/k8s-hub:0.7.0-beta.2
    Image ID:
    Port:          8081/TCP
    Host Port:     0/TCP
    Command:
      jupyterhub
      --config
      /srv/jupyterhub_config.py
      --upgrade-db
    State:          Waiting
      Reason:       ContainerCreating
    Ready:          False
    Restart Count:  0
    Requests:
      cpu:     200m
      memory:  512Mi
    Environment:
      SINGLEUSER_IMAGE:        jupyterhub/k8s-singleuser-sample:0.7.0-beta.2
      POD_NAMESPACE:           jhub (v1:metadata.namespace)
      CONFIGPROXY_AUTH_TOKEN:  <set to the key 'proxy.token' in secret 'hub-secret'>  Optional: false
    Mounts:
      /etc/jupyterhub/config/ from config (rw)
      /etc/jupyterhub/secret/ from secret (rw)
      /srv/jupyterhub from hub-db-dir (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from hub-token-fpxqd (ro)
Conditions:
  Type           Status
  Initialized    True
  Ready          False
  PodScheduled   True
Volumes:
  config:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      hub-config
    Optional:  false
  secret:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  hub-secret
    Optional:    false
  hub-db-dir:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  hub-db-dir
    ReadOnly:   false
  hub-token-fpxqd:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  hub-token-fpxqd
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason                  Age               From                                    Message
  ----     ------                  ----              ----                                    -------
  Warning  FailedScheduling        9m (x2 over 9m)   default-scheduler                       pod has unbound PersistentVolumeClaims (repeated 2 times)
  Normal   Scheduled               9m                default-scheduler                       Successfully assigned hub-5ff7fcb7bf-sgz27 to ip-172-20-152-79.ec2.internal
  Normal   SuccessfulMountVolume   9m                kubelet, ip-172-20-152-79.ec2.internal  MountVolume.SetUp succeeded for volume "config"
  Normal   SuccessfulMountVolume   9m                kubelet, ip-172-20-152-79.ec2.internal  MountVolume.SetUp succeeded for volume "secret"
  Normal   SuccessfulMountVolume   9m                kubelet, ip-172-20-152-79.ec2.internal  MountVolume.SetUp succeeded for volume "hub-token-fpxqd"
  Warning  FailedAttachVolume      9m (x2 over 9m)   attachdetach-controller                 AttachVolume.Attach failed for volume "pvc-4d068ae7-a6f1-11e8-8a54-023b5bd89ff6" : "Error attaching EBS volume \"vol-02eb1ad199ffd5a95\"" to instance "i-0ed6f7e287bfe728c" since volume is in "creating" state
  Normal   SuccessfulAttachVolume  9m                attachdetach-controller                 AttachVolume.Attach succeeded for volume "pvc-4d068ae7-a6f1-11e8-8a54-023b5bd89ff6"
  Warning  FailedMount             43s (x4 over 7m)  kubelet, ip-172-20-152-79.ec2.internal  Unable to mount volumes for pod "hub-5ff7fcb7bf-sgz27_jhub(4d3621d3-a6f1-11e8-9524-0e4bc75ccd0c)": timeout expired waiting for volumes to attach or mount for pod "jhub"/"hub-5ff7fcb7bf-sgz27". list of unmounted volumes=[hub-db-dir]. list of unattached volumes=[config secret hub-db-dir hub-token-fpxqd]

I'm guessing I missed or screwed up some step in the z2jh guide, but how do I debug this?

Source

rsignell-usgs

Most helpful comment

Oh man. It's working now.

I was using m5.2xlarge instances in my cluster, but from this answer on github issues I discovered that m5 instances have their EBS volumes are exposed as NVMe block devices.

So when I switched to m4.2xlarge instances, the problem went away:

kops create cluster kopscluster.k8s.local \
  --zones us-east-1a,us-east-1b,us-east-1c,us-east-1d,us-east-1e,us-east-1f  \
  --authorization RBAC \
  --master-size t2.small \
  --master-volume-size 10 \
  --node-size m4.2xlarge \
  --master-count 3 \
  --node-count 2 \
  --node-volume-size 120 \
  --yes

$  kubectl --namespace=jhub get pod
NAME                     READY     STATUS    RESTARTS   AGE
hub-5ff7fcb7bf-lqfwl     1/1       Running   0          58s
proxy-7b4fd468c9-qk98v   1/1       Running   0          58s

Should something be added to the z2jh guide to prevent this from happening to others, or is this just something that users should know?

rsignell-usgs on 28 Aug 2018

❤3

All 14 comments

Hmmm a quick note from me on mobile:

does the PVC called hub-db-dir exist?
does the associated PV exist?
If yes to the first, and no to the second, or perhaps no matter what:
what does the default StorageClass say?

You can use kubectl to get describe and get --output yaml all three of these: pvc,pv,storageclass - perhaps that gives us more insight

consideRatio on 25 Aug 2018

@consideRatio , I do have the PVC named hub-db-dir:

$ kubectl -n jhub get pvc
NAME         STATUS    VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
hub-db-dir   Bound     pvc-4d068ae7-a6f1-11e8-8a54-023b5bd89ff6   1Gi        RWO            gp2            1d

$ kubectl -n jhub describe pvc hub-db-dir
Name:          hub-db-dir
Namespace:     jhub
StorageClass:  gp2
Status:        Bound
Volume:        pvc-4d068ae7-a6f1-11e8-8a54-023b5bd89ff6
Labels:        app=jupyterhub
               chart=jupyterhub-0.7.0-beta.2
               component=hub
               heritage=Tiller
               release=jhub
Annotations:   pv.kubernetes.io/bind-completed=yes
               pv.kubernetes.io/bound-by-controller=yes
               volume.beta.kubernetes.io/storage-provisioner=kubernetes.io/aws-ebs
Finalizers:    [kubernetes.io/pvc-protection]
Capacity:      1Gi
Access Modes:  RWO
Events:        <none>

and here's the info on the Volume:

$ kubectl -n jhub describe pv  pvc-4d068ae7-a6f1-11e8-8a54-023b5bd89ff6
Name:            pvc-4d068ae7-a6f1-11e8-8a54-023b5bd89ff6
Labels:          failure-domain.beta.kubernetes.io/region=us-east-1
                 failure-domain.beta.kubernetes.io/zone=us-east-1d
Annotations:     kubernetes.io/createdby=aws-ebs-dynamic-provisioner
                 pv.kubernetes.io/bound-by-controller=yes
                 pv.kubernetes.io/provisioned-by=kubernetes.io/aws-ebs
Finalizers:      [kubernetes.io/pv-protection]
StorageClass:    gp2
Status:          Bound
Claim:           jhub/hub-db-dir
Reclaim Policy:  Delete
Access Modes:    RWO
Capacity:        1Gi
Node Affinity:   <none>
Message:
Source:
    Type:       AWSElasticBlockStore (a Persistent Disk resource in AWS)
    VolumeID:   aws://us-east-1d/vol-02eb1ad199ffd5a95
    FSType:     ext4
    Partition:  0
    ReadOnly:   false
Events:         <none>

and the StorageClass:

[ec2-user@ip-172-31-29-161 ~]$ kubectl -n jhub describe StorageClass gp2
Name:            gp2
IsDefaultClass:  Yes
Annotations:     kubectl.kubernetes.io/last-applied-configuration={"apiVersion":"storage.k8s.io/v1","kind":"StorageClass","metadata":{"annotations":{"storageclass.beta.kubernetes.io/is-default-class":"true"},"name":"gp2","namespace":""},"parameters":{"type":"gp2"},"provisioner":"kubernetes.io/aws-ebs"}
,storageclass.beta.kubernetes.io/is-default-class=true
Provisioner:           kubernetes.io/aws-ebs
Parameters:            type=gp2
AllowVolumeExpansion:  <unset>
MountOptions:          <none>
ReclaimPolicy:         Delete
VolumeBindingMode:     Immediate
Events:                <none>

Does that give any clues?

rsignell-usgs on 25 Aug 2018

@jacobtomlinson, any ideas here?

rsignell-usgs on 28 Aug 2018

@rsignell-usgs I'm not confident about this, I think it is not related to this repository but the cloud provider and/or Kubernetes running on it.

Googling on the errors in the event logs for when you ran "describe", I found some comments about kubernetes version and kops version.

I am too facing same issue after doing kops upgrade, which moved the kublet version to 1.9.6

I've upgraded all my nodes to 1.8.2 and the redis pod started and volume seems to be mounted normally.

My suggested action:

Make sure your instances kubelet (initialized by kops I figure), your kops and your kubectl are the same version. I think kubectl can be a higher version and thats fine but don't let it be lower.

consideRatio on 28 Aug 2018

Thanks for a good summary of your logs etc @rsignell-usgs ! I hope you get past this troublesome issue.

Oh hmm btw, you may also want to try delete the PVC. That should in turn make the cloud provider perform a cleanup of the provisioned PV within a minute or so. Verify that this happened. I don't know if this could help in some way, just an additional way to reset the state.

consideRatio on 28 Aug 2018

To accomplish getting the same kubelet version as kops, if kops have an associated kubernetes version at all which I assumed... You may want to recreate instances, or upgrade your instance group, or something like this. Beware: I'm just guessing terminology and available tech without ever used an Amazon cloud.

consideRatio on 28 Aug 2018

I downgraded my kubectl client to be the same as the server, so I get:

$ kubectl version

Client Version: version.Info{Major:"1", Minor:"10", GitVersion:"v1.10.3", GitCommit:"2bba0127d85d5a46ab4b778548be28623b32d0b0", GitTreeState:"clean", BuildDate:"2018-05-21T09:17:39Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"10", GitVersion:"v1.10.3", GitCommit:"2bba0127d85d5a46ab4b778548be28623b32d0b0", GitTreeState:"clean", BuildDate:"2018-05-21T09:05:37Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"}

and kops is also same major/minor version:

$ kops version

Version 1.10.0 (git-8b52ea6d1)

tried deleting and installing the JH chart:

$ kubectl -n jhub get pods
$ helm upgrade --install jhub jupyterhub/jupyterhub --namespace jhub --version 0.7.0-beta.2 --values config.yaml

and got back:

Release "jhub" does not exist. Installing it now.
NAME:   jhub
LAST DEPLOYED: Tue Aug 28 17:42:17 2018
NAMESPACE: jhub
STATUS: DEPLOYED

RESOURCES:
==> v1/ConfigMap
NAME        DATA  AGE
hub-config  36    1s

==> v1/PersistentVolumeClaim
NAME        STATUS   VOLUME  CAPACITY  ACCESS MODES  STORAGECLASS  AGE
hub-db-dir  Pending  gp2     1s

==> v1/ServiceAccount
NAME  SECRETS  AGE
hub   1        1s

==> v1beta1/Role
NAME  AGE
hub   1s

==> v1beta1/RoleBinding
NAME  AGE
hub   1s

==> v1/Secret
NAME        TYPE    DATA  AGE
hub-secret  Opaque  1     1s

==> v1/Service
NAME          TYPE          CLUSTER-IP      EXTERNAL-IP  PORT(S)       AGE
hub           ClusterIP     100.69.215.85   <none>       8081/TCP      1s
proxy-public  LoadBalancer  100.69.3.32     <pending>    80:31455/TCP  1s
proxy-api     ClusterIP     100.66.150.224  <none>       8001/TCP      0s

==> v1beta2/Deployment
NAME   DESIRED  CURRENT  UP-TO-DATE  AVAILABLE  AGE
hub    1        1        1           0          0s
proxy  1        0        0           0          0s

==> v1beta1/PodDisruptionBudget
NAME   MIN AVAILABLE  MAX UNAVAILABLE  ALLOWED DISRUPTIONS  AGE
hub    1              N/A              0                    0s
proxy  1              N/A              0                    0s

==> v1/Pod(related)
NAME                    READY  STATUS             RESTARTS  AGE
hub-5ff7fcb7bf-hw64r    0/1    Pending            0         0s
proxy-7b4fd468c9-6vhrt  0/1    ContainerCreating  0         0s


NOTES:
Thank you for installing JupyterHub!

but unfortunately I have the same situation again where the hub never leaves the ContainerCreating state:

$ kubectl -n jhub get pods
NAME                     READY     STATUS              RESTARTS   AGE
hub-5ff7fcb7bf-hw64r     0/1       ContainerCreating   0          6m
proxy-7b4fd468c9-6vhrt   1/1       Running             0          6m

rsignell-usgs on 28 Aug 2018

What do you find if you run:

kubectl describe node | grep "Kubelet Version"

consideRatio on 28 Aug 2018

$ kubectl describe node | grep "Kubelet Version"
 Kubelet Version:            v1.10.3
 Kubelet Version:            v1.10.3
 Kubelet Version:            v1.10.3
 Kubelet Version:            v1.10.3
 Kubelet Version:            v1.10.3

rsignell-usgs on 28 Aug 2018

😕1

Oh man. It's working now.

I was using m5.2xlarge instances in my cluster, but from this answer on github issues I discovered that m5 instances have their EBS volumes are exposed as NVMe block devices.

So when I switched to m4.2xlarge instances, the problem went away:

kops create cluster kopscluster.k8s.local \
  --zones us-east-1a,us-east-1b,us-east-1c,us-east-1d,us-east-1e,us-east-1f  \
  --authorization RBAC \
  --master-size t2.small \
  --master-volume-size 10 \
  --node-size m4.2xlarge \
  --master-count 3 \
  --node-count 2 \
  --node-volume-size 120 \
  --yes

$  kubectl --namespace=jhub get pod
NAME                     READY     STATUS    RESTARTS   AGE
hub-5ff7fcb7bf-lqfwl     1/1       Running   0          58s
proxy-7b4fd468c9-qk98v   1/1       Running   0          58s

Should something be added to the z2jh guide to prevent this from happening to others, or is this just something that users should know?

rsignell-usgs on 28 Aug 2018

❤3

ah awesome!!! hmm nah i think this issue may be enough, it is sooooo much that would be needed documenting and kept up to date, or hmmm was it part of the guide to recommend the setting you had?

consideRatio on 29 Aug 2018

Or perhaps, I dont know yet what it means to be a m4 vs m5.large, i need to learn more as usual :D

consideRatio on 29 Aug 2018

The issue with NVMe is a kops issue rather than a jupyter issue.

The workaround for it is to switch the kops os image to debian stretch where NVMe support has been added.

image: kope.io/k8s-1.8-debian-stretch-amd64-hvm-ebs-2018-02-08

jacobtomlinson on 29 Aug 2018

👍1

Thanks for documenting this @jacobtomlinson !
@rsignell-usgs the title of the issue is great, I bet others will find this if they run into the same issue!

consideRatio on 29 Aug 2018

👍1

Was this page helpful?

0 / 5 - 0 ratings