Kubernetes: Pods stuck with status ContainerCreating

Created on 6 Aug 2016 · 3Comments · Source: kubernetes/kubernetes

For some reason today when I rolled out a new version to one of our deployments the pod got stuck in ContainerCreating with this error events:

1h          1m         37        some-api-2275263275-01pq7              Pod                                             Warning   FailedMount               {kubelet gke-cluster-1-default-pool-4399eaa3-os4v}      Unable to mount volumes for pod "some-api-2275263275-01pq7_default(afc5ae68-5b5e-11e6-afbb-42010a800105)": timeout expired waiting for volumes to attach/mount for pod "some-api-2275263275-01pq7"/"default". list of unattached/unmounted volumes=[default-token-880jy]
1h          1m         37        some-api-2275263275-01pq7              Pod                                             Warning   FailedSync                {kubelet gke-cluster-1-default-pool-4399eaa3-os4v}      Error syncing pod, skipping: timeout expired waiting for volumes to attach/mount for pod "some-api-2275263275-01pq7"/"default". list of unattached/unmounted volumes=[default-token-880jy]

I then attempted to scale the cluster and more than 75% of the previously running pods switched to ContainerCreating and also got stuck there. This caused widespread failure in our system and I had to quickly create a new cluster.

We're using google cloud platform's container engine and the cluster version is 1.3.2.

arekubectl kinbug sistorage

Source

montanaflynn

Most helpful comment

I saw a similar issue with v1.3.3 but in my case, the root cause was a lot more pedestrian. My deployment requires a secrets volume and I had forgotten to create the associated secret for the cluster to which I was trying to perform the new deployment. I saw no errors when using kubectl describe or kubectl logs but eventually realized that the deployment stayed stuck in ContainerCreating state (without logs afaict) if a volume that it depended on is missing.

abawany on 10 Aug 2016

👍2

All 3 comments

@montanaflynn There were a number of storage related issues with v1.3.2 that were fixed with v1.3.4. You probably hit one of those.

If you share the complete /var/log/kubelet log from a node with a stuck deployment I can take a look and confirm if it's a known issue or not. I'd need your GKE project name/cluster name/zone as well to grab your master logs. Feel free to email me if you don't want to share publicly.