Kops: BUG REPORT:kube-dns stuck at Error syncing pod/Pod sandbox changed, it will be killed and re-created

Created on 9 Oct 2017 · 15Comments · Source: kubernetes/kops

--BUG REPORT --
Initial kops cluster has issue with kube-dns; kube-dns stuck at Error syncing pod/Pod sandbox changed, it will be killed and re-created.

kops command

 kops create cluster --cloud=aws --zones=$AWS_ZONE \ 
--name=$CLUSTER_NAME \
--network-cidr=${NETWORK_CIDR} --vpc=${VPC_ID} \
--bastion=true --topology=private --networking=calico \
--dns-zone=${DNS_ZONE}

kops version
Version 1.7.0 (git-e04c29d)

kubectl version

Client Version: version.Info{Major:"1", Minor:"7", GitVersion:"v1.7.5", GitCommit:"17d7182a7ccbb167074be7a87f0a68bd00d58d97", GitTreeState:"clean", BuildDate:"2017-08-31T09:14:02Z", GoVersion:"go1.8.3", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"7", GitVersion:"v1.7.2", GitCommit:"922a86cfcd65915a9b2f69f3f193b8907d741d9c", GitTreeState:"clean", BuildDate:"2017-07-21T08:08:00Z", GoVersion:"go1.8.3", Compiler:"gc", Platform:"linux/amd64"}

cloud provider: AWS

admin@ip-172-17-3-61:~$ kubectl get events --all-namespaces
NAMESPACE     LASTSEEN   FIRSTSEEN   COUNT     NAME                                     KIND      SUBOBJECT                     TYPE      REASON           SOURCE                                 MESSAGE
kube-system   18s        1h          204       kube-dns-479524115-h5sxc                 Pod                                     Warning   FailedSync       kubelet, ip-172-17-3-61.ec2.internal   Error syncing pod
kube-system   17s        1h          203       kube-dns-479524115-h5sxc                 Pod                                     Normal    SandboxChanged   kubelet, ip-172-17-3-61.ec2.internal   Pod sandbox changed, it will be killed and re-created.
kube-system   9s         1h          209       kube-dns-autoscaler-1818915203-7j0cx     Pod                                     Warning   FailedSync       kubelet, ip-172-17-3-61.ec2.internal   Error syncing pod
kube-system   9s         1h          205       kube-dns-autoscaler-1818915203-7j0cx     Pod                                     Normal    SandboxChanged   kubelet, ip-172-17-3-61.ec2.internal   Pod sandbox changed, it will be killed and re-created.
kube-system   3m         4d          1405      kube-proxy-ip-172-17-3-61.ec2.internal   Pod       spec.containers{kube-proxy}   Normal    Created          kubelet, ip-172-17-3-61.ec2.internal   Created container
kube-system   3m         4d          1405      kube-proxy-ip-172-17-3-61.ec2.internal   Pod       spec.containers{kube-proxy}   Normal    Started          kubelet, ip-172-17-3-61.ec2.internal   Started container
kube-system   3m         4d          1404      kube-proxy-ip-172-17-3-61.ec2.internal   Pod       spec.containers{kube-proxy}   Normal    Pulled           kubelet, ip-172-17-3-61.ec2.internal   Container image "gcr.io/google_containers/kube-proxy:v1.7.2" already present on machine
kube-system   9s         4d          32243     kube-proxy-ip-172-17-3-61.ec2.internal   Pod       spec.containers{kube-proxy}   Warning   BackOff          kubelet, ip-172-17-3-61.ec2.internal   Back-off restarting failed container
kube-system   9s         4d          32243     kube-proxy-ip-172-17-3-61.ec2.internal   Pod                                     Warning   FailedSync       kubelet, ip-172-17-3-61.ec2.internal   Error syncing pod
kube-system   18s        4d          13683     kubernetes-dashboard-4056215011-05kjw    Pod                                     Warning   FailedSync       kubelet, ip-172-17-3-61.ec2.internal   Error syncing pod
kube-system   17s        4d          13628     kubernetes-dashboard-4056215011-05kjw    Pod                                     Normal    SandboxChanged   kubelet, ip-172-17-3-61.ec2.internal   Pod sandbox changed, it will be killed and re-created.

p.s. I had to change the taint on the master node to get past the initial error message of No nodes are available that match all of the following predicates:: PodToleratesNodeTaints (1). This seems like a bad choice for default?

lifecyclrotten

Source

AnthonyWC

👍10

Most helpful comment

I had the same problem going while running 1.7.11 using weave. This began all of the sudden which is scary as even though this happened on staging, my production environment has exactly the same setup. Pods stuck on ContainerCreating.

I tried going from 1.7.11 -> 1.8.4 on a desperate attempt to get things working again, but things remained the same.

This fix was suggested to me by @hubt on the #kops slack channel. It boils down to upgrading to weave 2.1.3

Delete the weave Daemonset on kube-system namespace.
Download the manifest from https://github.com/weaveworks/weave/releases/download/v2.1.3/weave-daemonset-k8s-1.7.yaml
I had do a kubectl delete -f weave-daemonset-k8s-1.7.yaml as rolebindings are not exactly the same, but I am not 100% of this step.
Then create the new Daemonset, sa and rolebindings with kubectl create -f weave-daemonset-k8s-1.7.yaml.

Still, it is very frustrating not knowing what is the reason. I suspect it might be related to https://github.com/weaveworks/weave/issues/2822 as I saw the Unexpected command output Device "eth0" does not exist. Message several times, and also checking the IPAM service as suggested https://github.com/weaveworks/weave/issues/2822#issuecomment-283113983 gives similar output.

AlexRRR on 19 Dec 2017

👍2

All 15 comments

I'm also seeing this issue with a K8s cluster created with:

$ kops version
Version 1.7.1
$ kubectl version
Server Version: version.Info{Major:"1", Minor:"7", GitVersion:"v1.7.9", GitCommit:"19fe91923d584c30bd6db5c5a21e9f0d5f742de8", GitTreeState:"clean", BuildDate:"2017-10-19T16:55:06Z", GoVersion:"go1.8.3", Compiler:"gc", Platform:"linux/amd64"}

OS version Container Linux by CoreOS 1576.4.0 (Ladybug)

Private topology with public dns using Calico

calico-node: Image: calico/node:v1.2.1
install-cni: Image: calico/cni:v1.8.3

$ kubectl -n kube-system get pods
NAME                                                   READY     STATUS              RESTARTS   AGE
calico-node-22fnj                                      2/2       Running             6          46d
calico-node-c02n6                                      1/2       CrashLoopBackOff    101        46d
calico-node-cv03v                                      1/2       CrashLoopBackOff    526        46d
calico-policy-controller-3446986630-3043s              1/1       Running             3          46d
dns-controller-1140005764-ldx45                        1/1       Running             5          46d
etcd-server-events-ip-10-151-25-36.ec2.internal        1/1       Running             3          46d
etcd-server-ip-10-151-25-36.ec2.internal               1/1       Running             3          46d
heapster-2663181808-4l7j0                              0/2       ContainerCreating   0          1d
k8s-ec2-srcdst-501385644-m6290                         1/1       Running             4          46d
kube-apiserver-ip-10-151-25-36.ec2.internal            1/1       Running             3          46d
kube-controller-manager-ip-10-151-25-36.ec2.internal   1/1       Running             3          46d
kube-dns-1311260920-0m7rz                              0/3       ContainerCreating   0          1d
kube-dns-1311260920-bf45f                              0/3       ContainerCreating   0          1d
kube-dns-autoscaler-1818915203-6kgh9                   0/1       ContainerCreating   0          1d
kube-proxy-ip-10-151-24-194.ec2.internal               1/1       Running             3          2d
kube-proxy-ip-10-151-24-6.ec2.internal                 1/1       Running             3          1d
kube-proxy-ip-10-151-25-36.ec2.internal                1/1       Running             3          46d
kube-scheduler-ip-10-151-25-36.ec2.internal            1/1       Running             3          46d

Also, the cluster lives in AWS and has m4.large instances for master and nodes

$ kops validate cluster
INSTANCE GROUPS
NAME                    ROLE    MACHINETYPE     MIN     MAX     SUBNETS
bastions                Bastion t2.micro        1       1       us-east-1d-pro
master-us-east-1d       Master  m4.large        1       1       us-east-1d-pro
nodes                   Node    m4.large        2       2       us-east-1d-pro

NODE STATUS
NAME                            ROLE    READY
ip-10-151-24-194.ec2.internal   node    True
ip-10-151-24-6.ec2.internal     node    True
ip-10-151-25-36.ec2.internal    master  True

Pod Failures in kube-system
NAME
calico-kube-controllers-3097067221-bxkjn
calico-node-c02n6
calico-node-cv03v
heapster-2663181808-4l7j0
heapster-2663181808-4l7j0
kube-dns-1311260920-0m7rz
kube-dns-1311260920-0m7rz
kube-dns-1311260920-0m7rz
kube-dns-1311260920-bf45f
kube-dns-1311260920-bf45f
kube-dns-1311260920-bf45f
kube-dns-autoscaler-1818915203-6kgh9

Validation Failed
Ready Master(s) 1 out of 1.
Ready Node(s) 2 out of 2.

your kube-system pods are NOT healthy

estebancrw on 11 Dec 2017

I tried going from 1.7.11 -> 1.8.4 on a desperate attempt to get things working again, but things remained the same.

This fix was suggested to me by @hubt on the #kops slack channel. It boils down to upgrading to weave 2.1.3

Delete the weave Daemonset on kube-system namespace.
Download the manifest from https://github.com/weaveworks/weave/releases/download/v2.1.3/weave-daemonset-k8s-1.7.yaml
I had do a kubectl delete -f weave-daemonset-k8s-1.7.yaml as rolebindings are not exactly the same, but I am not 100% of this step.
Then create the new Daemonset, sa and rolebindings with kubectl create -f weave-daemonset-k8s-1.7.yaml.

AlexRRR on 19 Dec 2017

👍2

Weave is getting upgraded to 2.1.3 in #3944, which hopefully will fix this problem for new clusters.

@bboreham will a kops upgrade automatically upgrade weave on a running cluster, or will admins need to manually upgrade weave in their clusters?

erickt on 29 Dec 2017

will a kops upgrade automatically upgrade weave on a running cluster, or will admins need to manually upgrade weave in their clusters?

If yes, maybe it's worth releasing a v1.8.1 version?

AndreaGiardini on 31 Dec 2017

Same issue here and Calico

joan38 on 3 Jan 2018

Seem to be running into the same issue with calico 2.6.7 on coreos and kubeadm setup not kops on digital ocean.

krasaee on 15 Feb 2018

Anyone figure out a root cause? I am seeing two different cni providers, so I think it is not the providers. Different OS, so not CoreOS or Debian. I am thinking docker or k8s maybe. Anyone find anything in the logs? Anyone have a repeatable set of commands to recreate this? I am seeing even kubeadm mentioned, so I am guessing this is not kops.

chrislovecnm on 16 Feb 2018

👍1

I had this bug when I'm trying to update my 'nodes' cluster from using 1 instance to 2. min == max

I don't use any particular CNI.

Congrammers on 26 Feb 2018

I'm seeing similar issue in kops v1.9.0, Kubernetes v1.9.6 andCalico network. Every time I have to restart kubelet or delete kube-dns and replace it. I'm not able to find root cause and there is nothing in the logs too. Anyone have any idea?

veeru@ultron:~$ kubectl get pods
NAME                                                                  READY     STATUS              RESTARTS   AGE
cadvisor-4cxdn                                                        1/1       Running             6          16d
cadvisor-652hf                                                        1/1       NodeLost            3          16d
cadvisor-6xdzc                                                        1/1       Running             1          16d
cadvisor-b62rm                                                        0/1       Error               8          16d
cadvisor-gxpc5                                                        1/1       Running             1          16d
cadvisor-zcgwv                                                        1/1       Running             2          16d
calico-kube-controllers-d97b7c4c8-n9msk                               1/1       Running             0          2d
calico-node-464md                                                     0/2       Error               16         16d
calico-node-5m79z                                                     2/2       Running             2          16d
calico-node-kvkbb                                                     2/2       NodeLost            6          16d
calico-node-sgjww                                                     2/2       Running             2          16d
calico-node-wff9p                                                     2/2       Running             4          16d
calico-node-x6fvs                                                     2/2       Running             10         5d
dns-controller-6b689bc66f-vkqsf                                       1/1       Running             1          16d
etcd-server-events-ip-172-20-107-53.us-east-2.compute.internal        1/1       Running             1          16d
etcd-server-events-ip-172-20-58-23.us-east-2.compute.internal         1/1       Running             1          16d
etcd-server-events-ip-172-20-75-191.us-east-2.compute.internal        1/1       Running             2          16d
etcd-server-ip-172-20-107-53.us-east-2.compute.internal               1/1       Running             1          16d
etcd-server-ip-172-20-58-23.us-east-2.compute.internal                1/1       Running             1          16d
etcd-server-ip-172-20-75-191.us-east-2.compute.internal               1/1       Running             2          16d
kube-apiserver-ip-172-20-107-53.us-east-2.compute.internal            1/1       Running             1          16d
kube-apiserver-ip-172-20-58-23.us-east-2.compute.internal             1/1       Running             1          16d
kube-apiserver-ip-172-20-75-191.us-east-2.compute.internal            1/1       Running             2          16d
kube-controller-manager-ip-172-20-107-53.us-east-2.compute.internal   1/1       Running             1          16d
kube-controller-manager-ip-172-20-58-23.us-east-2.compute.internal    1/1       Running             1          16d
kube-controller-manager-ip-172-20-75-191.us-east-2.compute.internal   1/1       Running             2          16d
kube-dns-6c4cb66dfb-kj5j2                                             0/3       ContainerCreating   0          2d
kube-dns-6c4cb66dfb-l9kxx                                             0/3       ContainerCreating   0          2d
kube-dns-autoscaler-f4c47db64-nnnn5                                   0/1       ContainerCreating   0          2d
kube-proxy-ip-172-20-107-53.us-east-2.compute.internal                1/1       Running             1          16d
kube-proxy-ip-172-20-126-109.us-east-2.compute.internal               1/1       Unknown             3          5d
kube-proxy-ip-172-20-45-160.us-east-2.compute.internal                1/1       Running             6          2d
kube-proxy-ip-172-20-58-23.us-east-2.compute.internal                 1/1       Running             1          16d
kube-proxy-ip-172-20-75-191.us-east-2.compute.internal                1/1       Running             2          16d
kube-proxy-ip-172-20-76-82.us-east-2.compute.internal                 0/1       Error               8          2d
kube-scheduler-ip-172-20-107-53.us-east-2.compute.internal            1/1       Running             1          16d
kube-scheduler-ip-172-20-58-23.us-east-2.compute.internal             1/1       Running             1          16d
kube-scheduler-ip-172-20-75-191.us-east-2.compute.internal            1/1       Running             2          16d
kubernetes-dashboard-head-6c65fd464-68sgq                             1/1       Running             1          2d

veeru@ultron:~$ kubectl logs kube-dns-autoscaler-f4c47db64-nnnn5
Error from server (BadRequest): container "autoscaler" in pod "kube-dns-autoscaler-f4c47db64-nnnn5" is waiting to start: ContainerCreating

veeru@ultron:~$ kubectl logs kube-dns-6c4cb66dfb-l9kxx kubedns
Error from server (BadRequest): container "kubedns" in pod "kube-dns-6c4cb66dfb-l9kxx" is waiting to start: ContainerCreating

veeru@ultron:~$ kubectl logs kube-dns-6c4cb66dfb-l9kxx dnsmasq
Error from server (BadRequest): container "dnsmasq" in pod "kube-dns-6c4cb66dfb-l9kxx" is waiting to start: ContainerCreating

veeru@ultron:~$ kubectl logs kube-dns-6c4cb66dfb-l9kxx sidecar
Error from server (BadRequest): container "sidecar" in pod "kube-dns-6c4cb66dfb-l9kxx" is waiting to start: ContainerCreating

veeru@ultron:~$ kubectl get events
LAST SEEN   FIRST SEEN   COUNT     NAME                                                                     KIND      SUBOBJECT   TYPE      REASON                   SOURCE                                                MESSAGE
56s         2d           36325     cadvisor-b62rm.152981ac9bd9f78b                                          Pod                   Normal    SandboxChanged           kubelet, ip-172-20-76-82.us-east-2.compute.internal   Pod sandbox changed, it will be killed and re-created.
15m         2d           36256     cadvisor-b62rm.152981aeda096280                                          Pod                   Warning   FailedCreatePodSandBox   kubelet, ip-172-20-76-82.us-east-2.compute.internal   Failed create pod sandbox.
53s         2d           36309     calico-node-464md.152981aceaa14299                                       Pod                   Normal    SandboxChanged           kubelet, ip-172-20-76-82.us-east-2.compute.internal   Pod sandbox changed, it will be killed and re-created.
50m         50m          1         calico-node-464md.152a20c05f28a525                                       Pod                   Warning   FailedSync               kubelet, ip-172-20-76-82.us-east-2.compute.internal   error determining status: rpc error: code = Unknown desc = Error: No such container: a43e31a0abd6642060ad910b11ade722dc47a14c232cd14396908e4bdeb1782d
13s         2d           36285     kube-dns-6c4cb66dfb-kj5j2.152981b3e36d6c69                               Pod                   Normal    SandboxChanged           kubelet, ip-172-20-76-82.us-east-2.compute.internal   Pod sandbox changed, it will be killed and re-created.
21s         2d           36353     kube-dns-6c4cb66dfb-l9kxx.152981b39f81ab34                               Pod                   Normal    SandboxChanged           kubelet, ip-172-20-76-82.us-east-2.compute.internal   Pod sandbox changed, it will be killed and re-created.
15s         2d           36303     kube-dns-autoscaler-f4c47db64-nnnn5.152981b3e0cd2264                     Pod                   Normal    SandboxChanged           kubelet, ip-172-20-76-82.us-east-2.compute.internal   Pod sandbox changed, it will be killed and re-created.
58s         2d           36271     kube-proxy-ip-172-20-76-82.us-east-2.compute.internal.152981ac914e7b1e   Pod                   Normal    SandboxChanged           kubelet, ip-172-20-76-82.us-east-2.compute.internal   Pod sandbox changed, it will be killed and re-created.
20m         21m          2         kube-proxy-ip-172-20-76-82.us-east-2.compute.internal.152a225f44201641   Pod                   Warning   FailedSync               kubelet, ip-172-20-76-82.us-east-2.compute.internal   error determining status: rpc error: code = Unknown desc = Error: No such container: 608921f6d10e4c3af963950d451dda6acdd9c16fd231cb2da4f0d365c800eb72

veerendra2 on 30 Apr 2018

i am seeing the same issue. Debian and calico and K8s 1.9.6. I have autoscaler for nodes enabled. When this happened there was a ASG event deleting a node and spinning another back up. Seems to be related to auto scaling? I fixed it by restart docker on node which is kinda silly. This was in staging so could not imagine this happening in production. I will not enable auto scaling in production.

bradleyd on 20 Jun 2018

I've seen the same issue and it is not only related to kube-dns. I'm using weave and not calico.

More info from the kubelet logs from one broken node:

Aug 23 07:46:29 ip-172-20-53-254 kubelet[7383]: E0823 07:46:29.891765    7383 remote_runtime.go:92] RunPodSandbox from runtime service failed: rpc error: code = DeadlineExceeded desc = context deadline exceeded
Aug 23 07:46:29 ip-172-20-53-254 kubelet[7383]: E0823 07:46:29.891815    7383 kuberuntime_sandbox.go:54] CreatePodSandbox for pod "nginx-7dc755b6f7-kc5g8_custom(f6fe3f93-a6a6-11e8-80a5-0205d2a81076)" failed: rpc error: code = DeadlineExceeded desc = context deadline exceeded
Aug 23 07:46:29 ip-172-20-53-254 kubelet[7383]: E0823 07:46:29.891832    7383 kuberuntime_manager.go:647] createPodSandbox for pod "nginx-7dc755b6f7-kc5g8_custom(f6fe3f93-a6a6-11e8-80a5-0205d2a81076)" failed: rpc error: code = DeadlineExceeded desc = context deadline exceeded
Aug 23 07:46:29 ip-172-20-53-254 kubelet[7383]: E0823 07:46:29.891888    7383 pod_workers.go:186] Error syncing pod f6fe3f93-a6a6-11e8-80a5-0205d2a81076 ("nginx-7dc755b6f7-kc5g8_custom(f6fe3f93-a6a6-11e8-80a5-0205d2a81076)"), skipping: failed to "CreatePodSandbox" for "nginx-7dc755b6f7-kc5g8_custom(f6fe3f93-a6a6-11e8-80a5-0205d2a81076)" with CreatePodSandboxError: "CreatePodSandbox for pod \"nginx-7dc755b6f7-kc5g8_custom(f6fe3f93-a6a6-11e8-80a5-0205d2a81076)\" failed: rpc error: code = DeadlineExceeded desc = context deadline exceeded"
Aug 23 07:46:30 ip-172-20-53-254 kubelet[7383]: I0823 07:46:30.730025    7383 kuberuntime_manager.go:416] Sandbox for pod "nginx-7dc755b6f7-kc5g8_custom(f6fe3f93-a6a6-11e8-80a5-0205d2a81076)" has no IP address.  Need to start a new one
Aug 23 07:46:31 ip-172-20-53-254 kubelet[7383]: I0823 07:46:31.436352    7383 kubelet.go:1896] SyncLoop (PLEG): "nginx-7dc755b6f7-kc5g8_custom(f6fe3f93-a6a6-11e8-80a5-0205d2a81076)", event: &pleg.PodLifecycleEvent{ID:"f6fe3f93-a6a6-11e8-80a5-0205d2a81076", Type:"ContainerDied", Data:"da883b31b03187408bbee1b4642ba836932776977c200905fcb8e5f8cb9f4024"}
Aug 23 07:46:31 ip-172-20-53-254 kubelet[7383]: W0823 07:46:31.436438    7383 pod_container_deletor.go:77] Container "da883b31b03187408bbee1b4642ba836932776977c200905fcb8e5f8cb9f4024" not found in pod's containers
Aug 23 07:46:31 ip-172-20-53-254 kubelet[7383]: I0823 07:46:31.436465    7383 kubelet.go:1896] SyncLoop (PLEG): "nginx-7dc755b6f7-kc5g8_custom(f6fe3f93-a6a6-11e8-80a5-0205d2a81076)", event: &pleg.PodLifecycleEvent{ID:"f6fe3f93-a6a6-11e8-80a5-0205d2a81076", Type:"ContainerStarted", Data:"4deab2663ce209335c30401f003c0465401ef20604d32e2cfbd5ec6ab9b6b938"}
Aug 23 07:47:05 ip-172-20-53-254 kubelet[7383]: I0823 07:47:05.109777    7383 server.go:796] GET /stats/summary/: (3.458746ms) 200 [[Go-http-client/1.1] 172.20.125.51:38646]
Aug 23 07:48:05 ip-172-20-53-254 kubelet[7383]: I0823 07:48:05.027382    7383 server.go:796] GET /stats/summary/: (3.582405ms) 200 [[Go-http-client/1.1] 172.20.125.51:38646]
Aug 23 07:48:26 ip-172-20-53-254 kubelet[7383]: I0823 07:48:26.863628    7383 container_manager_linux.go:425] [ContainerManager]: Discovered runtime cgroups name: /system.slice/docker.service

Raffo on 23 Aug 2018

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot on 21 Nov 2018

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

fejta-bot on 21 Dec 2018

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

fejta-bot on 20 Jan 2019

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.