--BUG REPORT --
Initial kops cluster has issue with kube-dns; kube-dns stuck at Error syncing pod/Pod sandbox changed, it will be killed and re-created.
kops command
kops create cluster --cloud=aws --zones=$AWS_ZONE \
--name=$CLUSTER_NAME \
--network-cidr=${NETWORK_CIDR} --vpc=${VPC_ID} \
--bastion=true --topology=private --networking=calico \
--dns-zone=${DNS_ZONE}
kops version
Version 1.7.0 (git-e04c29d)
kubectl version
Client Version: version.Info{Major:"1", Minor:"7", GitVersion:"v1.7.5", GitCommit:"17d7182a7ccbb167074be7a87f0a68bd00d58d97", GitTreeState:"clean", BuildDate:"2017-08-31T09:14:02Z", GoVersion:"go1.8.3", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"7", GitVersion:"v1.7.2", GitCommit:"922a86cfcd65915a9b2f69f3f193b8907d741d9c", GitTreeState:"clean", BuildDate:"2017-07-21T08:08:00Z", GoVersion:"go1.8.3", Compiler:"gc", Platform:"linux/amd64"}
cloud provider: AWS
admin@ip-172-17-3-61:~$ kubectl get events --all-namespaces
NAMESPACE LASTSEEN FIRSTSEEN COUNT NAME KIND SUBOBJECT TYPE REASON SOURCE MESSAGE
kube-system 18s 1h 204 kube-dns-479524115-h5sxc Pod Warning FailedSync kubelet, ip-172-17-3-61.ec2.internal Error syncing pod
kube-system 17s 1h 203 kube-dns-479524115-h5sxc Pod Normal SandboxChanged kubelet, ip-172-17-3-61.ec2.internal Pod sandbox changed, it will be killed and re-created.
kube-system 9s 1h 209 kube-dns-autoscaler-1818915203-7j0cx Pod Warning FailedSync kubelet, ip-172-17-3-61.ec2.internal Error syncing pod
kube-system 9s 1h 205 kube-dns-autoscaler-1818915203-7j0cx Pod Normal SandboxChanged kubelet, ip-172-17-3-61.ec2.internal Pod sandbox changed, it will be killed and re-created.
kube-system 3m 4d 1405 kube-proxy-ip-172-17-3-61.ec2.internal Pod spec.containers{kube-proxy} Normal Created kubelet, ip-172-17-3-61.ec2.internal Created container
kube-system 3m 4d 1405 kube-proxy-ip-172-17-3-61.ec2.internal Pod spec.containers{kube-proxy} Normal Started kubelet, ip-172-17-3-61.ec2.internal Started container
kube-system 3m 4d 1404 kube-proxy-ip-172-17-3-61.ec2.internal Pod spec.containers{kube-proxy} Normal Pulled kubelet, ip-172-17-3-61.ec2.internal Container image "gcr.io/google_containers/kube-proxy:v1.7.2" already present on machine
kube-system 9s 4d 32243 kube-proxy-ip-172-17-3-61.ec2.internal Pod spec.containers{kube-proxy} Warning BackOff kubelet, ip-172-17-3-61.ec2.internal Back-off restarting failed container
kube-system 9s 4d 32243 kube-proxy-ip-172-17-3-61.ec2.internal Pod Warning FailedSync kubelet, ip-172-17-3-61.ec2.internal Error syncing pod
kube-system 18s 4d 13683 kubernetes-dashboard-4056215011-05kjw Pod Warning FailedSync kubelet, ip-172-17-3-61.ec2.internal Error syncing pod
kube-system 17s 4d 13628 kubernetes-dashboard-4056215011-05kjw Pod Normal SandboxChanged kubelet, ip-172-17-3-61.ec2.internal Pod sandbox changed, it will be killed and re-created.
p.s. I had to change the taint on the master node to get past the initial error message of No nodes are available that match all of the following predicates:: PodToleratesNodeTaints (1). This seems like a bad choice for default?
I'm also seeing this issue with a K8s cluster created with:
$ kops version
Version 1.7.1
$ kubectl version
Server Version: version.Info{Major:"1", Minor:"7", GitVersion:"v1.7.9", GitCommit:"19fe91923d584c30bd6db5c5a21e9f0d5f742de8", GitTreeState:"clean", BuildDate:"2017-10-19T16:55:06Z", GoVersion:"go1.8.3", Compiler:"gc", Platform:"linux/amd64"}
OS version Container Linux by CoreOS 1576.4.0 (Ladybug)
Private topology with public dns using Calico
calico-node: Image: calico/node:v1.2.1
install-cni: Image: calico/cni:v1.8.3
$ kubectl -n kube-system get pods
NAME READY STATUS RESTARTS AGE
calico-node-22fnj 2/2 Running 6 46d
calico-node-c02n6 1/2 CrashLoopBackOff 101 46d
calico-node-cv03v 1/2 CrashLoopBackOff 526 46d
calico-policy-controller-3446986630-3043s 1/1 Running 3 46d
dns-controller-1140005764-ldx45 1/1 Running 5 46d
etcd-server-events-ip-10-151-25-36.ec2.internal 1/1 Running 3 46d
etcd-server-ip-10-151-25-36.ec2.internal 1/1 Running 3 46d
heapster-2663181808-4l7j0 0/2 ContainerCreating 0 1d
k8s-ec2-srcdst-501385644-m6290 1/1 Running 4 46d
kube-apiserver-ip-10-151-25-36.ec2.internal 1/1 Running 3 46d
kube-controller-manager-ip-10-151-25-36.ec2.internal 1/1 Running 3 46d
kube-dns-1311260920-0m7rz 0/3 ContainerCreating 0 1d
kube-dns-1311260920-bf45f 0/3 ContainerCreating 0 1d
kube-dns-autoscaler-1818915203-6kgh9 0/1 ContainerCreating 0 1d
kube-proxy-ip-10-151-24-194.ec2.internal 1/1 Running 3 2d
kube-proxy-ip-10-151-24-6.ec2.internal 1/1 Running 3 1d
kube-proxy-ip-10-151-25-36.ec2.internal 1/1 Running 3 46d
kube-scheduler-ip-10-151-25-36.ec2.internal 1/1 Running 3 46d
Also, the cluster lives in AWS and has m4.large instances for master and nodes
$ kops validate cluster
INSTANCE GROUPS
NAME ROLE MACHINETYPE MIN MAX SUBNETS
bastions Bastion t2.micro 1 1 us-east-1d-pro
master-us-east-1d Master m4.large 1 1 us-east-1d-pro
nodes Node m4.large 2 2 us-east-1d-pro
NODE STATUS
NAME ROLE READY
ip-10-151-24-194.ec2.internal node True
ip-10-151-24-6.ec2.internal node True
ip-10-151-25-36.ec2.internal master True
Pod Failures in kube-system
NAME
calico-kube-controllers-3097067221-bxkjn
calico-node-c02n6
calico-node-cv03v
heapster-2663181808-4l7j0
heapster-2663181808-4l7j0
kube-dns-1311260920-0m7rz
kube-dns-1311260920-0m7rz
kube-dns-1311260920-0m7rz
kube-dns-1311260920-bf45f
kube-dns-1311260920-bf45f
kube-dns-1311260920-bf45f
kube-dns-autoscaler-1818915203-6kgh9
Validation Failed
Ready Master(s) 1 out of 1.
Ready Node(s) 2 out of 2.
your kube-system pods are NOT healthy
I had the same problem going while running 1.7.11 using weave. This began all of the sudden which is scary as even though this happened on staging, my production environment has exactly the same setup. Pods stuck on ContainerCreating.
I tried going from 1.7.11 -> 1.8.4 on a desperate attempt to get things working again, but things remained the same.
This fix was suggested to me by @hubt on the #kops slack channel. It boils down to upgrading to weave 2.1.3
Daemonset on kube-system namespace.kubectl delete -f weave-daemonset-k8s-1.7.yaml as rolebindings are not exactly the same, but I am not 100% of this step.kubectl create -f weave-daemonset-k8s-1.7.yaml.Still, it is very frustrating not knowing what is the reason. I suspect it might be related to https://github.com/weaveworks/weave/issues/2822 as I saw the Unexpected command output Device "eth0" does not exist. Message several times, and also checking the IPAM service as suggested https://github.com/weaveworks/weave/issues/2822#issuecomment-283113983 gives similar output.
Weave is getting upgraded to 2.1.3 in #3944, which hopefully will fix this problem for new clusters.
@bboreham will a kops upgrade automatically upgrade weave on a running cluster, or will admins need to manually upgrade weave in their clusters?
will a kops upgrade automatically upgrade weave on a running cluster, or will admins need to manually upgrade weave in their clusters?
If yes, maybe it's worth releasing a v1.8.1 version?
Same issue here and Calico
Seem to be running into the same issue with calico 2.6.7 on coreos and kubeadm setup not kops on digital ocean.
Anyone figure out a root cause? I am seeing two different cni providers, so I think it is not the providers. Different OS, so not CoreOS or Debian. I am thinking docker or k8s maybe. Anyone find anything in the logs? Anyone have a repeatable set of commands to recreate this? I am seeing even kubeadm mentioned, so I am guessing this is not kops.
I had this bug when I'm trying to update my 'nodes' cluster from using 1 instance to 2. min == max
I don't use any particular CNI.
I'm seeing similar issue in kops v1.9.0, Kubernetes v1.9.6 andCalico network. Every time I have to restart kubelet or delete kube-dns and replace it. I'm not able to find root cause and there is nothing in the logs too. Anyone have any idea?
veeru@ultron:~$ kubectl get pods
NAME READY STATUS RESTARTS AGE
cadvisor-4cxdn 1/1 Running 6 16d
cadvisor-652hf 1/1 NodeLost 3 16d
cadvisor-6xdzc 1/1 Running 1 16d
cadvisor-b62rm 0/1 Error 8 16d
cadvisor-gxpc5 1/1 Running 1 16d
cadvisor-zcgwv 1/1 Running 2 16d
calico-kube-controllers-d97b7c4c8-n9msk 1/1 Running 0 2d
calico-node-464md 0/2 Error 16 16d
calico-node-5m79z 2/2 Running 2 16d
calico-node-kvkbb 2/2 NodeLost 6 16d
calico-node-sgjww 2/2 Running 2 16d
calico-node-wff9p 2/2 Running 4 16d
calico-node-x6fvs 2/2 Running 10 5d
dns-controller-6b689bc66f-vkqsf 1/1 Running 1 16d
etcd-server-events-ip-172-20-107-53.us-east-2.compute.internal 1/1 Running 1 16d
etcd-server-events-ip-172-20-58-23.us-east-2.compute.internal 1/1 Running 1 16d
etcd-server-events-ip-172-20-75-191.us-east-2.compute.internal 1/1 Running 2 16d
etcd-server-ip-172-20-107-53.us-east-2.compute.internal 1/1 Running 1 16d
etcd-server-ip-172-20-58-23.us-east-2.compute.internal 1/1 Running 1 16d
etcd-server-ip-172-20-75-191.us-east-2.compute.internal 1/1 Running 2 16d
kube-apiserver-ip-172-20-107-53.us-east-2.compute.internal 1/1 Running 1 16d
kube-apiserver-ip-172-20-58-23.us-east-2.compute.internal 1/1 Running 1 16d
kube-apiserver-ip-172-20-75-191.us-east-2.compute.internal 1/1 Running 2 16d
kube-controller-manager-ip-172-20-107-53.us-east-2.compute.internal 1/1 Running 1 16d
kube-controller-manager-ip-172-20-58-23.us-east-2.compute.internal 1/1 Running 1 16d
kube-controller-manager-ip-172-20-75-191.us-east-2.compute.internal 1/1 Running 2 16d
kube-dns-6c4cb66dfb-kj5j2 0/3 ContainerCreating 0 2d
kube-dns-6c4cb66dfb-l9kxx 0/3 ContainerCreating 0 2d
kube-dns-autoscaler-f4c47db64-nnnn5 0/1 ContainerCreating 0 2d
kube-proxy-ip-172-20-107-53.us-east-2.compute.internal 1/1 Running 1 16d
kube-proxy-ip-172-20-126-109.us-east-2.compute.internal 1/1 Unknown 3 5d
kube-proxy-ip-172-20-45-160.us-east-2.compute.internal 1/1 Running 6 2d
kube-proxy-ip-172-20-58-23.us-east-2.compute.internal 1/1 Running 1 16d
kube-proxy-ip-172-20-75-191.us-east-2.compute.internal 1/1 Running 2 16d
kube-proxy-ip-172-20-76-82.us-east-2.compute.internal 0/1 Error 8 2d
kube-scheduler-ip-172-20-107-53.us-east-2.compute.internal 1/1 Running 1 16d
kube-scheduler-ip-172-20-58-23.us-east-2.compute.internal 1/1 Running 1 16d
kube-scheduler-ip-172-20-75-191.us-east-2.compute.internal 1/1 Running 2 16d
kubernetes-dashboard-head-6c65fd464-68sgq 1/1 Running 1 2d
veeru@ultron:~$ kubectl logs kube-dns-autoscaler-f4c47db64-nnnn5
Error from server (BadRequest): container "autoscaler" in pod "kube-dns-autoscaler-f4c47db64-nnnn5" is waiting to start: ContainerCreating
veeru@ultron:~$ kubectl logs kube-dns-6c4cb66dfb-l9kxx kubedns
Error from server (BadRequest): container "kubedns" in pod "kube-dns-6c4cb66dfb-l9kxx" is waiting to start: ContainerCreating
veeru@ultron:~$ kubectl logs kube-dns-6c4cb66dfb-l9kxx dnsmasq
Error from server (BadRequest): container "dnsmasq" in pod "kube-dns-6c4cb66dfb-l9kxx" is waiting to start: ContainerCreating
veeru@ultron:~$ kubectl logs kube-dns-6c4cb66dfb-l9kxx sidecar
Error from server (BadRequest): container "sidecar" in pod "kube-dns-6c4cb66dfb-l9kxx" is waiting to start: ContainerCreating
veeru@ultron:~$ kubectl get events
LAST SEEN FIRST SEEN COUNT NAME KIND SUBOBJECT TYPE REASON SOURCE MESSAGE
56s 2d 36325 cadvisor-b62rm.152981ac9bd9f78b Pod Normal SandboxChanged kubelet, ip-172-20-76-82.us-east-2.compute.internal Pod sandbox changed, it will be killed and re-created.
15m 2d 36256 cadvisor-b62rm.152981aeda096280 Pod Warning FailedCreatePodSandBox kubelet, ip-172-20-76-82.us-east-2.compute.internal Failed create pod sandbox.
53s 2d 36309 calico-node-464md.152981aceaa14299 Pod Normal SandboxChanged kubelet, ip-172-20-76-82.us-east-2.compute.internal Pod sandbox changed, it will be killed and re-created.
50m 50m 1 calico-node-464md.152a20c05f28a525 Pod Warning FailedSync kubelet, ip-172-20-76-82.us-east-2.compute.internal error determining status: rpc error: code = Unknown desc = Error: No such container: a43e31a0abd6642060ad910b11ade722dc47a14c232cd14396908e4bdeb1782d
13s 2d 36285 kube-dns-6c4cb66dfb-kj5j2.152981b3e36d6c69 Pod Normal SandboxChanged kubelet, ip-172-20-76-82.us-east-2.compute.internal Pod sandbox changed, it will be killed and re-created.
21s 2d 36353 kube-dns-6c4cb66dfb-l9kxx.152981b39f81ab34 Pod Normal SandboxChanged kubelet, ip-172-20-76-82.us-east-2.compute.internal Pod sandbox changed, it will be killed and re-created.
15s 2d 36303 kube-dns-autoscaler-f4c47db64-nnnn5.152981b3e0cd2264 Pod Normal SandboxChanged kubelet, ip-172-20-76-82.us-east-2.compute.internal Pod sandbox changed, it will be killed and re-created.
58s 2d 36271 kube-proxy-ip-172-20-76-82.us-east-2.compute.internal.152981ac914e7b1e Pod Normal SandboxChanged kubelet, ip-172-20-76-82.us-east-2.compute.internal Pod sandbox changed, it will be killed and re-created.
20m 21m 2 kube-proxy-ip-172-20-76-82.us-east-2.compute.internal.152a225f44201641 Pod Warning FailedSync kubelet, ip-172-20-76-82.us-east-2.compute.internal error determining status: rpc error: code = Unknown desc = Error: No such container: 608921f6d10e4c3af963950d451dda6acdd9c16fd231cb2da4f0d365c800eb72
i am seeing the same issue. Debian and calico and K8s 1.9.6. I have autoscaler for nodes enabled. When this happened there was a ASG event deleting a node and spinning another back up. Seems to be related to auto scaling? I fixed it by restart docker on node which is kinda silly. This was in staging so could not imagine this happening in production. I will not enable auto scaling in production.
I've seen the same issue and it is not only related to kube-dns. I'm using weave and not calico.
More info from the kubelet logs from one broken node:
Aug 23 07:46:29 ip-172-20-53-254 kubelet[7383]: E0823 07:46:29.891765 7383 remote_runtime.go:92] RunPodSandbox from runtime service failed: rpc error: code = DeadlineExceeded desc = context deadline exceeded
Aug 23 07:46:29 ip-172-20-53-254 kubelet[7383]: E0823 07:46:29.891815 7383 kuberuntime_sandbox.go:54] CreatePodSandbox for pod "nginx-7dc755b6f7-kc5g8_custom(f6fe3f93-a6a6-11e8-80a5-0205d2a81076)" failed: rpc error: code = DeadlineExceeded desc = context deadline exceeded
Aug 23 07:46:29 ip-172-20-53-254 kubelet[7383]: E0823 07:46:29.891832 7383 kuberuntime_manager.go:647] createPodSandbox for pod "nginx-7dc755b6f7-kc5g8_custom(f6fe3f93-a6a6-11e8-80a5-0205d2a81076)" failed: rpc error: code = DeadlineExceeded desc = context deadline exceeded
Aug 23 07:46:29 ip-172-20-53-254 kubelet[7383]: E0823 07:46:29.891888 7383 pod_workers.go:186] Error syncing pod f6fe3f93-a6a6-11e8-80a5-0205d2a81076 ("nginx-7dc755b6f7-kc5g8_custom(f6fe3f93-a6a6-11e8-80a5-0205d2a81076)"), skipping: failed to "CreatePodSandbox" for "nginx-7dc755b6f7-kc5g8_custom(f6fe3f93-a6a6-11e8-80a5-0205d2a81076)" with CreatePodSandboxError: "CreatePodSandbox for pod \"nginx-7dc755b6f7-kc5g8_custom(f6fe3f93-a6a6-11e8-80a5-0205d2a81076)\" failed: rpc error: code = DeadlineExceeded desc = context deadline exceeded"
Aug 23 07:46:30 ip-172-20-53-254 kubelet[7383]: I0823 07:46:30.730025 7383 kuberuntime_manager.go:416] Sandbox for pod "nginx-7dc755b6f7-kc5g8_custom(f6fe3f93-a6a6-11e8-80a5-0205d2a81076)" has no IP address. Need to start a new one
Aug 23 07:46:31 ip-172-20-53-254 kubelet[7383]: I0823 07:46:31.436352 7383 kubelet.go:1896] SyncLoop (PLEG): "nginx-7dc755b6f7-kc5g8_custom(f6fe3f93-a6a6-11e8-80a5-0205d2a81076)", event: &pleg.PodLifecycleEvent{ID:"f6fe3f93-a6a6-11e8-80a5-0205d2a81076", Type:"ContainerDied", Data:"da883b31b03187408bbee1b4642ba836932776977c200905fcb8e5f8cb9f4024"}
Aug 23 07:46:31 ip-172-20-53-254 kubelet[7383]: W0823 07:46:31.436438 7383 pod_container_deletor.go:77] Container "da883b31b03187408bbee1b4642ba836932776977c200905fcb8e5f8cb9f4024" not found in pod's containers
Aug 23 07:46:31 ip-172-20-53-254 kubelet[7383]: I0823 07:46:31.436465 7383 kubelet.go:1896] SyncLoop (PLEG): "nginx-7dc755b6f7-kc5g8_custom(f6fe3f93-a6a6-11e8-80a5-0205d2a81076)", event: &pleg.PodLifecycleEvent{ID:"f6fe3f93-a6a6-11e8-80a5-0205d2a81076", Type:"ContainerStarted", Data:"4deab2663ce209335c30401f003c0465401ef20604d32e2cfbd5ec6ab9b6b938"}
Aug 23 07:47:05 ip-172-20-53-254 kubelet[7383]: I0823 07:47:05.109777 7383 server.go:796] GET /stats/summary/: (3.458746ms) 200 [[Go-http-client/1.1] 172.20.125.51:38646]
Aug 23 07:48:05 ip-172-20-53-254 kubelet[7383]: I0823 07:48:05.027382 7383 server.go:796] GET /stats/summary/: (3.582405ms) 200 [[Go-http-client/1.1] 172.20.125.51:38646]
Aug 23 07:48:26 ip-172-20-53-254 kubelet[7383]: I0823 07:48:26.863628 7383 container_manager_linux.go:425] [ContainerManager]: Discovered runtime cgroups name: /system.slice/docker.service
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale
Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten
Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close
@fejta-bot: Closing this issue.
In response to this:
Rotten issues close after 30d of inactivity.
Reopen the issue with/reopen.
Mark the issue as fresh with/remove-lifecycle rotten.Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
Most helpful comment
I had the same problem going while running 1.7.11 using weave. This began all of the sudden which is scary as even though this happened on staging, my production environment has exactly the same setup. Pods stuck on ContainerCreating.
I tried going from 1.7.11 -> 1.8.4 on a desperate attempt to get things working again, but things remained the same.
This fix was suggested to me by @hubt on the #kops slack channel. It boils down to upgrading to weave 2.1.3
Daemonsetonkube-systemnamespace.kubectl delete -f weave-daemonset-k8s-1.7.yamlas rolebindings are not exactly the same, but I am not 100% of this step.kubectl create -f weave-daemonset-k8s-1.7.yaml.Still, it is very frustrating not knowing what is the reason. I suspect it might be related to https://github.com/weaveworks/weave/issues/2822 as I saw the
Unexpected command output Device "eth0" does not exist.Message several times, and also checking the IPAM service as suggested https://github.com/weaveworks/weave/issues/2822#issuecomment-283113983 gives similar output.