What kops version are you running? The command kops version, will display
this information.
1.8.0
What Kubernetes version are you running? kubectl version will print the
version if a cluster is running or provide the Kubernetes version specified as
a kops flag.
1.8.4
AWS
Create a bunch of M5 nodes using the Stretch image, see many of them never connect to the network.
Some nodes start ok, others fail. Manually logging in to the failed node and restarting kubelet fixes the issue.
All nodes connect to the network.
It seems like this is caused by a timing issue somewhere, by comparing logs from a failed boot and a successful one this jumps out:
Failed: Cloud-init v. 0.7.9 running 'init-local' at Thu, 07 Dec 2017 09:48:31 +0000. Up 9.35 seconds.
Successful: Cloud-init v. 0.7.9 running 'init-local' at Thu, 07 Dec 2017 07:50:00 +0000. Up 19.02 seconds.
Why did it take 10 seconds longer for cloud-init to start in the successful case?
Logs attached.
It seems that using M5 nodes isn't actually required, but the same issue can be reproduced on M4 as well (it's just much less likely), so the real problem is probably with the stretch image..
I have the same thing happening with C4 instances with a stretch image.
kops 1.8.0
kubernetes 1.8.4
Docker version 1.13.1
The journal looks similar, with kubelet errors.
We encountered a similar issue with m5.large nodes failing status checks, a simple reboot in the ec2 console brought them up. Currently using m5.large nodes with this image kope.io/k8s-1.8-debian-jessie-amd64-hvm-ebs-2017-12-02. Hope that helps.
Using:
kops 1.8.0
kubernetes 1.8.4
Docker version 1.13.1
This actually seems to happen more on Jessie than it did on Stretch, and also happens on M4 (but happens much more on M5). This seems to actually be a general bug in kops (probably a missing dependency between services causing a timing issue).
M5鈥檚 are not supported with Kops Jessie image.
Do not use Jessie with m5鈥檚 please!
Use m4s all the time and have not seen this problem. But that is the older Jessie image.
This may well be related to https://github.com/kubernetes/kubernetes/issues/43768 and thus a general problem with kubelet itself. Systemd-analyze on a failed M4 node running Stretch shows this:
systemd-analyze critical-chain kubelet.service
The time after the unit is active or started is printed after the "@" character.
The time the unit takes to start is printed after the "+" character.
kubelet.service @36.884s
鈹斺攢docker.service @37.967s +1.113s
鈹斺攢docker.socket @37.957s +4ms
鈹斺攢sysinit.target @6.538s
鈹斺攢cloud-init.service @4.559s +1.970s
鈹斺攢networking.service @3.355s +1.199s
鈹斺攢network-pre.target @3.352s
鈹斺攢ebtables.service @30.199s +14ms
鈹斺攢local-fs.target @1.133s
鈹斺攢var-lib-docker-overlay-f95bef1de4740d7b5ba626e4f97ecec8cfc3ca39a5ab5dc82adff4c2c1497cf0-merged.mount @1min 13.174s
鈹斺攢var-lib-docker-overlay.mount @39.007s
鈹斺攢dev-xvda2.device @730ms +3.722s
Looking at the journal logs for networking.service and kubelet.service gives nothing:
journalctl -u networking.service:
Dec 16 07:28:22 localhost systemd[1]: Starting Raise network interfaces...
Dec 16 07:28:23 localhost dhclient[393]: Internet Systems Consortium DHCP Client 4.3.5
Dec 16 07:28:23 localhost ifup[381]: Internet Systems Consortium DHCP Client 4.3.5
Dec 16 07:28:23 localhost ifup[381]: Copyright 2004-2016 Internet Systems Consortium.
Dec 16 07:28:23 localhost ifup[381]: All rights reserved.
Dec 16 07:28:23 localhost ifup[381]: For info, please visit https://www.isc.org/software/dhcp/
Dec 16 07:28:23 localhost dhclient[393]: Copyright 2004-2016 Internet Systems Consortium.
Dec 16 07:28:23 localhost dhclient[393]: All rights reserved.
Dec 16 07:28:23 localhost dhclient[393]: For info, please visit https://www.isc.org/software/dhcp/
Dec 16 07:28:23 localhost dhclient[393]:
Dec 16 07:28:23 localhost dhclient[393]: Listening on LPF/eth0/0a:b6:31:85:fc:da
Dec 16 07:28:23 localhost ifup[381]: Listening on LPF/eth0/0a:b6:31:85:fc:da
Dec 16 07:28:23 localhost ifup[381]: Sending on LPF/eth0/0a:b6:31:85:fc:da
Dec 16 07:28:23 localhost ifup[381]: Sending on Socket/fallback
Dec 16 07:28:23 localhost ifup[381]: DHCPDISCOVER on eth0 to 255.255.255.255 port 67 interval 6
Dec 16 07:28:23 localhost dhclient[393]: Sending on LPF/eth0/0a:b6:31:85:fc:da
Dec 16 07:28:23 localhost ifup[381]: DHCPREQUEST of 172.20.104.149 on eth0 to 255.255.255.255 port 67
Dec 16 07:28:23 localhost ifup[381]: DHCPOFFER of 172.20.104.149 from 172.20.96.1
Dec 16 07:28:23 localhost ifup[381]: DHCPACK of 172.20.104.149 from 172.20.96.1
Dec 16 07:28:23 localhost dhclient[393]: Sending on Socket/fallback
Dec 16 07:28:23 localhost dhclient[393]: DHCPDISCOVER on eth0 to 255.255.255.255 port 67 interval 6
Dec 16 07:28:23 localhost dhclient[393]: DHCPREQUEST of 172.20.104.149 on eth0 to 255.255.255.255 port 67
Dec 16 07:28:23 localhost dhclient[393]: DHCPOFFER of 172.20.104.149 from 172.20.96.1
Dec 16 07:28:23 localhost dhclient[393]: DHCPACK of 172.20.104.149 from 172.20.96.1
Dec 16 07:28:23 ip-172-20-104-149 ifup[381]: bound to 172.20.104.149 -- renewal in 1772 seconds.
Dec 16 07:28:23 ip-172-20-104-149 systemd[1]: Started Raise network interfaces.
journalctl -u kubelet.service:
Dec 16 07:28:56 ip-172-20-104-149 systemd[1]: Started Kubernetes Kubelet Server.
The network had been up for over 30s before kubelet started, yet days later it still cannot connect.
Hmm. /etc/resolv.conf was written to much later:
ls -l /etc/resolv.conf
-rw-r--r-- 1 root root 90 Dec 18 07:59 /etc/resolv.conf
Does anything other than the dhcp client touch that? Could that be the smoking gun?
No, probably not. The dhcp client rewrites that constantly.
However this could very well be it:
Failed:
I1216 07:28:57.369519 1900 aws.go:847] Building AWS cloudprovider
I1216 07:28:57.369555 1900 aws.go:810] Zone not specified in configuration file; querying AWS metadata service
E1216 07:28:57.586712 1900 tags.go:94] Tag "KubernetesCluster" nor "kubernetes.io/cluster/..." not found; Kubernetes may behave unexpectedly.
W1216 07:28:57.586744 1900 tags.go:78] AWS cloud - no clusterID filtering applied for shared resources; do not run multiple clusters in this AZ.
I1216 07:28:57.586807 1900 server.go:301] Successfully initialized cloud provider: "aws" from the config file: ""
Successful:
I1218 08:32:38.492573 28476 aws.go:847] Building AWS cloudprovider
I1218 08:32:38.492596 28476 aws.go:810] Zone not specified in configuration file; querying AWS metadata service
I1218 08:32:38.669713 28476 tags.go:76] AWS cloud filtering on ClusterID: kube.test.int
I1218 08:32:38.669788 28476 server.go:301] Successfully initialized cloud provider: "aws" from the config file: ""
Why is the AWS metadata unavailable?
I have forwarded the issue here: https://github.com/kubernetes/kubernetes/issues/57382
I'd leave this open in kops as well, in case the issue does end up being here.
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale
This looks to have been resolved as part of https://github.com/kubernetes/kubernetes/pull/60125 as mentioned above. Can this now be closed or is it preferred that the cherry picks into 1.8 and 1.9 go through first?
Either way, thanks for sticking alongside this, @vainu-arto ! We appreciate it over here. 馃帀
Thanks @jlaswell. This is a fairly big thing for us as well, getting the fix into a stable release would help a great deal (I have been manually replacing the kubelet url in launch configurations that care deeply about this).
I conciously left this issue open since in issue #57382 @chrislovecnm said he thought that this is a bug on the installer side (kubelet should not be started before the tags are set), that side has not in my understanding been worked on.
Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten
/remove-lifecycle stale
Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close
Most helpful comment
Thanks @jlaswell. This is a fairly big thing for us as well, getting the fix into a stable release would help a great deal (I have been manually replacing the kubelet url in launch configurations that care deeply about this).
I conciously left this issue open since in issue #57382 @chrislovecnm said he thought that this is a bug on the installer side (kubelet should not be started before the tags are set), that side has not in my understanding been worked on.