Kops: M5 nodes sometimes fail to connect to the network

Created on 8 Dec 2017 · 15Comments · Source: kubernetes/kops

What kops version are you running? The command kops version, will display
this information.
1.8.0
What Kubernetes version are you running? kubectl version will print the
version if a cluster is running or provide the Kubernetes version specified as
a kops flag.

1.8.4

What cloud provider are you using?

AWS

What commands did you run? What is the simplest way to reproduce this issue?

Create a bunch of M5 nodes using the Stretch image, see many of them never connect to the network.

What happened after the commands executed?

Some nodes start ok, others fail. Manually logging in to the failed node and restarting kubelet fixes the issue.

What did you expect to happen?

All nodes connect to the network.

It seems like this is caused by a timing issue somewhere, by comparing logs from a failed boot and a successful one this jumps out:

Failed: Cloud-init v. 0.7.9 running 'init-local' at Thu, 07 Dec 2017 09:48:31 +0000. Up 9.35 seconds.
Successful: Cloud-init v. 0.7.9 running 'init-local' at Thu, 07 Dec 2017 07:50:00 +0000. Up 19.02 seconds.

Why did it take 10 seconds longer for cloud-init to start in the successful case?

Logs attached.

journal.bad.txt
journal.ok.txt

lifecyclrotten

Source

ghost

Most helpful comment

Thanks @jlaswell. This is a fairly big thing for us as well, getting the fix into a stable release would help a great deal (I have been manually replacing the kubelet url in launch configurations that care deeply about this).

I conciously left this issue open since in issue #57382 @chrislovecnm said he thought that this is a bug on the installer side (kubelet should not be started before the tags are set), that side has not in my understanding been worked on.

ghost on 20 Mar 2018

👍2

All 15 comments

It seems that using M5 nodes isn't actually required, but the same issue can be reproduced on M4 as well (it's just much less likely), so the real problem is probably with the stretch image..

ghost on 8 Dec 2017

I have the same thing happening with C4 instances with a stretch image.
kops 1.8.0
kubernetes 1.8.4
Docker version 1.13.1

The journal looks similar, with kubelet errors.

larte on 8 Dec 2017

We encountered a similar issue with m5.large nodes failing status checks, a simple reboot in the ec2 console brought them up. Currently using m5.large nodes with this image kope.io/k8s-1.8-debian-jessie-amd64-hvm-ebs-2017-12-02. Hope that helps.

Using:

kops 1.8.0
kubernetes 1.8.4
Docker version 1.13.1

Olofguard on 8 Dec 2017

This actually seems to happen more on Jessie than it did on Stretch, and also happens on M4 (but happens much more on M5). This seems to actually be a general bug in kops (probably a missing dependency between services causing a timing issue).

ghost on 15 Dec 2017

M5’s are not supported with Kops Jessie image.

Do not use Jessie with m5’s please!

Use m4s all the time and have not seen this problem. But that is the older Jessie image.

chrislovecnm on 15 Dec 2017

This may well be related to https://github.com/kubernetes/kubernetes/issues/43768 and thus a general problem with kubelet itself. Systemd-analyze on a failed M4 node running Stretch shows this:

systemd-analyze critical-chain kubelet.service
The time after the unit is active or started is printed after the "@" character.
The time the unit takes to start is printed after the "+" character.

kubelet.service @36.884s
└─docker.service @37.967s +1.113s
└─docker.socket @37.957s +4ms
└─sysinit.target @6.538s
└─cloud-init.service @4.559s +1.970s
└─networking.service @3.355s +1.199s
└─network-pre.target @3.352s
└─ebtables.service @30.199s +14ms
└─local-fs.target @1.133s
└─var-lib-docker-overlay-f95bef1de4740d7b5ba626e4f97ecec8cfc3ca39a5ab5dc82adff4c2c1497cf0-merged.mount @1min 13.174s
└─var-lib-docker-overlay.mount @39.007s
└─dev-xvda2.device @730ms +3.722s

Looking at the journal logs for networking.service and kubelet.service gives nothing:

journalctl -u networking.service:
Dec 16 07:28:22 localhost systemd[1]: Starting Raise network interfaces...
Dec 16 07:28:23 localhost dhclient[393]: Internet Systems Consortium DHCP Client 4.3.5
Dec 16 07:28:23 localhost ifup[381]: Internet Systems Consortium DHCP Client 4.3.5
Dec 16 07:28:23 localhost ifup[381]: Copyright 2004-2016 Internet Systems Consortium.
Dec 16 07:28:23 localhost ifup[381]: All rights reserved.
Dec 16 07:28:23 localhost ifup[381]: For info, please visit https://www.isc.org/software/dhcp/
Dec 16 07:28:23 localhost dhclient[393]: Copyright 2004-2016 Internet Systems Consortium.
Dec 16 07:28:23 localhost dhclient[393]: All rights reserved.
Dec 16 07:28:23 localhost dhclient[393]: For info, please visit https://www.isc.org/software/dhcp/
Dec 16 07:28:23 localhost dhclient[393]:
Dec 16 07:28:23 localhost dhclient[393]: Listening on LPF/eth0/0a:b6:31:85:fc:da
Dec 16 07:28:23 localhost ifup[381]: Listening on LPF/eth0/0a:b6:31:85:fc:da
Dec 16 07:28:23 localhost ifup[381]: Sending on LPF/eth0/0a:b6:31:85:fc:da
Dec 16 07:28:23 localhost ifup[381]: Sending on Socket/fallback
Dec 16 07:28:23 localhost ifup[381]: DHCPDISCOVER on eth0 to 255.255.255.255 port 67 interval 6
Dec 16 07:28:23 localhost dhclient[393]: Sending on LPF/eth0/0a:b6:31:85:fc:da
Dec 16 07:28:23 localhost ifup[381]: DHCPREQUEST of 172.20.104.149 on eth0 to 255.255.255.255 port 67
Dec 16 07:28:23 localhost ifup[381]: DHCPOFFER of 172.20.104.149 from 172.20.96.1
Dec 16 07:28:23 localhost ifup[381]: DHCPACK of 172.20.104.149 from 172.20.96.1
Dec 16 07:28:23 localhost dhclient[393]: Sending on Socket/fallback
Dec 16 07:28:23 localhost dhclient[393]: DHCPDISCOVER on eth0 to 255.255.255.255 port 67 interval 6
Dec 16 07:28:23 localhost dhclient[393]: DHCPREQUEST of 172.20.104.149 on eth0 to 255.255.255.255 port 67
Dec 16 07:28:23 localhost dhclient[393]: DHCPOFFER of 172.20.104.149 from 172.20.96.1
Dec 16 07:28:23 localhost dhclient[393]: DHCPACK of 172.20.104.149 from 172.20.96.1
Dec 16 07:28:23 ip-172-20-104-149 ifup[381]: bound to 172.20.104.149 -- renewal in 1772 seconds.
Dec 16 07:28:23 ip-172-20-104-149 systemd[1]: Started Raise network interfaces.

journalctl -u kubelet.service:
Dec 16 07:28:56 ip-172-20-104-149 systemd[1]: Started Kubernetes Kubelet Server.

The network had been up for over 30s before kubelet started, yet days later it still cannot connect.

ghost on 18 Dec 2017

Hmm. /etc/resolv.conf was written to much later:
ls -l /etc/resolv.conf
-rw-r--r-- 1 root root 90 Dec 18 07:59 /etc/resolv.conf

Does anything other than the dhcp client touch that? Could that be the smoking gun?

ghost on 18 Dec 2017

No, probably not. The dhcp client rewrites that constantly.

ghost on 18 Dec 2017

However this could very well be it:

Failed:
I1216 07:28:57.369519 1900 aws.go:847] Building AWS cloudprovider
I1216 07:28:57.369555 1900 aws.go:810] Zone not specified in configuration file; querying AWS metadata service
E1216 07:28:57.586712 1900 tags.go:94] Tag "KubernetesCluster" nor "kubernetes.io/cluster/..." not found; Kubernetes may behave unexpectedly.
W1216 07:28:57.586744 1900 tags.go:78] AWS cloud - no clusterID filtering applied for shared resources; do not run multiple clusters in this AZ.
I1216 07:28:57.586807 1900 server.go:301] Successfully initialized cloud provider: "aws" from the config file: ""

Successful:
I1218 08:32:38.492573 28476 aws.go:847] Building AWS cloudprovider
I1218 08:32:38.492596 28476 aws.go:810] Zone not specified in configuration file; querying AWS metadata service
I1218 08:32:38.669713 28476 tags.go:76] AWS cloud filtering on ClusterID: kube.test.int
I1218 08:32:38.669788 28476 server.go:301] Successfully initialized cloud provider: "aws" from the config file: ""

Why is the AWS metadata unavailable?

ghost on 18 Dec 2017

I have forwarded the issue here: https://github.com/kubernetes/kubernetes/issues/57382

I'd leave this open in kops as well, in case the issue does end up being here.

ghost on 19 Dec 2017

👍2

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot on 19 Mar 2018

This looks to have been resolved as part of https://github.com/kubernetes/kubernetes/pull/60125 as mentioned above. Can this now be closed or is it preferred that the cherry picks into 1.8 and 1.9 go through first?

Either way, thanks for sticking alongside this, @vainu-arto ! We appreciate it over here. 🎉

jlaswell on 20 Mar 2018

ghost on 20 Mar 2018

👍2

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten
/remove-lifecycle stale

fejta-bot on 19 Apr 2018

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

fejta-bot on 19 May 2018

Was this page helpful?

0 / 5 - 0 ratings