kubeadm init fails with no error and hangs forever

Created on 22 Oct 2019 · 5Comments · Source: kubernetes/kubeadm

/kind bug

kubeadm version (use kubeadm version): 1.15.3

What happened?

@vivgoyal recently spun up a lot of clusters (200) and 5% failed during kubeadm init. There were logs indicating kubeadm init started but then they hung forever and the process never finished. There were no logs and no timeout whatsoever. The environment was AWS.

Is there some code here that doesn't use context correctly?

What you expected to happen?

I expected a timeout or some error information.

How to reproduce it (as minimally and precisely as possible)?

It's an intermittent error. Only 10/200 kubeadm inits saw this problem.

kinbug prioritbacklog

Source

chuckha

All 5 comments

@vivgoyal recently spun up a lot of clusters (200) and 5% failed during kubeadm init.

we are creating more clusters in our CI and we haven't seen such a problem.
error logs would be ideal.

neolit123 on 22 Oct 2019

Is there some code here that doesn't use context correctly?

if kubeadm fails to find a healthy kubelet and/or api-server it should timeout in 4 minutes:
https://github.com/kubernetes/kubernetes/blob/master/cmd/kubeadm/app/constants/constants.go#L188
https://github.com/kubernetes/kubernetes/blob/1dc5235d0a93f4594402dfc0d7bcd5db88b3b4be/cmd/kubeadm/app/cmd/phases/init/waitcontrolplane.go#L95

i suspect that there is something odd going on in the AWS setup that trips these HTTP client calls:
https://github.com/kubernetes/kubernetes/blob/5268f69405251a4a74130fa903e055a59071179a/cmd/kubeadm/app/util/apiclient/wait.go#L140
https://github.com/kubernetes/kubernetes/blob/5268f69405251a4a74130fa903e055a59071179a/cmd/kubeadm/app/util/apiclient/wait.go#L81

but we can't proceed without error logs.
please ask @vivgoyal to call kubeadm init with --v=10 in their tests.

neolit123 on 22 Oct 2019

🚀1

yeah, i suspected we would need more verbose logging. Alrighty, thank you @neolit123 !

chuckha on 23 Oct 2019

I believe I found the issue.

Let's start here.

This is a fairly straight forward pattern.

Start a long-running go routine.
Wait for a timeout then return.

But here's the gotcha, a wait group.

We are waiting for the wait group to return after the timeout.

This is fine if the wait group ever returns, but the wait group only finishes after the goroutine returns.

Does it return? wait.Until returns when the stopChan is closed, which has already happened.

But, even if the channel is closed, it won't matter if wait.Until is stuck in a function, which, in this case, it happens to be calling an infinite loop in this function.

Why is that an infinite loop? The definition of that function is here and the infinite loop that holds the whole thing up is here.

If the API server never comes up, this function polls forever which means wait.Until is blocked forever, never looping, because no context is used to cancel a function call. Therefore the wait group will be waiting forever.

@neolit123 I'd like to take a stab at fixing this. I think the appropriate thing to do is use contexts, but I'm not sure that fits in with the pattern of kubeadm right now and I'm not sure we'd want to plumb a context all the way through. I'd be happy to try to tackle this work if you would accept a context all the way through, but if that is not the appropriate path forward, I believe changing PollImmediateInfinite to actually have a timeout is an equally good approach.

chuckha on 12 Nov 2019

🎉1

great investigation @chuckha
please do send a fix if you want. code freeze is Thursday EOD pacific time

I'd like to take a stab at fixing this. I think the appropriate thing to do is use contexts, but I'm not sure that fits in with the pattern of kubeadm right now and I'm not sure we'd want to plumb a context all the way through.

ideally we shouldn't use context here for now, as it is a much larger refactor.