Kubeadm: Increase robustness for kubeadm join TLS bootstrap

Created on 30 Mar 2020 · 16Comments · Source: kubernetes/kubeadm

Is this a BUG REPORT or FEATURE REQUEST?

BUG REPORT

Versions

kubeadm version: kubeadm v1.17.*

What happened?

While executing Cluster API tests, in some cases it was observed kubeadm join failures immediately after printing:

...
[kubelet-start] Waiting for the kubelet to perform the TLS Bootstrap...

xref https://github.com/kubernetes-sigs/cluster-api/issues/2769

What you expected to happen?

To make kubelet-start more resilient by:

increasing the TLS bootstrap timeout from to 2 or even 5 minutes (currently 40 sec?).
changing the code in order that also annotate CRI is retried as well.

How to reproduce it (as minimally and precisely as possible)?

This error happens only sometimes, most probably when there is a temporary blackout of the load balancer that sits in front of the API servers (HA proxy reloading his configuration).

Anything else we need to know?

Important: if possible the change should be kept as small and possible and backported

kinbug kinfeature lifecyclactive prioritimportant-soon

Source

fabriziopandini

All 16 comments

Ok, there is a bug here. In theory the call to WaitForKubeletAndFunc should wait for both the kubelet healthz and waitForTLSBootstrappedClient to state that things are OK. However:

WaitForKubeletAndFunc would wait for the first one of those to complete to return
~waitForTLSBootstrappedClient basically only constructs a Clientset from local files that were already written down on disk. Hence, it does not do any network operations and completes with a success in a negligible amount of time.~ (EDIT: My test env was dirty and I was looking at the wrong file...)

~Thus, the WaitForKubeletAndFunc returns when it had only checked that a Clientset is creatable. It does not verify at all if the kubelet is running.~

On the question of the healthz timeout. There is an initial wait of 40s, after which the first connection attempt is done. There are a total of 5 attempts with exponential backoff (initial delay of 5s and a factor of 2). Thus the total timeout here is 40s + 155s = 195s or 3m 15s.
But we don't wait for that.

/assign
/lifecycle active

rosti on 31 Mar 2020

With my above message in light and with a proper clean environment (with kind) I figured out that the healthz check is not appropriate for us. In the kubelet that check is just a ping check (it will return 200 OK as long as the kubelet is running). Hence, it would return OK even if the kubelet is not bootstrapped yet.
However, due to the fact that WaitForKubeletAndFunc would return on any (not all) conditions, the healthz check is going to be executed after just 40 seconds and WaitForKubeletAndFunc would return without waiting for the bootstrap to be completed.

In my opinion we should stop using healthz while waiting for the bootstrap to finish and use only the existence of the bootstrapped kubelet.conf file (what waitForTLSBootstrappedClient does).

rosti on 1 Apr 2020

In my opinion we should stop using healthz while waiting for the bootstrap to finish and use only the existence of the bootstrapped kubelet.conf file (what waitForTLSBootstrappedClient does).

+1 waiting for the file to appear seems like a good indicator.
i don't think we have to validate the kubeconfig again, because we are doing that in the discovery part.

neolit123 on 1 Apr 2020

I'm not sure that getting read of WaitForKubeletAndFunc is the right thing to do.
As far as I understand, the current logics wait for the TLSBootstrapped, but if after some times the kubelet is not healthy, we exit fast.
So the healthz responding does not makes the wait loop to finish, but the healthz failing after a grace period of 40s yes
see also https://github.com/kubernetes/kubernetes/pull/89735#discussion_r401704557

fabriziopandini on 1 Apr 2020

You can see the current implementation here. Basically it waits for the first func (between waitForTLSBootstrappedClient and WaitForHealthyKubelet) to finish and then it returns.

With the current behavior of kubelet's healthz probe and in the case of slow API Server it's entirely possible that WaitForHealthyKubelet would exit first after a 40s wait, but before the bootstrap process is completed.

On the other hand, we can modify WaitForKubeletAndFunc to actually wait for both funcs to finish (instead of just one).

rosti on 1 Apr 2020

@rosti, the WaitForHealthyKubelet triggers exit (returns to the channel) only in case of errors
https://github.com/kubernetes/kubernetes/blob/8ae26096f7714b8ba95f62cf5018073dde2d1dfb/cmd/kubeadm/app/util/apiclient/wait.go#L163-L165

fabriziopandini on 2 Apr 2020

👍1

@rosti, the WaitForHealthyKubelet triggers exit (returns to the channel) only in case of errors
https://github.com/kubernetes/kubernetes/blob/8ae26096f7714b8ba95f62cf5018073dde2d1dfb/cmd/kubeadm/app/util/apiclient/wait.go#L163-L165

Sorry about that, I looked into the code again and it indeed was writing to the channel only upon an error. So you are right about it.

rosti on 2 Apr 2020

👍1

@rosti re-opening because we are still missing:

changing the code in order that also annotate CRI is retried as well.

/reopen

This basically means to get the following two calls moved in a separate func and wrapped in their retry statement :

https://github.com/rosti/kubernetes/blob/c1fc466cddfbbfa05403140662ce946dfca067db/cmd/kubeadm/app/cmd/phases/join/kubelet.go#L188-L197

IMO, a good initial setting for this operation is something like:

// An exponential backoff configuration that returns durations for a total time of ~40s.
// Example: 0, .5s, 1.2s, 2.3s, 4s, 6s, 10s, 16s, 24s, 37s
// Jitter is added as a random fraction of the duration multiplied by the jitter factor.
return wait.Backoff{
    Duration: 500 * time.Millisecond,
    Factor:   1.5,
    Steps:    10,
    Jitter:   0.4,
}

fabriziopandini on 6 Apr 2020

@fabriziopandini: Reopened this issue.

In response to this:

@rosti re-opening because we are still missing:

changing the code in order that also annotate CRI is retried as well.

/reopen

This basically means to get the following two calls moved in a separate func and wrapped in their retry statement :

https://github.com/rosti/kubernetes/blob/c1fc466cddfbbfa05403140662ce946dfca067db/cmd/kubeadm/app/cmd/phases/join/kubelet.go#L188-L197

IMO, a good initial setting for this operation is something like:
// An exponential backoff configuration that returns durations for a total time of ~40s.
// Example: 0, .5s, 1.2s, 2.3s, 4s, 6s, 10s, 16s, 24s, 37s
// Jitter is added as a random fraction of the duration multiplied by the jitter factor.
return wait.Backoff{
  Duration: 500 * time.Millisecond,
  Factor:   1.5,
  Steps:    10,
  Jitter:   0.4,
}

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot on 6 Apr 2020

PS. Sorry if I missed this when reviewing kubernetes/kubernetes#89735

fabriziopandini on 6 Apr 2020

@fabriziopandini it's already retried using a polling timeout (see here). It's currently at 500ms retry interval for a 2 minute timeout. It can receive the same treatment as above - bump the retry interval to 5s and the total timeout to 5 mins.

rosti on 7 Apr 2020

@rosti I think we are good for patch node (even if 500ms is somehow aggressive)
Instead ClientSetFromFile is not retried.
If this call is not affected by temporary connection problems, we are fine. Otherwise, this call should be retried as well

fabriziopandini on 18 Apr 2020

@fabriziopandini @rosti

did we: