Cluster-api: When deploying cluster with control-plane-count = 3, api server is not responding during scaling

Created on 4 Mar 2020 · 20Comments · Source: kubernetes-sigs/cluster-api

What steps did you take and what happened:

When I am using CONTROL_PLANE_MACHINE_COUNT: 3 while creating cluster on AWS/VSphere environment, noticed that the api server is not up/responding for in between. Mainly during the control plane vm are being scaled from count=1 to count=3 and for some time after the control plane nodes are up.

Noticed this thing happening when trying to apply clusterctl init once the API server is up, but all the control plane nodes are not yet provisioned.

During the init workflow, api server is non responsive for some time and failed with,
Attempt 1: (AWS)

Error: failed to get cert-manager web-hook: rpc error: code = Unavailable desc = etcdserver: leader changed

Attempt 2: (AWS)

Error: failed to get cert-manager web-hook: etcdserver: request timed out

Attempt 3: (VSphere)

Error: failed to get cert-manager web-hook: Get https://192.168.111.79:6443/apis/apiregistration.k8s.io/v1beta1/apiservices/v1beta1.webhook.cert-manager.io: EOF

Attempt 4: (VSphere)

Error: failed to create cert-manager component: /v1, Kind=ServiceAccount, cert-manager/cert-manager-cainjector: rpc error: code = Unavailable desc = etcdserver: leader changed

All the failures were because, api server is not responding or some issue with the etcdserver.
Note: cluster init process started and installed all the CRD and all before the failure. So, API server was responsive when the init process was started.

What did you expect to happen:

API server should always be responsive during the control plane scaling operation.

Environment:

Cluster-api version: commit v0.3.0-rc.2-82-gaa289e08a
On vsphere and aws

/kind bug

arecontrol-plane kinbug

Source

Anuj2512

All 20 comments

/assign
@Anuj2512 https://github.com/kubernetes-sigs/cluster-api/pull/2524 just landed with some improvement to make the creation of resources more resilient.
Could you test if this change addresses your problems?

fabriziopandini on 4 Mar 2020

/area clusterctl

fabriziopandini on 4 Mar 2020

@fabriziopandini I tried with your fix #2524, It still does not fix the issue.

As I noticed, when the new(second/third) control-plane vm is up, for few seconds, API server is not responding and none of the kubectl command works meaning API server is unavailable for some time.
And at that time, I still got the same error.

failed to get cert-manager web-hook: failed to connect to the management cluster: Get https://192.168.111.24:6443/api?timeout=32s: unexpected EOF

It happens exactly at the time when READY REPLICAS becomes 2 from 1 for kubeadmcontrolplane. So, it does not look like clusterctl issue (it would be great if we can do something here in clusterctl) but I still feel that "API server should always be available during the control-plane scaling"

NAMESPACE    NAME                                                                                  READY   INITIALIZED   REPLICAS   READY REPLICAS   UPDATED REPLICAS   UNAVAILABLE REPLICAS
default      kubeadmcontrolplane.controlplane.cluster.x-k8s.io/management-vsphere-20200304162309   true    true          2          1                2                  1

NAMESPACE    NAME                                                                              PROVIDERID                                       PHASE
default      machine.cluster.x-k8s.io/management-vsphere-20200304162309-6tzc8                  vsphere://42213a48-0169-28ff-b68d-8cffaa22d157   Provisioning
default      machine.cluster.x-k8s.io/management-vsphere-20200304162309-lnn96                  vsphere://422110fa-26b0-c7c6-843d-1dbfdf5b6eab   Running

Anuj2512 on 5 Mar 2020

cc @yastij and @detiber

fabriziopandini on 5 Mar 2020

This could very well be caused by the brief interruption that is expected when scaling the etcd cluster from 1 to 2 members, where the cluster becomes unavailable for a brief period while the second instance is still coming up and quorum is not yet established.

I'm not necessarily sure there is much we can do about this, since this would exist for any kubeadm managed control plane during the scaling from 1 to 2 members.

detiber on 5 Mar 2020

failed to get cert-manager web-hook: failed to connect to the management cluster: Get https://192.168.111.24:6443/api?timeout=32s: unexpected EOF

At what point in the flow did you get this error? Do you have the full output available?

ncdc on 5 Mar 2020

@neolit123 @timothysc when you kubeadm join a control plane node, is it reasonable to expect the control plane has 100% availability during the join?

ncdc on 5 Mar 2020

Is this the test case here?

Create bootstrap cluster (kind create cluster)
clusterctl init into bootstrap cluster
clusterctl config cluster ... | k apply -f - to provision real management cluster, with 3 control plane machines
While the real management cluster is being set up, but before it's fully scaled to 3, you run clusterctl init targeting the real management cluster

ncdc on 5 Mar 2020

@ncdc Yes. That is exact test case I am trying. As API server is available after the first CP node, It allows user to do run kubectl commands as well as clusterctl init. When during CP scale-up the init operation is failing.

Anuj2512 on 5 Mar 2020

For the test case I highly suspect that you would need to ensure that you are attempting to use the management cluster during the scaling process from 1->2 replicas, I don't believe this same type of interruption would be seen when scaling from 2->3 replicas, since quorum of the cluster should not be impacted.

detiber on 5 Mar 2020

@detiber I have seen below error even after the all 3 CP nodes are provisioned, and then I run clusterctl init instantly.

Error: failed to create cert-manager component: /v1, Kind=ServiceAccount, cert-manager/cert-manager-cainjector: rpc error: code = Unavailable desc = etcdserver: leader changed

May be latest #2524 PR might have fixed this etcdserver: leader changed error because I haven't encountered this error after the above fix.

Anuj2512 on 5 Mar 2020

@Anuj2512 interesting, I wouldn't expect the leader to change as part of scaling up the etcd cluster.

What type of environment are you using for the management cluster? Does it have any type of resource contention that could be causing the etcd cluster to shift the leadership?

detiber on 5 Mar 2020

I have seen below error even after the all 3 CP nodes are provisioned

Define "provisioned", please? What are you waiting for, specifically?

ncdc on 5 Mar 2020

@ncdc Provisioned, I mean all 3 replicas are in READY state as below,

NAMESPACE    NAME                                                                                  READY   INITIALIZED   REPLICAS   READY REPLICAS   UPDATED REPLICAS   UNAVAILABLE REPLICAS
default      kubeadmcontrolplane.controlplane.cluster.x-k8s.io/management-vsphere-20200304162309   true    true          3          3                3

@detiber I don't think so. I tried above on AWS as well as VSphere,
Out of 6-7 attempts I got etcdserver: leader changed error 2 times. (once on AWS and once on Vsphere). And all other attempt failed with failed to connect to the management cluster: Get https://192.168.111.24:6443/api?timeout=32s: unexpected EOF error.

Anuj2512 on 5 Mar 2020

@detiber

This could very well be caused by the brief interruption that is expected when scaling the etcd cluster from 1 to 2 members, where the cluster becomes unavailable for a brief period while the second instance is still coming up and quorum is not yet established.

this can indeed happen.

@ncdc

@neolit123 @timothysc when you kubeadm join a control plane node, is it reasonable to expect the control plane has 100% availability during the join?

normally what users do is wait for kubeadm init as it blocks for the kubelet and api server to be available, at that point joining more CP nodes using kubeadm join should account for the potential etcd blackout using retries.

@Anuj2512

API server should always be responsive during the control plane scaling operation.

just pointing out that this might be hard to guarantee with etcd as the backend.

if during scale for 2+, the etcd cluster is not changing leader i'd be curious to know if you are seeing anything suspicious in the api-server and etcd logs, aside from the "leader changed" part.

side question: what is the k8s version of this cluster?

neolit123 on 5 Mar 2020

One immediate problem:

https://github.com/kubernetes-sigs/cluster-api/blob/ac1dee8cf441dda7f1a36fd4e149e195ef16db20/cmd/clusterctl/client/cluster/cert_manager.go#L144-L146

which can encounter this error if e.g. the apiserver is having issues: https://github.com/kubernetes-sigs/cluster-api/blob/ac1dee8cf441dda7f1a36fd4e149e195ef16db20/cmd/clusterctl/client/cluster/cert_manager.go#L207

If we return the error, then cm.pollImmediateWaiter is going to return immediately instead of retrying. This is one specific thing we can adjust. cc @fabriziopandini @vincepri

ncdc on 5 Mar 2020

👍1

Sent a fix for ^^

fabriziopandini on 6 Mar 2020

/milestone v0.3.x

vincepri on 6 Mar 2020

Going to close this one now that #2563 has been merged and we haven't heard back

/close

vincepri on 31 Mar 2020

🎉1

@vincepri: Closing this issue.

In response to this:

Going to close this one now that #2563 has been merged and we haven't heard back

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.