Cluster-api: [clusterctl] move to target cluster fails at init step

Created on 19 Mar 2020 · 16Comments · Source: kubernetes-sigs/cluster-api

What steps did you take and what happened:
[A clear and concise description on how to REPRODUCE the bug.]

target cluster is ready:

kubectl --kubeconfig=./capi-quickstart.kubeconfig get nodes
NAME                                    STATUS   ROLES    AGE     VERSION
capi-quickstart-2-control-plane-mnldn   Ready    master   12m     v1.17.3
capi-quickstart-2-control-plane-vjnnr   Ready    master   8m50s   v1.17.3
capi-quickstart-2-control-plane-vqc22   Ready    master   10m     v1.17.3
capi-quickstart-2-md-0-99ms5            Ready    <none>   10m     v1.17.3
capi-quickstart-2-md-0-99nzj            Ready    <none>   9m55s   v1.17.3
capi-quickstart-2-md-0-rt4wr            Ready    <none>   9m54s   v1.17.3

Follow the instructions at https://cluster-api.sigs.k8s.io/clusterctl/commands/move.html#pivot

clusterctl --kubeconfig=./capi-quickstart.kubeconfig init                                                                                                
Fetching providers
Installing Provider="cluster-api" Version="v0.3.1" TargetNamespace="capi-system"
Error: action failed after 3 attempts: failed to create provider object cert-manager.io/v1alpha2, Kind=Certificate, capi-webhook-system/capi-serving-cert: Internal error occurred: failed calling webhook "webhook.cert-manager.io": the server is currently unable to handle the request

What did you expect to happen:

Anything else you would like to add:
[Miscellaneous information that will assist in solving the issue.]

Environment:

Cluster-api version:
Minikube/KIND version:
Kubernetes version: (use kubectl version):
OS (e.g. from /etc/os-release):

/kind bug

areclusterctl kinbug prioritawaiting-more-evidence

Source

CecileRobertMichon

All 16 comments

CAPI version?

vincepri on 19 Mar 2020

I see it in the logs now, could you try with master?

vincepri on 19 Mar 2020

clusterctl master should have a fix that retries cert-manager

/cc @fabriziopandini

vincepri on 19 Mar 2020

same issue from master / v0.3.2

CecileRobertMichon on 19 Mar 2020

$ clusterctl --kubeconfig=./capi-quickstart.kubeconfig init --v 5                                                                                                                      
tching File="control-plane-components.yaml" Provider="control-plane-kubeadm" Version="v0.3.2"
Fetching File="metadata.yaml" Provider="cluster-api" Version="v0.3.2"
Fetching File="metadata.yaml" Provider="bootstrap-kubeadm" Version="v0.3.2"
Fetching File="metadata.yaml" Provider="control-plane-kubeadm" Version="v0.3.2"
Installing Provider="cluster-api" Version="v0.3.2" TargetNamespace="capi-system"
Creating shared objects Provider="cluster-api" Version="v0.3.2"
Creating Namespace="capi-webhook-system"
Creating CustomResourceDefinition="clusters.cluster.x-k8s.io"
Creating CustomResourceDefinition="machinedeployments.cluster.x-k8s.io"
Creating CustomResourceDefinition="machinehealthchecks.cluster.x-k8s.io"
Creating CustomResourceDefinition="machinepools.exp.cluster.x-k8s.io"
Creating CustomResourceDefinition="machines.cluster.x-k8s.io"
Creating CustomResourceDefinition="machinesets.cluster.x-k8s.io"
Creating MutatingWebhookConfiguration="capi-mutating-webhook-configuration"
Creating Service="capi-webhook-service" Namespace="capi-webhook-system"
Creating Deployment="capi-controller-manager" Namespace="capi-webhook-system"
Creating Certificate="capi-serving-cert" Namespace="capi-webhook-system"
Operation failed, retry Error={}
Creating Certificate="capi-serving-cert" Namespace="capi-webhook-system"
Operation failed, retry Error={}
Creating Certificate="capi-serving-cert" Namespace="capi-webhook-system"
Operation failed, retry Error={}
Creating Certificate="capi-serving-cert" Namespace="capi-webhook-system"
Operation failed, retry Error={}
Creating Certificate="capi-serving-cert" Namespace="capi-webhook-system"
Operation failed, retry Error={}
Creating Certificate="capi-serving-cert" Namespace="capi-webhook-system"
Operation failed, retry Error={}
Creating Certificate="capi-serving-cert" Namespace="capi-webhook-system"
Operation failed, retry Error={}
Creating Certificate="capi-serving-cert" Namespace="capi-webhook-system"
Operation failed, retry Error={}
Creating Certificate="capi-serving-cert" Namespace="capi-webhook-system"
Operation failed, retry Error={}
Creating Certificate="capi-serving-cert" Namespace="capi-webhook-system"
Error: action failed after 10 attempts: failed to create provider object cert-manager.io/v1alpha2, Kind=Certificate, capi-webhook-system/capi-serving-cert: Internal error occurred: failed calling webhook "webhook.cert-manager.io": the server is currently unable to handle the request

CecileRobertMichon on 19 Mar 2020

@CecileRobertMichon

webhook "webhook.cert-manager.io": the server is currently unable to handle the request

I saw this error in two cases:

when the target cluster is without a CNI addon or with network problems
when cert-manger running on machines without free resources under high pressure (mostly locally on docker on local VMs)

Is it possible we are in one of those conditions

fabriziopandini on 19 Mar 2020

Should we maybe add a check to make sure a CNI has been installed, maybe we could check if the nodes are in a Ready state?

vincepri on 19 Mar 2020

👍1

I installed the CNI (Calico) before attempting the move and all nodes were ready (see output of get nodes in the issue description).

I doubt that the machines being under high pressure would be an issue, I'm using Azure VMs size Standard_D2s_v3 for both control planes and machines (with 3 replicas for both).

Here's the output of get pods that shows Calico running:

 kubectl --kubeconfig=./capi-quickstart.kubeconfig get pods --all-namespaces                                                                                                          1 ↵  10789  13:49:52
NAMESPACE             NAME                                                          READY   STATUS              RESTARTS   AGE
capi-webhook-system   capi-controller-manager-68b5666f57-lsjdg                      0/2     ContainerCreating   0          3m41s
cert-manager          cert-manager-69b4f77ffc-b62st                                 1/1     Running             0          5m53s
cert-manager          cert-manager-cainjector-576978ffc8-hjm5m                      1/1     Running             0          5m53s
cert-manager          cert-manager-webhook-c67fbc858-nrlq5                          1/1     Running             0          5m53s
kube-system           calico-kube-controllers-77c4b7448-77sww                       1/1     Running             0          8m24s
kube-system           calico-node-2rsgg                                             1/1     Running             0          8m24s
kube-system           calico-node-6l6bj                                             1/1     Running             0          8m24s
kube-system           calico-node-llf65                                             1/1     Running             0          8m24s
kube-system           calico-node-mxc8q                                             1/1     Running             0          8m24s
kube-system           calico-node-w8vsq                                             1/1     Running             0          8m24s
kube-system           calico-node-wvtl8                                             1/1     Running             0          8m24s
kube-system           coredns-6955765f44-5w9ck                                      1/1     Running             0          12m
kube-system           coredns-6955765f44-bhmxw                                      1/1     Running             0          12m
kube-system           etcd-capi-quickstart-control-plane-7kjgb                      1/1     Running             0          10m
kube-system           etcd-capi-quickstart-control-plane-d5gtw                      1/1     Running             0          12m
kube-system           etcd-capi-quickstart-control-plane-jthdv                      1/1     Running             0          11m
kube-system           kube-apiserver-capi-quickstart-control-plane-7kjgb            1/1     Running             0          10m
kube-system           kube-apiserver-capi-quickstart-control-plane-d5gtw            1/1     Running             0          12m
kube-system           kube-apiserver-capi-quickstart-control-plane-jthdv            1/1     Running             0          11m
kube-system           kube-controller-manager-capi-quickstart-control-plane-7kjgb   1/1     Running             0          10m
kube-system           kube-controller-manager-capi-quickstart-control-plane-d5gtw   1/1     Running             1          12m
kube-system           kube-controller-manager-capi-quickstart-control-plane-jthdv   1/1     Running             1          11m
kube-system           kube-proxy-4qn8d                                              1/1     Running             0          11m
kube-system           kube-proxy-88vcv                                              1/1     Running             0          11m
kube-system           kube-proxy-cfsg7                                              1/1     Running             0          12m
kube-system           kube-proxy-gkhnl                                              1/1     Running             0          10m
kube-system           kube-proxy-nfnjn                                              1/1     Running             0          10m
kube-system           kube-proxy-x69sf                                              1/1     Running             0          10m
kube-system           kube-scheduler-capi-quickstart-control-plane-7kjgb            1/1     Running             1          10m
kube-system           kube-scheduler-capi-quickstart-control-plane-d5gtw            1/1     Running             1          12m
kube-system           kube-scheduler-capi-quickstart-control-plane-jthdv            1/1     Running             1          11m

CecileRobertMichon on 19 Mar 2020

So I tried to repro with capz 0.4.0 and capi v0.3.3 and I'm not getting the same error anymore, this time I get past it, however I'm now seeing init getting stuck at Waiting for cert-manager to be available... consistently, even though cert-manager seems to be ready:

kubectl --kubeconfig=./capi-quickstart.kubeconfig get nodes                                                                                                                     

NAME                         STATUS   ROLES    AGE   VERSION
capi-2-control-plane-5nws7   Ready    master   16m   v1.17.3
capi-2-control-plane-g7k5k   Ready    master   15m   v1.17.3
capi-2-control-plane-ps8nw   Ready    master   18m   v1.17.3
capi-2-md-0-ngw2x            Ready    <none>   17m   v1.17.3
capi-2-md-0-nwzh4            Ready    <none>   16m   v1.17.3
capi-2-md-0-qsb42            Ready    <none>   17m   v1.17.3

clusterctl --kubeconfig=./capi-quickstart.kubeconfig init --v 5                                                                                                                 

Installing the clusterctl inventory CRD
Creating CustomResourceDefinition="providers.clusterctl.cluster.x-k8s.io"
Fetching providers
Fetching File="core-components.yaml" Provider="cluster-api" Version="v0.3.3"
Fetching File="bootstrap-components.yaml" Provider="bootstrap-kubeadm" Version="v0.3.3"
Fetching File="control-plane-components.yaml" Provider="control-plane-kubeadm" Version="v0.3.3"
Fetching File="metadata.yaml" Provider="cluster-api" Version="v0.3.3"
Fetching File="metadata.yaml" Provider="bootstrap-kubeadm" Version="v0.3.3"
Fetching File="metadata.yaml" Provider="control-plane-kubeadm" Version="v0.3.3"
Installing cert-manager
Creating Namespace="cert-manager"
Creating CustomResourceDefinition="challenges.acme.cert-manager.io"
Creating CustomResourceDefinition="orders.acme.cert-manager.io"
Creating CustomResourceDefinition="certificaterequests.cert-manager.io"
Creating CustomResourceDefinition="certificates.cert-manager.io"
Creating CustomResourceDefinition="clusterissuers.cert-manager.io"
Creating CustomResourceDefinition="issuers.cert-manager.io"
Creating ServiceAccount="cert-manager-cainjector" Namespace="cert-manager"
Creating ServiceAccount="cert-manager" Namespace="cert-manager"
Creating ServiceAccount="cert-manager-webhook" Namespace="cert-manager"
Creating ClusterRole="cert-manager-cainjector"
Creating ClusterRoleBinding="cert-manager-cainjector"
Creating Role="cert-manager-cainjector:leaderelection" Namespace="kube-system"
Creating RoleBinding="cert-manager-cainjector:leaderelection" Namespace="kube-system"
Creating ClusterRoleBinding="cert-manager-webhook:auth-delegator"
Creating RoleBinding="cert-manager-webhook:webhook-authentication-reader" Namespace="kube-system"
Creating ClusterRole="cert-manager-webhook:webhook-requester"
Creating Role="cert-manager:leaderelection" Namespace="kube-system"
Creating RoleBinding="cert-manager:leaderelection" Namespace="kube-system"
Creating ClusterRole="cert-manager-controller-issuers"
Creating ClusterRole="cert-manager-controller-clusterissuers"
Creating ClusterRole="cert-manager-controller-certificates"
Creating ClusterRole="cert-manager-controller-orders"
Creating ClusterRole="cert-manager-controller-challenges"
Creating ClusterRole="cert-manager-controller-ingress-shim"
Creating ClusterRoleBinding="cert-manager-leaderelection"
Creating ClusterRoleBinding="cert-manager-controller-issuers"
Creating ClusterRoleBinding="cert-manager-controller-clusterissuers"
Creating ClusterRoleBinding="cert-manager-controller-certificates"
Creating ClusterRoleBinding="cert-manager-controller-orders"
Creating ClusterRoleBinding="cert-manager-controller-challenges"
Creating ClusterRoleBinding="cert-manager-controller-ingress-shim"
Creating ClusterRole="cert-manager-view"
Creating ClusterRole="cert-manager-edit"
Creating Service="cert-manager" Namespace="cert-manager"
Creating Service="cert-manager-webhook" Namespace="cert-manager"
Creating Deployment="cert-manager-cainjector" Namespace="cert-manager"
Creating Deployment="cert-manager" Namespace="cert-manager"
Creating Deployment="cert-manager-webhook" Namespace="cert-manager"
Creating APIService="v1beta1.webhook.cert-manager.io"
Creating MutatingWebhookConfiguration="cert-manager-webhook"
Creating ValidatingWebhookConfiguration="cert-manager-webhook"
Waiting for cert-manager to be available...

kubectl --kubeconfig=./capi-quickstart.kubeconfig get pods --all-namespaces                                                                                                     

NAMESPACE      NAME                                                 READY   STATUS    RESTARTS   AGE
cert-manager   cert-manager-69b4f77ffc-psxzj                        1/1     Running   0          3m27s
cert-manager   cert-manager-cainjector-576978ffc8-w2nr5             1/1     Running   0          3m27s
cert-manager   cert-manager-webhook-c67fbc858-c4v5b                 1/1     Running   1          3m26s
kube-system    calico-kube-controllers-576dfc659c-gftt4             1/1     Running   1          5m20s
kube-system    calico-node-4zbv6                                    1/1     Running   1          5m22s
kube-system    calico-node-bhpnf                                    1/1     Running   0          5m22s
kube-system    calico-node-c49b5                                    1/1     Running   1          5m22s
kube-system    calico-node-d8wwd                                    1/1     Running   0          5m22s
kube-system    calico-node-knvmh                                    1/1     Running   0          5m22s
kube-system    calico-node-mcz8b                                    1/1     Running   1          5m22s
kube-system    coredns-6955765f44-w2zhr                             1/1     Running   0          17m
kube-system    coredns-6955765f44-xgznx                             1/1     Running   0          17m
kube-system    etcd-capi-2-control-plane-5nws7                      1/1     Running   0          16m
kube-system    etcd-capi-2-control-plane-g7k5k                      1/1     Running   0          14m
kube-system    etcd-capi-2-control-plane-ps8nw                      1/1     Running   0          17m
kube-system    kube-apiserver-capi-2-control-plane-5nws7            1/1     Running   0          16m
kube-system    kube-apiserver-capi-2-control-plane-g7k5k            1/1     Running   0          14m
kube-system    kube-apiserver-capi-2-control-plane-ps8nw            1/1     Running   0          17m
kube-system    kube-controller-manager-capi-2-control-plane-5nws7   1/1     Running   1          16m
kube-system    kube-controller-manager-capi-2-control-plane-g7k5k   1/1     Running   0          14m
kube-system    kube-controller-manager-capi-2-control-plane-ps8nw   1/1     Running   1          17m
kube-system    kube-proxy-4ptdx                                     1/1     Running   0          16m
kube-system    kube-proxy-ctbh8                                     1/1     Running   0          17m
kube-system    kube-proxy-jd7f5                                     1/1     Running   0          16m
kube-system    kube-proxy-jv8mg                                     1/1     Running   0          16m
kube-system    kube-proxy-p72tw                                     1/1     Running   0          14m
kube-system    kube-proxy-v45gf                                     1/1     Running   0          16m
kube-system    kube-scheduler-capi-2-control-plane-5nws7            1/1     Running   1          16m
kube-system    kube-scheduler-capi-2-control-plane-g7k5k            1/1     Running   1          14m
kube-system    kube-scheduler-capi-2-control-plane-ps8nw            1/1     Running   1          17m

kubectl --kubeconfig=./capi-quickstart.kubeconfig get deploy -n cert-manager  

NAME                      READY   UP-TO-DATE   AVAILABLE   AGE
cert-manager              1/1     1            1           7m24s
cert-manager-cainjector   1/1     1            1           7m24s
cert-manager-webhook      1/1     1            1           7m23s

After 10 minutes, init fails with Error: timed out waiting for the condition.

EDIT: I see this when I describe the apiservice:

Message:               failing or missing response from https://10.100.0.172:443/apis/webhook.cert-manager.io/v1beta1: Get https://10.100.0.172:443/apis/webhook.cert-manager.io/v1beta1: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)

CecileRobertMichon on 31 Mar 2020

@CecileRobertMichon "Waiting for cert-manager to be available..." is waiting for the ApiService v1beta1.webhook.cert-manager.io to get status.conditions with type Available and value true.
this can be checked with

kubectl get apiservice v1beta1.webhook.cert-manager.io -o json | jq '.status.conditions[] | select(.type == "Available") | .status'

or with

kubectl wait --for=condition=Available apiservice/v1beta1.webhook.cert-manager.io

By looking at the apiservice/v1beta1.webhook.cert-manager.io spec, this API service depends on the following service

  service:
    name: cert-manager-webhook
    namespace: cert-manager
    port: 443

And by my observations, the apiservice condition is set as soon as this service is backed by one pod (the cert-manager-webhook pod)

Could you kindly check the v1beta1.webhook.cert-manager.io API service and the cert-manager-webhook service on your cluster.
Also is it possible that this sequence does not complete in 10m (the current timeout) on your cluster?

fabriziopandini on 31 Mar 2020

@fabriziopandini see the very last line in my previous comment. I did check the API service and it was indeed not available. I suspect it has something to do with the default capz security group not allowing port 443 traffic. I'll give changing the NSG a try today.

CecileRobertMichon on 31 Mar 2020

👍1

Changing the NSG rules did not help unfortunately, it's still failing with the same error. I took a look with @vincepri this morning and we couldn't figure out what was going on. Cluster networking seems generally healthy and I didn't have any issues creating a simple nginx service.

CecileRobertMichon on 31 Mar 2020

/area clusterctl

fabriziopandini on 2 Apr 2020

@CecileRobertMichon ok to close this issue now that we've identified the issue is with Calico configuration?

vincepri on 8 Apr 2020

Yes

/close

CecileRobertMichon on 8 Apr 2020

@CecileRobertMichon: Closing this issue.

In response to this:

Yes

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.