Cluster-api: clusterctl delete everything returns an error intermittenly

Created on 28 Feb 2020  路  15Comments  路  Source: kubernetes-sigs/cluster-api

What steps did you take and what happened:

  1. Install components via clusterctl init --infrastructure=aws:v0.5.0
  2. Try and delete all the providers, namespaces and crds. And repeat the process a few times.

Deleted all providers but returned an error which left resources behind

  $ clusterctl delete --all --include-namespace --include-crd
Deleting Provider="infrastructure-aws" Version="v0.5.0" TargetNamespace="capa-system"
Deleting Provider="bootstrap-kubeadm" Version="v0.3.0-rc.2" TargetNamespace="capi-kubeadm-bootstrap-system"
Deleting Provider="control-plane-kubeadm" Version="v0.3.0-rc.2" TargetNamespace="capi-kubeadm-control-plane-system"
Deleting Provider="cluster-api" Version="v0.3.0-rc.2" TargetNamespace="capi-system"
Error: failed to list api resources: unable to retrieve the complete list of server APIs: controlplane.cluster.x-k8s.io/v1alpha3: the server could not find the requested resource

Deleted some providers but returned an error which left resources behind

 $ clusterctl delete --all --include-crd --include-namespace
Deleting Provider="infrastructure-aws" Version="v0.5.0" TargetNamespace="capa-system"
Deleting Provider="bootstrap-kubeadm" Version="v0.3.0-rc.2" TargetNamespace="capi-kubeadm-bootstrap-system"
Deleting Provider="control-plane-kubeadm" Version="v0.3.0-rc.2" TargetNamespace="capi-kubeadm-control-plane-system"
Error: failed to list api resources: unable to retrieve the complete list of server APIs: bootstrap.cluster.x-k8s.io/v1alpha2: the server could not find the requested resource, bootstrap.cluster.x-k8s.io/v1alpha3: the server could not find the requested resource

Everything deleted successfully!

$ clusterctl delete --all --include-crd --include-namespace
Deleting Provider="infrastructure-aws" Version="v0.5.0" TargetNamespace="capa-system"
Deleting Provider="bootstrap-kubeadm" Version="v0.3.0-rc.2" TargetNamespace="capi-kubeadm-bootstrap-system"
Deleting Provider="control-plane-kubeadm" Version="v0.3.0-rc.2" TargetNamespace="capi-kubeadm-control-plane-system"
Deleting Provider="cluster-api" Version="v0.3.0-rc.2" TargetNamespace="capi-system"

What did you expect to happen:
Everything to delete successfully

Anything else you would like to add:
Running the same command a second time cleans everything up.
~Also capi-webhook-system namespace is left around.~
UPDATE: As per the test, capi-webhook-system is intentionally left around.
https://github.com/kubernetes-sigs/cluster-api/blob/2d2c9c86d49edfaeaec70001d66d3feb1211e4e9/cmd/clusterctl/pkg/client/cluster/components_test.go#L236

Environment:

  • Cluster-api version: a39618d45eda45400759223a8a73c99e591e2101
  • Minikube/KIND version: kind v0.7.0 go1.13.6 darwin/amd64
    /kind bug
areclusterctl help wanted kinbug lifecyclactive

All 15 comments

/area clusterctl

@fabriziopandini Can you take a look at this issue to see if I'm missing something? I just noticed this behavior recently and wanted to better understand the expected behavior. It's not a serious one so no rush at all 馃檪

This seems a kind of race that happens when deleting more providers in a row

...
Deleting Provider="control-plane-kubeadm" Version="v0.3.0-rc.2" TargetNamespace="capi-kubeadm-control-plane-system"

Deletes controlplane.cluster.x-k8s.io/v1alpha3 CRD, but when the next delete operation is executed, it seems the type is still around/still in the client discovery cache, and this leads to error.

Deleting Provider="cluster-api" Version="v0.3.0-rc.2" TargetNamespace="capi-system"
Error: failed to list api resources: unable to retrieve the complete list of server APIs: controlplane.cluster.x-k8s.io/v1alpha3: the server could not find the requested resource

Wondering if we need to explicitly wait for CRD deletion to complete before moving on with the next delete

/milestone v0.3.x

/assign

I can take a look into this one since I'm dabbling in the clusterctl code anyways 馃檪

/assign @fabriziopandini @wfernandes

Can we re-triage and evaluate if we should keep this open or close it?

/milestone v0.3.6

@vincepri I'll re-triage this today.

This is still reproducible.

# These are the providers installed
$ kubectl get providers -A
NAMESPACE                           NAME                     TYPE   PROVIDER                 VERSION   WATCH NAMESPACE
capa-system                         infrastructure-aws              InfrastructureProvider   v0.5.3
capi-kubeadm-bootstrap-system       bootstrap-kubeadm               BootstrapProvider        v0.3.5
capi-kubeadm-control-plane-system   control-plane-kubeadm           ControlPlaneProvider     v0.3.5
capi-system                         cluster-api                     CoreProvider             v0.3.5
capv-system                         infrastructure-vsphere          InfrastructureProvider   v0.6.4

# Occasionally fails to delete some providers.
$ clusterctl delete --all --include-namespace --include-crd
Deleting Provider="infrastructure-aws" Version="v0.5.3" TargetNamespace="capa-system"
Deleting Provider="bootstrap-kubeadm" Version="v0.3.5" TargetNamespace="capi-kubeadm-bootstrap-system"
Deleting Provider="control-plane-kubeadm" Version="v0.3.5" TargetNamespace="capi-kubeadm-control-plane-system"
Deleting Provider="cluster-api" Version="v0.3.5" TargetNamespace="capi-system"
Deleting Provider="infrastructure-vsphere" Version="v0.6.4" TargetNamespace="capv-system"
Error: failed to list api resources: unable to retrieve the complete list of server APIs: cluster.x-k8s.io/v1alpha2: the server could not find the requested resource

# CAPV Provider, its CRDs and controllers are still around.
$ kubectl get providers -A
NAMESPACE     NAME                     TYPE   PROVIDER                 VERSION   WATCH NAMESPACE
capv-system   infrastructure-vsphere          InfrastructureProvider   v0.6.4

$ kubectl get pods -A
NAMESPACE             NAME                                         READY   STATUS    RESTARTS   AGE
capi-webhook-system   capv-controller-manager-545dc54966-w2jv8     2/2     Running   0          48s
capv-system           capv-controller-manager-8df9785b7-lg6zv      2/2     Running   0          47s
...

$ kubectl get crds
NAME                                                      CREATED AT
...
providers.clusterctl.cluster.x-k8s.io                     2020-05-05T14:50:53Z
vsphereclusters.infrastructure.cluster.x-k8s.io           2020-05-05T17:22:00Z
vspheremachines.infrastructure.cluster.x-k8s.io           2020-05-05T17:22:00Z
vspheremachinetemplates.infrastructure.cluster.x-k8s.io   2020-05-05T17:22:01Z
vspherevms.infrastructure.cluster.x-k8s.io                2020-05-05T17:22:01Z

/help

@wfernandes:
This request has been marked as needing help from a contributor.

Please ensure the request meets the requirements listed here.

If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-help command.

In response to this:

This is still reproducible.

# These are the providers installed
$ kubectl get providers -A
NAMESPACE                           NAME                     TYPE   PROVIDER                 VERSION   WATCH NAMESPACE
capa-system                         infrastructure-aws              InfrastructureProvider   v0.5.3
capi-kubeadm-bootstrap-system       bootstrap-kubeadm               BootstrapProvider        v0.3.5
capi-kubeadm-control-plane-system   control-plane-kubeadm           ControlPlaneProvider     v0.3.5
capi-system                         cluster-api                     CoreProvider             v0.3.5
capv-system                         infrastructure-vsphere          InfrastructureProvider   v0.6.4

# Occasionally fails to delete some providers.
$ clusterctl delete --all --include-namespace --include-crd
Deleting Provider="infrastructure-aws" Version="v0.5.3" TargetNamespace="capa-system"
Deleting Provider="bootstrap-kubeadm" Version="v0.3.5" TargetNamespace="capi-kubeadm-bootstrap-system"
Deleting Provider="control-plane-kubeadm" Version="v0.3.5" TargetNamespace="capi-kubeadm-control-plane-system"
Deleting Provider="cluster-api" Version="v0.3.5" TargetNamespace="capi-system"
Deleting Provider="infrastructure-vsphere" Version="v0.6.4" TargetNamespace="capv-system"
Error: failed to list api resources: unable to retrieve the complete list of server APIs: cluster.x-k8s.io/v1alpha2: the server could not find the requested resource

# CAPV Provider, its CRDs and controllers are still around.
$ kubectl get providers -A
NAMESPACE     NAME                     TYPE   PROVIDER                 VERSION   WATCH NAMESPACE
capv-system   infrastructure-vsphere          InfrastructureProvider   v0.6.4

$ kubectl get pods -A
NAMESPACE             NAME                                         READY   STATUS    RESTARTS   AGE
capi-webhook-system   capv-controller-manager-545dc54966-w2jv8     2/2     Running   0          48s
capv-system           capv-controller-manager-8df9785b7-lg6zv      2/2     Running   0          47s
...

$ kubectl get crds
NAME                                                      CREATED AT
...
providers.clusterctl.cluster.x-k8s.io                     2020-05-05T14:50:53Z
vsphereclusters.infrastructure.cluster.x-k8s.io           2020-05-05T17:22:00Z
vspheremachines.infrastructure.cluster.x-k8s.io           2020-05-05T17:22:00Z
vspheremachinetemplates.infrastructure.cluster.x-k8s.io   2020-05-05T17:22:01Z
vspherevms.infrastructure.cluster.x-k8s.io                2020-05-05T17:22:01Z

/help

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

/milestone v0.3.x

/milestone v0.3.9
/assign @ncdc
to triage and investigate API discovery

This is happening because of a timing issue. We are actively deleting providers, which includes deleting their CRDs. Deleting a CRD removes it from API discovery. It can take some time between when a CRD is deleted and when it is removed from /apis.

In the example above, we deleted KCP, and then we try to remove another provider (cluster-api). As part of deleting, we use the discovery API client to get the server's list of preferred resources. That code first gets a list of all the API groups, and then iterates through them, making a separate discovery API call for each GroupVersion. It's possible that a CRD's group is present during step one (list groups), and then gone by the time the second call happens.

The fix here is probably either:

  1. Tolerate discovery.ErrGroupDiscoveryFailed errors
  2. Retry https://github.com/kubernetes-sigs/cluster-api/blob/e955160f6f30f61eabce65231899a1fe9d513046/cmd/clusterctl/client/cluster/proxy.go#L164 a few times before giving up

I'm +1 to retry

/assign

/lifecycle active

Was this page helpful?
0 / 5 - 0 ratings