Cluster-api: Machine deletion: try up to n times to delete the Node, then move on

Created on 26 Sep 2019 · 13Comments · Source: kubernetes-sigs/cluster-api

What steps did you take and what happened:

Do something catastrophic such as manually deleting a CAPA cluster's ELB
kubectl delete machine/foo
Error in the logs (note, this is from v1alpha1, but the issue is still in v1alpha2):

I0814 17:10:39.243788       1 instances.go:67] [machine-actuator]/cluster.k8s.io/v1alpha1/4d95faba9cb7ee388671ac3cef6ee79b39c25f15/bf038fa5/worker-bf038fa5-nodepool-bf038fa5default-6c86565b8rvqdh "level"=2 "msg"="Looking for existing machine instance by tags"  
I0814 17:10:39.288157       1 machine_controller.go:181] Deleting node "ip-10-0-0-20.us-west-2.compute.internal" for machine "worker-bf038fa5-nodepool-bf038fa5default-6c86565b8rvqdh"
E0814 17:10:39.301721       1 machine_controller.go:183] Error deleting node "ip-10-0-0-20.us-west-2.compute.internal" for machine "worker-bf038fa5-nodepool-bf038fa5default-6c86565b8rvqdh": Delete 
https://bf038fa5-apiserver-261840232.us-west-2.elb.amazonaws.com:6443/api/v1/nodes/ip-10-0-0-20.us-west-2.compute.internal
: dial tcp: lookup bf038fa5-apiserver-261840232.us-west-2.elb.amazonaws.com
on 10.96.0.10:53: no such host

The Machine is never deleted

What did you expect to happen:
This error shouldn't block Machine deletion

Anything else you would like to add:
I think it would be reasonable to attempt to delete the Node multiple times over the span of 30-60 seconds. If the deletion fails, we can record an Event, then allow the Machine deletion to continue.

Environment:

Cluster-api version: v0.1.x and v0.2.x

/kind bug

xref https://github.com/kubernetes-sigs/cluster-api-provider-aws/issues/1084#issuecomment-530840269 and my next comment as well

help wanted kinbug prioritimportant-soon

Source

ncdc

Most helpful comment

Ok, I'm good with interval=2s, timeout=10s

ncdc on 26 Sep 2019

👍2

All 13 comments

/assign

tahsinrahman on 26 Sep 2019

@ncdc I'm thinking of wrapping this function inside wait.PollImmediate(), what do you think?

https://github.com/kubernetes-sigs/cluster-api-provider-aws/blob/423721d074144de956f70ed95996101e3585758f/pkg/cloud/services/ec2/instances.go#L246-L262

tahsinrahman on 26 Sep 2019

That seems AWS specific function, for this we should probably just retry a specific number of times. We should be able to use https://github.com/kubernetes/apimachinery/blob/master/pkg/util/wait/wait.go#L333 to achieve this, wdyt?

vincepri on 26 Sep 2019

yes, i'm actually talking about this!

tahsinrahman on 26 Sep 2019

👍1

any suggestion for interval and timeout duration?

tahsinrahman on 26 Sep 2019

Sounds good, I saw the linked aws code and was confused :D

I'd retry maybe every 2 seconds and for max 10? @ncdc

vincepri on 26 Sep 2019

Do you think it makes sense to try for up to either 30 or 60 seconds? Or is it more likely that if it fails once, it will probably fail every time, in which case trying for that long is an unnecessary delay?

ncdc on 26 Sep 2019

Or should we just try once and not even bother with any more attempts?

ncdc on 26 Sep 2019

Once might be a temporary failure, I'd limit it the retries to 10-15 seconds, if it fails for that long there is a good chance that we won't be able to reach it

vincepri on 26 Sep 2019

Ok, I'm good with interval=2s, timeout=10s

ncdc on 26 Sep 2019

👍2

Should we do the same for bastion?

tahsinrahman on 26 Sep 2019

The bastion is CAPA specific and is unrelated to this issue (the bastion doesn't have a corresponding Kubernetes Node).

ncdc on 26 Sep 2019

oops, I mistook it as a capa issue, thats why i was linking the capa function :(

tahsinrahman on 26 Sep 2019

Was this page helpful?

0 / 5 - 0 ratings

Related issues

clusterctl pivots to internal cluster, make this optional

oneilcin · 6Comments

Remove the example provider

fabriziopandini · 5Comments

Use MatchPolicy on conversion webhooks to intercept all convertible versions

vincepri · 5Comments

Bootstrap controllers should set `Spec.Type` when creating and reading Secrets

vincepri · 4Comments

Running clusterctl init twice results in an error

wfernandes · 5Comments