What steps did you take and what happened:
kubectl delete machine/fooI0814 17:10:39.243788 1 instances.go:67] [machine-actuator]/cluster.k8s.io/v1alpha1/4d95faba9cb7ee388671ac3cef6ee79b39c25f15/bf038fa5/worker-bf038fa5-nodepool-bf038fa5default-6c86565b8rvqdh "level"=2 "msg"="Looking for existing machine instance by tags"
I0814 17:10:39.288157 1 machine_controller.go:181] Deleting node "ip-10-0-0-20.us-west-2.compute.internal" for machine "worker-bf038fa5-nodepool-bf038fa5default-6c86565b8rvqdh"
E0814 17:10:39.301721 1 machine_controller.go:183] Error deleting node "ip-10-0-0-20.us-west-2.compute.internal" for machine "worker-bf038fa5-nodepool-bf038fa5default-6c86565b8rvqdh": Delete
https://bf038fa5-apiserver-261840232.us-west-2.elb.amazonaws.com:6443/api/v1/nodes/ip-10-0-0-20.us-west-2.compute.internal
: dial tcp: lookup bf038fa5-apiserver-261840232.us-west-2.elb.amazonaws.com
on 10.96.0.10:53: no such host
What did you expect to happen:
This error shouldn't block Machine deletion
Anything else you would like to add:
I think it would be reasonable to attempt to delete the Node multiple times over the span of 30-60 seconds. If the deletion fails, we can record an Event, then allow the Machine deletion to continue.
Environment:
/kind bug
xref https://github.com/kubernetes-sigs/cluster-api-provider-aws/issues/1084#issuecomment-530840269 and my next comment as well
/assign
@ncdc I'm thinking of wrapping this function inside wait.PollImmediate(), what do you think?
That seems AWS specific function, for this we should probably just retry a specific number of times. We should be able to use https://github.com/kubernetes/apimachinery/blob/master/pkg/util/wait/wait.go#L333 to achieve this, wdyt?
yes, i'm actually talking about this!
any suggestion for interval and timeout duration?
Sounds good, I saw the linked aws code and was confused :D
I'd retry maybe every 2 seconds and for max 10? @ncdc
Do you think it makes sense to try for up to either 30 or 60 seconds? Or is it more likely that if it fails once, it will probably fail every time, in which case trying for that long is an unnecessary delay?
Or should we just try once and not even bother with any more attempts?
Once might be a temporary failure, I'd limit it the retries to 10-15 seconds, if it fails for that long there is a good chance that we won't be able to reach it
Ok, I'm good with interval=2s, timeout=10s
Should we do the same for bastion?
The bastion is CAPA specific and is unrelated to this issue (the bastion doesn't have a corresponding Kubernetes Node).
oops, I mistook it as a capa issue, thats why i was linking the capa function :(
Most helpful comment
Ok, I'm good with interval=2s, timeout=10s