Autoscaler: Failed to drain node - pods remaining after timeout

Created on 3 May 2017 · 15Comments · Source: kubernetes/autoscaler

Is it possible to increase the node drain timeout?

seeing this in logs:
Failed to scale down: Failed to delete ip-10-100-6-220.ec2.internal: Failed to drain node /ip-10-100-6-220.ec2.internal: pods remaining after timeout

Source

yawboateng

Most helpful comment

The problem with accepting higher graceful termination is that we stop CA operations until the node is deleted. If gt was 10 min then CA would stop for 10 min and during this time no scale up operations would be executed. This is probably not what the users would want.

So if we want to have graceful termination significantly bigger than 1 min then we need to go asynchronous with the deletes, which will make the whole thing even more complicated. So I guess we won't do anything around it for 1.7, maybe for 1.8 but we have other, probably more important pain points to fix.

Speaking about the issue - it seems that your app is probably ignoring SIGTERMs. And after investigating the code around this there seem to be a subtle timing/race-condition bug in the pod checking loop. The fix is on the way.

mwielgus on 6 May 2017

👍4

All 15 comments

When draining the node cluster-autoscaler sets grace termination period on all evictions for 1 minute and later waits the same 1 minute, before giving up and logging the message you observed. You can change 1 minute to a different value by using --max-grateful-termination-sec flag (this is copy pasted from code, the typo is really there :( ).

However, the fact that CA failed to drain the node may not be a big problem. It should retry it next loop (in ~10s). On the other hand draining blocks the main loop, so by increasing timeout you will stop autoscaler from doing anything else for that timeout.

MaciekPytel on 4 May 2017

yea its retrying in a loop, and scale down is not working. i noticed my pods were being deleted in a constant loop but the CA doesnt delete the node, so I set the --max-grateful-termination-sec to 5800, and its still reporting the same error above.

Few things:
1, As soon as CA starts draining, it should mark the node as unschedulable (but I dont see it doing that)
2, pods go into a terminating state and
3, kubernetes instantly creates new pods elsewhere, so if the node was to be marked as unschedulable then after a few retires to drain, I'd expect CA to force delete the node, instead of throwing the timeout error

But in my case, I think the cause of the loop is since the node is left in a ready state, some recreated pods end up on that node

yawboateng on 4 May 2017

What CA and kubernetes version are you using? 0.5.x CA puts a taint on the node to make it unschedulable during drain.

MaciekPytel on 4 May 2017

CA gcr.io/google_containers/cluster-autoscaler:v0.5.2

Kubernetes:

Client Version: version.Info{Major:"1", Minor:"5", GitVersion:"v1.5.2", GitCommit:"08e099554f3c31f6e6f07b448ab3ed78d0520507", GitTreeState:"clean", BuildDate:"2017-01-12T04:57:25Z", GoVersion:"go1.7.4", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"5", GitVersion:"v1.5.3+coreos.0", GitCommit:"8fc95b64d0fe1608d0f6c788eaad2c004f31e7b7", GitTreeState:"clean", BuildDate:"2017-02-15T19:52:15Z", GoVersion:"go1.7.4", Compiler:"gc", Platform:"linux/amd64"}

yawboateng on 4 May 2017

This is a known issue with running CA 0.5 on k8s 1.5 (https://github.com/kubernetes/contrib/issues/2491#issuecomment-288896371). Generally speaking k8s 1.5 / CA 0.5 is not fully supported and it's recommended to use 0.4 CA with 1.5 k8s.

MaciekPytel on 4 May 2017

ok, thanks for letting me know. I will downgrade to 0.4, is there any plans for a fix?

yawboateng on 4 May 2017

also, after downgrading, im seeing multiple log lines of:

I0504 17:37:43.542093       1 scale_down.go:163] No candidates for scale down
I0504 17:37:53.713349       1 scale_down.go:163] No candidates for scale down
I0504 17:38:03.998987       1 scale_down.go:163] No candidates for scale down
I0504 17:38:15.278546       1 scale_down.go:163] No candidates for scale down
I0504 17:38:25.490368       1 scale_down.go:163] No candidates for scale down

is that expected? or does 0.4.0 require some other parameters?

yawboateng on 4 May 2017

No candidate for scale down line means that CA didn't find any nodes to remove in it's main loop. This is expected to show up in logs roughly every ~10s.
BTW the same is true in 0.5 CA.

MaciekPytel on 5 May 2017

@MaciekPytel @yawboateng I had a similar problem where node draining would timeout. I found that it was because CA is setting the termination grace period for every pod to whatever value --max-grateful-termination-sec is. If you have pods running on that node that will accept SIGTERM and then relies on the kubelet to send a SIGKILL after terminationGracePeriod then draining nodes will always fail. I ended up fixing it by adding this patch https://github.com/kubernetes/contrib/pull/2494 (also fixes typo in --max-grateful-termination-sec flag). I can re-open it against this PR if you think the fix is valid.

andrewsykim on 5 May 2017

I found for the most part that CA 0.5 does work with k8s 1.5, the only caveat being the NoSchedule taint doesn't work, but this generally isn't a problem as long as a new pod isn't scheduled while the node is being drained (and if it were it would retry eventually). It would be nice if we can use annotations alongside taints to support both k8s 1.5 and 1.6 going forward.

andrewsykim on 5 May 2017

@andrewsykim Regarding kubernetes/contrib#2494 - what if pod has a grace period of >1m defined in it's spec? Currently we only respect it up to 1m, but with this change it may need to wait forever (leading to similar issue to this one). Perhaps a better solution would be to just wait on drain for grace period + some epsilon?

cc: @mwielgus

MaciekPytel on 5 May 2017

In my opinion, I think forcing the terminationGracePeriod on every pod on a node is bad practice to begin with, even if the node is to be terminated. There are many cases where users set their grace periods longer than 1 min and no one expects CA to override that. Grace period + epsilon is better than the current implementation, but I think the better approach is to set the --max-grat(c)eful-termination-sec higher by default (5 mins?) where most clusters won't timeout and let cluster admins decide for themselves what a reasonable grace period is based on their workload.

andrewsykim on 5 May 2017

mwielgus on 6 May 2017

👍4

Should be fixed in 0.5.4.

mwielgus on 6 Jun 2017

@bowei @nicksardo FYI