Is it possible to increase the node drain timeout?
seeing this in logs:
Failed to scale down: Failed to delete ip-10-100-6-220.ec2.internal: Failed to drain node /ip-10-100-6-220.ec2.internal: pods remaining after timeout
When draining the node cluster-autoscaler sets grace termination period on all evictions for 1 minute and later waits the same 1 minute, before giving up and logging the message you observed. You can change 1 minute to a different value by using --max-grateful-termination-sec flag (this is copy pasted from code, the typo is really there :( ).
However, the fact that CA failed to drain the node may not be a big problem. It should retry it next loop (in ~10s). On the other hand draining blocks the main loop, so by increasing timeout you will stop autoscaler from doing anything else for that timeout.
yea its retrying in a loop, and scale down is not working. i noticed my pods were being deleted in a constant loop but the CA doesnt delete the node, so I set the --max-grateful-termination-sec to 5800, and its still reporting the same error above.
Few things:
1, As soon as CA starts draining, it should mark the node as unschedulable (but I dont see it doing that)
2, pods go into a terminating state and
3, kubernetes instantly creates new pods elsewhere, so if the node was to be marked as unschedulable then after a few retires to drain, I'd expect CA to force delete the node, instead of throwing the timeout error
But in my case, I think the cause of the loop is since the node is left in a ready state, some recreated pods end up on that node
What CA and kubernetes version are you using? 0.5.x CA puts a taint on the node to make it unschedulable during drain.
CA gcr.io/google_containers/cluster-autoscaler:v0.5.2
Kubernetes:
Client Version: version.Info{Major:"1", Minor:"5", GitVersion:"v1.5.2", GitCommit:"08e099554f3c31f6e6f07b448ab3ed78d0520507", GitTreeState:"clean", BuildDate:"2017-01-12T04:57:25Z", GoVersion:"go1.7.4", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"5", GitVersion:"v1.5.3+coreos.0", GitCommit:"8fc95b64d0fe1608d0f6c788eaad2c004f31e7b7", GitTreeState:"clean", BuildDate:"2017-02-15T19:52:15Z", GoVersion:"go1.7.4", Compiler:"gc", Platform:"linux/amd64"}
This is a known issue with running CA 0.5 on k8s 1.5 (https://github.com/kubernetes/contrib/issues/2491#issuecomment-288896371). Generally speaking k8s 1.5 / CA 0.5 is not fully supported and it's recommended to use 0.4 CA with 1.5 k8s.
ok, thanks for letting me know. I will downgrade to 0.4, is there any plans for a fix?
also, after downgrading, im seeing multiple log lines of:
I0504 17:37:43.542093 1 scale_down.go:163] No candidates for scale down
I0504 17:37:53.713349 1 scale_down.go:163] No candidates for scale down
I0504 17:38:03.998987 1 scale_down.go:163] No candidates for scale down
I0504 17:38:15.278546 1 scale_down.go:163] No candidates for scale down
I0504 17:38:25.490368 1 scale_down.go:163] No candidates for scale down
is that expected? or does 0.4.0 require some other parameters?
No candidate for scale down line means that CA didn't find any nodes to remove in it's main loop. This is expected to show up in logs roughly every ~10s.
BTW the same is true in 0.5 CA.
@MaciekPytel @yawboateng I had a similar problem where node draining would timeout. I found that it was because CA is setting the termination grace period for every pod to whatever value --max-grateful-termination-sec is. If you have pods running on that node that will accept SIGTERM and then relies on the kubelet to send a SIGKILL after terminationGracePeriod then draining nodes will always fail. I ended up fixing it by adding this patch https://github.com/kubernetes/contrib/pull/2494 (also fixes typo in --max-grateful-termination-sec flag). I can re-open it against this PR if you think the fix is valid.
I found for the most part that CA 0.5 does work with k8s 1.5, the only caveat being the NoSchedule taint doesn't work, but this generally isn't a problem as long as a new pod isn't scheduled while the node is being drained (and if it were it would retry eventually). It would be nice if we can use annotations alongside taints to support both k8s 1.5 and 1.6 going forward.
@andrewsykim Regarding kubernetes/contrib#2494 - what if pod has a grace period of >1m defined in it's spec? Currently we only respect it up to 1m, but with this change it may need to wait forever (leading to similar issue to this one). Perhaps a better solution would be to just wait on drain for grace period + some epsilon?
cc: @mwielgus
In my opinion, I think forcing the terminationGracePeriod on every pod on a node is bad practice to begin with, even if the node is to be terminated. There are many cases where users set their grace periods longer than 1 min and no one expects CA to override that. Grace period + epsilon is better than the current implementation, but I think the better approach is to set the --max-grat(c)eful-termination-sec higher by default (5 mins?) where most clusters won't timeout and let cluster admins decide for themselves what a reasonable grace period is based on their workload.
The problem with accepting higher graceful termination is that we stop CA operations until the node is deleted. If gt was 10 min then CA would stop for 10 min and during this time no scale up operations would be executed. This is probably not what the users would want.
So if we want to have graceful termination significantly bigger than 1 min then we need to go asynchronous with the deletes, which will make the whole thing even more complicated. So I guess we won't do anything around it for 1.7, maybe for 1.8 but we have other, probably more important pain points to fix.
Speaking about the issue - it seems that your app is probably ignoring SIGTERMs. And after investigating the code around this there seem to be a subtle timing/race-condition bug in the pod checking loop. The fix is on the way.
Should be fixed in 0.5.4.
@bowei @nicksardo FYI
Most helpful comment
The problem with accepting higher graceful termination is that we stop CA operations until the node is deleted. If gt was 10 min then CA would stop for 10 min and during this time no scale up operations would be executed. This is probably not what the users would want.
So if we want to have graceful termination significantly bigger than 1 min then we need to go asynchronous with the deletes, which will make the whole thing even more complicated. So I guess we won't do anything around it for 1.7, maybe for 1.8 but we have other, probably more important pain points to fix.
Speaking about the issue - it seems that your app is probably ignoring SIGTERMs. And after investigating the code around this there seem to be a subtle timing/race-condition bug in the pod checking loop. The fix is on the way.