Autoscaler: Why not to drain daemonset pods?

Created on 14 Jan 2019 · 15Comments · Source: kubernetes/autoscaler

I read the source code about the scaling-down part, and I found the cluster-autoscaler never drain the daemonset pods when scaling down. Wouldn't this cause problems?

cluster-autoscaler lifecyclrotten

Source

calixwu

Most helpful comment

I just noticed this too on Google's GKE. We have a consul agent daemonset on every node, and have "leave_on_terminate: true" set in the Consul config to gracefully leave the consul mesh when the Consul agent is stopped. I wasn't seeing any graceful leaving happening however, and it turns out it seems that there's no SIGTERM of the Daemonset container before the cluster autoscaler deletes the node.

I agree that this does seem like something the kubelet should be handling though? Gracefully shutting down all the remaining workloads on a node when it itself is shut down.

KJTsanaktsidis on 5 Apr 2019

👍5

All 15 comments

There's no point evicting DaemonSet pods, as they can't be scheduled on another node anyway. You're correct this means in particular we won't attempt any sort of graceful termination other than what kubelet may provide when shutting down. I'm not really sure if there's any mechanism we could utilize for it.

aleksandra-malinowska on 14 Jan 2019

There's no point evicting DaemonSet pods, as they can't be scheduled on another node anyway. You're correct this means in particular we won't attempt any sort of graceful termination other than what kubelet may provide when shutting down. I'm not really sure if there's any mechanism we could utilize for it.
@aleksandra-malinowska

Yep.. My point is the graceful termination. Cann't we evict DaemonSet pods and afford a graceful termination the same as evicting the other pods ( eg. replicated pods )? The only problem in this way I can think of is that we couldn't
simultaneously delete the empty nodes that simply.

calixwu on 15 Jan 2019

👍2

Note CA only respects graceful termination of up to 10 minutes anyway. Evicting daemonset pods if we're draining a node probably wouldn't be difficult, but the empty nodes case is a large issue. A lot of people rely on the fact that they're deleted in parallel and changing that would be a huge regression.

MaciekPytel on 15 Jan 2019

Note CA only respects graceful termination of up to 10 minutes anyway. Evicting daemonset pods if we're draining a node probably wouldn't be difficult, but the empty nodes case is a large issue. A lot of people rely on the fact that they're deleted in parallel and changing that would be a huge regression.

@MaciekPytel
Could we do these to solve the empty-nodes issue? :

We ignore the daemonset pods when we try to find out the empty nodes.
We do not ignore the daemonset pods when we try to delete the empty nodes. Then add the "delete traint" to the empty nodes and evict the daemonset pods in parallel, after the empty nodes being actually empty (no daemonset pods on them), we call the cloud provider's deletion in parallel.

calixwu on 15 Jan 2019

Evicting daemonset pods if we're draining a node probably wouldn't be difficult

It may be harder than we think. We wait for pods to be scheduled back onto some other nodes, but with DaemonSet, each pod has a node selector for a particular node, and we don't want it to run anymore. These pods also seem to ignore the fact that node is not ready or unschedulable (which of course makes sense when starting a node that requires a DaemonSet to be ready), so they may just be scheduled back anyway. Finally, I recall some issues where an extra DaemonSet pod was left pending, so we'd have to make sure whatever we do is correctly handled by the controller.

Also, by default, kubectl drain doesn't try to remove DaemonSets, either. However, there's a flag --ignore-daemonsets=false, so it may be a starting point to investigate how it works.

aleksandra-malinowska on 15 Jan 2019

These pods also seem to ignore the fact that node is not ready or unschedulable (which of course makes sense when starting a node that requires a DaemonSet to be ready.

@aleksandra-malinowska That's because "node.kubernetes.io/not-ready" and "node.kubernetes.io/unschedulable" tolerations are added automatically, also see: https://kubernetes.io/docs/concepts/workloads/controllers/daemonset/#taints-and-tolerations. If we use a custom taint, it won't be scheduled back, I think.

calixwu on 15 Jan 2019

Also agree to investigate how kubectl drain -ignore-daemonsets=false works.

calixwu on 15 Jan 2019

Fair point. Although we'll still have to think about handling wildcard toleration, which kind of makes sense if you want to have a DaemonSet that really runs on any node in the cluster with custom resources or dedicated node groups.

aleksandra-malinowska on 15 Jan 2019

I agree that this does seem like something the kubelet should be handling though? Gracefully shutting down all the remaining workloads on a node when it itself is shut down.

KJTsanaktsidis on 5 Apr 2019

👍5

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot on 4 Jul 2019

/remove-lifecycle stale

metral on 11 Jul 2019

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot on 9 Oct 2019

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

fejta-bot on 8 Nov 2019

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

fejta-bot on 8 Dec 2019

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.