Autoscaler: Why not to drain daemonset pods?

Created on 14 Jan 2019  路  15Comments  路  Source: kubernetes/autoscaler

I read the source code about the scaling-down part, and I found the cluster-autoscaler never drain the daemonset pods when scaling down. Wouldn't this cause problems?

cluster-autoscaler lifecyclrotten

Most helpful comment

I just noticed this too on Google's GKE. We have a consul agent daemonset on every node, and have "leave_on_terminate: true" set in the Consul config to gracefully leave the consul mesh when the Consul agent is stopped. I wasn't seeing any graceful leaving happening however, and it turns out it seems that there's no SIGTERM of the Daemonset container before the cluster autoscaler deletes the node.

I agree that this does seem like something the kubelet should be handling though? Gracefully shutting down all the remaining workloads on a node when it itself is shut down.

All 15 comments

There's no point evicting DaemonSet pods, as they can't be scheduled on another node anyway. You're correct this means in particular we won't attempt any sort of graceful termination other than what kubelet may provide when shutting down. I'm not really sure if there's any mechanism we could utilize for it.

There's no point evicting DaemonSet pods, as they can't be scheduled on another node anyway. You're correct this means in particular we won't attempt any sort of graceful termination other than what kubelet may provide when shutting down. I'm not really sure if there's any mechanism we could utilize for it.
@aleksandra-malinowska

Yep.. My point is the graceful termination. Cann't we evict DaemonSet pods and afford a graceful termination the same as evicting the other pods ( eg. replicated pods )? The only problem in this way I can think of is that we couldn't
simultaneously delete the empty nodes that simply.

Note CA only respects graceful termination of up to 10 minutes anyway. Evicting daemonset pods if we're draining a node probably wouldn't be difficult, but the empty nodes case is a large issue. A lot of people rely on the fact that they're deleted in parallel and changing that would be a huge regression.

Note CA only respects graceful termination of up to 10 minutes anyway. Evicting daemonset pods if we're draining a node probably wouldn't be difficult, but the empty nodes case is a large issue. A lot of people rely on the fact that they're deleted in parallel and changing that would be a huge regression.

@MaciekPytel
Could we do these to solve the empty-nodes issue? :

  1. We ignore the daemonset pods when we try to find out the empty nodes.
  2. We do not ignore the daemonset pods when we try to delete the empty nodes. Then add the "delete traint" to the empty nodes and evict the daemonset pods in parallel, after the empty nodes being actually empty (no daemonset pods on them), we call the cloud provider's deletion in parallel.

Evicting daemonset pods if we're draining a node probably wouldn't be difficult

It may be harder than we think. We wait for pods to be scheduled back onto some other nodes, but with DaemonSet, each pod has a node selector for a particular node, and we don't want it to run anymore. These pods also seem to ignore the fact that node is not ready or unschedulable (which of course makes sense when starting a node that requires a DaemonSet to be ready), so they may just be scheduled back anyway. Finally, I recall some issues where an extra DaemonSet pod was left pending, so we'd have to make sure whatever we do is correctly handled by the controller.

Also, by default, kubectl drain doesn't try to remove DaemonSets, either. However, there's a flag --ignore-daemonsets=false, so it may be a starting point to investigate how it works.

These pods also seem to ignore the fact that node is not ready or unschedulable (which of course makes sense when starting a node that requires a DaemonSet to be ready.

@aleksandra-malinowska That's because "node.kubernetes.io/not-ready" and "node.kubernetes.io/unschedulable" tolerations are added automatically, also see: https://kubernetes.io/docs/concepts/workloads/controllers/daemonset/#taints-and-tolerations. If we use a custom taint, it won't be scheduled back, I think.

Also agree to investigate how kubectl drain -ignore-daemonsets=false works.

Fair point. Although we'll still have to think about handling wildcard toleration, which kind of makes sense if you want to have a DaemonSet that really runs on any node in the cluster with custom resources or dedicated node groups.

I just noticed this too on Google's GKE. We have a consul agent daemonset on every node, and have "leave_on_terminate: true" set in the Consul config to gracefully leave the consul mesh when the Consul agent is stopped. I wasn't seeing any graceful leaving happening however, and it turns out it seems that there's no SIGTERM of the Daemonset container before the cluster autoscaler deletes the node.

I agree that this does seem like something the kubelet should be handling though? Gracefully shutting down all the remaining workloads on a node when it itself is shut down.

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

/remove-lifecycle stale

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

hjkatz picture hjkatz  路  4Comments

whereisaaron picture whereisaaron  路  7Comments

hprotzek picture hprotzek  路  5Comments

duritong picture duritong  路  5Comments

mboersma picture mboersma  路  6Comments