I read the source code about the scaling-down part, and I found the cluster-autoscaler never drain the daemonset pods when scaling down. Wouldn't this cause problems?
There's no point evicting DaemonSet pods, as they can't be scheduled on another node anyway. You're correct this means in particular we won't attempt any sort of graceful termination other than what kubelet may provide when shutting down. I'm not really sure if there's any mechanism we could utilize for it.
There's no point evicting DaemonSet pods, as they can't be scheduled on another node anyway. You're correct this means in particular we won't attempt any sort of graceful termination other than what kubelet may provide when shutting down. I'm not really sure if there's any mechanism we could utilize for it.
@aleksandra-malinowska
Yep.. My point is the graceful termination. Cann't we evict DaemonSet pods and afford a graceful termination the same as evicting the other pods ( eg. replicated pods )? The only problem in this way I can think of is that we couldn't
simultaneously delete the empty nodes that simply.
Note CA only respects graceful termination of up to 10 minutes anyway. Evicting daemonset pods if we're draining a node probably wouldn't be difficult, but the empty nodes case is a large issue. A lot of people rely on the fact that they're deleted in parallel and changing that would be a huge regression.
Note CA only respects graceful termination of up to 10 minutes anyway. Evicting daemonset pods if we're draining a node probably wouldn't be difficult, but the empty nodes case is a large issue. A lot of people rely on the fact that they're deleted in parallel and changing that would be a huge regression.
@MaciekPytel
Could we do these to solve the empty-nodes issue? :
Evicting daemonset pods if we're draining a node probably wouldn't be difficult
It may be harder than we think. We wait for pods to be scheduled back onto some other nodes, but with DaemonSet, each pod has a node selector for a particular node, and we don't want it to run anymore. These pods also seem to ignore the fact that node is not ready or unschedulable (which of course makes sense when starting a node that requires a DaemonSet to be ready), so they may just be scheduled back anyway. Finally, I recall some issues where an extra DaemonSet pod was left pending, so we'd have to make sure whatever we do is correctly handled by the controller.
Also, by default, kubectl drain doesn't try to remove DaemonSets, either. However, there's a flag --ignore-daemonsets=false, so it may be a starting point to investigate how it works.
These pods also seem to ignore the fact that node is not ready or unschedulable (which of course makes sense when starting a node that requires a DaemonSet to be ready.
@aleksandra-malinowska That's because "node.kubernetes.io/not-ready" and "node.kubernetes.io/unschedulable" tolerations are added automatically, also see: https://kubernetes.io/docs/concepts/workloads/controllers/daemonset/#taints-and-tolerations. If we use a custom taint, it won't be scheduled back, I think.
Also agree to investigate how kubectl drain -ignore-daemonsets=false works.
Fair point. Although we'll still have to think about handling wildcard toleration, which kind of makes sense if you want to have a DaemonSet that really runs on any node in the cluster with custom resources or dedicated node groups.
I just noticed this too on Google's GKE. We have a consul agent daemonset on every node, and have "leave_on_terminate: true" set in the Consul config to gracefully leave the consul mesh when the Consul agent is stopped. I wasn't seeing any graceful leaving happening however, and it turns out it seems that there's no SIGTERM of the Daemonset container before the cluster autoscaler deletes the node.
I agree that this does seem like something the kubelet should be handling though? Gracefully shutting down all the remaining workloads on a node when it itself is shut down.
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale
/remove-lifecycle stale
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale
Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten
Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close
@fejta-bot: Closing this issue.
In response to this:
Rotten issues close after 30d of inactivity.
Reopen the issue with/reopen.
Mark the issue as fresh with/remove-lifecycle rotten.Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
Most helpful comment
I just noticed this too on Google's GKE. We have a consul agent daemonset on every node, and have "leave_on_terminate: true" set in the Consul config to gracefully leave the consul mesh when the Consul agent is stopped. I wasn't seeing any graceful leaving happening however, and it turns out it seems that there's no SIGTERM of the Daemonset container before the cluster autoscaler deletes the node.
I agree that this does seem like something the kubelet should be handling though? Gracefully shutting down all the remaining workloads on a node when it itself is shut down.