As I understand it the current behaviour for rolling-update is:
for node in stale_nodes:
drain(node)
validate_stable()
delete(node)
With the ASG taking care of spawning the new nodes.
This is very slow, particularly when pod disruption budgets are not being violated. (e.g. imagine we are configured so that the smallest scale-down of any workload has capacity for 2 disruptions without issues, but scales up with load, then kops could safely introduce 1 disruption at any time, but also potentially many more when the disruption budget hasn't been exceeded.
Now, you might say 'slow doesn't matter', but ops folk have to pay attention during this process, and when it exceeds attention windows - say an hour - that becomes a human factor problem.
I'm not sure what the right behaviour should be, but something like the following would be pretty much ideal for many of our use cases:
asg.count = 2*len(stale_nodes)
wait for *a* node to be up
map(stale_nodes, cordon)
while True:
if drain_nonblocking(stale_nodes):
break
map(stale_nodes, delete)
where drain_nonblocking is something like this:
done = True
for node in stale_nodes:
for pod in pods:
if ((standalone(pod) or
statefulset_highest_unmoved(pod) or
daemonset/job(pod) or
deployment_above_disruption_budget(pod)) and
pod_can_be_rescheduled(pod)):
delete(pod)
else:
done=False
return done
The idea being to induce as much disruption as the cluster is defined as tolerating, as rapidly as possible.
An obvious extension to this approach is a canary incrementally more aggressive rollout - first one node, then two, then four, 8, 16 etc until the entire cluster is being done.
kops 1.9.0 :).
Let me know if this would be considered for merging, we might see about putting something together.
For me, an ideal solution would be for kops to understand what are my critical services, and how many instances/pods I can tolerate to be taken out of service. As long as those critical services are available with a minimum amount of instances, kops can recycle as many nodes as possible. PDB can be used. I think this is deeper than just plain pod disruption budgets, though. Kops could check the fleet of nodes before hand, and understand which ones can be recycled immediately, and then efficiently recycle the remaining without violating a PDB. As we all know, kops can hang on recycling a node that has pods on it that will cause a violation of a PDB. This intelligent decision making before hand can avoid that.
In the meantime, we usually run more than one upgrade command at a cluster, picking different instance groups while running the rolling deployment. That helps somewhat.
Sounds like we're saying much the same thing; the key question for me is how much room for experimentation kops will permit - no point putting experimental code forward if its not broadly interesting to the maintainers :)
@rbtcollins I totally sign in the point that rolling-update is super-slow.
I am not sure, you have the same thing in mind, but one of other possibilities is to temporary double-size the amount of nodes at auto-scaling-group and then... put it back to normal.
Auto-scaling-group removes the oldest nodes by default, as far as I know.
See also these:
Parallelize and improve rolling updates even more
Post from Reactive Ops (nice GIF explanation)
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale
/remove-lifecycle stale
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale
/remove-lifecycle stale
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale
Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten
Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close
@fejta-bot: Closing this issue.
In response to this:
Rotten issues close after 30d of inactivity.
Reopen the issue with/reopen.
Mark the issue as fresh with/remove-lifecycle rotten.Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
Most helpful comment
See also these:
Parallelize and improve rolling updates even more
Kops Rollout Strategies - WIP
Post from Reactive Ops (nice GIF explanation)