Kops: rolling-update very slow

Created on 23 Oct 2018 · 12Comments · Source: kubernetes/kops

As I understand it the current behaviour for rolling-update is:

for node in stale_nodes:
  drain(node)
  validate_stable()
  delete(node)

With the ASG taking care of spawning the new nodes.

This is very slow, particularly when pod disruption budgets are not being violated. (e.g. imagine we are configured so that the smallest scale-down of any workload has capacity for 2 disruptions without issues, but scales up with load, then kops could safely introduce 1 disruption at any time, but also potentially many more when the disruption budget hasn't been exceeded.

Now, you might say 'slow doesn't matter', but ops folk have to pay attention during this process, and when it exceeds attention windows - say an hour - that becomes a human factor problem.

I'm not sure what the right behaviour should be, but something like the following would be pretty much ideal for many of our use cases:

    asg.count = 2*len(stale_nodes)
    wait for *a* node to be up
    map(stale_nodes, cordon)
    while True:
        if drain_nonblocking(stale_nodes):
              break
    map(stale_nodes, delete)

where drain_nonblocking is something like this:

done = True
for node in stale_nodes:
  for pod in pods:
    if ((standalone(pod) or
        statefulset_highest_unmoved(pod) or
        daemonset/job(pod) or
        deployment_above_disruption_budget(pod)) and
        pod_can_be_rescheduled(pod)):
        delete(pod)
    else:
        done=False
return done

The idea being to induce as much disruption as the cluster is defined as tolerating, as rapidly as possible.

An obvious extension to this approach is a canary incrementally more aggressive rollout - first one node, then two, then four, 8, 16 etc until the entire cluster is being done.

kops 1.9.0 :).

Let me know if this would be considered for merging, we might see about putting something together.

lifecyclrotten

Source

rbtcollins

👍1

Most helpful comment

Kops Rollout Strategies - WIP

Post from Reactive Ops (nice GIF explanation)

BrianChristie on 26 Oct 2018

👍2

All 12 comments

For me, an ideal solution would be for kops to understand what are my critical services, and how many instances/pods I can tolerate to be taken out of service. As long as those critical services are available with a minimum amount of instances, kops can recycle as many nodes as possible. PDB can be used. I think this is deeper than just plain pod disruption budgets, though. Kops could check the fleet of nodes before hand, and understand which ones can be recycled immediately, and then efficiently recycle the remaining without violating a PDB. As we all know, kops can hang on recycling a node that has pods on it that will cause a violation of a PDB. This intelligent decision making before hand can avoid that.

In the meantime, we usually run more than one upgrade command at a cluster, picking different instance groups while running the rolling deployment. That helps somewhat.

mmerrill3 on 23 Oct 2018

Sounds like we're saying much the same thing; the key question for me is how much room for experimentation kops will permit - no point putting experimental code forward if its not broadly interesting to the maintainers :)

rbtcollins on 23 Oct 2018

@rbtcollins I totally sign in the point that rolling-update is super-slow.

I am not sure, you have the same thing in mind, but one of other possibilities is to temporary double-size the amount of nodes at auto-scaling-group and then... put it back to normal.
Auto-scaling-group removes the oldest nodes by default, as far as I know.

ghost on 26 Oct 2018

👍1

Kops Rollout Strategies - WIP

Post from Reactive Ops (nice GIF explanation)

BrianChristie on 26 Oct 2018

👍2

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot on 24 Jan 2019

/remove-lifecycle stale

ghost on 24 Jan 2019

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot on 29 Apr 2019

/remove-lifecycle stale

ghost on 29 Apr 2019

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot on 28 Jul 2019

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

fejta-bot on 27 Aug 2019

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

fejta-bot on 26 Sep 2019

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.