Cloud-on-k8s: Use OnDelete instead of RollingUpdate to roll more than 1 pod at a time

Created on 31 Jul 2019 · 20Comments · Source: elastic/cloud-on-k8s

While playing around and reviewing the statefulset branch I noticed that it is not possible to do a rolling update on several Pods at the same time.

As stated here:

It [StatefulSet controller] will proceed in the same order as Pod termination (from the largest ordinal to the smallest), updating each Pod one at a time. It will wait until an updated Pod is Running and Ready prior to updating its predecessor.

An other problem is that if a Pod can't be terminated _(e.g. in case of a network partition)_ all the update process is stuck, only Pods with a higher cardinal order can be updated.

A workaround could be for the Controller to delete the Pod when it's ready for an update with the OnDelete update policy.

v1.0.0-beta1

Source

barkbay

Most helpful comment

@pebrc I think we don't have very good arguments not to use OnDelete. Here are some arguments for using RollingUpdate instead of OnDelete that can be discussed and refuted:

RollingUpdate looks more like the "new/recommended" way of handling StatefulSets, whereas OnDelete is marked as legacy. Counter-argument: I don't think that means OnDelete is going to be deprecated any time soon though. It is still the default value.
Using RollingUpdate allows us to only interact (in the sense: create/update/delete) with StatefulSets resources. We only interact with the partition ordinal, and don't need to interact with pods at all. Which looks cleaner on paper IMHO. Counter-argument: OnDelete would make us request pods deletion, but that's really not a big deal (we interact with so many k8s resources already).
If we had to deal with pods deletion, we'd have to deal with stale resources cache (what if we re-delete - for updates - a pod that actually just correctly restarted?). We need to reintroduce pods expectations here. Counter-argument: if we only have to deal with deletions, we can use Preconditions as Michael proposed.
Even if we were to use OnDelete, we'd probably delete pods in a given order for consistency. Picking highest to lowest ordinal seems to be a good ordering convention. If we end up doing that, we might as well rely on RollingUpdate since this is what it does. Counter-argument: we still need to iterate over pods ordinal with our RollingUpdate implementation.
In the same spirit, we'd have to check status of pods before upgrading the next one. This is something RollingUpdate does for us already. Counter-argument: we still need to iterate over pod status/shards migration/etc. to allow a pod to be restarted.
Using RollingUpdate does not forbid us to delete a pod if we need to (which is going to be recreated). Counter-argument: if we start doing that/need to do that, we're erring towards the side of OnDelete.
I originally thought we'd not have to care about the status of pods so much, but only about the status of the StatefulSet. Counter-argument: turns out that's not true, for example when dealing with downscales and updating ES minimum_master_nodes. A 3-replicas StatefulSet of master nodes does not mean there are 3 pods running. Pods 4 and 5 may actually be in the process of being terminated, but still be part of the cluster. To update minimum_master_nodes correctly, we need to account for these 2 nodes being terminated, we cannot just rely on the StatefulSet spec. So we do need to inspect the state of Pods running in the cluster.

Now here are some arguments in favour of using OnDelete instead of RollingUpdate:

It gives us the flexibility to restart/recreate any pod we want, ignoring any ordering consideration. This may unlock features such as users annotating a pod to be rescheduled (we'd take care of preparing a clean deletion), and deleting more than 1 pod at once in a given StatefulSet, which is the original purpose of this issue.
I think it would simplify our code a little bit if we don't need to manage StatefulSet ordinals. There's no setting we'd need to persist in the apiserver anymore, we'd just request pods deletion when we need to.
It's easier to manage a cross-StatefulSets maxUnavailable setting.

To summarise, I think I'm +1 with @barkbay suggestion of relying on OnDelete instead of RollingUpdate.

sebgl on 8 Aug 2019

❤2 👍1

All 20 comments

Would be glad to help on this one in order to familiarize myself with the statefulset branch.

barkbay on 31 Jul 2019

Can you clarify what you're proposing? Would it just be to use maxUnavailable (or implement our own using OnDelete)?

anyasabo on 31 Jul 2019

I'm not sure we have a lot of options here, if we can't rely on the RollingUpdate strategy I guess we have to use OnDelete

The following algorithm could be applied:

Create Statefulset with the OnDelete strategy
If an update is needed:
1. Update the Statefulsets
2. For each Statefulset: _(should we start with Statefulset that contain some masters?)_
  1. List all the pods which are not up-to-date
  2. Remove Pod in Unknown state from the list and remove them from the budget. Maybe the Kubelet crashed or is unresponsive or the K8S node is suffering from network issues... I'm not sure we can do something about them...
  3. If the list contains some masters put the current one at the bottom of the list to give it less priority
  4. Sort the remaining Pods according to the status with the following priority:
    - Failed
    - Pending
    - Running Pods with non nil ContainerStateWaiting or ContainerStateTerminated, it could mean that a container inside the Pod is crashing, might be a good idea to update it first to fix a configuration issue.
    - Remaining Running Pods

3. While maxConcurrentUpgrades is not exceeded:
    1. Pick the Pod at the top of the list
    2. Delete it

barkbay on 1 Aug 2019

A few thoughts:

If a pod is in an unknown/terminating transient state (eg. k8s node down), maybe it makes sense to "pause" any rolling update we're doing until the situation is sorted out by the user? (Either bring the node back online, either manually delete the pod we should not care about anymore). There are chances the pod down may hold the last copy of some data, or is the single master node of the cluster. Might be better to err on the safe side of things here?
This problem occurs per-statefulset (not per cluster): we're stuck with upgrading one statefulset (not the others).
If we keep a healthcheck condition in between each node of a rolling upgrade (eg. check cluster green, or check other copies of the shards exist on other nodes), it may be hard to remove more than one node per StatefulSet at once, since we need to redo this check after the node is removed.
If we delete pods ourselves, we need to reintroduce expectations on pods deletion (don't delete a pod that was just recreated if our cache of the pod is out of date).
We could wait until https://github.com/kubernetes/enhancements/pull/1010 is supported officially to map our maxUnavailable concept to StatefulSet's one.
OnDelete is specified as legacy. I don't think that means it won't be supported anymore at some point, but does not seem to be "the intended way" of running StatefulSets.

So I guess we have a trade-off between:

Use RollingUpdate (current code), which relies a bit more on the StatefulSet controller (we don't manage pods at all), but does not give us much flexibility around removing any pod we want in case one pod is completely stuck/unavailable.
Use OnDelete, which relies a bit less on the StatefulSet controller (we do need to inspect pod status and delete them), but gives us more flexibility around removing any pod we want. Including, potentially, doing upgrades in any order that suits us (not ordinal-based).

sebgl on 1 Aug 2019

I'd love to hear @nkvoll thoughts about this. I initially intended to go the OnDelete way but discussing it with other folks made me change my mind (RollingUpdate partitions sounded easily feasible and more k8s-native, OnDelete has a legacy "warning").

sebgl on 1 Aug 2019

If a pod is in an unknown/terminating transient state (eg. k8s node down), maybe it makes sense to "pause" any rolling update we're doing until the situation is sorted out by the user?

I agree that maybe there's some situation where it is probably fine to wait, but it sounds a little bit odd to not be able to apply a new configuration just because one Pod can't be terminated.

This problem occurs per-statefulset (not per cluster): we're stuck with upgrading one statefulset (not the others).

It's perfectly right, just want to mention I have been working on servers with some very high Pod density, mostly because in my previous shop we had a per socket licensing. In this case a network issue could affect a lot of statefulset.

If we keep a healthcheck condition in between each node of a rolling upgrade (eg. check cluster green, or check other copies of the shards exist on other nodes), it may be hard to remove more than one node per StatefulSet at once, since we need to redo this check after the node is removed.

If we delete pods ourselves, we need to reintroduce expectations on pods deletion (don't delete a pod that was just recreated if our cache of the pod is out of date).

What about using a Precondition when deleting ?

    var uid types.UID = "xxxxxxxxxxxxx"
    err = d.Client.Delete(&pod, func(options *client.DeleteOptions) {
        if options.Preconditions == nil {
            options.Preconditions = &v1.Preconditions{}
        }
        options.Preconditions.UID = &uid
    })
    if errors.IsConflict(err) {
                // Cache is not up to date
        return results.WithResult(defaultRequeue)
    }
    if err != nil {
               // Looks like a real error
        return results.WithError(err)
    }

We could wait until kubernetes/enhancements#1010 is supported officially to map our maxUnavailable concept to StatefulSet's one.

IIUC the KEP has not been validated, not sure when it will be available and if it will fit our needs.

barkbay on 1 Aug 2019

What about using a Precondition when deleting ?

Wow. TIL. Great feature!

sebgl on 1 Aug 2019

What about using a Precondition when deleting ?

That was new to me too, looks like this also added support for specifying an RV (though shouldn't be necessary for us). But cool that that's seeing some love:
https://github.com/kubernetes/kubernetes/issues/73648

It's perfectly right, just want to mention I have been working on servers with some very high Pod density, mostly because in my previous shop we had a per socket licensing. In this case a network issue could affect a lot of statefulset.

I'm not sure we are going to see many individual ES clusters that have many ssets though, so I think we should be okay. I'd imagine most people will have separate ssets per role (if they have multiple at all)?

If we keep a healthcheck condition in between each node of a rolling upgrade (eg. check cluster green, or check other copies of the shards exist on other nodes), it may be hard to remove more than one node per StatefulSet at once, since we need to redo this check after the node is removed.

I'm curious how we handle this today in cloud. The simplest path definitely seems to be only deleting one node at a time. Otherwise safety checks seem like they could get quite complicated. Essentially, how could we determine a safe maxUnavailable?

Would we have to know that say, all indices with shards on a set of candidate pods have >=3 replicas (and say, set maxUnavailable = 2)? Even if they do, are we still okay with introducing a single point of failure by removing 2 out of 3 replicas? If I'm missing something obvious here please let me know. Maybe this is something we already solved.

anyasabo on 2 Aug 2019

good discussion, I am assuming this is for Elastic Cloud on K8s based. It would be good to know why you need this feature ? Is it mainly because there are many nodes and you want speed in updating or something else ?
Do the options described here meet your needs ?

@barkbay what do you mean by high pod density ? Also which application are you referring to ?
Also if you have other suggestion on making statefulset more useful, you can also bring up your case by opening an issue in k/k or joining the weekly sig-apps meetings.
HTH

krmayankk on 8 Aug 2019

I would be curious to know what the arguments were in favour of using partitions over OnDelete @sebgl as it affects the decision on this proposal and I cannot recall a good argument for it. It seems to be the more complex way of doing things.

pebrc on 8 Aug 2019

@anyasabo

Would we have to know that say, all indices with shards on a set of candidate pods have >=3 replicas (and say, set maxUnavailable = 2)? Even if they do, are we still okay with introducing a single point of failure by removing 2 out of 3 replicas?

I think that's the only thing we can do, yes. It's about:

how many replicas we accept to be out of sync during a rolling upgrade (may depend on the total number of replicas?), to make sure we don't affect data availability too much
how many nodes we accept to be down during a rolling upgrade, to make sure we don't affect cluster performance too much

Could be 2 different settings the user may want to tweak actually.

sebgl on 8 Aug 2019

@krmayankk thanks for chiming in, your proposal around maxUnavailable is very interesting.

I'm wondering if that would be enough to solve our situation though. So far we have a global maxUnavailable setting per Elasticsearch cluster, to express how many Elasticsearch nodes can be shut-down/restarted simultaneously.
A single cluster is made of several StatefulSets (node subsets run with different options and storage requirements).
Having a per-statefulset maxUnavailable would help with that, but I think we'd still have to reconcile it with a global cross-StatefulSets maxUnavailable. That maybe be quite complex to implement on our side, and requires updating StatefulSets maxUnavailable setting over and over again depending on the state of the cluster. For example, the first StatefulSet may be allowed maxUnavailable=3 at one time and the second maxUnavailable=0, but that could change over time.

sebgl on 8 Aug 2019

@pebrc I think we don't have very good arguments not to use OnDelete. Here are some arguments for using RollingUpdate instead of OnDelete that can be discussed and refuted:

RollingUpdate looks more like the "new/recommended" way of handling StatefulSets, whereas OnDelete is marked as legacy. Counter-argument: I don't think that means OnDelete is going to be deprecated any time soon though. It is still the default value.
Using RollingUpdate allows us to only interact (in the sense: create/update/delete) with StatefulSets resources. We only interact with the partition ordinal, and don't need to interact with pods at all. Which looks cleaner on paper IMHO. Counter-argument: OnDelete would make us request pods deletion, but that's really not a big deal (we interact with so many k8s resources already).
If we had to deal with pods deletion, we'd have to deal with stale resources cache (what if we re-delete - for updates - a pod that actually just correctly restarted?). We need to reintroduce pods expectations here. Counter-argument: if we only have to deal with deletions, we can use Preconditions as Michael proposed.
Even if we were to use OnDelete, we'd probably delete pods in a given order for consistency. Picking highest to lowest ordinal seems to be a good ordering convention. If we end up doing that, we might as well rely on RollingUpdate since this is what it does. Counter-argument: we still need to iterate over pods ordinal with our RollingUpdate implementation.
In the same spirit, we'd have to check status of pods before upgrading the next one. This is something RollingUpdate does for us already. Counter-argument: we still need to iterate over pod status/shards migration/etc. to allow a pod to be restarted.
Using RollingUpdate does not forbid us to delete a pod if we need to (which is going to be recreated). Counter-argument: if we start doing that/need to do that, we're erring towards the side of OnDelete.
I originally thought we'd not have to care about the status of pods so much, but only about the status of the StatefulSet. Counter-argument: turns out that's not true, for example when dealing with downscales and updating ES minimum_master_nodes. A 3-replicas StatefulSet of master nodes does not mean there are 3 pods running. Pods 4 and 5 may actually be in the process of being terminated, but still be part of the cluster. To update minimum_master_nodes correctly, we need to account for these 2 nodes being terminated, we cannot just rely on the StatefulSet spec. So we do need to inspect the state of Pods running in the cluster.

Now here are some arguments in favour of using OnDelete instead of RollingUpdate:

It gives us the flexibility to restart/recreate any pod we want, ignoring any ordering consideration. This may unlock features such as users annotating a pod to be rescheduled (we'd take care of preparing a clean deletion), and deleting more than 1 pod at once in a given StatefulSet, which is the original purpose of this issue.
I think it would simplify our code a little bit if we don't need to manage StatefulSet ordinals. There's no setting we'd need to persist in the apiserver anymore, we'd just request pods deletion when we need to.
It's easier to manage a cross-StatefulSets maxUnavailable setting.

To summarise, I think I'm +1 with @barkbay suggestion of relying on OnDelete instead of RollingUpdate.

sebgl on 8 Aug 2019

❤2 👍1

Counter-argument: we still need to iterate over pod status/shards migration/etc. to allow a pod to be restarted.

We could build this into the readiness check, correct? I haven't thought that through too much yet so might be missing something obvious.

That said I think I'm on board with using on delete, but that mostly stems from thinking that deleting one pod at a time is the way to go now. It seems like deleting multiple pods at a time is _significantly_ more complicated and is an optimization that can wait. So we do not really get much benefit from using partitions instead of just deleting pods ourselves.

anyasabo on 8 Aug 2019

Could be 2 different settings the user may want to tweak actually.

I would be 👍 on this idea. My main concern is to let the user operate the cluster even if some nodes are down. AFAIK a cluster can be green even if some Pods are in an Unknown state or stuck in a Terminating / Pending state.

Could be:

minAvailableNodes / maxUnavailableNodes
minAvailableReplicas / maxUnavailableReplicas

May be we can reuse a part of the syntax of the Pdb, the problem is that some values could be in conflict with a rolling restart strategy...

As a side note, IIRC Pdb will not help in this case.

barkbay on 9 Aug 2019

Handling Pod deletion correctly could be a lot of logic as you have already noted above which is already built into the StatefulSets Rolling Update logic. Also when the image changes, you get Rolling Update of a StatefulSet. So my initial thinking is to use StatefulSet's RollingUpdate with partition and the upcoming maxUnavailable feature.

Its essentially an architectural decision which requires knowledge of how a big Elastic Search Cluster works and what knobs does it need to be easily upgradeable. If you envision that the current StatefulSet RollingUpdate feature with its partition logic is going to limit you in some ways, then going for OnDelete makes more sense, since its the most highly customizable and powerful way of implementing any kind of Rolling Update across many StatefulSets.

That said, if this is your first attempt to create a Elastic Search Cluster CRD, I would start simple and use StatefulSets Rolling Update until it doesnt work. If a User specifies a maxUnavailable per cluster greater than 1, you could essentially issue a update on maxUnavailable number of StatefulSets. You would also need to determine what your failure domains across statefulsets are.

Nonetheless, we want to make sure that StatefulSet in Kubernetes remains as the atomic building block for implementing complex stateful applications, whether through OnDelete or through RollingUpdate.

This discussion is interesting and I would suggest knocking on sig-apps Monday meeting or bounce this on #sig-apps on kubernetes slack.

krmayankk on 10 Aug 2019

I have a similar need, we have a huge number of statefulset with a very large replica count. Each replica when restarted takes 30-60 minutes to come up ( it has to load and merge many files ). Doing it one at a time for our usecase takes a lot of time and impractical. Right now, we developed an external piece of software that based on a configurable strategy deletes the underlying pods ( i.e we are using OnDelete on the STS). The strategy could be 25%, 50% of the replica etc or divide the replicas in groups of X and then for for each group delete that many replicas at a time.

mohnishkodnani on 10 Aug 2019

Discussed: tie rollout to change budget and cluster health. I.e. currentUnvailable < maxUnavailable and green health allows to proceed with upgrades beyond just one pod at a time

pebrc on 22 Aug 2019

Closed by #1683

barkbay on 11 Sep 2019

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Revisit config struct

anyasabo · 3Comments

Enforce Elastic stack upgrade order

pebrc · 3Comments

Status subresource updates fail when the crd version changes

sebgl · 3Comments

Reconciliation is blocked when a Pod can't be created

barkbay · 5Comments

Support multi-namespace watches

sebgl · 3Comments