While playing around and reviewing the statefulset branch I noticed that it is not possible to do a rolling update on several Pods at the same time.
As stated here:
It [StatefulSet controller] will proceed in the same order as Pod termination (from the largest ordinal to the smallest), updating each Pod one at a time. It will wait until an updated Pod is Running and Ready prior to updating its predecessor.
An other problem is that if a Pod can't be terminated _(e.g. in case of a network partition)_ all the update process is stuck, only Pods with a higher cardinal order can be updated.
A workaround could be for the Controller to delete the Pod when it's ready for an update with the OnDelete update policy.
Would be glad to help on this one in order to familiarize myself with the statefulset branch.
See also this KEP: https://github.com/kubernetes/enhancements/pull/678
Can you clarify what you're proposing? Would it just be to use maxUnavailable (or implement our own using OnDelete)?
I'm not sure we have a lot of options here, if we can't rely on the RollingUpdate strategy I guess we have to use OnDelete
The following algorithm could be applied:
Statefulset with the OnDelete strategyStatefulsetsPod in Unknown state from the list and remove them from the budget. Maybe the Kubelet crashed or is unresponsive or the K8S node is suffering from network issues... I'm not sure we can do something about them...FailedPendingRunning Pods with non nil ContainerStateWaiting or ContainerStateTerminated, it could mean that a container inside the Pod is crashing, might be a good idea to update it first to fix a configuration issue.Running Pods3. While maxConcurrentUpgrades is not exceeded:
1. Pick the Pod at the top of the list
2. Delete it
A few thoughts:
maxUnavailable concept to StatefulSet's one.OnDelete is specified as legacy. I don't think that means it won't be supported anymore at some point, but does not seem to be "the intended way" of running StatefulSets.So I guess we have a trade-off between:
RollingUpdate (current code), which relies a bit more on the StatefulSet controller (we don't manage pods at all), but does not give us much flexibility around removing any pod we want in case one pod is completely stuck/unavailable.OnDelete, which relies a bit less on the StatefulSet controller (we do need to inspect pod status and delete them), but gives us more flexibility around removing any pod we want. Including, potentially, doing upgrades in any order that suits us (not ordinal-based).I'd love to hear @nkvoll thoughts about this. I initially intended to go the OnDelete way but discussing it with other folks made me change my mind (RollingUpdate partitions sounded easily feasible and more k8s-native, OnDelete has a legacy "warning").
- If a pod is in an unknown/terminating transient state (eg. k8s node down), maybe it makes sense to "pause" any rolling update we're doing until the situation is sorted out by the user?
I agree that maybe there's some situation where it is probably fine to wait, but it sounds a little bit odd to not be able to apply a new configuration just because one Pod can't be terminated.
- This problem occurs per-statefulset (not per cluster): we're stuck with upgrading one statefulset (not the others).
It's perfectly right, just want to mention I have been working on servers with some very high Pod density, mostly because in my previous shop we had a per socket licensing. In this case a network issue could affect a lot of statefulset.
- If we keep a healthcheck condition in between each node of a rolling upgrade (eg. check cluster green, or check other copies of the shards exist on other nodes), it may be hard to remove more than one node per StatefulSet at once, since we need to redo this check after the node is removed.
- If we delete pods ourselves, we need to reintroduce expectations on pods deletion (don't delete a pod that was just recreated if our cache of the pod is out of date).
What about using a Precondition when deleting ?
var uid types.UID = "xxxxxxxxxxxxx"
err = d.Client.Delete(&pod, func(options *client.DeleteOptions) {
if options.Preconditions == nil {
options.Preconditions = &v1.Preconditions{}
}
options.Preconditions.UID = &uid
})
if errors.IsConflict(err) {
// Cache is not up to date
return results.WithResult(defaultRequeue)
}
if err != nil {
// Looks like a real error
return results.WithError(err)
}
- We could wait until kubernetes/enhancements#1010 is supported officially to map our
maxUnavailableconcept to StatefulSet's one.
IIUC the KEP has not been validated, not sure when it will be available and if it will fit our needs.
What about using a Precondition when deleting ?
Wow. TIL. Great feature!
What about using a Precondition when deleting ?
That was new to me too, looks like this also added support for specifying an RV (though shouldn't be necessary for us). But cool that that's seeing some love:
https://github.com/kubernetes/kubernetes/issues/73648
It's perfectly right, just want to mention I have been working on servers with some very high Pod density, mostly because in my previous shop we had a per socket licensing. In this case a network issue could affect a lot of statefulset.
I'm not sure we are going to see many individual ES clusters that have many ssets though, so I think we should be okay. I'd imagine most people will have separate ssets per role (if they have multiple at all)?
If we keep a healthcheck condition in between each node of a rolling upgrade (eg. check cluster green, or check other copies of the shards exist on other nodes), it may be hard to remove more than one node per StatefulSet at once, since we need to redo this check after the node is removed.
I'm curious how we handle this today in cloud. The simplest path definitely seems to be only deleting one node at a time. Otherwise safety checks seem like they could get quite complicated. Essentially, how could we determine a safe maxUnavailable?
Would we have to know that say, all indices with shards on a set of candidate pods have >=3 replicas (and say, set maxUnavailable = 2)? Even if they do, are we still okay with introducing a single point of failure by removing 2 out of 3 replicas? If I'm missing something obvious here please let me know. Maybe this is something we already solved.
good discussion, I am assuming this is for Elastic Cloud on K8s based. It would be good to know why you need this feature ? Is it mainly because there are many nodes and you want speed in updating or something else ?
Do the options described here meet your needs ?
@barkbay what do you mean by high pod density ? Also which application are you referring to ?
Also if you have other suggestion on making statefulset more useful, you can also bring up your case by opening an issue in k/k or joining the weekly sig-apps meetings.
HTH
I would be curious to know what the arguments were in favour of using partitions over OnDelete @sebgl as it affects the decision on this proposal and I cannot recall a good argument for it. It seems to be the more complex way of doing things.
@anyasabo
Would we have to know that say, all indices with shards on a set of candidate pods have >=3 replicas (and say, set maxUnavailable = 2)? Even if they do, are we still okay with introducing a single point of failure by removing 2 out of 3 replicas?
I think that's the only thing we can do, yes. It's about:
Could be 2 different settings the user may want to tweak actually.
@krmayankk thanks for chiming in, your proposal around maxUnavailable is very interesting.
I'm wondering if that would be enough to solve our situation though. So far we have a global maxUnavailable setting per Elasticsearch cluster, to express how many Elasticsearch nodes can be shut-down/restarted simultaneously.
A single cluster is made of several StatefulSets (node subsets run with different options and storage requirements).
Having a per-statefulset maxUnavailable would help with that, but I think we'd still have to reconcile it with a global cross-StatefulSets maxUnavailable. That maybe be quite complex to implement on our side, and requires updating StatefulSets maxUnavailable setting over and over again depending on the state of the cluster. For example, the first StatefulSet may be allowed maxUnavailable=3 at one time and the second maxUnavailable=0, but that could change over time.
@pebrc I think we don't have very good arguments not to use OnDelete. Here are some arguments for using RollingUpdate instead of OnDelete that can be discussed and refuted:
RollingUpdate looks more like the "new/recommended" way of handling StatefulSets, whereas OnDelete is marked as legacy. Counter-argument: I don't think that means OnDelete is going to be deprecated any time soon though. It is still the default value.RollingUpdate allows us to only interact (in the sense: create/update/delete) with StatefulSets resources. We only interact with the partition ordinal, and don't need to interact with pods at all. Which looks cleaner on paper IMHO. Counter-argument: OnDelete would make us request pods deletion, but that's really not a big deal (we interact with so many k8s resources already).OnDelete, we'd probably delete pods in a given order for consistency. Picking highest to lowest ordinal seems to be a good ordering convention. If we end up doing that, we might as well rely on RollingUpdate since this is what it does. Counter-argument: we still need to iterate over pods ordinal with our RollingUpdate implementation.RollingUpdate does for us already. Counter-argument: we still need to iterate over pod status/shards migration/etc. to allow a pod to be restarted.RollingUpdate does not forbid us to delete a pod if we need to (which is going to be recreated). Counter-argument: if we start doing that/need to do that, we're erring towards the side of OnDelete.minimum_master_nodes. A 3-replicas StatefulSet of master nodes does not mean there are 3 pods running. Pods 4 and 5 may actually be in the process of being terminated, but still be part of the cluster. To update minimum_master_nodes correctly, we need to account for these 2 nodes being terminated, we cannot just rely on the StatefulSet spec. So we do need to inspect the state of Pods running in the cluster. Now here are some arguments in favour of using OnDelete instead of RollingUpdate:
maxUnavailable setting.To summarise, I think I'm +1 with @barkbay suggestion of relying on OnDelete instead of RollingUpdate.
Counter-argument: we still need to iterate over pod status/shards migration/etc. to allow a pod to be restarted.
We could build this into the readiness check, correct? I haven't thought that through too much yet so might be missing something obvious.
That said I think I'm on board with using on delete, but that mostly stems from thinking that deleting one pod at a time is the way to go now. It seems like deleting multiple pods at a time is _significantly_ more complicated and is an optimization that can wait. So we do not really get much benefit from using partitions instead of just deleting pods ourselves.
Could be 2 different settings the user may want to tweak actually.
I would be 馃憤 on this idea. My main concern is to let the user operate the cluster even if some nodes are down. AFAIK a cluster can be green even if some Pods are in an Unknown state or stuck in a Terminating / Pending state.
Could be:
minAvailableNodes / maxUnavailableNodes
minAvailableReplicas / maxUnavailableReplicas
May be we can reuse a part of the syntax of the Pdb, the problem is that some values could be in conflict with a rolling restart strategy...
As a side note, IIRC Pdb will not help in this case.
Handling Pod deletion correctly could be a lot of logic as you have already noted above which is already built into the StatefulSets Rolling Update logic. Also when the image changes, you get Rolling Update of a StatefulSet. So my initial thinking is to use StatefulSet's RollingUpdate with partition and the upcoming maxUnavailable feature.
Its essentially an architectural decision which requires knowledge of how a big Elastic Search Cluster works and what knobs does it need to be easily upgradeable. If you envision that the current StatefulSet RollingUpdate feature with its partition logic is going to limit you in some ways, then going for OnDelete makes more sense, since its the most highly customizable and powerful way of implementing any kind of Rolling Update across many StatefulSets.
That said, if this is your first attempt to create a Elastic Search Cluster CRD, I would start simple and use StatefulSets Rolling Update until it doesnt work. If a User specifies a maxUnavailable per cluster greater than 1, you could essentially issue a update on maxUnavailable number of StatefulSets. You would also need to determine what your failure domains across statefulsets are.
Nonetheless, we want to make sure that StatefulSet in Kubernetes remains as the atomic building block for implementing complex stateful applications, whether through OnDelete or through RollingUpdate.
This discussion is interesting and I would suggest knocking on sig-apps Monday meeting or bounce this on #sig-apps on kubernetes slack.
I have a similar need, we have a huge number of statefulset with a very large replica count. Each replica when restarted takes 30-60 minutes to come up ( it has to load and merge many files ). Doing it one at a time for our usecase takes a lot of time and impractical. Right now, we developed an external piece of software that based on a configurable strategy deletes the underlying pods ( i.e we are using OnDelete on the STS). The strategy could be 25%, 50% of the replica etc or divide the replicas in groups of X and then for for each group delete that many replicas at a time.
Discussed: tie rollout to change budget and cluster health. I.e. currentUnvailable < maxUnavailable and green health allows to proceed with upgrades beyond just one pod at a time
Closed by #1683
Most helpful comment
@pebrc I think we don't have very good arguments not to use
OnDelete. Here are some arguments for usingRollingUpdateinstead ofOnDeletethat can be discussed and refuted:RollingUpdatelooks more like the "new/recommended" way of handling StatefulSets, whereasOnDeleteis marked aslegacy. Counter-argument: I don't think that meansOnDeleteis going to be deprecated any time soon though. It is still the default value.RollingUpdateallows us to only interact (in the sense: create/update/delete) with StatefulSets resources. We only interact with thepartitionordinal, and don't need to interact with pods at all. Which looks cleaner on paper IMHO. Counter-argument:OnDeletewould make us request pods deletion, but that's really not a big deal (we interact with so many k8s resources already).OnDelete, we'd probably delete pods in a given order for consistency. Picking highest to lowest ordinal seems to be a good ordering convention. If we end up doing that, we might as well rely onRollingUpdatesince this is what it does. Counter-argument: we still need to iterate over pods ordinal with ourRollingUpdateimplementation.RollingUpdatedoes for us already. Counter-argument: we still need to iterate over pod status/shards migration/etc. to allow a pod to be restarted.RollingUpdatedoes not forbid us to delete a pod if we need to (which is going to be recreated). Counter-argument: if we start doing that/need to do that, we're erring towards the side ofOnDelete.minimum_master_nodes. A 3-replicas StatefulSet of master nodes does not mean there are 3 pods running. Pods 4 and 5 may actually be in the process of being terminated, but still be part of the cluster. To updateminimum_master_nodescorrectly, we need to account for these 2 nodes being terminated, we cannot just rely on the StatefulSet spec. So we do need to inspect the state of Pods running in the cluster.Now here are some arguments in favour of using
OnDeleteinstead ofRollingUpdate:maxUnavailablesetting.To summarise, I think I'm +1 with @barkbay suggestion of relying on
OnDeleteinstead ofRollingUpdate.