Ready checks are currently used to decide whether a Service should include the pods, so pods are considered ready as soon as they can respond to HTTP requests.
Once we start to use PodDisruptionBudgets, readiness means that the pod is not considered disrupted. In this scenario, the safest bet is cluster state: green. Yellow might be tolerated in some scenarios (e.g 1 node cluster), but it seems this should be configurable by the user.
These two are not immediately compatible, but since readiness is the only way to indicate to the k8s admins that the cluster is disrupted, this takes precedence.
To facilitate this, we can tolerate unready pods in the service. This causes a problem: requests may be forwarded to pods that are not ready to receive requests at all. To fix this, we can use some kind of new ready label on the pods in the Service selector. This label may be controlled by the operator (e.g pinging individual nodes every X seconds), effectively moving the service-related readiness probe to the operator and separating it from the PDB-related readiness check.
Relates to: https://github.com/elastic/cloud-on-k8s/issues/914
This label may be controlled by the operator (e.g pinging individual nodes every X seconds)
One disadvantage of this approach is that unavailable pods might be listed as endpoints of the service and available pods might be missing from the endpoints if the operator is not running. The current implementation allows clusters to run (and fail) independently from the failure domain of the operator (unless structural changes are being applied to the cluster via the operator, in which case they are somewhat tied together)
Just to be sure I understand the concerns correctly:
Current ready check:
Both the pod disruption budget and service use the pod's ready status to decide if pod is disrupted and if the pod is ready to receive requests.
So if the pod gets disrupted, it will get marked as ready as soon as that pod is up and responding to requests. This might cause a problem because the cluster may still be yellow and be recovering (say it is replicating from a primary to get back in sync since it was last up), but from the pod disruption budget it thinks everything is fine and may disrupt the pod hosting the primary shard, which will make the cluster even sadder.
If we change the readiness check to both:
Then essentially one pod going down (which would turn the cluster yellow), marks the whole cluster as not ready and removes them all from the service, which is Not Good.
Would it maybe make sense to mark a node as not ready if it is a target node for any active recoveries? I think that might address this in a more idiomatic way unless there's other reasons for not readiness I might be missing.
My initial thought was maybe make use of the ready++ readiness gates enhancement:
https://github.com/kubernetes/enhancements/blob/master/keps/sig-network/0007-pod-ready%2B%2B.md
where we could define multiple readiness gates, but we kind of want the opposite (ready--?) if I understand correctly. We want a subset of readiness checks for service readiness, but all of them for disruption budget purposes.
Then essentially one pod going down (which would turn the cluster yellow), marks the whole cluster as not ready and removes them all from the service, which is Not Good.
I think one detail that was implied in the OP and not clear to me is that if we were to follow this approach and bind the default readiness check to cluster health we would of course need to make service endpoint membership independent of that check by using publishNotReadyAddresses: true. If that were the case then service membership would only rely on the labelling of the pods which we would then use in the selector for the service.
(the failure domain argument still applies though)
Ah got it. I'm curious what states specifically will the node be replying to our readiness check as is (replying to the / endpoint with a 200), _but_ the cluster is actually not ready to have another pod disrupted.
The one I could think of was if there were active recoveries going to that node. If we were to disrupt the node with the shard it is recovering from, then if there is only one replica (the one that is getting recovered to) we may turn the cluster red.
But on the other hand, recoveries can happen in normal operations too, so I'm not sure taking a node out of the service just because it is receiving a recovery is necessarily the right choice either.
Based on https://kubernetes.io/docs/tasks/run-application/configure-pdb/#arbitrary-controllers-and-selectors, it looks like we cannot use maxUnavailable when selecting pods with a non-builtin controller label (eg. not using a Deployment or StatefulSet label selector).
IIUC that is because the PDB controller would not be able to figure out what's the "expected" number of pods allowed to run. Which makes sense.
So we end up with either:
minAvailable instead of maxUnavailable.Working with option 2, the operator needs to reconcile the correct value for minAvailable in the PDB, based on the number of nodes in the cluster.
I think we can do it this way, in the reconciliation loop:
We already poll for cluster health information regularly, and reconcile if it changes. This way we don't have to change anything in our healthchecks and service usage.
Now there's of course a race condition here where the PDB would be set for a green cluster whereas the cluster is (realtime) yellow. If the disruption happens before we reconcile the PDB, then there's a risk we loose an important node. But I think that's also true when using readiness checks anyway, since they are only executed every X seconds.
Since the PDB would be somewhat "dynamic", we cannot offer to have it specified by the user anymore. Instead we could make it optional (managePDB: false), so users can create their own outside the scope of the operator.
Thoughts?
I like your option 2 to just use minAvailable @sebgl , though as mentioned it's not a direct function of the replicas in the spec, but rather a function of the current pods we should have (say we are migrating to a new sset), so it is more complicated than it appears at first.
Since the PDB would be somewhat "dynamic", we cannot offer to have it specified by the user anymore. Instead we could make it optional (managePDB: false), so users can create their own outside the scope of the operator.
I think this can be left as an option to add later. This seems like a relatively advanced use case and I'm not sure it's something we need to expose just yet.
I think we have a decision at this point to
minAvailable and make it dynamic and bound to cluster health in the way @sebgl has described and thereby sidestepping the problem in the OP PodDisruptionBudgets are immutable in k8s < 1.15 (PR that adds mutability in 1.15).
Which means we'll have to delete then recreate (at least for k8s < 1.15), which leaves a small time window where things can go wrong.
Most helpful comment
Based on https://kubernetes.io/docs/tasks/run-application/configure-pdb/#arbitrary-controllers-and-selectors, it looks like we cannot use
maxUnavailablewhen selecting pods with a non-builtin controller label (eg. not using a Deployment or StatefulSet label selector).IIUC that is because the PDB controller would not be able to figure out what's the "expected" number of pods allowed to run. Which makes sense.
So we end up with either:
minAvailableinstead ofmaxUnavailable.Working with option 2, the operator needs to reconcile the correct value for
minAvailablein the PDB, based on the number of nodes in the cluster.I think we can do it this way, in the reconciliation loop:
We already poll for cluster health information regularly, and reconcile if it changes. This way we don't have to change anything in our healthchecks and service usage.
Now there's of course a race condition here where the PDB would be set for a green cluster whereas the cluster is (realtime) yellow. If the disruption happens before we reconcile the PDB, then there's a risk we loose an important node. But I think that's also true when using readiness checks anyway, since they are only executed every X seconds.
Since the PDB would be somewhat "dynamic", we cannot offer to have it specified by the user anymore. Instead we could make it optional (
managePDB: false), so users can create their own outside the scope of the operator.Thoughts?