Cloud-on-k8s: Ready checks are incompatible with Service usage and PDB usage

Created on 23 May 2019 · 8Comments · Source: elastic/cloud-on-k8s

Ready checks are currently used to decide whether a Service should include the pods, so pods are considered ready as soon as they can respond to HTTP requests.

Once we start to use PodDisruptionBudgets, readiness means that the pod is not considered disrupted. In this scenario, the safest bet is cluster state: green. Yellow might be tolerated in some scenarios (e.g 1 node cluster), but it seems this should be configurable by the user.

These two are not immediately compatible, but since readiness is the only way to indicate to the k8s admins that the cluster is disrupted, this takes precedence.

To facilitate this, we can tolerate unready pods in the service. This causes a problem: requests may be forwarded to pods that are not ready to receive requests at all. To fix this, we can use some kind of new ready label on the pods in the Service selector. This label may be controlled by the operator (e.g pinging individual nodes every X seconds), effectively moving the service-related readiness probe to the operator and separating it from the PDB-related readiness check.

Relates to: https://github.com/elastic/cloud-on-k8s/issues/914

discuss

Source

nkvoll

Most helpful comment

Based on https://kubernetes.io/docs/tasks/run-application/configure-pdb/#arbitrary-controllers-and-selectors, it looks like we cannot use maxUnavailable when selecting pods with a non-builtin controller label (eg. not using a Deployment or StatefulSet label selector).
IIUC that is because the PDB controller would not be able to figure out what's the "expected" number of pods allowed to run. Which makes sense.

So we end up with either:

one PDB per StatefulSet. Not an option for us since we can have N StatefulSets as part of the cluster.
one PDB for all pods, but then we must use minAvailable instead of maxUnavailable.

Working with option 2, the operator needs to reconcile the correct value for minAvailable in the PDB, based on the number of nodes in the cluster.
I think we can do it this way, in the reconciliation loop:

If the cluster is green, set minAvailable to the number of nodes minus 1 (authorize one pod to be disrupted)
if the cluster is not green, set minAvailable to the number of nodes

We already poll for cluster health information regularly, and reconcile if it changes. This way we don't have to change anything in our healthchecks and service usage.

Now there's of course a race condition here where the PDB would be set for a green cluster whereas the cluster is (realtime) yellow. If the disruption happens before we reconcile the PDB, then there's a risk we loose an important node. But I think that's also true when using readiness checks anyway, since they are only executed every X seconds.

Since the PDB would be somewhat "dynamic", we cannot offer to have it specified by the user anymore. Instead we could make it optional (managePDB: false), so users can create their own outside the scope of the operator.

Thoughts?

sebgl on 19 Sep 2019

👍2

All 8 comments

This label may be controlled by the operator (e.g pinging individual nodes every X seconds)

One disadvantage of this approach is that unavailable pods might be listed as endpoints of the service and available pods might be missing from the endpoints if the operator is not running. The current implementation allows clusters to run (and fail) independently from the failure domain of the operator (unless structural changes are being applied to the cluster via the operator, in which case they are somewhat tied together)

pebrc on 13 Aug 2019

Just to be sure I understand the concerns correctly:

Current ready check:

Does the pod respond to http requests

Both the pod disruption budget and service use the pod's ready status to decide if pod is disrupted and if the pod is ready to receive requests.

So if the pod gets disrupted, it will get marked as ready as soon as that pod is up and responding to requests. This might cause a problem because the cluster may still be yellow and be recovering (say it is replicating from a primary to get back in sync since it was last up), but from the pod disruption budget it thinks everything is fine and may disrupt the pod hosting the primary shard, which will make the cluster even sadder.

If we change the readiness check to both:

the pod responds to http requests
the cluster is green

Then essentially one pod going down (which would turn the cluster yellow), marks the whole cluster as not ready and removes them all from the service, which is Not Good.

Would it maybe make sense to mark a node as not ready if it is a target node for any active recoveries? I think that might address this in a more idiomatic way unless there's other reasons for not readiness I might be missing.

My initial thought was maybe make use of the ready++ readiness gates enhancement:
https://github.com/kubernetes/enhancements/blob/master/keps/sig-network/0007-pod-ready%2B%2B.md

where we could define multiple readiness gates, but we kind of want the opposite (ready--?) if I understand correctly. We want a subset of readiness checks for service readiness, but all of them for disruption budget purposes.

anyasabo on 13 Aug 2019

👍2

Then essentially one pod going down (which would turn the cluster yellow), marks the whole cluster as not ready and removes them all from the service, which is Not Good.

I think one detail that was implied in the OP and not clear to me is that if we were to follow this approach and bind the default readiness check to cluster health we would of course need to make service endpoint membership independent of that check by using publishNotReadyAddresses: true. If that were the case then service membership would only rely on the labelling of the pods which we would then use in the selector for the service.

(the failure domain argument still applies though)

pebrc on 16 Aug 2019

Ah got it. I'm curious what states specifically will the node be replying to our readiness check as is (replying to the / endpoint with a 200), _but_ the cluster is actually not ready to have another pod disrupted.

The one I could think of was if there were active recoveries going to that node. If we were to disrupt the node with the shard it is recovering from, then if there is only one replica (the one that is getting recovered to) we may turn the cluster red.

But on the other hand, recoveries can happen in normal operations too, so I'm not sure taking a node out of the service just because it is receiving a recovery is necessarily the right choice either.

anyasabo on 16 Aug 2019

So we end up with either:

one PDB per StatefulSet. Not an option for us since we can have N StatefulSets as part of the cluster.
one PDB for all pods, but then we must use minAvailable instead of maxUnavailable.

If the cluster is green, set minAvailable to the number of nodes minus 1 (authorize one pod to be disrupted)
if the cluster is not green, set minAvailable to the number of nodes

We already poll for cluster health information regularly, and reconcile if it changes. This way we don't have to change anything in our healthchecks and service usage.

Thoughts?

sebgl on 19 Sep 2019

👍2

I like your option 2 to just use minAvailable @sebgl , though as mentioned it's not a direct function of the replicas in the spec, but rather a function of the current pods we should have (say we are migrating to a new sset), so it is more complicated than it appears at first.

Since the PDB would be somewhat "dynamic", we cannot offer to have it specified by the user anymore. Instead we could make it optional (managePDB: false), so users can create their own outside the scope of the operator.

I think this can be left as an option to add later. This seems like a relatively advanced use case and I'm not sure it's something we need to expose just yet.

anyasabo on 19 Sep 2019

I think we have a decision at this point to

change the readiness checks to use a node local API. So pod readiness is not conflated with cluster availability anymore
make the PDB use minAvailable and make it dynamic and bound to cluster health in the way @sebgl has described and thereby sidestepping the problem in the OP

pebrc on 19 Sep 2019

PodDisruptionBudgets are immutable in k8s < 1.15 (PR that adds mutability in 1.15).
Which means we'll have to delete then recreate (at least for k8s < 1.15), which leaves a small time window where things can go wrong.

sebgl on 20 Sep 2019

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Reconciliation is blocked when a Pod can't be created

barkbay · 5Comments

Custom elasticsearch configuration option should be available

Sakib37 · 3Comments

ARM e2e run sometimes ignores build constraints

pebrc · 5Comments

Using self-signed certificates for filebeat

SebastianCaceresUltra · 3Comments

Status subresource updates fail when the crd version changes

sebgl · 3Comments