Zero-to-jupyterhub-k8s: Default settings of the Hub pod's readinessProbe are problematic

Created on 24 Jul 2020 · 5Comments · Source: jupyterhub/zero-to-jupyterhub-k8s

We have a k8s readinessProbe defined, which means that we declare the hub only to be "Ready" to receive traffic when it has passed the test of responding to a HTTP request. This is only used as an indicator of successfully starting up currently. For example, the CI tests should not try send traffic to the hub pod before the hub pod has started.

NOTE: that there is also a livenessProbe which if that fails, it would trigger the container in the pod to restart.

To make a pod not-ready at a later time doesn't make much sense - we don't have a pod to fall back to, and we cannot have multiple instances running. Due to this, the setting of the readinessProbe should be such that the hub pod becomes ready when it should be, but that it doesn't ever fail because it was slow for a while.

Instead of disabling the readinessProbe entirely, I think we should try to make it not fail, this can be done using failureThreshold for example: https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.18/#probe-v1-core

503 (Service unavailable) errors

These happen when traffic is routed to a k8s Service that direct traffic to pods without any pod in a Ready status, which is influenced by the readinessProbe.

Current default configuration and its use

https://github.com/jupyterhub/zero-to-jupyterhub-k8s/blob/0c88d30d32f00a27c491a50986169bdf77e86f4d/jupyterhub/values.yaml#L90-L97

https://github.com/jupyterhub/zero-to-jupyterhub-k8s/blob/d080b1d24bc1eb4d8faac0726f5d5ea87605459b/jupyterhub/templates/hub/deployment.yaml#L224-L246

bug

Source

consideRatio

👍1

Most helpful comment

lololol, I just made #1800 and came here to file this exact issue :D

To make a pod not-ready at a later time doesn't make much sense - we don't have a pod to fall back to, and we cannot have multiple instances running. Due to this, the setting of the readinessProbe should be such that the hub pod becomes ready when it should be, but that it doesn't ever fail because it was slow for a while.

I agree! I think we should get rid of the readiness probe, and instead use a startupProbe

yuvipanda on 5 Oct 2020

❤1 🎉1

All 5 comments

lololol, I just made #1800 and came here to file this exact issue :D

To make a pod not-ready at a later time doesn't make much sense - we don't have a pod to fall back to, and we cannot have multiple instances running. Due to this, the setting of the readinessProbe should be such that the hub pod becomes ready when it should be, but that it doesn't ever fail because it was slow for a while.

I agree! I think we should get rid of the readiness probe, and instead use a startupProbe

yuvipanda on 5 Oct 2020

❤1 🎉1

:D @yuvipanda I agree, but that will require k8s 1.16+ so we need to introduce a requirement on that or make it conditional until it's resolved.

+1 for adding a startupProbe Helm chart configuration with it enabled, but letting it be disabled for k8s clusters not supporting a startupProbe.

I'm not sure what i think about deleting the livenessProbe, it can serve a purpose to reset the hub pod if it would get stuck following startup, but I don't ever see that happening though.

I think we should have the readinessProbe running still though, as I believe that is a separate function to both livenessProbe and startupProbe. We want the hub pod to quickly become k8s Pod Ready when it is, but we also don't want it to not be ready after it has become ready at any time in the past, because that doesn't make sense for a single replica deployment that doesn't even do rolling upgrades. But, if we remove it, I think it means it will be Ready from second 0, which makes us unable to wait for its status as an indicator of the hub pods successful startup I think, but perhaps the startupProbe influence also the readiness state, but I don't think so - pods will be considered Ready whenever they start running without a readinessProbe I believe.

consideRatio on 5 Oct 2020

I just learned about startupProbes a minute ago and came here to check if we already have an Issue for it and happy to see you folks are on it! :heart:

I agree 100% that we should use a startupProbe to solve the hard problem, whether that's version-checking or waiting until we can bump minimum supported k8s to 1.16 (probably soon?).

I'm not sure what i think about deleting the livenessProbe, it can serve a purpose to reset the hub pod if it would get stuck following startup, but I don't ever see that happening though.

It used to happen quite a bit, and this is part of what drove the Hub's internal consecutive failure limit which prompts a restart (the Hub's own liveness check, essentially, which still causes daily restarts on mybinder.org). So I think it's reasonable to keep a simple livenessProbe around that restarts a stuck hub once in a while is still a good thing. It is important that liveness check doesn't fail just because the Hub is busy, though, since that's the last time we want to be restarting the Hub! For that reason, I'd be okay disabling it by default, too.

I think we should have the readinessProbe running still though, as I believe that is a separate function to both livenessProbe and startupProbe

From the docs I was just reading:

Sometimes, applications are temporarily unable to serve traffic. For example, an application might need to load large data or configuration files during startup, or depend on external services after startup. In such cases, you don't want to kill the application, but you don't want to send it requests either. Kubernetes provides readiness probes to detect and mitigate these situations. A pod with containers reporting that they are not ready does not receive traffic through Kubernetes Services.

I think readinessProbes are not very useful when we have a startupProbe for the Hub in the absence of replicas and rolling upgrades. I think readiness is only useful then if we want a Hub pod to go back into an unready state at some point after becoming ready following the startupProbe's success. I think we want either readinessProbe or startupProbe, but not both. The only use I can see for readiness the is if there's a flood of traffic resulting in DOS of the Hub. If this causes a readiness failure, traffic might stop in kubernetes rather than continuing to pile up on the Hub. It's still a failure mode, but might behave better sometimes? I'm not sure.

minrk on 6 Oct 2020

👍1

It is my understanding that the startupProbe won't influence readiness at all, only delay the livenessProbe. In other words, the startupProbe won't do anything for us other than help us avoid potential issues during startup from a livenessProbe.

It can be nice to have a livenessProbe still to recover from some unknown failure event.
It can be nice to have a startupProbe as a complement to livenessProbe configuration, allowing us to have a separate logic for it during startup.
It can be nice to have a readinessProbe to help us get informed by the readiness state, which can for example help us do kubectl rollout status deploy/hub which will be like "sleep until hub is ready".
It may make sense to not have a readinessProbe configured if we only have one replica anyhow, as there is no fallback, but at the same time, it could be good to use a "service unavailable" fallback response which comes from a k8s Service not having any ready pods to send traffic to as a fallback as well, to reduce the load so it can recover. If we want, we can let the failureThreashold be 1000 while the successThreashold be 1, which makes us almost never become unready after having been ready once.

To conclude, I think using all three makes sense, and that we should increase the readinessProbe's failureThreashold to a big number.

Notes on startupProbe not being related to readinessProbe

This is documentation from kubernetes docs, and I see no indication that a startupProbe would influence readiness.

Thanks to the startup probe, the application will have a maximum of 5 minutes (30 * 10 = 300s) to finish its startup. Once the startup probe has succeeded once, the liveness probe takes over to provide a fast response to container deadlocks. If the startup probe never succeeds, the container is killed after 300s and subject to the pod's restartPolicy.

consideRatio on 6 Oct 2020

Action points to close this issue

[x] Set a very large failureThreshold on the readinessProbe (#1941)
[x] Set a very long initialDelaySeconds for the livenessProbe (#1941)
[ ] When we assume startupProbe to be available, we can reduce/remove the initialDelaySeconds of the livenessProbe and rely on the startupProbe to manage a slow startup.

consideRatio on 8 Oct 2020

Was this page helpful?

0 / 5 - 0 ratings