We have a k8s readinessProbe defined, which means that we declare the hub only to be "Ready" to receive traffic when it has passed the test of responding to a HTTP request. This is only used as an indicator of successfully starting up currently. For example, the CI tests should not try send traffic to the hub pod before the hub pod has started.
NOTE: that there is also a livenessProbe which if that fails, it would trigger the container in the pod to restart.
To make a pod not-ready at a later time doesn't make much sense - we don't have a pod to fall back to, and we cannot have multiple instances running. Due to this, the setting of the readinessProbe should be such that the hub pod becomes ready when it should be, but that it doesn't ever fail because it was slow for a while.
Instead of disabling the readinessProbe entirely, I think we should try to make it not fail, this can be done using failureThreshold for example: https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.18/#probe-v1-core
These happen when traffic is routed to a k8s Service that direct traffic to pods without any pod in a Ready status, which is influenced by the readinessProbe.
lololol, I just made #1800 and came here to file this exact issue :D
To make a pod not-ready at a later time doesn't make much sense - we don't have a pod to fall back to, and we cannot have multiple instances running. Due to this, the setting of the readinessProbe should be such that the hub pod becomes ready when it should be, but that it doesn't ever fail because it was slow for a while.
I agree! I think we should get rid of the readiness probe, and instead use a startupProbe
:D @yuvipanda I agree, but that will require k8s 1.16+ so we need to introduce a requirement on that or make it conditional until it's resolved.
+1 for adding a startupProbe Helm chart configuration with it enabled, but letting it be disabled for k8s clusters not supporting a startupProbe.
I'm not sure what i think about deleting the livenessProbe, it can serve a purpose to reset the hub pod if it would get stuck following startup, but I don't ever see that happening though.
I think we should have the readinessProbe running still though, as I believe that is a separate function to both livenessProbe and startupProbe. We want the hub pod to quickly become k8s Pod Ready when it is, but we also don't want it to not be ready after it has become ready at any time in the past, because that doesn't make sense for a single replica deployment that doesn't even do rolling upgrades. But, if we remove it, I think it means it will be Ready from second 0, which makes us unable to wait for its status as an indicator of the hub pods successful startup I think, but perhaps the startupProbe influence also the readiness state, but I don't think so - pods will be considered Ready whenever they start running without a readinessProbe I believe.
I just learned about startupProbes a minute ago and came here to check if we already have an Issue for it and happy to see you folks are on it! :heart:
I agree 100% that we should use a startupProbe to solve the hard problem, whether that's version-checking or waiting until we can bump minimum supported k8s to 1.16 (probably soon?).
I'm not sure what i think about deleting the livenessProbe, it can serve a purpose to reset the hub pod if it would get stuck following startup, but I don't ever see that happening though.
It used to happen quite a bit, and this is part of what drove the Hub's internal consecutive failure limit which prompts a restart (the Hub's own liveness check, essentially, which still causes daily restarts on mybinder.org). So I think it's reasonable to keep a simple livenessProbe around that restarts a stuck hub once in a while is still a good thing. It is important that liveness check doesn't fail just because the Hub is busy, though, since that's the last time we want to be restarting the Hub! For that reason, I'd be okay disabling it by default, too.
I think we should have the readinessProbe running still though, as I believe that is a separate function to both livenessProbe and startupProbe
From the docs I was just reading:
Sometimes, applications are temporarily unable to serve traffic. For example, an application might need to load large data or configuration files during startup, or depend on external services after startup. In such cases, you don't want to kill the application, but you don't want to send it requests either. Kubernetes provides readiness probes to detect and mitigate these situations. A pod with containers reporting that they are not ready does not receive traffic through Kubernetes Services.
I think readinessProbes are not very useful when we have a startupProbe for the Hub in the absence of replicas and rolling upgrades. I think readiness is only useful then if we want a Hub pod to go back into an unready state at some point after becoming ready following the startupProbe's success. I think we want either readinessProbe or startupProbe, but not both. The only use I can see for readiness the is if there's a flood of traffic resulting in DOS of the Hub. If this causes a readiness failure, traffic might stop in kubernetes rather than continuing to pile up on the Hub. It's still a failure mode, but might behave better sometimes? I'm not sure.
It is my understanding that the startupProbe won't influence readiness at all, only delay the livenessProbe. In other words, the startupProbe won't do anything for us other than help us avoid potential issues during startup from a livenessProbe.
kubectl rollout status deploy/hub which will be like "sleep until hub is ready".To conclude, I think using all three makes sense, and that we should increase the readinessProbe's failureThreashold to a big number.
This is documentation from kubernetes docs, and I see no indication that a startupProbe would influence readiness.
Thanks to the startup probe, the application will have a maximum of 5 minutes (30 * 10 = 300s) to finish its startup. Once the startup probe has succeeded once, the liveness probe takes over to provide a fast response to container deadlocks. If the startup probe never succeeds, the container is killed after 300s and subject to the pod's restartPolicy.
Most helpful comment
lololol, I just made #1800 and came here to file this exact issue :D
I agree! I think we should get rid of the readiness probe, and instead use a startupProbe