This is a Feature Request
Change request for: https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/
What would you like to be added
I would like to have a clear warning on the Configure Liveness, Readiness and Startup Probes page that Liveness Probes can worsen the situation for applications: they can lead to cascading failures (e.g. in high-load situations when the health endpoint does not respond anymore + app restarts might take a long time).
Some proposed text (I'm open for anything better):
Please note that liveness probes can lead to cascading failures,
e.g. causing excessive downtime due to container restarts in high-load situations.
Understand the difference between readiness and liveness probes
and when to apply them for your app.
Why is this needed
Kubernetes documentation pages are often read by inexperienced app developers who are not familiar with Kubernetes. The current page only mentions Liveness Probes as a way to increase availability for containers which get stuck, but does not say anything about the danger of using Liveness Probes. I observe app developers often using Liveness Probes in the same way as Readiness Probes (sometimes even with the exact same probe settings), which will cause more harm than good.
For more context and Zalando's recommendations for app developers, see my blog post: https://srcco.de/posts/kubernetes-liveness-probes-are-dangerous.html
This was triggered by this Tweet which suggests reading the docs is enough (which is not enough as the pitfalls are not mentioned): https://twitter.com/Guillaume_Swiss/status/1178258781152563200
I don't want to add "never use liveness probes", but I think the documentation page is currently not balanced as it mentions "availability" only as benefit of Liveness Probes and not the downsides.
Note that the answer to "When should you use a liveness probe?" also does not provide any hints on potential dangers:
If the process in your Container is able to crash on its own whenever it encounters an issue or becomes unhealthy, you do not necessarily need a liveness probe; the kubelet will automatically perform the correct action in accordance with the Pod’s restartPolicy.
If you’d like your Container to be killed and restarted if a probe fails, then specify a liveness probe, and specify a restartPolicy of Always or OnFailure.
For the naive application developer this might sound good ("restart my container on failure? yes, I want this of course!") --- the documentation does not mention that there is no coordination across Pods and that PDBs are not respected, i.e. that often all containers are restarted due to some external event/dependency (e.g. high load, health check on DB which has a hiccup, etc).
This is a valid issue
/priority backlog
I'm OK to add something, but I'd rather phrase it in a more positive light. E.g.
Liveness probes can be a powerful way to recover from application failures, but they should be used with caution. Liveness probes must be configured carefully to ensure that they truly indicate unrecoverable application failure, for example a deadlock. A common pattern for liveness probes is to use the same low-cost HTTP endpoint as for readiness probes, but with a higher failureThreshold. This ensures that the pod is observed as not-ready for some period of time before it is hard killed.
Incorrect configuration of liveness probes can lead to cascading failures. For example, killing the pod when it has high load, as opposed to being crashed, can lead to failed client requests or traffic being shifted onto other pods in the same deployment or service, thereby overloading them.
Understand the difference between readiness and liveness probes and when to apply them for your app.
My gut feeling is that there's enough material here for a new Task page, aimed at developers (think CKAD), that describes how to design readiness, liveness and startup probes for your workload.
The website has a _lot_ of task pages, but (IMO) that's because there are a lot of different tasks that readers might want to do.
@thockin your text proposal LGTM :smile:
I think a good task doc would be great! How do we go about doing that?
I'm happy to consult, but I am not a writer..
On Tue, Oct 1, 2019 at 1:18 AM Tim Bannister notifications@github.com
wrote:
My gut feeling is that there's enough material here for a new Task page,
aimed at developers (think CKAD), that describes how to design readiness,
liveness and startup probes for your workload.The website has a lot of task pages, but (IMO) that's because there are
a lot of different tasks that readers might want to do.—
You are receiving this because you were assigned.
Reply to this email directly, view it on GitHub
https://github.com/kubernetes/website/issues/16607?email_source=notifications&email_token=ABKWAVAQIRW3BCCES5RSZS3QMMBWVA5CNFSM4I3R7DR2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEAANCDQ#issuecomment-536924430,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABKWAVHWQLI5NAUU4ZUWHITQMMBWVANCNFSM4I3R7DRQ
.
Thanks @thockin, very good proposed text, that doesn’t hide the problem, and guides how and when to use
I'm happy to do a PR, but I'm not sure if it should be after the second paragraph or somewhere else (?).
I would create a short paragraph about LivenessProbe and ReadinessProbe just before the more verbose sections that show the how.
Maybe https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#container-probes would be a better place and maybe link from the configuration page.
Did this get finished?
@thockin no, not yet. I was on vacation and will follow up.
Is this done now?
any update?
any progress Mr @hjacobs ?
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale
Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten
Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close
@fejta-bot: Closing this issue.
In response to this:
Rotten issues close after 30d of inactivity.
Reopen the issue with/reopen.
Mark the issue as fresh with/remove-lifecycle rotten.Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
Most helpful comment
I'm OK to add something, but I'd rather phrase it in a more positive light. E.g.
Liveness probes can be a powerful way to recover from application failures, but they should be used with caution. Liveness probes must be configured carefully to ensure that they truly indicate unrecoverable application failure, for example a deadlock. A common pattern for liveness probes is to use the same low-cost HTTP endpoint as for readiness probes, but with a higher failureThreshold. This ensures that the pod is observed as not-ready for some period of time before it is hard killed.
Incorrect configuration of liveness probes can lead to cascading failures. For example, killing the pod when it has high load, as opposed to being crashed, can lead to failed client requests or traffic being shifted onto other pods in the same deployment or service, thereby overloading them.
Understand the difference between readiness and liveness probes and when to apply them for your app.