Website: Liveness Probes: mention that they can worsen app availability

Created on 29 Sep 2019 · 20Comments · Source: kubernetes/website

This is a Feature Request

Change request for: https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/

What would you like to be added

I would like to have a clear warning on the Configure Liveness, Readiness and Startup Probes page that Liveness Probes can worsen the situation for applications: they can lead to cascading failures (e.g. in high-load situations when the health endpoint does not respond anymore + app restarts might take a long time).

Some proposed text (I'm open for anything better):

Please note that liveness probes can lead to cascading failures,
e.g. causing excessive downtime due to container restarts in high-load situations.
Understand the difference between readiness and liveness probes
and when to apply them for your app.

Why is this needed

Kubernetes documentation pages are often read by inexperienced app developers who are not familiar with Kubernetes. The current page only mentions Liveness Probes as a way to increase availability for containers which get stuck, but does not say anything about the danger of using Liveness Probes. I observe app developers often using Liveness Probes in the same way as Readiness Probes (sometimes even with the exact same probe settings), which will cause more harm than good.

For more context and Zalando's recommendations for app developers, see my blog post: https://srcco.de/posts/kubernetes-liveness-probes-are-dangerous.html

lifecyclrotten prioritbacklog

Source

hjacobs

👍17 🚀1

Most helpful comment

I'm OK to add something, but I'd rather phrase it in a more positive light. E.g.

Liveness probes can be a powerful way to recover from application failures, but they should be used with caution. Liveness probes must be configured carefully to ensure that they truly indicate unrecoverable application failure, for example a deadlock. A common pattern for liveness probes is to use the same low-cost HTTP endpoint as for readiness probes, but with a higher failureThreshold. This ensures that the pod is observed as not-ready for some period of time before it is hard killed.

Incorrect configuration of liveness probes can lead to cascading failures. For example, killing the pod when it has high load, as opposed to being crashed, can lead to failed client requests or traffic being shifted onto other pods in the same deployment or service, thereby overloading them.

Understand the difference between readiness and liveness probes and when to apply them for your app.

thockin on 1 Oct 2019

👍27

All 20 comments

This was triggered by this Tweet which suggests reading the docs is enough (which is not enough as the pitfalls are not mentioned): https://twitter.com/Guillaume_Swiss/status/1178258781152563200

hjacobs on 29 Sep 2019

👍1

I don't want to add "never use liveness probes", but I think the documentation page is currently not balanced as it mentions "availability" only as benefit of Liveness Probes and not the downsides.

hjacobs on 29 Sep 2019

Note that the answer to "When should you use a liveness probe?" also does not provide any hints on potential dangers:

If the process in your Container is able to crash on its own whenever it encounters an issue or becomes unhealthy, you do not necessarily need a liveness probe; the kubelet will automatically perform the correct action in accordance with the Pod’s restartPolicy.

If you’d like your Container to be killed and restarted if a probe fails, then specify a liveness probe, and specify a restartPolicy of Always or OnFailure.

For the naive application developer this might sound good ("restart my container on failure? yes, I want this of course!") --- the documentation does not mention that there is no coordination across Pods and that PDBs are not respected, i.e. that often all containers are restarted due to some external event/dependency (e.g. high load, health check on DB which has a hiccup, etc).

hjacobs on 29 Sep 2019

This is a valid issue
/priority backlog

sftim on 30 Sep 2019

👍1

I'm OK to add something, but I'd rather phrase it in a more positive light. E.g.

Understand the difference between readiness and liveness probes and when to apply them for your app.

thockin on 1 Oct 2019

👍27

My gut feeling is that there's enough material here for a new Task page, aimed at developers (think CKAD), that describes how to design readiness, liveness and startup probes for your workload.

The website has a _lot_ of task pages, but (IMO) that's because there are a lot of different tasks that readers might want to do.

sftim on 1 Oct 2019

👍1

@thockin your text proposal LGTM :smile:

hjacobs on 1 Oct 2019

I think a good task doc would be great! How do we go about doing that?
I'm happy to consult, but I am not a writer..

On Tue, Oct 1, 2019 at 1:18 AM Tim Bannister notifications@github.com
wrote:

My gut feeling is that there's enough material here for a new Task page,
aimed at developers (think CKAD), that describes how to design readiness,
liveness and startup probes for your workload.

The website has a lot of task pages, but (IMO) that's because there are
a lot of different tasks that readers might want to do.

—
You are receiving this because you were assigned.
Reply to this email directly, view it on GitHub
https://github.com/kubernetes/website/issues/16607?email_source=notifications&email_token=ABKWAVAQIRW3BCCES5RSZS3QMMBWVA5CNFSM4I3R7DR2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEAANCDQ#issuecomment-536924430,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABKWAVHWQLI5NAUU4ZUWHITQMMBWVANCNFSM4I3R7DRQ
.

thockin on 1 Oct 2019

Thanks @thockin, very good proposed text, that doesn’t hide the problem, and guides how and when to use

szuecs on 1 Oct 2019

I'm happy to do a PR, but I'm not sure if it should be after the second paragraph or somewhere else (?).

hjacobs on 1 Oct 2019

I would create a short paragraph about LivenessProbe and ReadinessProbe just before the more verbose sections that show the how.
Maybe https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#container-probes would be a better place and maybe link from the configuration page.

szuecs on 1 Oct 2019

👍1

Did this get finished?

thockin on 14 Nov 2019

@thockin no, not yet. I was on vacation and will follow up.

hjacobs on 28 Nov 2019

👍1

Is this done now?

thockin on 14 Feb 2020

any update?

xiaods on 3 Mar 2020

any progress Mr @hjacobs ?

Lechus on 6 Apr 2020

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot on 5 Jul 2020

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

fejta-bot on 4 Aug 2020

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

fejta-bot on 3 Sep 2020

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.