Envoy: [watchdog] Config options for random delay before kill and percent of threads stuck before multi-kill

Created on 1 Jun 2020 · 10Comments · Source: envoyproxy/envoy

Here are two config options that could be added to improve the behavior of the watchdog:

Add a config parameter for random(0, N) seconds before kill that can be used to smear out single-thread watchdog kill events and avoid coordinated, simultaneous failures across multiple proxy instances. Do not kill the proxy if it recovers while waiting for the random delay timer to fire.
Add a percent or number of stuck threads required to trigger a multi-thread kill. Default to the current behavior of needing 2 stuck threads before triggering multi-kill.

arewatchdog help wanted

Source

antoniovicente

All 10 comments

I'll tackle this issue. I don't see a button to assign it to myself, but I'll just leave a comment to claim it.

KBaichoo on 13 Jul 2020

/assign @KBaichoo

antoniovicente on 13 Jul 2020

😕1

@KBaichoo cannot be assigned to this issue.

:cat:

Caused by: a https://github.com/envoyproxy/envoy/issues/11389#issuecomment-657761798 was created by @antoniovicente.

see: more, trace.

repokitteh[bot] on 13 Jul 2020

👎1

I'll tackle this issue. I don't see a button to assign it to myself, but I'll just leave a comment to claim it.

Thanks for the help!
I sent you some info on how to get your github id added to the envoy org so we can assign issues to you.

antoniovicente on 13 Jul 2020

I'm now a part of the envoy org

KBaichoo on 13 Jul 2020

/assign @KBaichoo

antoniovicente on 13 Jul 2020

👍1

To implement:

Add a config parameter for random(0, N) seconds before kill that can be used to smear out single-thread watchdog kill events and avoid coordinated, simultaneous failures across multiple proxy instances. Do not kill the proxy if it recovers while waiting for the random delay timer to fire.

I think this is the way to go as it leverages existing mechanisms:

Change the watchdog config proto to include a duration for max_random_kill_duration (default 0 to disable)
Modify GuardDogImpl::kill_timeout_ to be adjusted based on the kill timeout specified in the config, as well as the max_random_kill_duration (if killing is enabled) to leverage the existing mechanism for killing for a single thread.
- We'll calculate this adjustment at the ctor of GuardDogImpl

Thoughts?

KBaichoo on 13 Jul 2020

To implement:

Add a percent or number of stuck threads required to trigger a multi-thread kill. Default to the current behavior of needing 2 stuck threads before triggering multi-kill.

Proposed Changes:

Change the watchdog config message to include a uint for threshold before a multi-kill (default 2 to preserve current behavior)
Modify the step() function for multi-kill enabled to count the number of unresponsive threads and when we’re >= the threshold, then we can trigger a multi-kill.

KBaichoo on 13 Jul 2020

To implement:

Add a percent or number of stuck threads required to trigger a multi-thread kill. Default to the current behavior of needing 2 stuck threads before triggering multi-kill.

Proposed Changes:

Change the watchdog config message to include a uint for threshold before a multi-kill (default 2 to preserve current behavior)

I was thinking about a percentage of registered threads and preserving current behavior by doing multi-kill on "max(2, registered_threads * multi_kill_percent)" with a default multi_kill_percent of 0. Having an absolute number of threads required for multi-kill with a default of 2 when multi-kill is enabled also seems fine.

Modify the step() function for multi-kill enabled to count the number of unresponsive threads and when we’re >= the threshold, then we can trigger a multi-kill.

antoniovicente on 14 Jul 2020

This isn't entirely closed (PR #12082 should finish it out). I don't have access on my end to re-open it.

KBaichoo on 21 Jul 2020

Was this page helpful?

0 / 5 - 0 ratings