Envoy: [watchdog] Config options for random delay before kill and percent of threads stuck before multi-kill

Created on 1 Jun 2020  路  10Comments  路  Source: envoyproxy/envoy

Here are two config options that could be added to improve the behavior of the watchdog:

  • Add a config parameter for random(0, N) seconds before kill that can be used to smear out single-thread watchdog kill events and avoid coordinated, simultaneous failures across multiple proxy instances. Do not kill the proxy if it recovers while waiting for the random delay timer to fire.
  • Add a percent or number of stuck threads required to trigger a multi-thread kill. Default to the current behavior of needing 2 stuck threads before triggering multi-kill.
arewatchdog help wanted

All 10 comments

I'll tackle this issue. I don't see a button to assign it to myself, but I'll just leave a comment to claim it.

/assign @KBaichoo

@KBaichoo cannot be assigned to this issue.


:cat:

Caused by: a https://github.com/envoyproxy/envoy/issues/11389#issuecomment-657761798 was created by @antoniovicente.

see: more, trace.

I'll tackle this issue. I don't see a button to assign it to myself, but I'll just leave a comment to claim it.

Thanks for the help!
I sent you some info on how to get your github id added to the envoy org so we can assign issues to you.

I'm now a part of the envoy org

/assign @KBaichoo

To implement:

  • Add a config parameter for random(0, N) seconds before kill that can be used to smear out single-thread watchdog kill events and avoid coordinated, simultaneous failures across multiple proxy instances. Do not kill the proxy if it recovers while waiting for the random delay timer to fire.

I think this is the way to go as it leverages existing mechanisms:

  • Change the watchdog config proto to include a duration for max_random_kill_duration (default 0 to disable)
  • Modify GuardDogImpl::kill_timeout_ to be adjusted based on the kill timeout specified in the config, as well as the max_random_kill_duration (if killing is enabled) to leverage the existing mechanism for killing for a single thread.

    • We'll calculate this adjustment at the ctor of GuardDogImpl

Thoughts?

To implement:

  • Add a percent or number of stuck threads required to trigger a multi-thread kill. Default to the current behavior of needing 2 stuck threads before triggering multi-kill.

Proposed Changes:

  • Change the watchdog config message to include a uint for threshold before a multi-kill (default 2 to preserve current behavior)
  • Modify the step() function for multi-kill enabled to count the number of unresponsive threads and when we鈥檙e >= the threshold, then we can trigger a multi-kill.

To implement:

  • Add a percent or number of stuck threads required to trigger a multi-thread kill. Default to the current behavior of needing 2 stuck threads before triggering multi-kill.

Proposed Changes:

  • Change the watchdog config message to include a uint for threshold before a multi-kill (default 2 to preserve current behavior)

I was thinking about a percentage of registered threads and preserving current behavior by doing multi-kill on "max(2, registered_threads * multi_kill_percent)" with a default multi_kill_percent of 0. Having an absolute number of threads required for multi-kill with a default of 2 when multi-kill is enabled also seems fine.

  • Modify the step() function for multi-kill enabled to count the number of unresponsive threads and when we鈥檙e >= the threshold, then we can trigger a multi-kill.

This isn't entirely closed (PR #12082 should finish it out). I don't have access on my end to re-open it.

Was this page helpful?
0 / 5 - 0 ratings