Here are two config options that could be added to improve the behavior of the watchdog:
I'll tackle this issue. I don't see a button to assign it to myself, but I'll just leave a comment to claim it.
/assign @KBaichoo
@KBaichoo cannot be assigned to this issue.
:cat:
Caused by: a https://github.com/envoyproxy/envoy/issues/11389#issuecomment-657761798 was created by @antoniovicente.
I'll tackle this issue. I don't see a button to assign it to myself, but I'll just leave a comment to claim it.
Thanks for the help!
I sent you some info on how to get your github id added to the envoy org so we can assign issues to you.
I'm now a part of the envoy org
/assign @KBaichoo
To implement:
- Add a config parameter for random(0, N) seconds before kill that can be used to smear out single-thread watchdog kill events and avoid coordinated, simultaneous failures across multiple proxy instances. Do not kill the proxy if it recovers while waiting for the random delay timer to fire.
I think this is the way to go as it leverages existing mechanisms:
max_random_kill_duration (default 0 to disable)GuardDogImpl::kill_timeout_ to be adjusted based on the kill timeout specified in the config, as well as the max_random_kill_duration (if killing is enabled) to leverage the existing mechanism for killing for a single thread.GuardDogImplThoughts?
To implement:
- Add a percent or number of stuck threads required to trigger a multi-thread kill. Default to the current behavior of needing 2 stuck threads before triggering multi-kill.
Proposed Changes:
step() function for multi-kill enabled to count the number of unresponsive threads and when we鈥檙e >= the threshold, then we can trigger a multi-kill.To implement:
- Add a percent or number of stuck threads required to trigger a multi-thread kill. Default to the current behavior of needing 2 stuck threads before triggering multi-kill.
Proposed Changes:
- Change the watchdog config message to include a uint for threshold before a multi-kill (default 2 to preserve current behavior)
I was thinking about a percentage of registered threads and preserving current behavior by doing multi-kill on "max(2, registered_threads * multi_kill_percent)" with a default multi_kill_percent of 0. Having an absolute number of threads required for multi-kill with a default of 2 when multi-kill is enabled also seems fine.
- Modify the
step()function for multi-kill enabled to count the number of unresponsive threads and when we鈥檙e >= the threshold, then we can trigger a multi-kill.
This isn't entirely closed (PR #12082 should finish it out). I don't have access on my end to re-open it.