Envoy: Different guard dogs for threads with different behavior

Created on 1 Sep 2020 · 8Comments · Source: envoyproxy/envoy

Different Guarddogs : a guard dog for worker threads and a guard dog for the main thread.

The main thread has different workloads than worker threads, this causes friction in systems that set thresholds for both of them as it’s “one-size fits all”.

For example, the main thread often spends a significant amount of time on parsing new config. That amount of time would be considered excessive for a worker thread to not have finished it’s event loop iteration and check in with its watchdog.

This “one size fits all” guard dog doesn’t work. Instead we need two guard dogs, one that’ll monitor the main thread and (other auxiliary threads)?, and another for the worker threads. This would allow us to better fine tune the thresholds depending on the guarddog (and what threads it watches) -- better allowing us to deploy watchdog capabilities such as watchdog actions, kill timeouts, etc.

The main trade off would be that it’d be harder to enact cross-guard dog policy that would trigger if we’d aggregate across guard dogs (such as multikill events), but won’t trigger because they aren’t sufficient within their individual guard dog to trigger that event.

arewatchdog help wanted

Source

KBaichoo

All 8 comments

This seems reasonable to me, assuming we put this all behind configuration

snowp on 2 Sep 2020

/assign @KBaichoo

KBaichoo on 2 Sep 2020

👍1

To implement this I propose the following:

Steps:

Encapsulate Watchdog Configuration in Server::Configuration::Watchdog instead of Server::Configuration::Main, though we’ll still allow the Main configuration to have method(s) to return the watchdog config(s). This will look similar to the Admin interface, and they'll be a new class encapping watchdog info (timeouts, etc)
1. The method to return guarddog configs will likely look like const Server::Configuration::Watchdog& getWatchdogConfig() and an std::optional<const Server::Configuration::Watchdog&> getWorkerWatchdogConfig() for the additional guarddog that’ll watch only worker threads if both watchdog configs are specified.
2. Add a method on Server::Configuration::Main regarding whether we are using multiple guarddogs.
Modify Guarddog ctor to take Server::Configuration::Watchdog instead of Server::Configuration::Main; this would simplify the interface since currently the guarddog just uses its specific config information from the main config and it avoids adding an additional param / mechanism for clarifying which Server::Configuration::Watchdog a guarddog need to boot up as its own.
Add an additional watchdog in bootstrap.proto, if it is specified then we’ll have a separate guarddog for worker threads.
Augment the server.cc to have a unique_ptr to another guarddog that’ll be focused only on worker threads if both are specified
Modify InstanceImpl::startWorkers() use the appropriate guarddog for watching the worker threads based off of information is Server::Configuration::Main.

If the additional watchdog isn’t specified in the Bootstrap then we’ll fall back to the default behavior of a single guarddog.

Thought @antoniovicente @envoyproxy/api-shepherds ? Thanks

KBaichoo on 2 Sep 2020

My thoughts about API:

Like you mention in item (3) above, we need a Watchdog config message in bootstrap.proto for each type. One possible option is the one you suggest of having a second field for the worker watchdog parameters. Some alternate options include having the new parameter specify the auxiliary thread watchdog parameters instead. Or add 2 new parameters(one for workers, one for auxiliary) and slowly deprecate the current parameter.

Another option would be to change the type of the watchdog config field in bootstrap.proto to be a repeated field and add some type enum to the Watchdog message to specify if it applies to workers, main thread, or is the default set of parameters for threads that don't have a more specific watchdog type. A possible problem with this last suggestion is that Envoy coding convention does not allow changing type of proto fields from optional to repeated because it breaks compat with certain config sources. Also, the naming convention for repeated fields is plural.

It seems fine for multi-kill to only pay attention to a specific type of thread like worker threads. Multi-kill may not make as much sense for auxiliary threads.

antoniovicente on 3 Sep 2020

Thanks for the input @antoniovicente.

I believe we only create WDs right now for workers, and for the main thread. Perhaps we'd want to in the future create it for other threads with similar workloads (say async io threads)?

One possible option is the one you suggest of having a second field for the worker watchdog parameters.

We'd be able to keep the existing field (breaking downstream a little less), but it would have additional cognitive load as we'd have a single knob that changes in complex ways depending on the value of another knob.

Some alternate options include having the new parameter specify the auxiliary thread watchdog parameters instead.

The main drawback with this approach (IIUC the suggestion), is that we'd have an explosion of parameters for every new field we'd need an additional one for the auxiliary.

Or add 2 new parameters(one for workers, one for auxiliary) and slowly deprecate the current parameter.

This would likely be cleaner from a code hygiene perspective, but it won't be future proof to say adding a guarddog for threads with another set of behaviors (say logger threads). That is to say, often times when we want 2 of some X, it sometimes becomes a multiple > 2 in the future.

Another option would be to change the type of the watchdog config field in bootstrap.proto to be a repeated field and add some type enum to the Watchdog message to specify if it applies to workers, main thread, or is the default set of parameters for threads that don't have a more specific watchdog type. A possible problem with this last suggestion is that Envoy coding convention does not allow changing type of proto fields from optional to repeated because it breaks compat with certain config sources. Also, the naming convention for repeated fields is plural.

This generalizes from the 2 new parameter solution, but we run into issues from coding conventions (I believe changing the name doesn't matter to the protos as the tag is what matters).

Since we can't retrofit the old field, we could just add a new field watchdogs and have an enum for configs for different watchdog types; this would also simplify the Server::Configuration::Main interface to have a single function std::optional<WatchdogConfig> GetWatchdogConfig(WatchdogTypeEnum).

A drawback to this is configuration validation could get messy (for example users could have multiple configs for the same watchdog type provided), but we could just take the first one and warn them about their misconfiguration.

KBaichoo on 3 Sep 2020

I'd avoid enum, I think the cleanest here is to add a MultiWatchdog multi_watchdog = N; to bootstrap and a new message with:

message MultiWatchdog {
  Watchdog aux_watchdog = 1;
  Watchdog worker_watchdog = 2;
  ...
}

This way we can deprecate the existing watchdog field. This is similar to what we did with Runtime when introducing explicit layer control.

htuch on 3 Sep 2020

👍1

Adding a new message to group together the multiple watchdog configs seems like a fine plan. Is that the preferred option? If yes, let's move on to implementation.

antoniovicente on 3 Sep 2020

Yea upon implementation that seemed to be cleaner than repeated field. I've started implementation with the new message.

KBaichoo on 3 Sep 2020

Was this page helpful?

0 / 5 - 0 ratings