The thread watchdog is already an important mechanism to detect and recover from coding errors that results in infinite loops, blocking API calls and very long computations in worker threads. There are a few simple improvements that would make the watchdog even more awesome:
/assign @KBaichoo
For a general mechanism I'm proposing the following changes:
Bootstrap.proto
Message WatchDog {
Enum WatchDogEvent {
Abort
Miss
MegaMiss
}
Message WatchDogActions {
String name;
WatchDogEvent evt_type;
}
repeated WatchDogActions actions;
}
In GuardDogImpl:
The callbacks will have the following signature:
void Function(WatchDogEvent event, std::vector<std::pair<Thread::ThreadID, MonotonicTime>> tid_ltt_pairs, MonotonicTime now)
You may want to have the events be: MultiKill, Kill, Miss and Megamiss to match the current actions.
Getting the full list of threads that are in a miss/megamiss state seems very useful. Similarly, list of threads involved in the Kill or MultiKill event seems useful and would be consistent with the API used for miss/megamiss, even if the kind of operations I expect we would run on Kill/MultiKill possibly not requiring the thread id information.
Good Idea, that would add more granularity and be more consistent than Abort.
IIUC, the best way to provide the 'extensions' (for particular watchdog events) is via Factories and using utilities such as GetAndCheckFactory to then get the callbacks.
I don't see a class that derives from TypedFactory that best suits WatchDog, so I'll create a subclass off of that which all specific factories will derive from similar to NamedNetworkFilterConfigFactory.
@envoyproxy/api-shepherds for input
+1 to using an extension/typed_config interface if we think that we will want this to be extensible in the future with different actions/events.
I've turned the Extension PR from a draft into an actual PR. Since it was getting quite big I've decided to implement one of the extensions in another PR.
For implementing CPU profiling based on WatchDogEvents:
Add WD’s dispatcher ref to the context that can be used by WDActions to plan callbacks
step() which will StartProfiling, and can then schedule the stop() functionality on the GuardDog’s dispatcher. We need to schedule the stop() there as it’s possible we don’t invoke the action while profiling, and thus might not stop profiling.A proto for the configuration for the WD action:
Profile durations -- duration on how long to run the profile forProfile_path -- path to file to output the profiles onMax_profiles_per_thread -- limits max number of profiles we’ll generate for a given thread to avoid filling disks
Most helpful comment
/assign @KBaichoo