Envoy: [watchdog] Provide additional watchdog actions and/or extension points

Created on 1 Jun 2020 · 8Comments · Source: envoyproxy/envoy

The thread watchdog is already an important mechanism to detect and recover from coding errors that results in infinite loops, blocking API calls and very long computations in worker threads. There are a few simple improvements that would make the watchdog even more awesome:

Option to capture a 5sec to 10sec CPU profile after a series of watchdog misses or mega misses, and either write it to disk or make it available via admin interface. If writing to disk, provide parameter for max number of profiles to generate to avoid filling up the disk.
Option to capture and log the current stack of the watched thread or all thread stacks on mega miss.
Option to terminate the process by sending SIGABRT to the stuck thread instead calling PANIC on the guarddog thread.
Registration mechanism for additional callbacks to invoke on watchdog miss or megamiss which could be used to implement some of the prior ideas and/or integrate with third party systems. Callback arguments may include the list of threads that have experienced recent megamiss events and info about when they were last reported alive.

arewatchdog help wanted

Source

antoniovicente

Most helpful comment

/assign @KBaichoo

KBaichoo on 15 Jul 2020

👍2

All 8 comments

/assign @KBaichoo

KBaichoo on 15 Jul 2020

👍2

For a general mechanism I'm proposing the following changes:

Bootstrap.proto

Message WatchDog {
        Enum WatchDogEvent {
                Abort
                Miss
                MegaMiss
        }

        Message WatchDogActions {
                String name;
                WatchDogEvent evt_type;
        }

        repeated WatchDogActions actions;
}

In GuardDogImpl:

Step() function needs to have vectors that’ll hold the (tid, last time touch) for the different miss categories
- 3 vectors:
  - Miss
  - Mega Miss
  - MultiKill
- Have the miss / megamiss callbacks occur outside the critical section, and the multi kill / kill inside
Need to have access to the registry mechanism and then create a mapping from ‘Event Type’: [Array of Callbacks]

The callbacks will have the following signature:
void Function(WatchDogEvent event, std::vector<std::pair<Thread::ThreadID, MonotonicTime>> tid_ltt_pairs, MonotonicTime now)

KBaichoo on 17 Jul 2020

You may want to have the events be: MultiKill, Kill, Miss and Megamiss to match the current actions.

Getting the full list of threads that are in a miss/megamiss state seems very useful. Similarly, list of threads involved in the Kill or MultiKill event seems useful and would be consistent with the API used for miss/megamiss, even if the kind of operations I expect we would run on Kill/MultiKill possibly not requiring the thread id information.

antoniovicente on 17 Jul 2020

Good Idea, that would add more granularity and be more consistent than Abort.

KBaichoo on 17 Jul 2020

IIUC, the best way to provide the 'extensions' (for particular watchdog events) is via Factories and using utilities such as GetAndCheckFactory to then get the callbacks.

I don't see a class that derives from TypedFactory that best suits WatchDog, so I'll create a subclass off of that which all specific factories will derive from similar to NamedNetworkFilterConfigFactory.

KBaichoo on 20 Jul 2020

@envoyproxy/api-shepherds for input

antoniovicente on 20 Jul 2020

+1 to using an extension/typed_config interface if we think that we will want this to be extensible in the future with different actions/events.

mattklein123 on 20 Jul 2020

I've turned the Extension PR from a draft into an actual PR. Since it was getting quite big I've decided to implement one of the extensions in another PR.

For implementing CPU profiling based on WatchDogEvents:

Add WD’s dispatcher ref to the context that can be used by WDActions to plan callbacks
- We’ll be able to trigger the action from the GuardDog’s step() which will StartProfiling, and can then schedule the stop() functionality on the GuardDog’s dispatcher. We need to schedule the stop() there as it’s possible we don’t invoke the action while profiling, and thus might not stop profiling.
A proto for the configuration for the WD action:
- Profile durations -- duration on how long to run the profile for
- Profile_path -- path to file to output the profiles on
- Max_profiles_per_thread -- limits max number of profiles we’ll generate for a given thread to avoid filling disks
Implement a WatchDogProfileAction and Factory
- Action tracks whether it’s running a profile current, as well as tids associated with profiles and counts