Esp-idf: [TW#13265] Implicitly enabled task WDT leads to unexpected crashes

Created on 13 Jun 2017 · 13Comments · Source: espressif/esp-idf

Task WDT is enabled implicitly on the first call to task_wdt_feed (here).
this may in some cases trigger WDT timeouts if code path that feeds WDT is invoked from a task that does not use it.
case in point: eventTask delivers system events, which does something that feeds WDT - now eventTask is registered for WDT, and duly triggers it, because it does not feed it, as a rule.

instead, task WDT registration should be explicit.

Source

rojer

😕1

All 13 comments

if code path that feeds WDT is invoked

Isnt it better to avoid that in general?

negativekelvin on 13 Jun 2017

👍1

um... no? because it's not always direct. for example, in this case this was in logging, which happened to be invoked from different tasks.

so, no, i think the way WDT works currently is non-obvious and should be enabled explicitly on every task that should be monitored.

rojer on 14 Jun 2017

😕1

I suppose that's reasonable multiple tasks sharing code that ends up in long running loop like log dump.

Perhaps task_wdt_feed could take a bool argument "add this task if not already in list".

negativekelvin on 14 Jun 2017

right. but then you'd set it to false everywhere except, perhaps, where you know it's the first time you're calling it.
so then task_wdt_feed(true) is more or less task_wdt_enable() anyway :)
i think it should just be a separate function that you call, ideally with a task handle argument to make it non-ambiguous. it's easy enough to say task_wdt_enable(xTaskGetCurrentTaskHandle()) if you need it, but really, the decision whether you want WDT on a task or not is most likely made at task creation, where you have the handle of the just-created created task, like so:

void app_main(void) {
  TaskHandle_t h;
  ...
  xTaskCreate(my_task, "myTask", 4096, NULL, 1, &h);
  // This is a critical task.
  task_wdt_enable(h);
}

rojer on 14 Jun 2017

👎1

On second thought, I believe thie proposed change is a bad idea and strongly vote against it.

Essentially, the proposed change supports non-cleanly designed code, where code shared by several execution contexts assumes that it is being run in a certain execution context when the opposite is true, while it breaks the assertion of clean multi-thread-enabled code that all calls to trigger the WD can be relied on, without a compile time error to signal the changed semantics of the existing API call.

Options I see:

1) Have a task_wdt_disable() instead of the task_wdt_enable(). Also not a good idea because user code would disable a system task's WD assertions without knowing them (in the above example), which is a very bad approach.

2) Have an additional API call task_wdt_feed_if_watched(). This would support the few cases where splitting up a longer task into smaller ones, with the WD triggered between the chunks in the regular way by the task specific code, would require significant overhead, and maintenance / clarity / design of the code plays a minor role. It would also not break the principle of least surprise and the API call name would be intuitive.

3) Have a completely new API in parallel, again in addition to the old one which should not changes its semantics. Then, one would have one "automatic" and "complete" API and one "explicitly task-based" API.

4) Change to the newly proposed API semantics but change all API function names, such that existing code won't compile and nobody is fooled into thinking that their existing code still has the same semantics -- who would enable the WD if they wouldn't want to rely on it?

HubbyGitter on 18 Jun 2017

there is nothing wrong with having a piece of code executed from different tasks, provided access to shared state is properly protected.
for example, filesystem code must be able to be executed from any task, but may still want to feed wdt during potentially long running operation, such as flash erase (another actual use case i ran into). if the task it is invoked from uses wdt, it should be fed, but having wdt enabled on the invoking task hen it wasn't before is a wrong thing to do and leads to crashes.

as to how to make the API change, i'm ok with either. in fact, i'd be ok if all the existing system tasks and all the newly-created tasks just had WDT enabled by default - after all, protecting system event task with a WDT is not such a bad idea. just need to make sure they actually feed wdt properly when idle.

rojer on 18 Jun 2017

there is nothing wrong with having a piece of code executed from different tasks, provided access to shared state is properly protected.

I never said it was wrong. Running certain code artifacts in different execution contexts is a fundamental basis of muiti-threaded systems. But your condition is incomplete.

As we can clearly see in this thread, it is _not_ enough to "protect access" (e.g. to avoid race conditions), but also to consider other aspects for such code artifacts:

different resource limitations (e.g. stack and heap of task)
different external impact (e.g. core specific interrupt and / or scheduler settings)
different performance (e.g. different cores on one chip)
...

Therefore, code which can potentially run in any of the given contexts must obey these aspects.

Having long running function calls is always a challenge. Fully flavoured OSs run them in different contexts and on IDF, I detach them from my application code myself, although it would be better if the system would do that already.

Either way, why not implement option 2). It's simple, intuitive, does not break existing code, and follows the principle of least surprise as far as I see.

HubbyGitter on 18 Jun 2017

theoretical arguments aside, option (2) would make IDF's API different from, basically, every other system everywhere. task_wdt_feed() should always behave as feed_if_watched, that's the only sensible way.

rojer on 18 Jun 2017

Let's say there are two approaches to using a WD.

Relaxed

"I want to know that my system is still running."
"I want extensive library calls to kick the WD regardless of whether or not it monitors the task."

Strict

"My code has tight runtime restrictions."
"I have chosen the WDT interval carefully; my system is not guaranteed to be stable if I exten it."
"When I kick the WD in my code, it is not primarily because I want to satisfy the WD, but rather because this code must be monitored because its functionality is critical to the system's health."
"I break down longer activities into chunks. My tasks are mostly event-driven and I cannot afford to not handle any event on time. Longer tasks are cancelled gracefully after having processed a chunk if a contradicting event occurs."
"I place longer running third party code in a different task and cover it by callback / event and timeout mechanisms."

Now we have the existing implementation which effectively implements (and requires) the _strict_ approach. On FreeRTOS, this makes sense. At the same time, this matches neither Arduino thinking nor some of the existing libraries (e.g. flash handling) very well, and you alsy say that the API should not deviate from "basically every other system".

So what do you see as a good solution?

I believe I say a proprietary WD implementation in the context of functional safety where kicking the WD from an unregistered task was considered a fatal configuration error detected at runtime.

HubbyGitter on 18 Jun 2017

FYI, I think kicking the WDT in library tasks is a very bad thing to do: in order to make sure your code isn't spinning somewhere, the only reasonable place to kick the WDT is in the top-level thread loop. (I maybe adhere more to the 'strict' approach as defined above.) For 'less strict' approaches, I suggest setting the WDT timeout to something bigger, so the WDT doesn't kick in when your thread is out to lunch for a while because it needs to do something specific. (Tbh, allowing the threads to set a per-thread watchdog timeout value may be helpful to fail early when we can.)

I do agree that the wisdom in implicitly enabling of the WDT on feeding it the first time is debatable, but I'm not sure if an API change to fix that issue is worth it.

Spritetm on 20 Jun 2017

👍1

I second @Spritetm and will say that in professional embedded systems, the "relaxed" approach, erm, _should_ not be found. Therefore, I'm not in favour of it at all. However, ESP32 lives a hybrid life, having a lot of potential for professional applications, while at the same time trying to serve the demand of the Arduino-ish world, where many people just want to get something up and running. So if there were two APIs for the two worlds, I'd also be happy. But if IDF makes a strict implementation impossible, then I won't see that as good news.

HubbyGitter on 20 Jun 2017

But if IDF makes a strict implementation impossible, then I won't see that as good news.

I second that.

PerMalmberg on 6 Aug 2017

Task watchdog has been overhauled, please refer to commit 9d63e1da4a233f8b3043157f58894fb9b2c2668d .
Documentation can be found via https://esp-idf.readthedocs.io/en/latest/api-reference/system/wdts.html