ESPEasy 🚀 - Hardware watchdog... how to find the cause?

If some piece of code is running for over 6 seconds without calling any delay or yield, it will trigger a hardware watchdog, which performs a reset.

So there is some code in your setup either waiting for that long, or running an "infinite loop".
Could you please give more info on your setup?
Also do not set the "MessageDelay" too high, nor use "delay" in the rules.

TD-er on 16 Aug 2018

@TD-er: FYI, the latest crash/reboot I mentioned in #1643 was reported in the GUI as _Reset Reason | Hardware Watchdog_. This was the test system that had run for a day, then the WiFi went offline for a couple hours, then the board rebooted on its own. Might be related to this issue, or maybe not.

Thomas

thomastech on 16 Aug 2018

here my config
schermata 2018-08-18 alle 11 42 14

giig1967g on 18 Aug 2018

See my comment here: https://github.com/letscontrolit/ESPEasy/issues/1659#issuecomment-414047835
Looks like my nodes are also "affected", which is good :)

TD-er on 18 Aug 2018

👍1

@TD-er: Yesterday I loaded the ESP_Easy_mega-20180815_test_ESP8266_4096 build on a NodeMCU. It ran great for 18 hours then rebooted. System Info says: Boot : Manual reboot (1), Reset Reason : Hardware Watchdog.

A second duplicate NodeMCU is still running fine. But it has only been running for 17 hrs, so it may face the reboot dance soon.

Thomas

thomastech on 18 Aug 2018

Would be great to see if it occurs at the same interval, or time of day.
Maybe it is some NTP refresh, or something else, who knows.

TD-er on 18 Aug 2018

@TD-er: That would be great. But so far I have not seen a pattern that indicates it is triggered by run duration or time of day.

My hunch is that it is something related to WiFi, such as a reconnect. But I have tried to torture the WiFi connection (force router offline, create weak RF signal levels) and nothing bad happened. So my hunch seems to be nonsense. Hopefully you find the cause and save us.

Thomas

thomastech on 18 Aug 2018

It rebooted again after running for about 2 hours. Now reports _Manual reboot (2), Reset Reason : Hardware Watchdog_.

The second duplicate NodeMCU is still running fine. About 19+ hours so far.

Thomas

thomastech on 18 Aug 2018

I found an issue with the handling of UDP traffic (when C013 is used). That could cause Exception crashes. (not likely a Watchdog reset)
I also added some checks when creating an UDP client for NTP, to see if that may cause infinite waiting.
Those can cause a watchdog reset.

When tested, I will make a commit for it.

TD-er on 18 Aug 2018

@TD-er
I've been testing the new builds as they are released. So far none have solved the Watchdog reset. However, the latest ESP_Easy_mega-20180922_dev_ESP8266_4096.bin build seems better than the last release.

The latest firmware has been successfully running on one of my NodeMCU's (current duration 1 day 22 hrs, no reboots). But the second NodeMCU has rebooted several times.

I have a feeling the rebooting is related to the WiFi access since the latest reboot occurred when I accessed the device from my browser. I also noticed a previous reboot occurred at a time when the good "working" device reported a beacon timeout. But I can't replicate the reboots on demand.

SysInfo reports this:
Boot: Manual reboot (8)
Reset Reason: Hardware Watchdog

Not sure if you are interested in this feedback. But it's been a month+ since my last updated and I thought I'd keep you posted on my findings.

Thomas

thomastech on 24 Sep 2018

👍1

I'm always interested in feedback, especially when there seems to be a bit of improvement :)

TD-er on 24 Sep 2018

The reboot issue certainly has been a tough nut to crack.

My winning streak with the "working" device just ended. After 48 hours runtime it rebooted. Sysinfo reports _Manual reboot (1) Hardware Watchdog_

Thomas

thomastech on 24 Sep 2018

See PR #1834
I just merged a change which moves the address space of the Arduino stack to be on top of the System stack.
The latest core library appears to have shifted the Arduino stack to overlap a bit with the System stack to save about 4k of memory.
But since we're allocating quite a lot on the System stack, this may have led to an increase in reports of HW Watchdog resets.

So please test with the October 1st build, as soon as that one is ready.

TD-er on 1 Oct 2018

I've installed ESP_Easy_mega-20181001_dev_ESP8266_4096.bin one two NodeMCU's. I will report back tomorrow (or sooner) if they experience a reboot.

Some initial comments:

The 4K decrease in system free ram is disappointing. If this change to stack memory does not help then please consider reverting back.
Not sure if it is related to the new firmware, but web page access is slow. Sometimes navigating the tabs take several seconds for each new page to populate.

Thanks again for your efforts. Fingers are crossed that this merge helps cure the W-dog reboot issue.

Thomas

thomastech on 1 Oct 2018

Hmm I had the impression it was loading a bit faster with the changed Arduino stack location.
But maybe getting the free stack for statistics is using quite some resources and is called a bit more in some functions

TD-er on 1 Oct 2018

@TD-er: I don't know why, but the slow page loading has gone away and refresh is OK now. If there are no other reports of it then I'd say it was an isolated issue related to my WiFi router.

Thomas

thomastech on 1 Oct 2018

Pfiew, I was afraid you had to report a crash/reboot already.

TD-er on 1 Oct 2018

I was thinking about this....
The webserver has some mechanism to free memory when it is too low.
This freeing memory may take some time which makes the webinterface slow down.

TD-er on 1 Oct 2018

Interesting, I wasn't aware the webserver could do that kind of magic. It may have been involved because during the slow browser response the system load was about 40%. But after the problem went away it settled down to about 30% load.

Now the bad news. One of the test devices rebooted after 10 hours.
System info report:
Boot: Manual reboot (1)
Reset Reason: Software Watchdog

I'll let the other device continue to run until it reboots. Maybe the force is stronger with this one.

Thomas

thomastech on 2 Oct 2018

It is not that bad, since a software watchdog is different from a hardware watchdog.
The software version means it is still doing stuff

TD-er on 2 Oct 2018

You shouldn't have said that it is not that bad because Murphy is watching us. The second device rebooted at 19 hours due to hardware reset.
System info report:
Boot: Manual reboot (2)
Reset Reason: Hardware Watchdog

BTW, the other device rebooted again a few minutes ago. Another Software reset.
System info report:
Boot: Manual reboot (3)
Reset Reason: Software Watchdog

Thomas

thomastech on 2 Oct 2018

OK, so that's not the fix :(
Can you also/already test using this PR: https://github.com/letscontrolit/ESPEasy/pull/1838 ?
I will later this evening add extra settimeout calls to other WiFi client instances, so it is not complete yet.

TD-er on 2 Oct 2018

No problem, I'll test PR #1838 after it is incorporated in the nightly build.

Thomas

thomastech on 2 Oct 2018

I'm running both devices on ESP_Easy_mega-20181004_dev_ESP8266_4096.bin. Twelve hours so far, no reboots. Fingers crossed.

Thomas

thomastech on 4 Oct 2018

I am running 20181002 with Mikrotik router. No reboots so far. Running for 2 days and 4 hours.

giig1967g on 4 Oct 2018

@TD-er do you think this could help?
Somebody just needs to figure out how to conveniently link the .elf file from the build date to the exception decoder...

Another thought on this: If it is possible to catch the exception and save stuff, can't we just catch that exception too and do something with it ? Save some text, send an email, ignore it and carry on ?

s0170071 on 4 Oct 2018

@thomastech Just curious, what are the memory and stack stats of that node running the dev build?

@s0170071 That's a very nice library.
I think we could try it, to see what's happening.
Also I could read the last crash log at boot and write it to SPIFFS.
Or just add a 'crash log report' option to send the crash to a server
I think it deserves its own issue. (to make it easier to find)

TD-er on 4 Oct 2018

Just curious, what are the memory and stack stats of that node running the dev build?

Both NodeMCU devices have the same ESPEasy configuration. Here's a System Info snapshot.

_Device 1 ("production" NodeMCU):_
Load 30.60% (LC=10052)
Free Mem 10328 (8968 - sendContentBlocking)
Free Stack 3584 (144 - LoadTaskSettings)

_Device 2 ("Test" NodeMCU):_
Load: 27.40% (LC=10648)
Free Mem: 9920 (8272 - sendContentBlocking)
Free Stack: 3520 (144 - LoadTaskSettings)

Thomas

thomastech on 4 Oct 2018

OK, so the Arduino stack is dangerously low at some point.
The lowest in both configurations is 144 bytes.
I will increase the Arduino stack to 5k (default = 4k) just to be sure and then start hunting for stack usage.

Also the heap is used quite intensively, so that one should also be looked at.

TD-er on 4 Oct 2018

I will increase the Arduino stack to 5k (default = 4k) just to be sure and then start hunting for stack usage.

That sounds like a good thing to try.

One device rebooted after 19 hours.
Sys Info on rebooted unit:
Load: 27.80% (LC=10652)
Free Mem: 11064 (9528 - sendContentBlocking)
Free Stack 3584 (144 - LoadTaskSettings)
Boot: Manual reboot (1)
Reset Reason: Hardware Watchdog

The second test unit is still OK, 22 hrs run time so far.

Thomas

thomastech on 5 Oct 2018

About the stack:
Normally, if you return from somewhere back to the main loop, the stack should only hold local variables defined by the main loop.
Anything that comes on top, i.e. drains the stack is then either

interrupt routine stuff, sometimes they can also be nested- but once your back to loop that should be ok.
Core stuff (Wifi) but that goes to the sys stack, afaik.

Did I forget something ?

How can it be then possible to have a leak on the stack ? Its just not how it works.

It just can be a local variable that "grows", i.e. a string or a list.

s0170071 on 5 Oct 2018

Lists and Strings grow only on the heap, even when declared on the stack.
And I don't think we have a leak, but we are just using too much in some calls.
Either the depth of calls or number of stack allocated variables (or their size) is the problem.

TD-er on 5 Oct 2018

I am running 20181002 with Mikrotik router. No reboots so far. Running for 2 days and 4 hours.

Unfortunately one of units just rebooted with Hardware Watchdog when connected to Mikrotik router.
The unit connected to ASUS router is still working after several days.

giig1967g on 5 Oct 2018

@TD-er is this fixed already ? Certainly WiFi related ;-)

s0170071 on 5 Oct 2018

I am running 20181002 with Mikrotik router. No reboots so far. Running for 2 days and 4 hours.

Unfortunately one of units just rebooted with Hardware Watchdog when connected to Mikrotik router.
The unit connected to ASUS router is still working after several days.

The same unit that had HW now does not respond to web server (but works "internally"). This unit is connected to Mikrotik.
So definetely recent changes do not solve the wifi connection issues.

giig1967g on 5 Oct 2018

please see #1640

giig1967g on 5 Oct 2018

Stop

On Oct 5, 2018 12:14 PM, "Plebs" notifications@github.com wrote:

please see #1640 https://github.com/letscontrolit/ESPEasy/issues/1640

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
https://github.com/letscontrolit/ESPEasy/issues/1656#issuecomment-427436515,
or mute the thread
https://github.com/notifications/unsubscribe-auth/Ap1Ge2QPDcw80AmBOZZQTAHP4G4pjLhmks5uh5NxgaJpZM4WAR2m
.

cs3395202 on 5 Oct 2018

@s0170071

TD-er is this fixed already ? Certainly WiFi related ;-)

Forgot it, will add it now to merge with the stack5k fix.

TD-er on 6 Oct 2018

Sorry to say: The dog is still there.
2 units running the daily build:

Wemos D1 mini, running rcwl-0516.

Wemos D1 mini, running a 1602 LCD, rcwl-0516 as Display button and 2# DS18B20.

In the Log:

Seems not correlated to the Watchdog.

Domosapiens on 6 Oct 2018

I found something very strange this evening.

I have installed the Chrome plugin "JSONView", which will check and format JSON.
This makes it very easy to check if JSON is valid.

I set the log level of the web log to "Debug More" (to do a lot of memory allocations) and then loading the JSON via /logjson#

On almost every reload I have corrupted JSON.
The output is filled with strings from other webserver requests.

This can also explain some of the issues related to some controllers and perhaps also lead to watchdog resets.

{"Log": {"Entries": [{"timestamp":373907,
"text":"UDP  : 2C:3A:E8:32:90:8B,192.168.1.92,7",
"level":4},
{"timestamp":377450,
"text":"SYS  : 6.00",
"level":2},
{"timestamp":377453,
"text":"EVENT: uptime#uptime=6.00",
"level":2},
{"timestamp":377455,
"text":"EVENT: uptime#uptime=6.00 Processing time:2 milliSeconds",
"level":3},
{"timestamp":377456,
"text":" Domoticz: Sensortype: 1 idx: 209 values: 6.00",
"level":2},
{"timestamp":377487,
"text":"HTTP : C001 connecting to domoticz:8080",
"level":3},
{"timestamp":377494,
"text":"GET /json.htm?type=command&param=udevice&idx=209&nvalue=0&svalue=6.00&rssi=7 HTTP/1.1
^Host: domoticz:8080
^User-Agent: ESP Eas",
"level":3},
{"timestamp":377494,
"text":"/json.htm?type=command&param=udevice&idx=209&nvalue=0&svalue=6.00&rssi=7",
"level":3},
{"timestamp":377504,
"text":"HTTP/1.1 200 OK
",
"level":4},
{"timestamp":377504,
"text":"HTTP : C001 Success! HTTP/1.1 200 OK
",
"level":3},
{"timestamp":377504,
"text":"HTTP : C001 closing connection",
"level":3},
{"timestamp":378549,
"text":"WD   : Uptime 6 ConnectFailures 0 FreeMem 10208",
"level":2},
{"timestamp":378550,
"text":"UDP  : Send Sysinfo message",
"level":4}],
"TTL":1000,
"timeHalfBuffer":2708,
"nrEntries":13,
"SettingsWebLogLevel":4,
"logTimeSpan":4643
}}

As can be seen, it outputs a lot of other return values, like HTTP/1.1 200 OK

TD-er on 8 Oct 2018

nothing wrong with this 200 OK. it is just log from C001 with level 4.

uzi18 on 8 Oct 2018

@uzi18
You're right.
I found (and fixed) the bug where the left over \r wasn't replaced when converting it to JSON.

I will have a look at the changes lately that so a lot with memory allocations and also try to find allocations on the stack.

TD-er on 8 Oct 2018

@TD-er I am currently updating the checkram function to monitor stack too. First output did not yield useful results.
If you go hunting for stack usage, how about warpping function calls in a "SPIFFS_CHECK" like define that compares the stack pointer before and after the function call ? That should reveal stack eaters...

s0170071 on 9 Oct 2018

That's a good one.
I also want to know more about the heap, for example if it gets really fragmented or not. (e.g. what's the max continuous block)

TD-er on 9 Oct 2018

Later, once it is known what requires substantial amounts of stack (e.g. spiffs), the available amount can be checked before making the chall and then ? Reboot instead of just crash ?
Fragmentation may be an issue too, but I have seen crashes with 17k heap free and no activity. So its not too likely the bug we're after.

s0170071 on 9 Oct 2018

That last remark sounds reasonable and may save a lot of searching :)

TD-er on 9 Oct 2018

If we had heap fragmentation, what would happen if you cannot allocate new memory ? I would assume the pointer returned by new() to be null. Is this correct ?
If so, a viable test could be to now and then try to allocate some useful heap (leave 3k free for wifi) and see if that worked. And then just free it again.

s0170071 on 9 Oct 2018

In the staged version of the core lib there is some development on that: https://github.com/esp8266/Arduino/pull/5090/commits

So I can have a look at that and make a test build which will also show the heap statistics when available.
Just to get some idea on what's happening.

Allocations with new should indeed return a NULL pointer, but String will fail silently.

TD-er on 9 Oct 2018

very good. Seems to be an issue then.
About the strings failing silently: if you allocate / new() some memory block, free() it again you should be able to string.reserve() it afterwards without trouble, right ?

Sounds like a wrapper function #define for String.reserve()

s0170071 on 9 Oct 2018

std::string is traditionally (in STL library for C++) a standard container, which does the allocation/deallocation for you. The Arduino String class is loosely based on the same principle, only with some extra's and also some other functions missing.
Maybe we can check for the actual capacity of the String after calling a reserve. Not sure yet if those are publicly accessible. But you shouldn't do new and free/delete on the members of String or else members will get out of sync.

TD-er on 9 Oct 2018

No free on the string. I meant to check if there is heap available, try to allocate it, free it and then reserve the string buffer.

s0170071 on 9 Oct 2018

@TD-er maybe it is possible to just allocate/reserve some big buffer 200+ chars and use it as static place to manipulate with strings?

uzi18 on 9 Oct 2018

Then you have to implement a lot of operations yourself.

TD-er on 9 Oct 2018

There is no mention of MQTT in this thread as far as will look for.

Yesterday I added a delay(1) to the readByte part of MQTT client PubSubClient.
Can you please test if this is now still an issue?

And if another plugin is active, please mention that one too.

TD-er on 23 Oct 2018

I've been running ESP_Easy_mega-20181023_dev_ESP8266_4096.bin on two NodeMCU devices. One rebooted today, hardware Wdog reset.

Load: | 23.20% (LC=9683)
Free Mem: 10520 (7232 - ruleMatch2)
Free Stack: 3536 (640 - LoadTaskSettings)
Boot: Manual reboot (3)
Reset Reason: Hardware Watchdog

controllers
devices

thomastech on 25 Oct 2018

This could be a major game changer:
Release mega-20181025:
[WDT] Change yield() to delay(0)

Domosapiens on 25 Oct 2018

@Domosapiens: Thanks for the heads-up. I will flash ESP_Easy_mega-20181025_dev_ESP8266_4096.bin into my two devices.

thomastech on 25 Oct 2018

Yep we hope to close this on. :+1:

Grovkillen on 25 Oct 2018

@thomastech What uptime did you get on your node?
And please have a look at the controller settings.
Especially those that may increase memory usage, like Max Queue depth and minimum send interval.

TD-er on 25 Oct 2018

@TD-er: The device that rebooted had been manually reset (RST button press). Then about 18 hours later it rebooted due to hardware wdog.

MQTT controller settings:
controller_1

thomastech on 25 Oct 2018

Hmm, those are "interesting" settings.
No retries, no queue and "ignore new".
So in other words, a new sample will be tried once and kept in the queue when there is no wifi connection.
Also at first attempt it will be removed from the queue.

I would expect "delete oldest" when using no queue, or else you may prefer an older value when the broker has been unreachable for a while.

TD-er on 25 Oct 2018

Hmm, those are "interesting" settings.

They were the defaults when I originally installed the Controller. What should all the settings be for a typical OpenHab MQTT controller?

Thomas

thomastech on 25 Oct 2018

You can delete the controller and re-add it.
Then you have the new defaults. (make sure to press save after adding it)

Proper defaults are:

You may lower the minimum send interval if your broker is fast enough.
I run 10 msec here on a raspberry pi 3

TD-er on 25 Oct 2018

Installed Release mega-20181025 yesterday (because it was not available earlier ;)
No conclusions yet, but with the last daily releases, I have seen no memory nor stack problems.

(so great that you just can paste a snapshot!)

Up-time seems still be a problem.
But ... I'm hunting also for the cause of excessive RCWL-0516 (multiple units in the lab interfering?) detections
As with #1857 I need to use a rule for LDC On/Off resulting in excessive rule calls. So no conclusions yet.

Domosapiens on 26 Oct 2018

Nice to see the free stack is also increasing a few bytes at a time on new builds :)

TD-er on 26 Oct 2018

🎉1

You can delete the controller and re-add it. Then you have the new defaults.

@TD-er: Thanks, MQTT controller has been updated with new defaults.

thomastech on 26 Oct 2018

@Domosapiens I have a set of nodes running for several days now. The build from yesterday evening was running all night.
Your uptime problems must be due to something else. Try a fresh hardware and another power supply an no devices/plugins. Please report back if that worked better.

s0170071 on 26 Oct 2018

@s0170071 Thanks for your advice.

I have 4 boxes under test as described here:
https://www.letscontrolit.com/forum/viewtopic.php?f=2&t=5955&sid=db230a574377fbb18394ecdcb9e9b75a
So fresh HW is not an option, power supply is sufficient and clean, and with no devices/plugins they are useless.

Yes, I can understand your positive experience ....
Without hardware there is no reason for the Hardware Watchdog to reboot ;)
But I will flash a few bare Wemos units.

One unit is running mega-2080322 for over 141 hr !!! No reboot. No DS18B20 NAN.

With the other 3, I follow the latest developments.
One unit did 40 hr, the others less.

Still hunting for the dog!

Domosapiens on 26 Oct 2018

Feedback on ESP_Easy_mega-20181025_dev_ESP8266_4096.bin

One NodeMCU still running without reboot. ~28 hrs.
Second NodeMCU rebooted at 27 hrs. Details below.

Load: | 25.50% (LC=9670)
Free Mem: | 10848 (8144 - sendContentBlocking)
Free Stack: | 3584 (720 - LoadTaskSettings)
Boot: | Manual reboot (2)
Reset Reason: | Hardware Watchdog

Thomas

thomastech on 27 Oct 2018

I think this is no longer an issue. If it still is an issue. please open a new issue.

I will close this one now, since its last post was a year ago.

TD-er on 27 Oct 2019

Espeasy: Hardware watchdog... how to find the cause?

All 70 comments

Related issues