Hi,
in the last versions I am experiencing a constant reboot (every other day) with "Hardware watchdog" as the Reboot cause.
I have also changed from Static IP to DHCP.
How can I find the cause of the Hardware watchdog?
What does exactly it means?
If some piece of code is running for over 6 seconds without calling any delay or yield, it will trigger a hardware watchdog, which performs a reset.
So there is some code in your setup either waiting for that long, or running an "infinite loop".
Could you please give more info on your setup?
Also do not set the "MessageDelay" too high, nor use "delay" in the rules.
@TD-er: FYI, the latest crash/reboot I mentioned in #1643 was reported in the GUI as _Reset Reason | Hardware Watchdog_. This was the test system that had run for a day, then the WiFi went offline for a couple hours, then the board rebooted on its own. Might be related to this issue, or maybe not.
here my config


See my comment here: https://github.com/letscontrolit/ESPEasy/issues/1659#issuecomment-414047835
Looks like my nodes are also "affected", which is good :)
@TD-er: Yesterday I loaded the ESP_Easy_mega-20180815_test_ESP8266_4096 build on a NodeMCU. It ran great for 18 hours then rebooted. System Info says: Boot : Manual reboot (1), Reset Reason : Hardware Watchdog.
A second duplicate NodeMCU is still running fine. But it has only been running for 17 hrs, so it may face the reboot dance soon.
Would be great to see if it occurs at the same interval, or time of day.
Maybe it is some NTP refresh, or something else, who knows.
@TD-er: That would be great. But so far I have not seen a pattern that indicates it is triggered by run duration or time of day.
My hunch is that it is something related to WiFi, such as a reconnect. But I have tried to torture the WiFi connection (force router offline, create weak RF signal levels) and nothing bad happened. So my hunch seems to be nonsense. Hopefully you find the cause and save us.
It rebooted again after running for about 2 hours. Now reports _Manual reboot (2), Reset Reason : Hardware Watchdog_.
The second duplicate NodeMCU is still running fine. About 19+ hours so far.
I found an issue with the handling of UDP traffic (when C013 is used). That could cause Exception crashes. (not likely a Watchdog reset)
I also added some checks when creating an UDP client for NTP, to see if that may cause infinite waiting.
Those can cause a watchdog reset.
When tested, I will make a commit for it.
@TD-er
I've been testing the new builds as they are released. So far none have solved the Watchdog reset. However, the latest ESP_Easy_mega-20180922_dev_ESP8266_4096.bin build seems better than the last release.
The latest firmware has been successfully running on one of my NodeMCU's (current duration 1 day 22 hrs, no reboots). But the second NodeMCU has rebooted several times.
I have a feeling the rebooting is related to the WiFi access since the latest reboot occurred when I accessed the device from my browser. I also noticed a previous reboot occurred at a time when the good "working" device reported a beacon timeout. But I can't replicate the reboots on demand.
SysInfo reports this:
Boot: Manual reboot (8)
Reset Reason: Hardware Watchdog
Not sure if you are interested in this feedback. But it's been a month+ since my last updated and I thought I'd keep you posted on my findings.
I'm always interested in feedback, especially when there seems to be a bit of improvement :)
The reboot issue certainly has been a tough nut to crack.
My winning streak with the "working" device just ended. After 48 hours runtime it rebooted. Sysinfo reports _Manual reboot (1) Hardware Watchdog_
See PR #1834
I just merged a change which moves the address space of the Arduino stack to be on top of the System stack.
The latest core library appears to have shifted the Arduino stack to overlap a bit with the System stack to save about 4k of memory.
But since we're allocating quite a lot on the System stack, this may have led to an increase in reports of HW Watchdog resets.
So please test with the October 1st build, as soon as that one is ready.
I've installed ESP_Easy_mega-20181001_dev_ESP8266_4096.bin one two NodeMCU's. I will report back tomorrow (or sooner) if they experience a reboot.
Some initial comments:
Thanks again for your efforts. Fingers are crossed that this merge helps cure the W-dog reboot issue.
Hmm I had the impression it was loading a bit faster with the changed Arduino stack location.
But maybe getting the free stack for statistics is using quite some resources and is called a bit more in some functions
@TD-er: I don't know why, but the slow page loading has gone away and refresh is OK now. If there are no other reports of it then I'd say it was an isolated issue related to my WiFi router.
Pfiew, I was afraid you had to report a crash/reboot already.
I was thinking about this....
The webserver has some mechanism to free memory when it is too low.
This freeing memory may take some time which makes the webinterface slow down.
Interesting, I wasn't aware the webserver could do that kind of magic. It may have been involved because during the slow browser response the system load was about 40%. But after the problem went away it settled down to about 30% load.
Now the bad news. One of the test devices rebooted after 10 hours.
System info report:
Boot: Manual reboot (1)
Reset Reason: Software Watchdog
I'll let the other device continue to run until it reboots. Maybe the force is stronger with this one.
It is not that bad, since a software watchdog is different from a hardware watchdog.
The software version means it is still doing stuff
You shouldn't have said that it is not that bad because Murphy is watching us. The second device rebooted at 19 hours due to hardware reset.
System info report:
Boot: Manual reboot (2)
Reset Reason: Hardware Watchdog
BTW, the other device rebooted again a few minutes ago. Another Software reset.
System info report:
Boot: Manual reboot (3)
Reset Reason: Software Watchdog
OK, so that's not the fix :(
Can you also/already test using this PR: https://github.com/letscontrolit/ESPEasy/pull/1838 ?
I will later this evening add extra settimeout calls to other WiFi client instances, so it is not complete yet.
No problem, I'll test PR #1838 after it is incorporated in the nightly build.
I'm running both devices on ESP_Easy_mega-20181004_dev_ESP8266_4096.bin. Twelve hours so far, no reboots. Fingers crossed.
I am running 20181002 with Mikrotik router. No reboots so far. Running for 2 days and 4 hours.
@TD-er do you think this could help?
Somebody just needs to figure out how to conveniently link the .elf file from the build date to the exception decoder...
Another thought on this: If it is possible to catch the exception and save stuff, can't we just catch that exception too and do something with it ? Save some text, send an email, ignore it and carry on ?
@thomastech Just curious, what are the memory and stack stats of that node running the dev build?
@s0170071 That's a very nice library.
I think we could try it, to see what's happening.
Also I could read the last crash log at boot and write it to SPIFFS.
Or just add a 'crash log report' option to send the crash to a server
I think it deserves its own issue. (to make it easier to find)
Just curious, what are the memory and stack stats of that node running the dev build?
Both NodeMCU devices have the same ESPEasy configuration. Here's a System Info snapshot.
_Device 1 ("production" NodeMCU):_
Load 30.60% (LC=10052)
Free Mem 10328 (8968 - sendContentBlocking)
Free Stack 3584 (144 - LoadTaskSettings)
_Device 2 ("Test" NodeMCU):_
Load: 27.40% (LC=10648)
Free Mem: 9920 (8272 - sendContentBlocking)
Free Stack: 3520 (144 - LoadTaskSettings)
OK, so the Arduino stack is dangerously low at some point.
The lowest in both configurations is 144 bytes.
I will increase the Arduino stack to 5k (default = 4k) just to be sure and then start hunting for stack usage.
Also the heap is used quite intensively, so that one should also be looked at.
I will increase the Arduino stack to 5k (default = 4k) just to be sure and then start hunting for stack usage.
That sounds like a good thing to try.
One device rebooted after 19 hours.
Sys Info on rebooted unit:
Load: 27.80% (LC=10652)
Free Mem: 11064 (9528 - sendContentBlocking)
Free Stack 3584 (144 - LoadTaskSettings)
Boot: Manual reboot (1)
Reset Reason: Hardware Watchdog
The second test unit is still OK, 22 hrs run time so far.
About the stack:
Normally, if you return from somewhere back to the main loop, the stack should only hold local variables defined by the main loop.
Anything that comes on top, i.e. drains the stack is then either
Did I forget something ?
How can it be then possible to have a leak on the stack ? Its just not how it works.
It just can be a local variable that "grows", i.e. a string or a list.
Lists and Strings grow only on the heap, even when declared on the stack.
And I don't think we have a leak, but we are just using too much in some calls.
Either the depth of calls or number of stack allocated variables (or their size) is the problem.
I am running 20181002 with Mikrotik router. No reboots so far. Running for 2 days and 4 hours.
Unfortunately one of units just rebooted with Hardware Watchdog when connected to Mikrotik router.
The unit connected to ASUS router is still working after several days.
@TD-er is this fixed already ? Certainly WiFi related ;-)
I am running 20181002 with Mikrotik router. No reboots so far. Running for 2 days and 4 hours.
Unfortunately one of units just rebooted with Hardware Watchdog when connected to Mikrotik router.
The unit connected to ASUS router is still working after several days.
The same unit that had HW now does not respond to web server (but works "internally"). This unit is connected to Mikrotik.
So definetely recent changes do not solve the wifi connection issues.
please see #1640
Stop
On Oct 5, 2018 12:14 PM, "Plebs" notifications@github.com wrote:
please see #1640 https://github.com/letscontrolit/ESPEasy/issues/1640
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
https://github.com/letscontrolit/ESPEasy/issues/1656#issuecomment-427436515,
or mute the thread
https://github.com/notifications/unsubscribe-auth/Ap1Ge2QPDcw80AmBOZZQTAHP4G4pjLhmks5uh5NxgaJpZM4WAR2m
.
@s0170071
TD-er is this fixed already ? Certainly WiFi related ;-)
Forgot it, will add it now to merge with the stack5k fix.
Sorry to say: The dog is still there.
2 units running the daily build:

Wemos D1 mini, running rcwl-0516.


Wemos D1 mini, running a 1602 LCD, rcwl-0516 as Display button and 2# DS18B20.

In the Log:

Seems not correlated to the Watchdog.
I found something very strange this evening.
I have installed the Chrome plugin "JSONView", which will check and format JSON.
This makes it very easy to check if JSON is valid.
I set the log level of the web log to "Debug More" (to do a lot of memory allocations) and then loading the JSON via
On almost every reload I have corrupted JSON.
The output is filled with strings from other webserver requests.
This can also explain some of the issues related to some controllers and perhaps also lead to watchdog resets.
{"Log": {"Entries": [{"timestamp":373907,
"text":"UDP : 2C:3A:E8:32:90:8B,192.168.1.92,7",
"level":4},
{"timestamp":377450,
"text":"SYS : 6.00",
"level":2},
{"timestamp":377453,
"text":"EVENT: uptime#uptime=6.00",
"level":2},
{"timestamp":377455,
"text":"EVENT: uptime#uptime=6.00 Processing time:2 milliSeconds",
"level":3},
{"timestamp":377456,
"text":" Domoticz: Sensortype: 1 idx: 209 values: 6.00",
"level":2},
{"timestamp":377487,
"text":"HTTP : C001 connecting to domoticz:8080",
"level":3},
{"timestamp":377494,
"text":"GET /json.htm?type=command¶m=udevice&idx=209&nvalue=0&svalue=6.00&rssi=7 HTTP/1.1
^Host: domoticz:8080
^User-Agent: ESP Eas",
"level":3},
{"timestamp":377494,
"text":"/json.htm?type=command¶m=udevice&idx=209&nvalue=0&svalue=6.00&rssi=7",
"level":3},
{"timestamp":377504,
"text":"HTTP/1.1 200 OK
",
"level":4},
{"timestamp":377504,
"text":"HTTP : C001 Success! HTTP/1.1 200 OK
",
"level":3},
{"timestamp":377504,
"text":"HTTP : C001 closing connection",
"level":3},
{"timestamp":378549,
"text":"WD : Uptime 6 ConnectFailures 0 FreeMem 10208",
"level":2},
{"timestamp":378550,
"text":"UDP : Send Sysinfo message",
"level":4}],
"TTL":1000,
"timeHalfBuffer":2708,
"nrEntries":13,
"SettingsWebLogLevel":4,
"logTimeSpan":4643
}}
As can be seen, it outputs a lot of other return values, like HTTP/1.1 200 OK
nothing wrong with this 200 OK. it is just log from C001 with level 4.
@uzi18
You're right.
I found (and fixed) the bug where the left over \r wasn't replaced when converting it to JSON.
I will have a look at the changes lately that so a lot with memory allocations and also try to find allocations on the stack.
@TD-er I am currently updating the checkram function to monitor stack too. First output did not yield useful results.
If you go hunting for stack usage, how about warpping function calls in a "SPIFFS_CHECK" like define that compares the stack pointer before and after the function call ? That should reveal stack eaters...
That's a good one.
I also want to know more about the heap, for example if it gets really fragmented or not. (e.g. what's the max continuous block)
Later, once it is known what requires substantial amounts of stack (e.g. spiffs), the available amount can be checked before making the chall and then ? Reboot instead of just crash ?
Fragmentation may be an issue too, but I have seen crashes with 17k heap free and no activity. So its not too likely the bug we're after.
That last remark sounds reasonable and may save a lot of searching :)
If we had heap fragmentation, what would happen if you cannot allocate new memory ? I would assume the pointer returned by new() to be null. Is this correct ?
If so, a viable test could be to now and then try to allocate some useful heap (leave 3k free for wifi) and see if that worked. And then just free it again.
In the staged version of the core lib there is some development on that: https://github.com/esp8266/Arduino/pull/5090/commits
So I can have a look at that and make a test build which will also show the heap statistics when available.
Just to get some idea on what's happening.
Allocations with new should indeed return a NULL pointer, but String will fail silently.
very good. Seems to be an issue then.
About the strings failing silently: if you allocate / new() some memory block, free() it again you should be able to string.reserve() it afterwards without trouble, right ?
Sounds like a wrapper function #define for String.reserve()
std::string is traditionally (in STL library for C++) a standard container, which does the allocation/deallocation for you. The Arduino String class is loosely based on the same principle, only with some extra's and also some other functions missing.
Maybe we can check for the actual capacity of the String after calling a reserve. Not sure yet if those are publicly accessible. But you shouldn't do new and free/delete on the members of String or else members will get out of sync.
No free on the string. I meant to check if there is heap available, try to allocate it, free it and then reserve the string buffer.
@TD-er maybe it is possible to just allocate/reserve some big buffer 200+ chars and use it as static place to manipulate with strings?
Then you have to implement a lot of operations yourself.
There is no mention of MQTT in this thread as far as
Yesterday I added a delay(1) to the readByte part of MQTT client PubSubClient.
Can you please test if this is now still an issue?
And if another plugin is active, please mention that one too.
I've been running ESP_Easy_mega-20181023_dev_ESP8266_4096.bin on two NodeMCU devices. One rebooted today, hardware Wdog reset.
Load: | 23.20% (LC=9683)
Free Mem: 10520 (7232 - ruleMatch2)
Free Stack: 3536 (640 - LoadTaskSettings)
Boot: Manual reboot (3)
Reset Reason: Hardware Watchdog


This could be a major game changer:
Release mega-20181025:
[WDT] Change yield() to delay(0)
@Domosapiens: Thanks for the heads-up. I will flash ESP_Easy_mega-20181025_dev_ESP8266_4096.bin into my two devices.
Yep we hope to close this on. :+1:
@thomastech What uptime did you get on your node?
And please have a look at the controller settings.
Especially those that may increase memory usage, like Max Queue depth and minimum send interval.
@TD-er: The device that rebooted had been manually reset (RST button press). Then about 18 hours later it rebooted due to hardware wdog.
MQTT controller settings:

Hmm, those are "interesting" settings.
No retries, no queue and "ignore new".
So in other words, a new sample will be tried once and kept in the queue when there is no wifi connection.
Also at first attempt it will be removed from the queue.
I would expect "delete oldest" when using no queue, or else you may prefer an older value when the broker has been unreachable for a while.
Hmm, those are "interesting" settings.
They were the defaults when I originally installed the Controller. What should all the settings be for a typical OpenHab MQTT controller?
You can delete the controller and re-add it.
Then you have the new defaults. (make sure to press save after adding it)
Proper defaults are:

You may lower the minimum send interval if your broker is fast enough.
I run 10 msec here on a raspberry pi 3
Installed Release mega-20181025 yesterday (because it was not available earlier ;)
No conclusions yet, but with the last daily releases, I have seen no memory nor stack problems.


(so great that you just can paste a snapshot!)

Up-time seems still be a problem.
But ... I'm hunting also for the cause of excessive RCWL-0516 (multiple units in the lab interfering?) detections
As with #1857 I need to use a rule for LDC On/Off resulting in excessive rule calls. So no conclusions yet.
Nice to see the free stack is also increasing a few bytes at a time on new builds :)
You can delete the controller and re-add it. Then you have the new defaults.
@TD-er: Thanks, MQTT controller has been updated with new defaults.
@Domosapiens I have a set of nodes running for several days now. The build from yesterday evening was running all night.
Your uptime problems must be due to something else. Try a fresh hardware and another power supply an no devices/plugins. Please report back if that worked better.
@s0170071 Thanks for your advice.
I have 4 boxes under test as described here:
https://www.letscontrolit.com/forum/viewtopic.php?f=2&t=5955&sid=db230a574377fbb18394ecdcb9e9b75a
So fresh HW is not an option, power supply is sufficient and clean, and with no devices/plugins they are useless.
Yes, I can understand your positive experience ....
Without hardware there is no reason for the Hardware Watchdog to reboot ;)
But I will flash a few bare Wemos units.
One unit is running mega-2080322 for over 141 hr !!! No reboot. No DS18B20 NAN.
With the other 3, I follow the latest developments.
One unit did 40 hr, the others less.
Still hunting for the dog!
Feedback on ESP_Easy_mega-20181025_dev_ESP8266_4096.bin
One NodeMCU still running without reboot. ~28 hrs.
Second NodeMCU rebooted at 27 hrs. Details below.
Load: | 25.50% (LC=9670)
Free Mem: | 10848 (8144 - sendContentBlocking)
Free Stack: | 3584 (720 - LoadTaskSettings)
Boot: | Manual reboot (2)
Reset Reason: | Hardware Watchdog
Thomas
I think this is no longer an issue. If it still is an issue. please open a new issue.
I will close this one now, since its last post was a year ago.