Arduino: Random WDT after periodically connecting and disconnecting Wifi.

Created on 1 Jun 2019  ·  98Comments  ·  Source: esp8266/Arduino

Basic Infos

  • [X] This issue complies with the issue POLICY doc.
  • [X] I have read the documentation at readthedocs and the issue is not addressed there.
  • [X] I have tested that the issue is present in current master branch (aka latest git).
  • [X] I have searched the issue tracker for a similar issue.
  • [n/a ] If there is a stack dump, I have decoded it.
  • [X] I have filled out all fields below.

Platform

  • Hardware: NodeMCU 1.0 or ESP-12E
  • Core Version: 2.5.1, git 2.6.0-dev #455583b from 5-30-2019
  • Development Env: Arduino IDE
  • Operating System: Windows

Settings in IDE

  • Module: NodeMCU or Generic ESP8266 Module
  • Flash Mode: qio
  • Flash Size: 4MB (1MB SPIFFS)
  • lwip Variant: v2 Lower Memory
  • Reset Method: nodemcu
  • Flash Frequency: 40Mhz
  • CPU Frequency: 80Mhz or 160MHz
  • Upload Using: SERIAL
  • Upload Speed: 921600
  • SSL Support: either all or basic SSL ciphers

Problem Description

I am observing unstable behavior of the ESP8266 when I am turning on and off the Wifi frequently.
The example sketch is taken from my code and is greatly simplified. The point is to run ESP8266 with wifi turned off (low power consumption), accumulate data and send them to a server periodically. Here, the period is 30 seconds, in reality it is much longer, of course.
With axTLS the sketch runs easily overnight (e. g. 12 hours without reset, didn't test longer time), with BearSSL it does not survive 15 minutes, generally.
If you uncomment the prints in loop(), you can see that the wdt fires outside the sketch, the last character printed is always '-'. In this case use putty as serial monitor, the arduino serial monitor does not interpret BS character and you will get full screen of dots and hyphens :)
I tried to add delays in wifiDisconnect function, but no change.

I know your time is precious, so I tried to be as specific as I could. I am lost in debugging wdt outside the loopfunction, though.

Do you have any idea what to test? I hope the sketch is ok.
The same mechanism with disconnecting wifi is running in my devices based on v 2.5.0 dated 8-2018 (commit 641c5cd) with axTLS for months without an issue.

MCVE Sketch

//#define USING_AXTLS

#include <ESP8266WiFi.h>
#ifdef USING_AXTLS
    #include "WiFiClientSecureAxTLS.h"
    using namespace axTLS;
#endif

#define SSID "***"
#define PASSWORD "***"

static uint32_t MILLIS_TO_WAKE_UP = 30*1000;  // wake up after 30 seconds
const char* SERVERNAME = "www.example.com";
uint16_t PORT = 443;

#ifdef USING_AXTLS
    axTLS::WiFiClientSecure wifiClient;
#else
    BearSSL::WiFiClientSecure wifiClient;
#endif

uint32_t lastSleep = 0;  // millis of the last time when forced to sleep

void setup() {
    WiFi.setAutoConnect(false);
    Serial.begin(115200);

    pinMode(2, OUTPUT);
    digitalWrite(2, LOW);  // Start indicator

    // Wait for GPIO0 down as a start condition (we want to stop here after a wdt reset)
/*    Serial.println(F("\nConnect GPIO0 to GND to start"));
    pinMode(0, INPUT_PULLUP);
    while (digitalRead(0) == HIGH)
        delay(100);
*/
    pinMode(2, INPUT);  // Back to default

    Serial.print(ESP.getSdkVersion());
#ifdef USING_AXTLS
    Serial.println(F(", axTLS"));
#else
    Serial.println(F(", BearSSL"));
#endif

    wifiConnect();
    wifiSend(3);
    wifiDisconnect();
}

void loop() {
    //Serial.print("\x08.");

    if (millis() - lastSleep > MILLIS_TO_WAKE_UP) {
        wifiConnect();
        wifiSend(2);
        wifiDisconnect();
    }

    //Serial.print("\x08-");
}

void wifiConnect(void) {
    WiFi.mode(WIFI_STA);
    delay(100);

    WiFi.begin(SSID, PASSWORD);
    Serial.printf_P(PSTR("Connecting to %s "), SSID);

    while (WiFi.status() == WL_DISCONNECTED) {
        Serial.write('.');
        delay(500);
    }
    Serial.println();

    if (WiFi.status() == WL_CONNECTED) {
        Serial.printf_P(PSTR("WiFi connected (RSSI %d), IP address: %s, "), WiFi.RSSI(), WiFi.localIP().toString().c_str());
        Serial.printf_P(PSTR("mem: %d\r\n"), ESP.getFreeHeap());

        if (time(nullptr) < 100000000)
            readTime();
    }
}

void wifiDisconnect(void) {
    // Disconnecting wifi
    Serial.print(F("Disconnecting client"));
    wifiClient.stop();

    Serial.print(F(", wifi"));
    WiFi.disconnect();
    WiFi.mode(WIFI_OFF);
    delay(100);  // FIXME

    Serial.println(F(", sleeping"));
    WiFi.forceSleepBegin();  // turn off ESP8266 RF
    delay(100);  // FIXME

    lastSleep = millis();
}

boolean wifiSend(int8_t status) {
    // Check the wifi
    if (WiFi.status() != WL_CONNECTED) {
        Serial.println(F("[WiFi] Not connected to AP"));
        return false;
    }

#ifndef USING_AXTLS
    wifiClient.setInsecure();  // for testing ok
#endif    

   if (wifiClient.connect(SERVERNAME, PORT)) {
        Serial.println(F("[WiFi] Connected to server"));
    }
    else {
        Serial.println(F("[WiFi] Connection to server failed"));
        return false;
    }

    if (wifiClient.connected()) {
        // GET /test HTTP/1.1
        wifiClient.printf_P(PSTR("GET /test HTTP/1.1\nHost: %s\n\n"), SERVERNAME);

        Serial.print(F("[WiFi] Data sent, waiting for response ... "));

        // Wait max 5 seconds for server response
        long m = millis();
        while (millis() - m < 5000 && !wifiClient.available()) {
            delay(100);
        }

        // Read the response header
        Serial.println();
        while (wifiClient.connected()) {
            String line = wifiClient.readStringUntil('\n');
            Serial.println(line);
            if (line == "\r") {
//                Serial.println("headers received");
                break;
            }
            yield();
        }

        // Read and discard the data
        while (wifiClient.available() && wifiClient.connected()) {
            String line = wifiClient.readStringUntil('\n');
            yield();
        }
    }
    return true;
}

void readTime(void) {
    if (WiFi.status() != WL_CONNECTED) {
        return;
    }

    Serial.print(F("Setting time using SNTP "));

    configTime(1 * 3600, 0, "tik.cesnet.cz", "pool.ntp.org");

    // Read time, wait 5 seconds
    uint32_t m = millis();
    time_t now = time(nullptr);
    while (now < 100000000 && millis() - m < 5000) {
        delay(100);
        Serial.write('.');
        now = time(nullptr);
    }
    Serial.println();

    if (now < 100000000) {
        Serial.println(F("Time was not set."));
    }
    else {
        Serial.print(F("Current time: "));
        Serial.println(ctime(&now));
    }
}

Debug Messages

Note that the 404 response is ok, we are using www.example.com server to test.

Connecting to BILNet ..........
WiFi connected (RSSI -52), IP address: 192.168.1.102, mem: 44784
[WiFi] Connected to server
[WiFi] Data sent, waiting for response ... 
HTTP/1.1 404 Not Found
Accept-Ranges: bytes
Cache-Control: max-age=604800
Content-Type: text/html; charset=UTF-8
Date: Sat, 01 Jun 2019 09:14:29 GMT
Expires: Sat, 08 Jun 2019 09:14:29 GMT
Last-Modified: Tue, 28 May 2019 06:46:04 GMT
Server: ECS (dcb/7EA6)
Vary: Accept-Encoding
X-Cache: 404-HIT
Content-Length: 1270

Disconnecting client, wifi, sleeping
 ets Jan  8 2013,rst cause:4, boot mode:(3,6)

wdt reset
load 0x4010f000, len 1384, room 16 
tail 8
chksum 0x2d
csum 0x2d
vffffffff
~ld
2.2.1(cfd48f3), BearSSL

waiting for feedback

All 98 comments

First non-wdt with a stack trace appeared just now, perhaps it would be useful - crashes in malloc:

Exception (2):
epc1=0x3ffeec3c epc2=0x00000000 epc3=0x00000000 excvaddr=0x3ffeec3c depc=0x00000000

Exception 2: InstructionFetchError: Processor internal physical address or data error during instruction fetch
Decoding 85 results
0x401008ac: malloc at D:\Programy\arduino\hardware\esp8266com-git_version\esp8266\cores\esp8266\umm_malloc/umm_malloc.cpp line 1677
0x4010458c: lmacProcessAckTimeout at ?? line ?
0x4020f763: new_linkoutput at /home/gauchard/dev/esp8266/esp8266/tools/sdk/lwip2/builder/glue-lwip/lwip-git.c line 235
0x4020fb54: ethernet_output at /home/gauchard/dev/esp8266/esp8266/tools/sdk/lwip2/builder/lwip2-src/src/netif/ethernet.c line 312
0x40105159: ets_timer_disarm at ?? line ?
0x402054c3: loop_task(ETSEventTag*) at D:\Programy\arduino\hardware\esp8266com-git_version\esp8266\cores\esp8266/core_esp8266_main.cpp line 140
0x40216d7f: etharp_raw at /home/gauchard/dev/esp8266/esp8266/tools/sdk/lwip2/builder/lwip2-src/src/core/ipv4/etharp.c line 1161
0x401008ac: malloc at D:\Programy\arduino\hardware\esp8266com-git_version\esp8266\cores\esp8266\umm_malloc/umm_malloc.cpp line 1677
0x40211264: lwip_cyclic_timer at /home/gauchard/dev/esp8266/esp8266/tools/sdk/lwip2/builder/lwip2-src/src/core/timeouts.c line 233
0x40216f7a: etharp_request at /home/gauchard/dev/esp8266/esp8266/tools/sdk/lwip2/builder/lwip2-src/src/core/ipv4/etharp.c line 1202
0x4020fba8: do_memp_malloc_pool at /home/gauchard/dev/esp8266/esp8266/tools/sdk/lwip2/builder/lwip2-src/src/core/memp.c line 254
0x40216fe4: etharp_tmr at /home/gauchard/dev/esp8266/esp8266/tools/sdk/lwip2/builder/lwip2-src/src/core/ipv4/etharp.c line 203
0x40211264: lwip_cyclic_timer at /home/gauchard/dev/esp8266/esp8266/tools/sdk/lwip2/builder/lwip2-src/src/core/timeouts.c line 233
0x40211264: lwip_cyclic_timer at /home/gauchard/dev/esp8266/esp8266/tools/sdk/lwip2/builder/lwip2-src/src/core/timeouts.c line 233
0x40211274: lwip_cyclic_timer at /home/gauchard/dev/esp8266/esp8266/tools/sdk/lwip2/builder/lwip2-src/src/core/timeouts.c line 243
0x4020fc0e: memp_free at /home/gauchard/dev/esp8266/esp8266/tools/sdk/lwip2/builder/lwip2-src/src/core/memp.c line 447
0x4021140c: sys_check_timeouts at /home/gauchard/dev/esp8266/esp8266/tools/sdk/lwip2/builder/lwip2-src/src/core/timeouts.c line 390
0x4023b434: ets_timer_handler_isr at ?? line ?
0x4023b441: ets_timer_handler_isr at ?? line ?
0x4023b486: ets_timer_handler_isr at ?? line ?
0x402054c3: loop_task(ETSEventTag*) at D:\Programy\arduino\hardware\esp8266com-git_version\esp8266\cores\esp8266/core_esp8266_main.cpp line 140
0x40104980: call_user_start_local at ?? line ?
0x40104986: call_user_start_local at ?? line ?
0x4010000d: call_user_start at ?? line ?
0x4024a9e8: node_remove_from_list at ?? line ?
0x401026ee: wDev_ProcessFiq at ?? line ?
0x4021b5c7: sha2small_out at /home/earle/Arduino/hardware/esp8266com/esp8266/tools/sdk/ssl/bearssl/src/hash/sha2small.c line 249
0x4024a8b0: node_remove_from_list at ?? line ?
0x4021b608: br_sha256_out at /home/earle/Arduino/hardware/esp8266com/esp8266/tools/sdk/ssl/bearssl/src/hash/sha2small.c line 305
0x4024a8b0: node_remove_from_list at ?? line ?
0x40221f55: br_hmac_out at /home/earle/Arduino/hardware/esp8266com/esp8266/tools/sdk/ssl/bearssl/src/mac/hmac.c line 120
0x4024a8b0: node_remove_from_list at ?? line ?
0x4023545b: pp_attach at ?? line ?
0x402354aa: pp_attach at ?? line ?
0x402355b6: pp_attach at ?? line ?
0x4023545b: pp_attach at ?? line ?
0x402354aa: pp_attach at ?? line ?
0x402355b6: pp_attach at ?? line ?
0x40101482: pp_post at ?? line ?
0x40234567: ppTxPkt at ?? line ?
0x40227947: ieee80211_output_pbuf at ?? line ?
0x40104eff: wdt_feed at ?? line ?
0x40101482: pp_post at ?? line ?
0x40104877: lmacRxDone at ?? line ?
0x40102199: trc_NeedRTS at ?? line ?
0x40101482: pp_post at ?? line ?
0x4010236a: trc_NeedRTS at ?? line ?
0x4020f763: new_linkoutput at /home/gauchard/dev/esp8266/esp8266/tools/sdk/lwip2/builder/glue-lwip/lwip-git.c line 235
0x401027aa: wDev_ProcessFiq at ?? line ?
0x40102544: wDev_ProcessFiq at ?? line ?
0x401008ac: malloc at D:\Programy\arduino\hardware\esp8266com-git_version\esp8266\cores\esp8266\umm_malloc/umm_malloc.cpp line 1677
0x40101482: pp_post at ?? line ?
0x40100daf: pp_soft_wdt_feed_local at ?? line ?
0x4010065c: _umm_free at D:\Programy\arduino\hardware\esp8266com-git_version\esp8266\cores\esp8266\umm_malloc/umm_malloc.cpp line 1304
0x40102522: wDev_MacTim1Arm at ?? line ?
0x40102586: wDev_ProcessFiq at ?? line ?
0x4020f4e1: glue2esp_linkoutput at /home/gauchard/dev/esp8266/esp8266/tools/sdk/lwip2/builder/glue-esp/lwip-esp.c line 299
0x402133f6: pbuf_free_LWIP2 at /home/gauchard/dev/esp8266/esp8266/tools/sdk/lwip2/builder/lwip2-src/src/core/pbuf.c line 786 (discriminator 1)
0x40102544: wDev_ProcessFiq at ?? line ?
0x402187d4: mem_malloc at /home/gauchard/dev/esp8266/esp8266/tools/sdk/lwip2/builder/lwip2-src/src/core/mem.c line 210
0x40249f70: node_remove_from_list at ?? line ?
0x40105159: ets_timer_disarm at ?? line ?
0x40105159: ets_timer_disarm at ?? line ?
0x4010065c: _umm_free at D:\Programy\arduino\hardware\esp8266com-git_version\esp8266\cores\esp8266\umm_malloc/umm_malloc.cpp line 1304
0x40100aa4: free at D:\Programy\arduino\hardware\esp8266com-git_version\esp8266\cores\esp8266\umm_malloc/umm_malloc.cpp line 1764
0x401001c0: millis at D:\Programy\arduino\hardware\esp8266com-git_version\esp8266\cores\esp8266/core_esp8266_wiring.cpp line 186
0x40205540: esp_yield at D:\Programy\arduino\hardware\esp8266com-git_version\esp8266\cores\esp8266/core_esp8266_main.cpp line 97
0x40205561: esp_schedule at D:\Programy\arduino\hardware\esp8266com-git_version\esp8266\cores\esp8266/core_esp8266_main.cpp line 102
0x402055f9: loop_wrapper() at D:\Programy\arduino\hardware\esp8266com-git_version\esp8266\cores\esp8266/core_esp8266_main.cpp line 134

The below info might be of interest:

With what might be a similar issue, for my sketches I get Hardware Watchdog Reboots, which if I add a "delay(2000)" immediately after the "WiFi.mode(WIFI_OFF)", then the Hardware Watchdog always triggers during this delay. However, without this "delay(2000)", the Hardware Watchdog triggers some dozen or more program steps after the "WiFi.mode(WIFI_OFF)" and subsequent "WiFi.forceSleepBegin()".

I reconnect over SSL every 8 minutes or so (and my SSL data exchanges always succeed and look the same as far as I can tell), and the subsequent crashes are intermittent (sometimes 3 in an hour, sometimes none for 24 hours).

My free-Heap does not appear to drop below approx 21000 bytes, and my free-Stack does not appear to drop below 1360 bytes.

[I am using an ESP8266 D1mini with the "ESP8266 Arduino Github software" (as at 31May19) with "BearSSL" and "IwIP variant v2 lower memory"].

I have not yet found a solution!..

@Rob58329 , thanks for the information.
Have you tried using axTLS? My devices with old core and axTLS work quite reliably (they connect to wifi once in 10 minutes). But merely switching from axTLS to BearSSL brings the issue.

@JiriBilek: My original ESP8266-sketch used the “ESP8266 Arduino Github Software core” from about 18 months ago (ie. which used axTLS and IwIP v1.4 (or perhaps earlier)), and my ESP8266 units were running for 3++ months without issue. Annoyingly I have not yet been able to work out exactly which version-date of the Github Software I was using from 18 month ago...

But if the same sketch is compiled on the current (v31May19) “ESP8266 Arduino Github Software” (or in-fact using any of the Github versions from the last couple of months), it generates the above detailed intermittent Hardware Watchdog crashes. I still get the same crashes if I use the current Github software (v31May19) with axTLS, or BearSSL(which the sketch needed slight modification for), and with IwIP v1.4 or with IwIP v2. I note that the current Github core compiles my sketches to use a bit more RAM when running (vs. 18months ago), but I currently don't think this is the issue.

What's interesting here is that you're turning off WiFi without ever closing the SSL connection.

Would you be able to check if the same thing happens if you add a wifiClient.close() at the end of your send routine?

It's possible that when you come back after the next time, or when the wifi power off event happens that some part of the LWIP closes everything, but the client still has a pointer to something that's no longer valid. Then when you reconnect the first thing the client will do is to try to ::close() itself and uses this pointer and boom, memory corruption (==crash, WDT, whatever).

Thanks for an idea, but I am getting compilation error: class BearSSL::WiFiClientSecure' has no member named 'close'
I can't find close() function either in WiFiClientSecure, WiFiClient or Client.

Oops, it's stop not close I meant to type there.

I see. I am stopping the client in wifiDisconnect(). It is called immediately after wifiSend().

There goes the easy bit. :( Can you make the WiFiClientSecure a local variable in your send routine? If that works reliably then it means there is a data lifetime issue in the client that needs looking at. If it fails, too, then there's something very strange going on.

I replaced wifiClient with an temporary object who was newed at the end of wifiConnect() and which was deleted immediately after its stop call. So there was 0 possibility of the object using something invalidated after WIFI was turned off.

It still crashed. I don't think this is related in any way to the WiFiClientSecure. I think it's something in the SDK blob at this point going weird.

@d-a-v, @devyte, anything suspicious in the code here?

I replaced WiFiClientSecure with plain WiFiClient.

Even plain WiFiClient has the same crash in the waiting portion, so it's not HTTPS related, even. Either WiFi powerdown/up causes the crashes or LWIP does (maybe trying something that's not valid to close a TCP connection buffer or something?)

````
Disconnecting client, wifi, sleeping
Connecting to NOBABIES ...........
WiFi connected (RSSI -47), IP address: 192.168.1.154, mem: 50848
[WiFi] Connected to server
[WiFi] Data sent, waiting for response ...
HTTP/1.1 404 Not Found

Server: nginx

Date: Wed, 05 Jun 2019 23:31:31 GMT

Content-Type: text/html

Content-Length: 162

Connection: keep-alive

Disconnecting client, wifi, sleeping

ets Jan 8 2013,rst cause:4, boot mode:(3,6)

wdt reset
load 0x4010f000, len 1384, room 16
tail 8
chksum 0x2d
csum 0x2d
v7d5343a6
````

Using the base WiFiClient and only doing the WiFiClient.connect() and .WiFiClient.stop() in the loop (all data transmission is removed) also results in occasional WDTs.
````
...
WiFi connected (RSSI -43), IP address: 192.168.1.154, mem: 50024
[WiFi] Connected to server
Disconnecting client, wifi, sleeping
Connecting to NOBABIES .......
WiFi connected (RSSI -44), IP address: 192.168.1.154, mem: 50352
[WiFi] Connected to server
Disconnecting client, wifi, sleeping
Connecting to NOBABIES .......
WiFi connected (RSSI -45), IP address: 192.168.1.154, mem: 50024
[WiFi] Connected to server
Disconnecting client, wifi, sleeping

ets Jan 8 2013,rst cause:4, boot mode:(3,6)

wdt reset
load 0x4010f000, len 1384, room 16
tail 8
chksum 0x2d
csum 0x2d
v7d5343a6
~ld
````

Thanks, not being related to the SSL it is even more annoying.
Is there a chance to trace the wdt? I mean stack dump or any more information?

It's not in LWIP, either. It's in the core blob. I commented everything related to the WiFiClient:
````
void wifiDisconnect(void) {
// Disconnecting wifi
Serial.print(F("Disconnecting client"));
//wifiClient->stop();
delete wifiClient;

Serial.print(F(", wifi"));
WiFi.disconnect();
WiFi.mode(WIFI_OFF);
delay(100);  // FIXME

Serial.println(F(", sleeping"));
WiFi.forceSleepBegin();  // turn off ESP8266 RF
delay(100);  // FIXME

lastSleep = millis();

}

boolean wifiSend(int8_t status) {
// Check the wifi
if (WiFi.status() != WL_CONNECTED) {
Serial.println(F("[WiFi] Not connected to AP"));
return false;
}

if 0

ifndef USING_AXTLS

//wifiClient->setInsecure();  // for testing ok

endif

if (wifiClient->connect(SERVERNAME, PORT)) {
Serial.println(F("[WiFi] Connected to server"));
}
...
}
#endif
return true;
}
````

I still got a WDT overnight:
````
g to NOBABIES .......
WiFi connected (RSSI -49), IP address: 192.168.1.154, mem: 51464
Disconnecting client, wifi, sleeping
Connecting to NOBABIES .......
WiFi connected (RSSI -49), IP address: 192.168.1.154, mem: 51464
Disconnecting client, wifi, sleeping

ets Jan 8 2013,rst cause:4, boot mode:(3,6)

wdt reset
load 0x4010f000, len 1384, room 16
tail 8
chksum 0x2d
csum 0x2d
v7d5343a6
~ld
2.2.1(cfd48f3), BearSSL
Connecting to NOBABIES ......
````

There was another issue, same problem, but I can't seem to find it now.

WDT's handled by the RTC block, so I don't think you can get any info on where it happened. To the CPU it looks like a simple reset.

Just to be clear, the actual guts of WiFi power off/power on stuff is in the Espressif binary-only blob. So there is no way to debug or anything we can do here about it.

@earlephilhower , thanks a lot for your effort. I changed the title of this issue, it is misleading now.

can you try change
Serial.println(F(", sleeping"));
WiFi.forceSleepBegin(); // turn off ESP8266 RF
delay(100); // FIXME
to
Serial.println(F(", sleeping"));
delay(100); // FIXME
WiFi.forceSleepBegin(); // turn off ESP8266 RF

see if it fixed.
I suspect that the Serial.print is still executing in background while you turn-off RF. I add delay to ensure the Serial.print has succesfully executed, before turning the RF off.

Yesterday, I compiled the original test sketch with an old core (2.5.0 dev, commit https://github.com/esp8266/Arduino/commit/641c5cd) and BearSSL.
It run for more than 18 hours without a problem. This confirms that the problem is not in the ssl library.

Dear:
Please send questions to this email address:[email protected] doesn't deal with technical issues
------------------ 原始邮件 ------------------
发件人: "Jiri Bilek"notifications@github.com;
发送时间: 2019年6月10日(星期一) 下午2:20
收件人: "esp8266/Arduino"Arduino@noreply.github.com;
抄送: "Subscribed"subscribed@noreply.github.com;
主题: Re: [esp8266/Arduino] Random WDT after periodically connecting anddisconnecting Wifi. (#6172)

Yesterday, I compiled the original test sketch with an old core (2.5.0 dev, commit 641c5cd) and BearSSL.
It run for more than 18 hours without a problem. This confirms that the problem is not in the ssl library.


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub, or mute the thread.

@tbdltee : I tried what you suggested, unfortunately the wdts and even exceptions are not gone, although appear much less frequent (once per 2 hours approximately).
The testing setup was: the git version of library, BearSSL used to open a connection and send a GET request.

The exception stack I received:

Exception 2: InstructionFetchError: Processor internal physical address or data error during instruction fetch
Decoding 103 results
0x401008ac: malloc at D:\Programy\arduino\hardware\esp8266com-git_455583b\esp8266\cores\esp8266\umm_malloc/umm_malloc.cpp line 1677
0x4010458c: lmacProcessAckTimeout at ?? line ?
0x4020fa23: new_linkoutput at /home/gauchard/dev/esp8266/esp8266/tools/sdk/lwip2/builder/glue-lwip/lwip-git.c line 235
0x4020fe14: ethernet_output at /home/gauchard/dev/esp8266/esp8266/tools/sdk/lwip2/builder/lwip2-src/src/netif/ethernet.c line 312
0x40205593: loop_task(ETSEventTag*) at D:\Programy\arduino\hardware\esp8266com-git_455583b\esp8266\cores\esp8266/core_esp8266_main.cpp line 140
0x40205670: loop_wrapper() at D:\Programy\arduino\hardware\esp8266com-git_455583b\esp8266\cores\esp8266/core_esp8266_main.cpp line 124
0x4021703b: etharp_raw at /home/gauchard/dev/esp8266/esp8266/tools/sdk/lwip2/builder/lwip2-src/src/core/ipv4/etharp.c line 1161
0x401008ac: malloc at D:\Programy\arduino\hardware\esp8266com-git_455583b\esp8266\cores\esp8266\umm_malloc/umm_malloc.cpp line 1677
0x40211524: lwip_cyclic_timer at /home/gauchard/dev/esp8266/esp8266/tools/sdk/lwip2/builder/lwip2-src/src/core/timeouts.c line 233
0x40217236: etharp_request at /home/gauchard/dev/esp8266/esp8266/tools/sdk/lwip2/builder/lwip2-src/src/core/ipv4/etharp.c line 1202
0x4020fe68: do_memp_malloc_pool at /home/gauchard/dev/esp8266/esp8266/tools/sdk/lwip2/builder/lwip2-src/src/core/memp.c line 254
0x402172a0: etharp_tmr at /home/gauchard/dev/esp8266/esp8266/tools/sdk/lwip2/builder/lwip2-src/src/core/ipv4/etharp.c line 203
0x40211524: lwip_cyclic_timer at /home/gauchard/dev/esp8266/esp8266/tools/sdk/lwip2/builder/lwip2-src/src/core/timeouts.c line 233
0x40211524: lwip_cyclic_timer at /home/gauchard/dev/esp8266/esp8266/tools/sdk/lwip2/builder/lwip2-src/src/core/timeouts.c line 233
0x40211534: lwip_cyclic_timer at /home/gauchard/dev/esp8266/esp8266/tools/sdk/lwip2/builder/lwip2-src/src/core/timeouts.c line 243
0x4020fece: memp_free at /home/gauchard/dev/esp8266/esp8266/tools/sdk/lwip2/builder/lwip2-src/src/core/memp.c line 447
0x402116cc: sys_check_timeouts at /home/gauchard/dev/esp8266/esp8266/tools/sdk/lwip2/builder/lwip2-src/src/core/timeouts.c line 390
0x40245974: ets_timer_handler_isr at ?? line ?
0x40245981: ets_timer_handler_isr at ?? line ?
0x402459c6: ets_timer_handler_isr at ?? line ?
0x40205593: loop_task(ETSEventTag*) at D:\Programy\arduino\hardware\esp8266com-git_455583b\esp8266\cores\esp8266/core_esp8266_main.cpp line 140
0x40104980: call_user_start_local at ?? line ?
0x40104986: call_user_start_local at ?? line ?
0x4010000d: call_user_start at ?? line ?
0x40101482: pp_post at ?? line ?
0x40104877: lmacRxDone at ?? line ?
0x40102199: trc_NeedRTS at ?? line ?
0x4010236a: trc_NeedRTS at ?? line ?
0x401027aa: wDev_ProcessFiq at ?? line ?
0x40102544: wDev_ProcessFiq at ?? line ?
0x4021bc2c: br_sha2small_round at /home/earle/Arduino/hardware/esp8266com/esp8266/tools/sdk/ssl/bearssl/src/hash/sha2small.c line 101 (discriminator 2)
0x4021ba9c: br_sha2small_round at /home/earle/Arduino/hardware/esp8266com/esp8266/tools/sdk/ssl/bearssl/src/hash/sha2small.c line 85
0x40254f08: node_remove_from_list at ?? line ?
0x401037c5: lmacProcessTXStartData at ?? line ?
0x401037c2: lmacProcessTXStartData at ?? line ?
0x40101482: pp_post at ?? line ?
0x40104877: lmacRxDone at ?? line ?
0x40102199: trc_NeedRTS at ?? line ?
0x40105159: ets_timer_disarm at ?? line ?
0x4010236a: trc_NeedRTS at ?? line ?
0x401027aa: wDev_ProcessFiq at ?? line ?
0x40254fa8: node_remove_from_list at ?? line ?
0x40254df0: node_remove_from_list at ?? line ?
0x40105159: ets_timer_disarm at ?? line ?
0x4023f92f: pp_attach at ?? line ?
0x4023f97e: pp_attach at ?? line ?
0x4023fa8a: pp_attach at ?? line ?
0x40101482: pp_post at ?? line ?
0x4023ea27: ppTxPkt at ?? line ?
0x40231def: ieee80211_output_pbuf at ?? line ?
0x40104eff: wdt_feed at ?? line ?
0x40101482: pp_post at ?? line ?
0x40101482: pp_post at ?? line ?
0x40104877: lmacRxDone at ?? line ?
0x40101482: pp_post at ?? line ?
0x40101482: pp_post at ?? line ?
0x40104877: lmacRxDone at ?? line ?
0x40102199: trc_NeedRTS at ?? line ?
0x4010236a: trc_NeedRTS at ?? line ?
0x4010236a: trc_NeedRTS at ?? line ?
0x401027aa: wDev_ProcessFiq at ?? line ?
0x40102544: wDev_ProcessFiq at ?? line ?
0x40101482: pp_post at ?? line ?
0x401008ac: malloc at D:\Programy\arduino\hardware\esp8266com-git_455583b\esp8266\cores\esp8266\umm_malloc/umm_malloc.cpp line 1677
0x40102199: trc_NeedRTS at ?? line ?
0x4010065c: _umm_free at D:\Programy\arduino\hardware\esp8266com-git_455583b\esp8266\cores\esp8266\umm_malloc/umm_malloc.cpp line 1304
0x40100aa4: free at D:\Programy\arduino\hardware\esp8266com-git_455583b\esp8266\cores\esp8266\umm_malloc/umm_malloc.cpp line 1764
0x401027aa: wDev_ProcessFiq at ?? line ?
0x40218ab8: mem_free at /home/gauchard/dev/esp8266/esp8266/tools/sdk/lwip2/builder/lwip2-src/src/core/mem.c line 237
0x402136b6: pbuf_free_LWIP2 at /home/gauchard/dev/esp8266/esp8266/tools/sdk/lwip2/builder/lwip2-src/src/core/pbuf.c line 786 (discriminator 1)
0x40218a90: mem_malloc at /home/gauchard/dev/esp8266/esp8266/tools/sdk/lwip2/builder/lwip2-src/src/core/mem.c line 210
0x401008ac: malloc at D:\Programy\arduino\hardware\esp8266com-git_455583b\esp8266\cores\esp8266\umm_malloc/umm_malloc.cpp line 1677
0x40218a90: mem_malloc at /home/gauchard/dev/esp8266/esp8266/tools/sdk/lwip2/builder/lwip2-src/src/core/mem.c line 210
0x4020fe68: do_memp_malloc_pool at /home/gauchard/dev/esp8266/esp8266/tools/sdk/lwip2/builder/lwip2-src/src/core/memp.c line 254
0x4020fea4: memp_malloc at /home/gauchard/dev/esp8266/esp8266/tools/sdk/lwip2/builder/lwip2-src/src/core/memp.c line 356
0x40217443: etharp_query at /home/gauchard/dev/esp8266/esp8266/tools/sdk/lwip2/builder/lwip2-src/src/core/ipv4/etharp.c line 1031
0x40253340: sleep_reset_analog_rtcreg_8266 at ?? line ?
0x40105159: ets_timer_disarm at ?? line ?
0x40105159: ets_timer_disarm at ?? line ?
0x4010018a: millis at D:\Programy\arduino\hardware\esp8266com-git_455583b\esp8266\cores\esp8266/core_esp8266_wiring.cpp line 180
0x40100165: millis at D:\Programy\arduino\hardware\esp8266com-git_455583b\esp8266\cores\esp8266/core_esp8266_wiring.cpp line 174
0x40100aa4: free at D:\Programy\arduino\hardware\esp8266com-git_455583b\esp8266\cores\esp8266\umm_malloc/umm_malloc.cpp line 1764
0x401001c0: millis at D:\Programy\arduino\hardware\esp8266com-git_455583b\esp8266\cores\esp8266/core_esp8266_wiring.cpp line 186
0x4020161a: ESP8266WiFiGenericClass::forceSleepBegin(unsigned int) at D:\Programy\arduino\hardware\esp8266com-git_455583b\esp8266\libraries\ESP8266WiFi\src/ESP8266WiFiGeneric.cpp line 484
0x40205601: esp_schedule at D:\Programy\arduino\hardware\esp8266com-git_455583b\esp8266\cores\esp8266/core_esp8266_main.cpp line 102
0x40205699: loop_wrapper() at D:\Programy\arduino\hardware\esp8266com-git_455583b\esp8266\cores\esp8266/core_esp8266_main.cpp line 134

@JiriBilek, general speaking, I used to have the wdt-reset situation and I've found out that I need to call the ESP function in the strict order, so wdt_reset never happended to me again.
I also turn-on/off wifi very often. When ESP wake-up from deepsleep, it execute setup(), acquire data, turn-on wifi, sending data, and goto sleep.

Here are my general code:

  1. After Serial.print, Serial.println, need to wait at least 80ms before calling any command of WiFi.xxxx
  2. Sequence to turn-on WiFi, in this case your wifiConnect() function
    void wifiConnect() {
    WiFi.forceSleepWake();
    delay (1);
    WiFi.mode (WIFI_STA);
    WiFi.begin (ssid, pass);
    ...}

3.Sequence to turn-off WiFi, in this case your wifiDisconnect() function
void wifiDisconnect(void) {
Serial.print(F("Disconnecting client"));
wifiClient.stop();
Serial.print(F(", wifi"));
delay (100);
WiFi.disconnect();
Serial.println(F(", sleeping"));
delay (100);
lastSleep = millis();
WiFi.mode (WIFI_OFF);
WiFi.forceSleepBegin();
delay (1);
}

the delay (1); must be added. I don't know why but ESP seems not working without it.
Actually, I'm using ESP.deepsleep(). so my code is a little bit different. But the sequence of turn-on/off wifi are the same.

See if it helps?

@tbdltee: Thanks for the information but unfortunately, the proposed changes didn't work either. The device fires wdt in ca 1.5 hour frequency.
The only solution I see now is to revert to the old core.

The relevant part of the code:

void wifiConnect(void) {
    WiFi.forceSleepWake();
    delay(100);

    WiFi.mode(WIFI_STA);
    //delay(100);

    WiFi.begin(SSID, PASSWORD);
    Serial.printf_P(PSTR("Connecting to %s "), SSID);

    while (WiFi.status() == WL_DISCONNECTED) {
        Serial.write('.');
        delay(500);
    }
    Serial.println();

    if (WiFi.status() == WL_CONNECTED) {
        Serial.printf_P(PSTR("WiFi connected (RSSI %d), IP address: %s, "), WiFi.RSSI(), WiFi.localIP().toString().c_str());
        Serial.printf_P(PSTR("mem: %d\r\n"), ESP.getFreeHeap());

        if (time(nullptr) < 100000000)
            readTime();
    }
}

void wifiDisconnect(void) {
    // Disconnecting wifi
    Serial.print(F("Disconnecting client"));
    wifiClient.stop();

    Serial.print(F(", wifi"));
    delay(100);  // FIXME

    WiFi.disconnect();

    Serial.println(F(", sleeping"));
    delay(100);  // FIXME

    lastSleep = millis();

    WiFi.mode(WIFI_OFF);
    WiFi.forceSleepBegin();  // turn off ESP8266 RF
    delay(1);
}

@JiriBilek. Thanks for update.
Currently, I'm using core 2.4.2, SDK2.2.1. Core 2.5.x also have a lot of problem for me. I'll for a while until it's more stable.

But, after all, it is the best setup for the new core (2.6.0-dev). After two wdts in the evening it run overnight without problems (8.5 hours).

Just for information, the star in the graph is a reset and the gray circle is a successful transmittion. The first two stars in the picture are resets caused by firmware upload.
image

I can own this problem too. I spent ages trying everything. The issue is intermittent and I found my 'solution' was just to comment out my WiFi.disconnect(); line, and then it runs perfectly stably. I have not tested any effect on power consumption. Note the crash site is variable intermittent and some lines after the disconnect call. As I tinkered the problem got worse, not better and removing that line made all the difference. I can't explain it, but it is a simple thing to try and report back on. PaulS. 12TA.

@JiriBilek: I note that [earlephilhower] above said that the guts of the "WiFi.mode(WIFI_OFF)" and similar commands are actually in the Espressif SDK.

Also that the latest esp8266 github (eg v31May19) for the Arduino IDE has a "Generic ESP8266 Module" with the options to select 3 different versions of the SDK, namely: "nonos-sdk 2.2.1 (legacy)=v2.1.0-10-g509eae8", "2.2.2-190313 (testing)=v2.2.1-61-gc7b580c" and "sdk pre-3 (known issues)=v2.2.0-28-g89920dc".

For my current sketch which has the "Hardware Watchdog being intermittently triggered by (shortly after) WiFi.mode(WIFI_OFF)" issue:

  • "nonos-sdk 2.2.1 (legacy)=v2.1.0-10-g509eae8" - causes intermittent Hardware Watchdog crashes.
  • "sdk pre-3 (known issues)=v2.2.0-28-g89920dc" - causes intermittent Hardware Watchdog crashes
  • "2.2.2-190313 (testing)=v2.2.1-61-gc7b580c" - still causes intermittent Hardware Watchdog crashes

So, as earlier github software using “sdk pre-3” works fine, I do not think the SDK version is causing my Hardware Watchdog crash issue.

Instead, I note that:

Update (28Jul19):
I note that my post below which detailed the solution I found to my specific issue (intermittent Hardware WDT crashes after "WiFi.mode(WIFI_OFF)") appears to have been hidden by admin). Therefore if anyone else has the same issue, the (“temporary”) solution which worked for me ( and I believe also worked for @JiriBilek ) was to reverse the following commit:

  • “stop lwIP dhcp client when WiFi goes off #5703” (1Feb19)
    (https://github.com/esp8266/Arduino/pull/5703/files/60560e634284a5beb194588bdb5696e396824044)
    Which added the line “wifi_station_dhcpc_stop();”

Specifically, I edited the file “libraries/ESP8266WiFi/src/ESP8266WiFiGeneric.cpp” and commented out relevant 2 lines so they say:

// if (m != WIFI_STA && m != WIFI_AP_STA) // commented out as causing intermittent WDT
// wifi_station_dhcpc_stop(); // commented out as causing intermittent Hardware WDT

And this solved my intermittent Hardware WDT crashes (I have now been running 8 sensors connecting to a remote server every 10 minutes approx for over 4 weeks using the 31May19 github software without a single WDT crash).

I hope this info will be useful to someone.

@Rob58329: I think I am using the legacy core in all my tests. Not sure because now, I don't have my computer with me. Will be back in July.

WiFi.forceSleepBegin() must be accompanied by WiFi.forceSleepWake().

I could reproduce the bug, and it vanished with a call to WiFi.forceSleepWake() just before WiFi.begin().

sketch (same as above without any form of WiFiClient like @earlephilhower did), with the fix:

#include <ESP8266WiFi.h>

#define SSID STASSID
#define PASSWORD STAPSK

static uint32_t MILLIS_TO_WAKE_UP = 30 * 1000; // wake up after 30 seconds
uint32_t lastSleep = 0;  // millis of the last time when forced to sleep

void setup() {
  WiFi.setAutoConnect(false);
  Serial.begin(115200);

  pinMode(LED_BUILTIN, OUTPUT);
  digitalWrite(LED_BUILTIN, LOW);  // Start indicator

  // Wait for GPIO0 down as a start condition (we want to stop here after a wdt reset)
  /*    Serial.println(F("\nConnect GPIO0 to GND to start"));
      pinMode(0, INPUT_PULLUP);
      while (digitalRead(0) == HIGH)
          delay(100);
  */
  pinMode(LED_BUILTIN, INPUT);  // Back to default

  Serial.print(ESP.getSdkVersion());

  wifiConnect();
  wifiSend(3);
  wifiDisconnect();
}

void loop() {
  //Serial.print("\x08.");
  static int count = 0;
  if (((++count) % (1024 * 16)) == 0)
  {
    Serial.print("x");
    count = 0;
  }

  if (millis() - lastSleep > MILLIS_TO_WAKE_UP) {
    wifiConnect();
    wifiSend(2);
    wifiDisconnect();
  }
}

void wifiConnect(void) {
  Serial.println("start connecting");
  WiFi.mode(WIFI_STA);
  delay(100);
  WiFi.forceSleepWake();  // <============================================ FIX
  WiFi.begin(SSID, PASSWORD);
  Serial.printf_P(PSTR("Connecting to %s "), SSID);

  while (WiFi.status() == WL_DISCONNECTED) {
    Serial.write('.');
    delay(500);
  }
  Serial.println("not disconnected");

  if (WiFi.status() == WL_CONNECTED) {
    Serial.printf_P(PSTR("WiFi connected (RSSI %d), IP address: %s, "), WiFi.RSSI(), WiFi.localIP().toString().c_str());
    Serial.printf_P(PSTR("mem: %d\r\n"), ESP.getFreeHeap());

    if (time(nullptr) < 100000000)
      readTime();
  }
}

void wifiDisconnect(void) {
  // Disconnecting wifi

  Serial.print(F(", wifi"));
  WiFi.disconnect();
  WiFi.mode(WIFI_OFF);
  delay(100);  // FIXME

  Serial.println(F(", sleeping"));

  //WiFi.forceSleepBegin();  // turn off ESP8266 RF
  wifi_set_opmode_current(WIFI_OFF);
  //WiFi.forceSleepBegin(/*default*/0) equivalent:
  // sleep forever until wifi_fpm_do_wakeup() is called
  wifi_fpm_set_sleep_type(MODEM_SLEEP_T);
  wifi_fpm_open();
  wifi_fpm_do_sleep(0xFFFFFFF);

  delay(100);  // FIXME

  lastSleep = millis();
}

boolean wifiSend(int8_t status) {
  // Check the wifi
  if (WiFi.status() != WL_CONNECTED) {
    Serial.println(F("[WiFi] Not connected to AP"));
    return false;
  }
  return true;
}

void readTime(void) {
  if (WiFi.status() != WL_CONNECTED) {
    return;
  }

  Serial.print(F("Setting time using SNTP "));

  configTime(1 * 3600, 0, "tik.cesnet.cz", "pool.ntp.org");

  // Read time, wait 5 seconds
  uint32_t m = millis();
  time_t now = time(nullptr);
  while (now < 100000000 && millis() - m < 5000) {
    delay(100);
    Serial.write('.');
    now = time(nullptr);
  }
  Serial.println();

  if (now < 100000000) {
    Serial.println(F("Time was not set."));
  }
  else {
    Serial.print(F("Current time: "));
    Serial.println(ctime(&now));
  }
}

In my opinion, which comes from experimentation too, the single command
WiFi.disconnect();
is the cause of the intermittent hardware resets since commenting it out cured the problem for me. Can someone else verify this? Paul

I believe that my intermittent Hardware WDT crashes after "WiFi.mode(WIFI_OFF)" are caused by a commit dated 1Feb19, being:

  • “stop lwIP dhcp client when WiFi goes off #5703”
    (https://github.com/esp8266/Arduino/pull/5703/files/60560e634284a5beb194588bdb5696e396824044)

Which added the line “wifi_station_dhcpc_stop();”

(The above line is part of “bool ESP8266WiFiGenericClass::mode(WiFiMode_t m)”, so I assume this is what the arduino command "WiFi.mode(WIFI_OFF)" calls.)

Specifically, if I edit “libraries/ESP8266WiFi/src/ESP8266WiFiGeneric.cpp” and comment out the relevant two lines so they say:
// if (m != WIFI_STA && m != WIFI_AP_STA) // commented out as causing intermittent WDT
// wifi_station_dhcpc_stop(); // commented out as causing intermittent Hardware WDT

Then my sketches no longer have these intermittent Hardware WDTs!

  • My sketches now work fine using https://github.com/esp8266/Arduino/tree/2.5.0
  • They also work fine using the recent github (31May19) with either “nonos-sdk 2.2.1 (legacy)=v2.1.0-10-g509eae8" or “sdk pre-3 (known issues)=v2.2.0-28-g89920dc", and either axTLS or BearSSL, and either IwIP v1.4 or IwIP v2 - all work fine for me now!

FYI: I have 8 ESP8266-D1-Mini sensors, which connect to a remote server using SSL every 8 minutes or so. Interestingly, each ESP8266 seemed to have these “intermittent Hardware WDT” crashes at different rates (NB. they all auto-recover from these crashes and log the crash-details to eeprom): I have one sensor, which would run OK for a few hours, and then crash as above after every single connection (every 8 minutes) for 2+ hours until I removed it’s power. However, most of my ESP8266 only crashed once every 12hours to 24hours or so, and one of my sensors only crashed after 7 days. (NB. But they all crashed with a Hardware WDT in the “delay(2000)” which I had added after my "WiFi.mode(WIFI_OFF)"). Now with the above “wifi_station_dhcpc_stop();” removal, all 8 sensors have all been running without issues for 3 days (ie 8*3=24 sensor-days) using the recent github (of 31May19).

Update: Now all have been continuously running without crashes for 10 days (ie. 8*10=80 sensor-days).

(PS. My sketches use both WiFi.forceSleepBegin() and WiFi.forceSleepWake() in the correct places.)

(PPS. Similar to @tbdltee: I have also for some time used a "delay(1)" after WiFi.forceSleepWake() & WiFi.forceSleepBegin(); and a "delay(5)" after WiFi.mode(WIFI_AP_STA) & WiFi.mode(WIFI_OFF) & WiFi.disconnect(). I found this was necessary to stop "Exception28" crashes - but note that ALL of these delay()s are probably not necessary, I just left them all in until I get time to test which ones are essential and which ones I can remove!)

@d-a-v: In last 3 days, I was testing your suggestion in various setups and it works! The key is to change wifi mode with WiFi.mode(WIFI_STA); first and then after a small delay, wake up the radio WiFi.forceSleepWake();. The reverse case that I unfortunately had in my projects does not work in newer releases of esp8266/Arduino and fires intermittent wdts outside the loop() function.
Thank you, I would never discover it.
I will try to find out where delays are needed and where not but it doesn't change the basic idea.

@Rob58329: your investigation is great, please check if your app is working when you change the mode first and only then wake up the radio.

The relevant code once more:

void wifiConnect(void) {
  WiFi.mode(WIFI_STA);
  delay(100);
  WiFi.forceSleepWake();
  WiFi.begin(SSID, PASSWORD);
...

For me the problem is solved, thank you all for helping.

Sad to say but I was too optimistic. The intermittent wdts disappeared in my simple test but with a network traffic they are back here. Wdts fire not so frequently as before, though.
I am trying @Rob58329 tip with removing wifi_station_dhcpc_stop().
Will keep posting here.

Try to do a WiFi disconnect from the AP and see if the node crashes.
See: https://github.com/esp8266/Arduino/issues/6266

Ideas to help force a disconnect:

  • Change channel on AP
  • Reboot AP
  • Kick node from AP (only a few can do that)

@TD-er: does it make sense to force disconnect when the node is almost all the time disconnected (low power) and only once per 5 minutes connects, sends data and then disconnects again?

@JiriBilek the thing is, a forced disconnect can happen almost any time.
For example when the AP does change channel or when it is set to do a force disconnect at N received packet errors. (depends on brand/model/setting of AP)

And as far as I know, the problem is not really with the disconnect itself, but somehow the ESP does crash with a WD reboot at this (re)connect stage.
This does not happen always, but as far as I have seen, it can be triggered by just sending a disconnect message to the ESP.
With normal use, a disconnect from the WiFi doesn't happen that often and it also depends on a lot of factors.
Every now and then, the ESP doesn't even know it is being disconnected and just continues doing its work.
I don't know if in the background it may do something, but at least none of the WiFi events are fired to let me know something has changed.
Maybe it does do something in the background, because it sometimes also crashes with a WD reboot right after a WiFi disconnect message was sent to the ESP node, but no event was reported on the ESP itself (at least not an event I have a callback function for)

Before I said this.
" In my opinion, which comes from experimentation too, the single command
WiFi.disconnect();
is the cause of the intermittent hardware resets since commenting it out cured the problem for me. Can someone else verify this? Paul "
The reason, I think, that people want to disconnect from WiFi is often to reduce power consumption by the ESP8266. Anyway, that is what I wanted. I was copying code snippets with that intention. Since then, I can confirm by experiment that power consumption falls to around 10ma in my NodeMCU board, so I assume WiFi is off and that my project really has stopped resetting. If others use that command in their projects, can they report on whether removing it fixes it.
FYI the reset occurred some time after the command and did seem to be intermittent and random. As the problem has gone away and my program is massive, I can't easily post it all. Paul

FYI the reset occurred some time after the command and did seem to be intermittent and random. As the problem has gone away and my program is massive

Well I am no stranger to "massive" regarding sketch size :)
Just removing the disconnect calls does not fix it here.
Still WD reboots every 2 or 3 times I force a WiFi disconnect from the AP.

Also removing references to the somewhat related function to set the WiFi mode to WIFI_OFF does not fix this issue.

@TD-er I am not sure if I understood your post. Does it mean that I can get intermittent WD reboots when connected with just disconnecting the client from the AP and the problem is not in this libraries but in the core?

Anyway, the day before yesterday, I changed my sketch as @Rob58329 suggested. The application works since then without a problem. I am continuing testing but starting to think the problem lies in commit https://github.com/esp8266/Arduino/pull/5703 that solves one issue and creates another one, unfortunately.

@JiriBilek: Many thanks for above info: So I understand the suggested code changes the order of commands to:

void wifiConnect(void) {
  WiFi.mode(WIFI_STA);
  delay(100);
  WiFi.forceSleepWake(); delay(1);
  WiFi.begin(SSID, PASSWORD);
}

but still using:

void wifiOff(void) {
  WiFi.disconnect();
  WiFi.mode(WIFI_OFF);
  delay(2000); // increased to 2 seconds for testing to ensure crashes happen here & not later
  WiFi.forceSleepBegin(); delay(1);
}

[ I hadn't spotted that the @d-a-v code in his above post had "WiFi.forceSleepWake()" AFTER "WiFi.mode(WIFI_STA)"!! ]

I have tried changing my code to match the above, but sadly it makes no difference for me (ie. same as for you I understand): my code still has intermittent Hardware WDT crashes in the "delay(2000)" I have added after my "WiFi.mode(WIFI_OFF)". (In my short tests the frequency of crashes seems approximately the same as before.)

HOWEVER, as I detailed previously, editing "ESP8266WiFiGeneric.cpp" to comment out the "wifi_station_dhcpc_stop()" line solves my WDT crash issue, with both the above suggested order, and also my original "WiFi.forceSleepWake(); delay(1); WiFi.mode(WIFI_STA); delay(5);" order.

I would be interested to know what the "official(??)" suggested order for "WiFi.forceSleepWake()" and "WiFi.mode(WIFI_STA)" is though, as the @d-a-v post above is the only one I can currently find which has it in this "WiFi.mode(WIFI_STA) first and THEN WiFi.forceSleepWake()" order, and all the other posts I can find use the opposite order?

About wifi_station_dhcpc_stop() a test is probably missing prior this call. I'll work on that asap.
It was added because a bug was suspected (dhcp timer still around and waking up when STA wasn't anymore). But the other checking whether STA is up before that call is missing.

About WiFi.forceSleepWake() order I am no authority.
I think it would be nice to put inside the core API the way to go to radio-sleep and to wake up from radio-sleep. So we can have a common basis to debug with and to improve.

@Rob58329 @JiriBilek @TD-er

Then my sketches no longer have these intermittent Hardware WDTs!

Can you please try this in ESP8266WiFiGeneric.cpp:

declare just before bool ESP8266WiFiGenericClass::mode(WiFiMode_t m):

extern "C" struct netif* eagle_lwip_getif(uint8 index);

then replace

    if (m != WIFI_STA && m != WIFI_AP_STA)

by

    if (m != WIFI_STA && m != WIFI_AP_STA && eagle_lwip_getif(STATION_IF) != nullptr)

(and leave wifi_station_dhcpc_stop(); in place)

@d-a-v
I will try the suggested check for the existence of the network adapter object (by lack of a better term).

And about moving stuff into the core; I think if steps are needed to have something functional, then yes!
Just because the core libraries should be as simple to operate as possible and thus remove the burden from the user to make explicit calls to some functions in a specific order.
But on the other hand, if it is a truly optional step with some specific use cases where it can be useful to not perform the steps then it should not be a fixed step in the core libraries.
If for some reason it is only needed to exclude a step in very specific use cases, then wrapping it in an #ifdef or if (<some defined boolean>) is also an option.

I am trying the fix and it's running smoothly for 90 minutes now. I plan to test it for at least 24 hours. Fingers crossed.

I just tested the proposed change, but it is at least not a fix to the WD reboots I am seeing when the unit does do a disconnect.
The code change suggests it should crash (null pointer dereference) when it happens what that extra check should fix, but I'm seeing that the ESP does simply halt or at least does not continue with running the loop() function.

@TD-er when #6283 is solved maybe we can understand better your WDT issue.

@TD-er I was thinking of a new WiFi.mode(WIFI_SHUTDOWN) which would not overlap current API, but which can be taken into account by the core API when setting wifi up to another mode.

edit
Also there is this FW api we don't so far know/use (not present in official user_interface.h but described in latest api reference datasheet).

After 3 hours, I've got first reset. I'm keeping running the node to collect more data.
The previous configuration (dhcpc stopping commented out) ran 3 days without a problem.

Does your sketch include the "WiFi.disconnect" command? If no, just say so,
if it does, have you tried commenting that command out. That was all I
needed to do. After, I had called
"WiFi.mode ( WIFI_OFF );"
but that alone seems enough to disconnect WiFi and reduce the power
consumption and no more random resets, which were driving me bonkers.

Dr Paul Stross.

On Thu, 11 Jul 2019, 10:58 Jiri Bilek, notifications@github.com wrote:

After 3 hours, I've got first reset. I'm keeping running the node to
collect more data.
The previous configuration (dhcpc stopping commented out) ran 3 days
without a problem.


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/esp8266/Arduino/issues/6172?email_source=notifications&email_token=AKVNUZULO67NLCIDNIH6JMLP6375BA5CNFSM4HSAERRKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZWGCHA#issuecomment-510419228,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AKVNUZVKIUANKNE3XHQ5V2DP6375BANCNFSM4HSAERRA
.

So it doesn't work like if you remove wifi_station_dhcpc_stop() @JiriBilek
Now can you please try and replace the test by:

    if (m != WIFI_STA && m != WIFI_AP_STA)
        // stop dhcp client on STA (safe to call if not started)
        for (netif* i = netif_list; i; i = i->next)
            if (i->num == STATION_IF)
                dhcp_stop(i);

and

extern "C" struct netif* eagle_lwip_getif(uint8 index);

by

#include <lwip/dhcp.h>

@12TA
It's interesting that you state that WiFi.mode(WIFI_OFF) works where WiFi.disconnect() doesn't (wdt...).
The underlying API is different for the same desired effect and we might consider transparently replacing WiFi.disconnect() code (which disconnects STA) by a call to WiFi.mode(AP or OFF) (according to current mode).

@12TA: Yes, I have WiFi.disconnect(), see the fragment of the code:

    WiFi.disconnect();
    WiFi.mode(WIFI_OFF);
    Serial.println(F(", sleeping"));
    delay(100);
    WiFi.forceSleepBegin();  // turn off ESP8266 RF

@d-a-v: This time I stated to get wdt reset on every disconnect. Then it somehow settled (really strange, I did nothing and it started to work after a couple of minutes) and now it is running. I will keep watching it.
Note, after flashing the code I did a power cycle to be sure there is nothing left in RAM.
The relevant code from ESP8266WiFiGeneric.cpp:

bool ESP8266WiFiGenericClass::mode(WiFiMode_t m) {
    if(_persistent){
        if(wifi_get_opmode() == (uint8) m && wifi_get_opmode_default() == (uint8) m){
            return true;
        }
    } else if(wifi_get_opmode() == (uint8) m){
        return true;
    }

    bool ret = false;

    if (m != WIFI_STA && m != WIFI_AP_STA)
        // calls lwIP's dhcp_stop(),
        // safe to call even if not started
        for (netif* i = netif_list; i; i = i->next)
            if (i->num == STATION_IF)
                dhcp_stop(i);

    ETS_UART_INTR_DISABLE();
    if(_persistent) {
        ret = wifi_set_opmode(m);
    } else {
        ret = wifi_set_opmode_current(m);
    }
    ETS_UART_INTR_ENABLE();

    return ret;
}

Edit: oh no, another wdt :(

For clarity I also used a
"WiFi.mode ( WIFI_OFF ); delay ( 10 ); WiFi.forceSleepBegin (); "

and removing

a prior
"WiFi.disconnect ( ); delay ( 10 );"
Fixed the reboots issue while still reducing power input and allowing a

Subsequent

"WiFi.forceSleepWake (); delay (1); WiFi.mode ( WIFI_STA ); delay (1); "

to resume WiFi functionality just prior to a quick post of data then back
to sleep.

I was copying code snippets I had found on line. Dr Paul Stross.

On Thu, 11 Jul 2019, 11:39 Jiri Bilek, notifications@github.com wrote:

@12TA https://github.com/12TA: Yes, I have WiFi.disconnect(), see the
fragment of the code:

WiFi.disconnect();
WiFi.mode(WIFI_OFF);
Serial.println(F(", sleeping"));
delay(100);
WiFi.forceSleepBegin();  // turn off ESP8266 RF

@d-a-v https://github.com/d-a-v: This time I stated to get wdt reset on
every disconnect. Then it somehow settled (really strange, I did nothing
and it started to work after a couple of minutes) and now it is running. I
will keep watching it.
Note, after flashing the code I did a power cycle to be sure there is
nothing left in RAM.
The relevant code from ESP8266WiFiGeneric.cpp:

bool ESP8266WiFiGenericClass::mode(WiFiMode_t m) {
if(_persistent){
if(wifi_get_opmode() == (uint8) m && wifi_get_opmode_default() == (uint8) m){
return true;
}
} else if(wifi_get_opmode() == (uint8) m){
return true;
}

bool ret = false;

if (m != WIFI_STA && m != WIFI_AP_STA)
    // calls lwIP's dhcp_stop(),
    // safe to call even if not started
    for (netif* i = netif_list; i; i = i->next)
        if (i->num == STATION_IF)
            dhcp_stop(i);

ETS_UART_INTR_DISABLE();
if(_persistent) {
    ret = wifi_set_opmode(m);
} else {
    ret = wifi_set_opmode_current(m);
}
ETS_UART_INTR_ENABLE();

return ret;

}


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/esp8266/Arduino/issues/6172?email_source=notifications&email_token=AKVNUZUQLQY4EEX7XPR2PQ3P64EVFA5CNFSM4HSAERRKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZWJH4I#issuecomment-510432241,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AKVNUZS7MCWTODZEQNKPJBLP64EVFANCNFSM4HSAERRA
.

@12TA: I removed WiFI.disconnect() like you suggested. The MCU is entering low power mode (< 20 mA) and wakes up correctly. So far so good.
Testing with @d-a-v's latest fix.

Thanks. I feel validated. The 'fix' for me has stood the test of time but
on 1 sketch and NodeMCU only (as that is all I have tried).

Dr Paul Stross.

On Thu, 11 Jul 2019, 12:01 Jiri Bilek, notifications@github.com wrote:

@12TA https://github.com/12TA: I removed WiFI.disconnect() like you
suggested. The MCU is entering low power mode (< 20 mA) and wakes up
correctly. So far so good.
Testing with @d-a-v https://github.com/d-a-v's latest fix.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/esp8266/Arduino/issues/6172?email_source=notifications&email_token=AKVNUZX4BGGGNMFMEJ2UJ63P64HGPA5CNFSM4HSAERRKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZWK37Q#issuecomment-510438910,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AKVNUZT3YX6GEYTHTG56XG3P64HGPANCNFSM4HSAERRA
.

@TD-er when #6283 is solved maybe we can understand better your WDT issue.

I agree.
Not sure how long that issue may be present in the code, since the WD reboots did appear somewhere in mid 2018.
But the WD reboots may have several causes, and the symptoms described in that issue look very similar to what I'm seeing.

@TD-er I was thinking of a new WiFi.mode(WIFI_SHUTDOWN) which would not overlap current API, but which can be taken into account by the core API when setting wifi up to another mode.

That sounds very good, since there is quite a lot of energy saving possible when turning off the WiFi, but right now I am not using that potential because of the reboots.
Should I add already an issue for it in the form of a feature request to keep this thread a bit more clean?

_edit_
Also there is this FW api we don't so far know/use (not present in official user_interface.h but described in latest api reference datasheet).

I will look at what these do.
Is it part of the SDK3.x only?

The testing node is now running for 8 hours, no problem.
I really don't understand why removing WiFi.disconnect() could fix the issue :(
I am open to test whatever ideas.

@d-a-v: Your first "extern "C" struct...eagle_lwip_getif..." suggestion does not work for me either. If anything it seems to produce more frequent WDT crashes for me.

Update: Your second "#include <lwip/dhcp.h>" suggestion does seem to reduce the WDT crashes for me by the order of about 50% (ie. I get them on average only once every 24hrs instead of once every 12hours), but I do still get WDT crashes.

Interestingly, in addition to the above "WiFi.disconnect()" perhaps not being needed, I have just noticed that "WiFi.mode(WIFI_STA)" and "WiFi.forceSleepWake()" no longer appear to be necessary either. It appears that the "WiFi.begin(ssid, password)" automatically does this on its own!?? (Although exactly which WiFi.mode()" it uses I'm not sure.) An Example sketch is:

#include <ESP8266WiFi.h>
const char* ssid     = "SSID";     // your network SSID (name of wifi network)
const char* password = "password"; // your network password
void setup() {
  Serial.begin(74880); delay(100);
  Serial.println("\nBooted ok - typical USB current now=75mA"); // typically for ESP8266-D1-mini
  WiFi.mode(WIFI_OFF); delay(100);
  Serial.println("-> WiFi.mode(WIFI_OFF) - typical USB current now=75mA");
  WiFi.forceSleepBegin(); delay(1);
  Serial.println("-> WiFi.forceSleepBegin() - typical USB current now=20mA");
  delay(1000);
  Serial.print("Attempting to connect to SSID: "); Serial.println(ssid);
  WiFi.begin(ssid, password);
  while (WiFi.status() != WL_CONNECTED) {Serial.print("."); delay(500);}
  Serial.print("\nConnected to "); Serial.print(ssid); Serial.println(" - typical USB current now=75mA"); 
  Serial.println("\nDemonstration that 'WiFi.mode(WIFI_STA)' and 'WiFi.forceSleepWake()' are no longer required!?"); 
}
void loop() {}

Which connects to my Wifi SSID fine, even without "WiFi.mode(WIFI_STA)" and "WiFi.forceSleepWake()"!? (This using the ESP8266-arduino-github sofware as at 31May19).

since the WD reboots did appear somewhere in mid 2018.

4482 was in March 2018 and it seems from that time assert/abort/panic are seen as WDT.

WiFi.mode()

Should I add already an issue

Why not, it would open the discussion

Is it part of the SDK3.x only?

These symbols are present in our current firmware.
They will be useful when waking up from WIFI_SHUTDOWN.

since the WD reboots did appear somewhere in mid 2018.

4482 was in March 2018 and it seems from that time assert/abort/panic are seen as WDT.

OK, that does explain a lot.
I wasn't sure when the WD reboots did start to occur, since around that time we also had a lot of other issues at hand regarding build issues. (Builds would break at random so it seems. Change a single line (even comments) and it would not connect to WiFi for example)
So if it was a commit from March 2018, I would say please make it the highest priority bug fix and it will probably eliminate lots and lots of other reported issues.

I quite agree with you and don't understand the logic, as so intermittent
and random BUT I grappled with it for many hours trying work arounds, most
of which did not work and the problem seemed to get worse. Several of othrs
'Solved' solutions did not stand the tests of time. At one stage it got to
be every loop, till I removed the
WiFi.disconnect ( );
and suddenly it was stable! :-) My project is a GPS logger, I feared
interference between WiFi and the GPS signals. Also it's battery powered,
so current saving was important. I also found the ESP8266 could not handle
the GPS serial stream while doing WiFi connections and postings. So, using
the WiFi and GPS alternately was desired, with a 30 second loop between
ThingSpeak postings. It recovered from the reboots BUT I could not measure
cumulative distance. Now, I can.
I bought a crude USB current monitor and proved it had achieved a lower
power state. As I was confident my 'fix' worked, I wanted to share it with
people who might be able to explain it to me, and turn magic into logic,
and help others who had the same problem.

Paul Stross.

On Thu, 11 Jul 2019, 20:27 Jiri Bilek, notifications@github.com wrote:

The testing node is now running for 8 hours, no problem.
I really don't understand why removing WiFi.disconnect() could fix the
issue :(
I am open to test whatever ideas.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/esp8266/Arduino/issues/6172?email_source=notifications&email_token=AKVNUZVK6WLA742ALRTLD2TP66CQ5A5CNFSM4HSAERRKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZXXNVQ#issuecomment-510621398,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AKVNUZRUXAFGED3T7MCS2GTP66CQ5ANCNFSM4HSAERRA
.

Can someone with the WDT errors please try #6288 and report back? It's got a fix to catch some more panic/asserts.

Do you mean to apply https://github.com/esp8266/Arduino/pull/6288 in my code?

Yes, #6288. Very small change, you can apply by hand if needed or standard Github PR process. Sorry bout that, got issue and PR mixed up.

The testing node setup now:

Running.

2 wdt resets so far :(
10 minutes testing, nothing special in serial port (should I have turned on debugging?)

Edit: not working, many wdts

Thanks for the update. No, there's no need for debugging. This only catches assert()/panic()/etc. in an IRQ, so this at least lets us rule that possibility out.

About my earlier remark here: https://github.com/esp8266/Arduino/issues/6172#issuecomment-510815344

4482 was merged on March 9 2018, which is a day after release of core 2.4.1.

I just tested a core 2.4.1 build and it also does a WDT reset as soon as I force a WiFi reconnect.
So fixing code related to that PR may not fix the WiFi issues :(

Just to give a short heads-up.
I built a version based on the current master branch and the WD-reboot on reconnect still exists.
It was already expected given earlier tests, but just to be sure.

System Libraries | ESP82xx Core 2.6.0-dev, NONOS SDK 2.2.2-dev(38a443e), LWIP: 2.1.2 PUYA support
-- | --
Build Time| Jul 16 2019 01:16:29

Thanks for reports.
Then #5667 should be solved in another way.

Just some more debug info:

I have some setup now that will only connect to WiFi the first attempt (after flashing) and (almost) every reboot after that it will fail to connect.
Only after 10 - 50 attempts it may succeed to connect.

This is the correct sequence (with Serial debug enabled):

4961 : Waiting for WiFi connect
scandone
state: 0 -> 2 (b0)
5859 : Waiting for WiFi connect
state: 2 -> 3 (0)
state: 3 -> 5 (10)
add 0
aid 8
cnt

connected with MikroTik, channel 8
dhcp client start...
ip:192.168.1.75,mask:255.255.255.0,gw:192.168.1.1
5985 : WIFI : Entering processConnect()

This is what happens in the next reboot attempts:

INIT : Booting version:  (ESP82xx Core 2_5_2, NONOS SDK 2.2.1(cfd48f3), LWIP: 2.1.2 PUYA support)
54 : INIT : Free RAM:33568
55 : INIT : Warm boot #113 Last Task: Background Task - Restart Reason: External System
57 : FS   : Mounting...
82 : FS   : Mount successful, used 77559 bytes of 957314
86 : CRC  : No program memory checksum found. Check output of crc2.py
121 : CRC  : SecuritySettings CRC   ...OK
161 : INIT : Free RAM:30392
163 : INIT : I2C
164 : INIT : SPI not enabled
2793 : INFO : Plugins: 48 [Normal] (ESP82xx Core 2_5_2, NONOS SDK 2.2.1(cfd48f3), LWIP: 2.1.2 PUYA support)
2796 : WIFI : Current mode 0
2797 : mode : sta(a0:20:a6:19:2a:bf)
add if0
fpm close 3 
del if0
usl
mode : null
WIFI : Set WiFi to STA
2934 : mode : sta(a0:20:a6:19:2a:bf)
add if0
WIFI : Connecting MikroTik attempt #0
3038 : WIFI : WiFi.status() = WL_DISCONNECTED  SSID: MikroTik
3039 : Waiting for WiFi connect
3141 : Waiting for WiFi connect
3243 : Waiting for WiFi connect
3344 : Waiting for WiFi connect
3445 : Waiting for WiFi connect
3548 : Waiting for WiFi connect
3649 : Waiting for WiFi connect
3750 : Waiting for WiFi connect
3852 : Waiting for WiFi connect
3954 : Waiting for WiFi connect
4055 : Waiting for WiFi connect
4158 : Waiting for WiFi connect
4259 : Waiting for WiFi connect
4360 : Waiting for WiFi connect
4462 : Waiting for WiFi connect
4564 : Waiting for WiFi connect
4665 : Waiting for WiFi connect
4766 : Waiting for WiFi connect
4867 : Waiting for WiFi connect
4968 : Waiting for WiFi connect
5069 : Waiting for WiFi connect
5170 : Waiting for WiFi connect
5272 : Waiting for WiFi connect
5373 : Waiting for WiFi connect
5474 : Waiting for WiFi connect
5574 : Waiting for WiFi connect
5675 : Waiting for WiFi connect
5776 : Waiting for WiFi connect
scandone

 ets Jan  8 2013,rst cause:4, boot mode:(3,6)

wdt reset
load 0x4010f000, len 1384, room 16 
tail 8
chksum 0x2d
csum 0x2d
v8b899c12
~ld
�U55 : 

INIT : Booting version:  (ESP82xx Core 2_5_2, NONOS SDK 2.2.1(cfd48f3), LWIP: 2.1.2 PUYA support)

I also have this issue, also with heap decreasing.
It is "solved" after a local timeout and

WiFi.mode(WIFI_OFF);
delay(1000); // mandatory... (madness, maybe less)
// retry

@TD-er can you enable core debug messages with PIO ?
(edit: I mean other debug options than core only)

I also have this issue, also with heap decreasing.
It is "solved" after a local timeout and

WiFi.mode(WIFI_OFF);
delay(1000); // mandatory... (madness, maybe less)
// retry

Is this delay needed? or can I also make sure not to set WiFi.mode again within 1000 msec after switching it off?

And I don't know if I can enable core debug messages.
is Serial.setDebugOutput(true); not enough?

Edit:
Had not yet noticed the decrease in heap, but you're right.

After adding a delay(1000 after each call to WIFI_OFF, it is indeed connecting again.

Is this delay needed?
After adding a delay(1000 after each call to WIFI_OFF, it is indeed connecting again.
Had not yet noticed the decrease in heap, but you're right.

... currently trying to understand this dark side of the core.

is Serial.setDebugOutput(true); not enough?

No, choose some among

 -DDEBUG_ESP_SSL -DDEBUG_ESP_TLS_MEM -DDEBUG_ESP_HTTP_CLIENT -DDEBUG_ESP_HTTP_SERVER -DDEBUG_ESP_CORE -DDEBUG_ESP_WIFI -DDEBUG_ESP_HTTP_UPDATE -DDEBUG_ESP_UPDATER -DDEBUG_ESP_OTA -DDEBUG_ESP_OOM

maybe at least

-DDEBUG_ESP_CORE -DDEBUG_ESP_OOM

I am now building with -DDEBUG_ESP_CORE -DDEBUG_ESP_OOM

By the way, if I add a delay(1000) to every switch of the WiFi.mode, then I get the same as before. It can't connect very well and takes 10+ attempts to connect to WiFi.
So I'm now using a delay of 1000 after switching to WIFI_OFF and delay(30) otherwise.

Still nothing with these debug flags set and processing a disconnect:

209155 : Waiting for WiFi connect
209258 : Waiting for WiFi connect
scandone

 ets Jan  8 2013,rst cause:4, boot mode:(3,6)

wdt reset
load 0x4010f000, len 1384, room 16 
tail 8
chksum 0x2d
csum 0x2d
v8b899c12
~ld
�U1061 : 

INIT : Booting version:  (ESP82xx Core 2_5_2, NONOS SDK 2.2.1(cfd48f3), LWIP: 2.1.2 PUYA support)
1062 : INIT : Free RAM:32672

Output from a successful boot + connect:

6942 : Waiting for WiFi connect
scandone
state: 0 -> 2 (b0)
7838 : Waiting for WiFi connect
state: 2 -> 3 (0)
state: 3 -> 5 (10)
add 0
aid 8
cnt

connected with MikroTik, channel 8
dhcp client start...
ip:192.168.1.75,mask:255.255.255.0,gw:192.168.1.1
7964 : WIFI : Entering processConnect()
8068 : WIFI : Connected! AP: MikroTik (B8:69:F4:9F:21:FA) Ch: 8 Duration: 3786 ms
8070 : WIFI : Entering processGotIP()
8076 : :urn 48
:urd 48, 48, 0
WIFI : DHCP IP: 192.168.1.75 (breadboard-1) GW: 192.168.1.1 SN: 255.255.255.0   duration: 30 ms
8123 : :ur 1
NTP  : NTP replied: delay 22 mSec Accuracy increased by 0.900 seconds

With the debug mode enabled, I do get some stack traces with indication of a failed alloc call.

While processing a WiFi disconnect:

SYS  : 3360.00
110709 :  Domoticz: Sensortype: 1 idx: 216 values: 3360
:wr 65 0
:wrc 65 65 0
:ack 65
:rn 258
:wr 64 0
:wrc 64 64 0
:ack 64
:rch 258, 356
:c 1, 258, 614
:rch 356, 266
:wr 64 0
:wrc 64 64 0
:ack 64
:rch 622, 266
:c 1, 356, 888
:wr 66 0
:wrc 66 66 0
:ack 66
:c 1, 266, 532
:rch 266, 272
:rch 538, 536
:rch 1074, 41
:wr 63 0
:wrc 63 63 0
:ack 63
:rch 1115, 229
:c 1, 266, 1344
:rch 1078, 266
:urn 80
:urd 80, 80, 0
:c 1, 272, 1344
:rch 1072, 247
:rch 1319, 332
:urn 80
:urd 80, 80, 0
:c 1, 536, 1651
:c 1, 41, 1115
:c 1, 229, 1074
:c 1, 266, 845
:c 1, 247, 579
:urn 80
:urd 80, 80, 0
:c0 1, 332
SPIFFS_close: fd=1
SPIFFS_close: fd=1
113602 : SYS  : 2.00
113615 :  Domoticz: Sensortype: 1 idx: 209 values: 2.00
SPIFFS_close: fd=1
SPIFFS_close: fd=1
113756 : DS   : Temperature: 24.87 (28-ff-69-de-30-17-4-83)
113768 :  Domoticz: Sensortype: 1 idx: 219 values: 24.87
:rn 268
:c0 1, 268
:rn 260
:rch 260, 536
:rch 796, 92
:rch 888, 284
:c 1, 260, 1172
SPIFFS_close: fd=1
SPIFFS_close: fd=1
115121 : SPIFFS_close: fd=1
SPIFFS_close: fd=1
SYS  : 4048.00
115153 : :oom(512)@?
:rch 912, 276
:rch 1188, 278
:oom(348)@?
:oom(312)@:0
state: 5 -> 2 (1c0)
rm 0
pm close 7
:close
state: 2 -> 0 (0)

Abort called

>>>stack>>>

ctx: cont
sp: 3fff1e70 end: 3fff23b0 offset: 01b0
3fff2020:  3fffbf6c 0000001f 3fff01d4 4022ca52
3fff2030:  3fff3108 00000080 3fffbf8b 4026c36e
3fff2040:  00000000 00000000 3fff2150 3fffa574
3fff2050:  3fffbf6c 00000002 3fff2150 40233cc3
3fff2060:  00000000 00000000 ffff20b0 4026c8f0  
3fff2070:  fefefeff 80808080 3fff20b0 3fffa574  
3fff2080:  3fff2150 3fff20b0 3fff2150 3fffa574
3fff2090:  00000007 3fff2250 3fff2150 40233dbf
3fff20a0:  00000007 3fff2250 3fff2150 40243f68
3fff20b0:  3fffbf6c 0031003f ffffaa00 3fff20f4
3fff20c0:  402b8600 00000020 3fff2104 4022b606
3fff20d0:  00000000 000000c8 3fff20f0 3fffa574
3fff20e0:  3fff2250 3fff2150 3fff2250 40244481  
3fff20f0:  00000000 3fffaa54 3fffaa64 3fffaaec  
3fff2100:  3fffab1c 3fffab0c 3fffaaec 402712a0
3fff2110:  3fff0114 3fff2210 3fff812c 3fff03d6
3fff2120:  3ffe9087 00000000 3fff2210 4026c36e
3fff2130:  3ffefe6c 3fff2210 3fff2210 4026c3ce
3fff2140:  3fff0308 3fff2210 3fff2210 4026c410
3fff2150:  38343034 0030302e 00ff2100 4026c268  
3fff2160:  00000001 00000008 00000008 4023500d  
3fff2170:  3fff1260 00000000 3ffefb00 3fffa574
3fff2180:  3fff80f4 00001800 3fff21b0 4026f324
3fff2190:  00000000 00000000 3ffefbe0 402a5575
3fff21a0:  4026a678 3fff21f0 3ffefbe0 3fff03d6
3fff21b0:  3fff0114 3fff22d8 402c2090 06dcb75b
3fff21c0:  3fff2250 00000001 0000002b 402297dd
3fff21d0:  ff000000 3fff22d8 402c2090 4026c268  
3fff21e0:  3ffefe6c 3fff22d8 402c2090 00000480
3fff21f0:  00000001 3fff2250 00000000 00000480
3fff2200:  00000000 3fff04e0 3fff2250 4024d56a
3fff2210:  00000000 00000000 ffff2280 4026c410
3fff2220:  06dca56f ffdc0041 402c2990 40216ab6
3fff2230:  00000000 00000000 3fff2250 3fff03d6
3fff2240:  3fff0114 3fff22d8 3fff7c08 4024d73a
3fff2250:  00000000 00000000 00dc0041 00000000  
3fff2260:  00000000 00ff2260 00000000 00000000
3fff2270:  00ff2270 00000000 00000000 00ff2278
3fff2280:  00000000 00000000 00ff2288 00000000
3fff2290:  0000008d 00000000 00000000 00000000
3fff22a0:  00000000 00000000 01000800 00012000
3fff22b0:  45d08000 00000000 00000000 00000000
3fff22c0:  00000009 4bc6a7f0 94fdf3b6 00000000
3fff22d0:  06dbfc3a 000004bd 000004bd 4027129c
3fff22e0:  3fff000c 3fff002c 3fff51b4 00000008  
3fff22f0:  06dca55a 3fff08c8 00000008 402295f9
3fff2300:  0001d4c9 3ffefff0 3ffe86f8 00000008  
3fff2310:  3ffefd6c 06dc0041 00000008 4024d803
3fff2320:  3fffdad0 06dbf999 00000000 4026eacb
3fff2330:  3fffdad0 06dbf999 00000003 40268625
3fff2340:  00000027 0001c141 0000a09f 4022927e
3fff2350:  3fff625c 40203240 3ffe85cc 3fff23e8
3fff2360:  3fffdad0 3ffef2de 3ffe85cc 40268889
3fff2370:  00000000 00000000 ffefeffe feefeffe
3fff2380:  00000000 00000000 00000001 4026db55
3fff2390:  3fffdad0 00000000 3fff23b8 4026dbe4
3fff23a0:  feefeffe feefeffe 3ffe86f8 401017c9  
<<<stack<<<

last failed alloc call: 4026F1DC(324)

 ets Jan  8 2013,rst cause:2, boot mode:(3,7)

load 0x4010f000, len 1384, room 16
tail 8
chksum 0x2d
csum 0x2d
v8b899c12
~ld
�U1061 : 

INIT : Booting version:  (ESP82xx Core 2_5_2, NONOS SDK 2.2.1(cfd48f3), LWIP: 2.1.2 PUYA support)
1062 : INIT : Free RAM:32672
1063 : INIT : Warm boot #197 Last Task: Task Device timer, id: 8 - Restart SPIFFSImpl: allocating 512+240+1400=2152 bytes
SPIFFSImpl: mounting fs @300000, size=fb000, block=2000, page=100
SPIFFSImpl: mount rc=0
Reason: Software Watchdog

And a sudden crash:

83582 :  Domoticz: Sensortype: 1 idx: 209 values: 1.00
:rch 372, 306
SPIFFS_close: fd=1
SPIFFS_close: fd=1
83771 : DS   : Temperature: 24.87 (28-ff-69-de-30-17-4-83)
83787 :  Domoticz: Sensortype: 1 idx: 219 values: 24.87
:c 1, 372, 678
:c0 1, 306
:urn 80
:urd 80, 80, 0
SPIFFS_close: fd=1
SPIFFS_close: fd=1
85072 : :oom(748)@?
:rn 268
:oom(608)@?
:oom(608)@?
:oom(608)@?
:oom(608)@?
:oom(608)@?
:urn 80
:oom(608)@?

Abort called

>>>stack>>>

ctx: cont
sp: 3fff1f30 end: 3fff23b0 offset: 01b0
3fff20e0:  402b2460 3fff01d1 3fff2250 40244336
3fff20f0:  00000000 00000000 3ffefb80 402a5575
3fff2100:  3fff26f8 000011d8 000011d8 4027129c
3fff2110:  3fff0114 3fff2210 3fffba8c 3fff03d6
3fff2120:  3ffe9087 00000000 3fff2210 4026c36e
3fff2130:  3ffefe6c 3fff2210 3fff2210 4026c3ce
3fff2140:  3fff0308 3fff2210 3fff2210 4026c410
3fff2150:  3ffe9087 00000000 ffff2190 4026c268
3fff2160:  00000001 00000008 00000008 4023500d  
3fff2170:  3fff1260 00000000 3ffefb00 4029cb6c  
3fff2180:  00000001 00001800 3fff21b0 4026f324
3fff2190:  00000000 00000000 3ffefb68 402a5575
3fff21a0:  4026a678 3fff21f0 3ffefb68 3fff03d6
3fff21b0:  3fff0114 3fff22d8 402c2090 05123f62
3fff21c0:  3fff2250 00000001 0000002b 402297dd
3fff21d0:  ff000000 3fff22d8 402c2090 4026c268  
3fff21e0:  3ffefe6c 3fff22d8 402c2090 00000480
3fff21f0:  00000001 3fff2250 00000000 00000480  
3fff2200:  00000000 3fff04e0 3fff2250 4024d56a
3fff2210:  00000000 00000000 ffff2280 4026c410
3fff2220:  05122cdc ff1169cc 402c2990 40216ab6
3fff2230:  00000000 00000000 3fff2250 3fff03d6
3fff2240:  3fff0114 3fff22d8 3fff7c08 4024d73a
3fff2250:  00000000 00000000 001169cc 00000000  
3fff2260:  00000000 00ff2260 00000000 00000000
3fff2270:  00ff2270 00000000 00000000 00ff2278
3fff2280:  00000000 00000000 00ff2288 00000000
3fff2290:  0000008d 00000000 00000000 00000000
3fff22a0:  00000000 00000000 01000800 00012000
3fff22b0:  45d24000 00000000 00000000 00000000
3fff22c0:  00000009 4bc6a7f0 9645a1ca 00000000
3fff22d0:  051165be 000004bd 000004bd 4027129c  
3fff22e0:  3fff000c 3fff002c 3fff51b4 00000008
3fff22f0:  05122cbf 3fff08c8 00000008 402295f9
3fff2300:  00015f7c 3ffefff0 3ffe86f8 00000008
3fff2310:  3ffefd6c 051169cc 00000008 4024d803
3fff2320:  3fffdad0 0511632e 00000000 4026eacb
3fff2330:  3fffdad0 0511632e 00000003 40268625
3fff2340:  00000027 00014bf4 00009c42 4022927e
3fff2350:  3fff625c 40203240 3ffe85cc 3fff23e8  
3fff2360:  3fffdad0 3ffef2de 3ffe85cc 40268889
3fff2370:  00000000 00000000 ffefeffe feefeffe
3fff2380:  00000000 00000000 00000001 4026db55
3fff2390:  3fffdad0 00000000 3fff23b8 4026dbe4
3fff23a0:  feefeffe feefeffe 3ffe86f8 401017c9
<<<stack<<<

last failed alloc call: 4026F139(620)

 ets Jan  8 2013,rst cause:2, boot mode:(3,7)

load 0x4010f000, len 1384, room 16 
tail 8
chksum 0x2d
csum 0x2d
v8b899c12
~ld
�U1060 : 

INIT : Booting version:  (ESP82xx Core 2_5_2, NONOS SDK 2.2.1(cfd48f3), LWIP: 2.1.2 PUYA support)
1061 : INIT : Free RAM:32672
1063 : INIT : Warm boot #198 Last Task: Task Device timer, id: 8 - Restart SPIFFSImpl: allocating 512+240+1400=2152 bytes
SPIFFSImpl: mounting fs @300000, size=fb000, block=2000, page=100
SPIFFSImpl: mount rc=0
Reason: Software Watchdog

@TD-er, without the stack decodes the stack trace itself isn't much use.

That said, the last lines of them are, and they indicate you're out of memory.

last failed alloc call: 4026F1DC(324)

last failed alloc call: 4026F139(620)

Since you're running with exceptions disabled, the "Abort called" means it is likely that a new of something in your code (of the size indicated) failed.

I know, and I will look into the stack trace decoder you mentioned yesterday.
Although, I have no clue how to use that one with PlatformIO and I have not built this using the ArduinoIDE.

Oh and the out-of-memory may now trigger a stacktrace, but without the OOM/CORE debug flags set it was not giving any.

Sorry, my only experience w/PIO is through the ATOM IDE for building Marlin.

However, you can manually run GDB and get the same info
From a command line, run GDB: xtensa-*-gdb
At the GDB prompt, run "file /full.path.to/sketch.elf"
Then, just "l *0x40....." (look for stack values and OOM addresses and the reported PC) and GDB will give you the line of the error.

That's what the ESPExceptionDecoder is doing, anyway. A CLI utility might be handy if there really is no way to debug stack traces with PIO.

As for OOM, it doesn't actually crash the machine when a malloc/new fails, but with debugging enabled it logs the address of the caller so that later, if you don't check the new/malloc return value and use the pointer (to 0) the resulting crash will be easier to debug.

example GDB for one of the failing mallocs:
````

xtensa-*gdb (will need full path to the tools dir where xtensa-gcc and xtensa-gdb are stored)

file "/tmp/arduino_build/sketch.ino.elf"
l *0x4026F139

````

And, of course, only the xtensa-*gdb distributed with the Arduino code can be used because your native gdb on Linux or Mac will only understand x86 instructions.

The failure to allocate 620 bytes was when running this macro in my code:

typedef std::shared_ptr<ControllerSettingsStruct> ControllerSettingsStruct_ptr_type;
#define MakeControllerSettings(T) ControllerSettingsStruct_ptr_type ControllerSettingsStruct_ptr(new ControllerSettingsStruct());\
                                    ControllerSettingsStruct& T = *ControllerSettingsStruct_ptr;

This is called from:

        MakeControllerSettings(ControllerSettings);
        LoadControllerSettings(event->ControllerIndex, ControllerSettings);

That's something that may be happening now, with the core debug strings active.
At reboot it does then show a "Software Watchdog" reboot.
Good to know these can be an issue, but not really the problem here I guess.
I will at least add some check in this macro to see if the pointer is valid.

I really don't get it.
Yesterday I was sure I had it working every time in every build I made the node could connect to WiFi right the first attempt.
Now I am using almost the same code, including 1000 msec wait after WIFI_OFF, but now with the debug stuff removed.
And it is now failing to connect even the first time. (waiting for wifi connect => WDT reboot)

This is really frustrating.

The only things changed are:

  • Remove core & OOM debug
  • change a bit of totally unrelated code (move the scope of a variable unrelated to WiFi code)

What if you enable core & OOM debug back again ? (Heisenberg effect)

I was curious if in my setup the OOM debug will tell anything. It may be you are chasing another bug because I ran my node for one day and no OOM messages appeared. There were WDT as usual, though.

The OOM was happening on my system with the core debug enabled.
And that's indeed another issue, totally unrelated to what we're discussing here.
But I had to mention it, since OOM stuff may clutter the log reports.

@d-a-v I did enable CORE and OOM debug last night and was still not able to connect, so I was tracking down some of the other changes to get something which makes it reproducible. I stopped at 2am for obvious reasons :)
So the only changes now present are some that just change the scope of variables in loops totally unrelated to WiFi code and the removal of a String allocation which appeared not to be used. (I was running CPPcheck on my code and followed its suggestions)
One of them may actually save some time when not yet performed, so indirectly may have an effect on WiFi connectivity. (reading settings from SPIFFS)

Just to be sure I am not missing anything else, I do clean builds for every attempt, so it may take a while to track all, even when doing a binary search on the files checked out.

The workarhack I was using for the failing connection was to WiFi.mode(WIFI_OFF) after a timeout, then delay(1000) and retry connection. That, until we can find a way to understand when connection attempt is in a bad state.

I do have the 1000 msec delay, but what do you mean by "after a timeout" ?

    if (!started)
        WiFi.mode(WIFI_STA)
        WiFi.begin; start=millis; started=true
    if (started and !connected and (millis-start>timeout))
        WiFi.mode(WIFI_OFF)
        delay(1000)
        started=false

(finished editing # 1) (that's not python :)

I removed all code related to WIFI_OFF and then it was capable of connecting with a very small custom build (only including a few plugins in ESPeasy), but the same code running a "normal build" (just more plugins) cannot connect to WiFi anymore. (Least amount or reboots was 55 until it finally succeeded)
So I think things are now way too time critical to be useful.

Tomorrow I will strip all fancy WiFi related code and just use something related to WiFiMulti class (and will make a pull request to allow working with hidden SSIDs and allow to do wifi off between reconnects)
I am not sure what's going on here, but this just isn't usable anymore with it being Russian Roulette between builds whether wifi will connect.

I also mentioned it here: https://github.com/platformio/platform-espressif8266/issues/166#issuecomment-513547150
But I guess this may be the more appropriate place to ask...

Just to be sure, since it does often result in WiFi connect issues.

  • Variables used in (wifi) event callback functions, do they need to be declared volatile? (tested with it and does not seem to make any difference)
  • Callback functions for WiFi events, do they need IRAM attributes? (not tested yet) And do all functions called from IRAM attr marked function also need to be marked as such?
  • Are there functions that should not be called from callback functions? (e.g. millis())

Variables used in (wifi) event callback functions, do they need to be declared volatile

Should not be needed

Callback functions for WiFi events, do they need IRAM attributes

No, they execute in SYS

do all functions called from IRAM attr marked function also need to be marked as such

Yes, the entire call tree needs to be in iram, which is why ISRs should be kept simple and isolated

Are there functions that should not be called from callback functions

Depends on the callback. Functions that execute in CONT are the most relaxed. Functions that execute in SYS, such as Ticker and the wifi events, can't call delay, yield, blocking functions, etc. and are subject to stricter timing requirements vs. CONT.

Check.
So if millis() can be called from sys, then we're fine.
I was just wondering, but since the examples were also not showing the volatile attributes, I also had not used them.
Since it already took again lots of hours debugging today I just want to make sure not to waste more on just things not right in the examples.

Here's an attempt to work on a common basis #6356

Just to confirm that the @d-a-v commit of 5Sept19 “Experimental: add new WiFi (pseudo) modes: WIFI_SHUTDOWN & WIFI_RESUME #6356” fixes my issue of “intermittent Hardware WDT crashes after "WiFi.mode(WIFI_OFF)". (I have now been running with the github version of 6Sep19 for nearly 3 weeks without a crash!)

Many thanks!

@Rob58329 You instability was probably fixed by #6484, but anyway thanks for testing it.

Is this issue still relevant after #6484 ?
on d-a-v.github.io here is an installable snapshot including #6484.
Closing, please create a new issue if needed.

Haven't been here for a while, sorry.
I checked out the git version of the library and my device is now running fine for 2 days. It seems to me the issue is fixed.
Thanks.

Was this page helpful?
0 / 5 - 0 ratings