Arduino: Exception in WiFiClient when WiFi is disconnected

Created on 3 Jan 2018  路  22Comments  路  Source: esp8266/Arduino

Basic Infos

Hardware

Hardware: ESP-WROOM-02
Core Version: 2.4.0

Description

WiFiClient causes exception 28 in connect function.

Settings in IDE

Module: Generic ESP8266 Module
Flash Size: 2M
CPU Frequency: 80Mhz
Flash Mode: DIO
Flash Frequency: 40Mhz
Upload Using: SERIAL
Reset Method: ck

Sketch


#include <ESP8266WiFi.h>
#include <Esp.h>

const char *ssid      = "MySSID";
const char *password  = "MyPass";

const char *serverIP  = "192.168.1.106";
const int  serverPort = 1883;

WiFiClient client;

void toggleLED()
{
  pinMode(5, OUTPUT);
  digitalWrite(5, HIGH);
  delay(250);
  digitalWrite(5, LOW);
  delay(250);
}

void setup() {
  // put your setup code here, to run once:
    Serial.begin(115200);
    WiFi.setOutputPower(20.5);
    WiFi.mode(WIFI_STA);
    WiFi.begin(ssid, password);
}

void loop() {
  if(WiFi.status() == WL_CONNECTED)
  {
    Serial.println("Connected to WiFi");
    if(!client.connect(serverIP, serverPort))
    {
      Serial.println("Can't connect client.");
    }
    else
    {
      Serial.println("Client connected."); 
    }
  }
  else
  {
     Serial.println("Not connected to WiFi");
  }

  toggleLED();
}

Debug Messages

Exception (28):
epc1=0x40202af1 epc2=0x00000000 epc3=0x00000000 excvaddr=0x000000bb depc=0x00000000

ctx: cont 
sp: 3ffefcd0 end: 3ffeff00 offset: 01a0
>>>stack>>>
3ffefe70:  3ffeec58 00000000 3fff112c 40202aea 
3ffefe80:  6a01a8c0 00000010 3ffe8880 3ffeeecc  
3ffefe90:  3fffdad0 00000011 3ffeeea0 3ffeeecc  
3ffefea0:  3fffdad0 0000075b 3ffeec58 402026a8 
3ffefeb0:  3ffe8bf0 6a01a8c0 3ffe8bf0 6a01a8c0  
3ffefec0:  3ffe88c8 40202460 3ffeeea0 40203198 
3ffefed0:  00000000 00000000 3ffeeea0 40202132
3ffefee0:  3fffdad0 00000000 3ffeeec4 40203404
3ffefef0:  feefeffe feefeffe 3ffeeee0 40100710 
<<<stack<<<<
 ets Jan  8 2013,rst cause:1, boot mode:(3,6)

load 0x4010f000, len 1384, room 16 
tail 8
chksum 0x2d
csum 0x2d

I'm trying to connect to a MQTT server. It connects to the server and works as expected. But when I turn of my router and the connection is lost it crashes with exception 28. Am I using something wrong or is this a bug?

network bug

All 22 comments

I can confirm this. Doesn't happen with LwIP v1.4 selected.

Log output with 1.4 when WiFi disconnects:

:ref 1
Client connected.
state: 5 -> 0 (0)
rm 0
:er -8 0x00000000
del if0
usl
mode : null
wifi evt: 1
STA disconnect: 8
:ur 1
:del

With 2.0, this results either in an exception or a hardware WDT (still wondering how that happens):

:ref 1
Client connected.
state: 5 -> 0 (0)
rm 0
del if0
usl
mode : null
wifi evt: 1
STA disconnect: 8
:ur 1
:close

 ets Jan  8 2013,rst cause:4, boot mode:(3,6)

wdt reset

Seems that if TCP disconnect is triggered after WiFi goes down, this results in an invalid memory access somewhere.

With LwIP 1.4, TCP PCB receives error callback (log with :er -8), and closes the connection, while with LwIP 2, this error callback is not sent. When WiFiClient.connect is called again, old connection is closed (tcp_close), but the network interface is already dead. Why this triggers a hardware WDT is something i don't entirely understand though.

The "del if0" is treated differently in lwip2.
I'll check asap.

--
on mobile

On January 4, 2018 2:26:20 PM Ivan Grokhotkov notifications@github.com wrote:

I can confirm this. Doesn't happen with LwIP v1.4 selected.

Log output with 1.4 when WiFi disconnects:

:ref 1
Client connected.
state: 5 -> 0 (0)
rm 0
:er -8 0x00000000
del if0
usl
mode : null
wifi evt: 1
STA disconnect: 8
:ur 1
:del

With 2.0, this results either in an exception or a hardware WDT (still
wondering how that happens):

:ref 1
Client connected.
state: 5 -> 0 (0)
rm 0
del if0
usl
mode : null
wifi evt: 1
STA disconnect: 8
:ur 1
:close

 ets Jan  8 2013,rst cause:4, boot mode:(3,6)

wdt reset

Seems that if TCP disconnect is triggered after WiFi goes down, this
results in an invalid memory access somewhere.

--
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/esp8266/Arduino/issues/4078#issuecomment-355281312

Thanks for reproducing!

That's the same behavior I described here a few days ago.

Thank you for your explanation! Where can I change to LwIP v1.4 if it is possible without changing core version?

@pakokol In Arduino IDE you can change it in the Tools menu
bildschirmfoto 2018-01-05 um 11 39 08

@jp112sdl Thank you!

about lwip2:
I think I have fixed the wdt problem, but i still need to understand what is triggering wifi_station_get_connect_status() result. It does not go back to STATION_GOT_IP even though dhcp is re-triggered. I will dig deeper.

@pakokol Please retest with the reference PR.

@devyte I have cloned and tested your commit, but it still crashes after a few connection drops.

Connected to WiFi 
ap_probe_send over, rest wifi status to disassoc 
state: 5 -> 0 (1) 
rm 0 
pm close 7 
ip:0.0.0.0,mask:255.255.255.0,gw:192.168.1.1 
Fatal exception 28(LoadProhibitedCause): 
epc1=0x40202af9, epc2=0x00000000, epc3=0x00000000, excvaddr=0x000002cb, depc=0x00000000 

Exception (28): 
epc1=0x40202af9 epc2=0x00000000 epc3=0x00000000 excvaddr=0x000002cb depc=0x00000000 

ctx: cont  
sp: 3ffefcd0 end: 3ffeff00 offset: 01a0

>>>stack>>> 
3ffefe70:  3ffeec58 00000000 3fff1044 40202af2   
3ffefe80:  6a01a8c0 00000010 3ffe8880 3ffeeecc   
3ffefe90:  3fffdad0 00000011 3ffeeea0 3ffeeecc   
3ffefea0:  3fffdad0 0000075b 3ffeec58 402026b0   
3ffefeb0:  3ffe8bf0 6a01a8c0 3ffe8bf0 6a01a8c0   
3ffefec0:  3ffe88c8 40202468 3ffeeea0 402031dc   
3ffefed0:  00000000 00000000 3ffeeea0 4020213a   
3ffefee0:  3fffdad0 00000000 3ffeeec4 40203448   
3ffefef0:  feefeffe feefeffe 3ffeeee0 40100710   
<<<stack<<< 

 ets Jan  8 2013,rst cause:1, boot mode:(3,7) 

load 0x4010f000, len 1384, room 16  
tail 8 
chksum 0x2d 
csum 0x2d 
v00000000
~ld
scandone
Not connected to WiFi


@pakokol
Can you retry with a clean installation from git (the referenced PR is merged),
and use the stack decoder with your stack dump ?

@d-a-v I have retested it with a clean installation from git and also removed my printouts from the above code. The exception occurred after the second connection drop. The result from stack decoder is:

0x40202ab6: WiFiClient::connect(IPAddress, unsigned short) at C:\Users\patrik.kokol\AppData\Local\Arduino15\packages\esp8266\hardware\esp8266\2.4.0\libraries\ESP8266WiFi\src/WiFiClient.cpp line 329
0x40202674: WiFiClient::connect(char const*, unsigned short) at C:\Users\patrik.kokol\AppData\Local\Arduino15\packages\esp8266\hardware\esp8266\2.4.0\libraries\ESP8266WiFi\src/WiFiClient.cpp line 329
0x4020242c: ESP8266WiFiSTAClass::status() at C:\Users\patrik.kokol\AppData\Local\Arduino15\packages\esp8266\hardware\esp8266\2.4.0\libraries\ESP8266WiFi\src/ESP8266WiFiSTA.cpp line 483
0x40202085: toggleLED() at C:\Users\patrik.kokol\Documents\Arduino\sketch_jan05a/sketch_jan05a.ino line 19
0x4020211d: loop at C:\Users\patrik.kokol\Documents\Arduino\sketch_jan05a/sketch_jan05a.ino line 48
0x402033c8: loop_wrapper at C:\Users\patrik.kokol\AppData\Local\Arduino15\packages\esp8266\hardware\esp8266\2.4.0\cores\esp8266/core_esp8266_main.cpp line 57
0x40100710: cont_norm at C:\Users\patrik.kokol\AppData\Local\Arduino15\packages\esp8266\hardware\esp8266\2.4.0\cores\esp8266/cont.S line 109

And the printouts:

scandone 
state: 0 -> 2 (b0) 
state: 2 -> 3 (0) 
state: 3 -> 5 (10) 
add 0 
aid 1 
cnt  
pm open,type:2 0 
state: 5 -> 0 (2) 
rm 0 
pm close 7 
reconnect 
scandone 
state: 0 -> 2 (b0) 
state: 2 -> 3 (0) 
state: 3 -> 5 (10) 
add 0 
aid 1 
cnt  

connected with QuickEagle, channel 1 
dhcp client start... 
ip:192.168.1.107,mask:255.255.255.0,gw:192.168.1.1 
pm open,type:2 0 
bcn_timout,ap_probe_send_start 
ap_probe_send over, rest wifi status to disassoc 
state: 5 -> 0 (1) 
rm 0 
pm close 7 
ip:0.0.0.0,mask:255.255.255.0,gw:192.168.1.1 
scandone 
no QuickEagle found, reconnect after 1s 
reconnect 
scandone 
no QuickEagle found, reconnect after 1s 
reconnect 
scandone 
no QuickEagle found, reconnect after 1s 
reconnect 
scandone 
no QuickEagle found, reconnect after 1s 
reconnect 
scandone 
no QuickEagle found, reconnect after 1s 
reconnect 
scandone 
no QuickEagle found, reconnect after 1s 
reconnect 
scandone 
no QuickEagle found, reconnect after 1s 
reconnect 
scandone 
no QuickEagle found, reconnect after 1s 
reconnect 
scandone 
no QuickEagle found, reconnect after 1s 
reconnect 
scandone 
no QuickEagle found, reconnect after 1s 
reconnect 
scandone 
no QuickEagle found, reconnect after 1s 
reconnect 
scandone 
state: 0 -> 2 (b0) 
state: 2 -> 3 (0) 
state: 3 -> 5 (10) 
add 0 
aid 1 
cnt  
pm open,type:2 0 
state: 5 -> 0 (2) 
rm 0 
pm close 7 
reconnect 
scandone 
state: 0 -> 2 (b0) 
state: 2 -> 3 (0) 
state: 3 -> 5 (10) 
add 0 
aid 1 
cnt  

connected with QuickEagle, channel 1 
dhcp client start... 
ip:192.168.1.107,mask:255.255.255.0,gw:192.168.1.1 
pm open,type:2 0 
bcn_timout,ap_probe_send_start 
ap_probe_send over, rest wifi status to disassoc 
state: 5 -> 0 (1) 
rm 0 
pm close 7 
ip:0.0.0.0,mask:255.255.255.0,gw:192.168.1.1 
Fatal exception 28(LoadProhibitedCause): 
epc1=0x40202abd, epc2=0x00000000, epc3=0x00000000, excvaddr=0x000001c1, depc=0x00000000 

Exception (28):  
epc1=0x40202abd epc2=0x00000000 epc3=0x00000000 excvaddr=0x000001c1 depc=0x00000000  

ctx: cont  
sp: 3ffefc60 end: 3ffefe90 offset: 01a0  

>>>stack>>> 
3ffefe00:  3ffeebe8 00000000 3fff0fa4 40202ab6    
3ffefe10:  6a01a8c0 02f64e96 3ffeec50 00000000    
3ffefe20:  3ffee610 3ffeec50 3ffeee70 3ffeee5c    
3ffefe30:  3fffdad0 0000075b 3ffeebe8 40202674    
3ffefe40:  3ffe8b98 6a01a8c0 3ffe8b98 6a01a8c0    
3ffefe50:  3ffe8870 4020242c 3ffeee54 40202085    
3ffefe60:  00000000 00000000 3ffeee54 4020211d    
3ffefe70:  3fffdad0 00000000 3ffeee54 402033c8    
3ffefe80:  feefeffe feefeffe 3ffeee70 40100710    
<<<stack<<< 

 ets Jan  8 2013,rst cause:2, boot mode:(3,6)  

load 0x4010f000, len 1384, room 16   
tail 8  
chksum 0x2d  
csum 0x2d  
v00000000  
~ld 

I hope I did everything the right way, because I'm a little confused about what did you meant that the referenced PR is merged. Aren't PR just printouts on the serial or did I miss something?

a PR is a "Pull Request" meaning a pending source code update which become included (part of core) once "Merged". Get some doc about git and GitHub for more information.

Could you please add client.println("hello"); client.stop(); right after your "Client connected" and retest ?
It works for me (10 times AP off and on).
Double check you are using the latest master branch of the core from git (not the latest release 2.4.0 which will not work of course, you created this issue for that reason), with

git pull origin master

Thanks for the explanation, I will look up in the docs so that I won't be surprised next time. I added the client.println("hello"); client.stop(); after the printout client connected and the output is:
```
Connected to WiFi
bcn_timout,ap_probe_send_start
ap_probe_send over, rest wifi status to disassoc
state: 5 -> 0 (1)
rm 0
pm close 7
ip:0.0.0.0,mask:255.255.255.0,gw:192.168.1.1
Fatal exception 28(LoadProhibitedCause):
epc1=0x40202bbd, epc2=0x00000000, epc3=0x00000000, excvaddr=0x002b00e3, depc=0x00000000

Exception (28):
epc1=0x40202bbd epc2=0x00000000 epc3=0x00000000 excvaddr=0x002b00e3 depc=0x00000000

ctx: cont
sp: 3ffefd80 end: 3ffeffb0 offset: 01a0

stack>>>
3ffeff20: 3ffeed08 00000000 3fff0934 40202bb6
3ffeff30: 6a01a8c0 00000010 3ffe8880 3ffeef7c
3ffeff40: 3fffdad0 00000011 3ffeef50 3ffeef7c
3ffeff50: 3fffdad0 0000075b 3ffeed08 402026dc
3ffeff60: 3ffe8ca8 6a01a8c0 3ffe8ca8 6a01a8c0
3ffeff70: 3ffe88d0 4020248c 3ffeef50 40203364
3ffeff80: 00000000 3ffeed08 3ffeef50 40202144
3ffeff90: 3fffdad0 00000000 3ffeef74 402035f8
3ffeffa0: feefeffe feefeffe 3ffeef90 40100710
<<

ets Jan 8 2013,rst cause:1, boot mode:(1,6)

ets Jan 8 2013,rst cause:4, boot mode:(1,6)

wdt reset
```
I also checked with git that I'm on the latest master. Maybe my settings are wrong.
2018-01-12 11_02_58-sketch_jan05a _ arduino 1 8 1

Later today I will also try to change my router.

Hi, sorry for the late response. I have changed my router and the exception is still triggered.

I have been able to reproduce, but it is honestly difficult to do.
I currently use my phone's AP and the latter occasionally reboots more often than the esp fails :]
I will try with another AP.

I could capture interesting logs.
I always had this behaviour since I could isolate the problem (lwip v1 or v2).
I instrumented ClientContext as follow:

--- a/libraries/ESP8266WiFi/src/include/ClientContext.h
+++ b/libraries/ESP8266WiFi/src/include/ClientContext.h
@@ -129,13 +129,18 @@ public:
         }
         _connect_pending = 1;
         _op_start_time = millis();
+os_printf(":x1 %p\n", _pcb);
         // This delay will be interrupted by esp_schedule in the connect callback
         delay(_timeout_ms);
+os_printf(":x2 %p\n", _pcb);
         _connect_pending = 0;
         if (state() != ESTABLISHED) {
+os_printf(":x3 %p\n", _pcb);
             abort();
+os_printf(":x4 %p\n", _pcb);
             return 0;
         }
+os_printf(":x5 %p\n", _pcb);
         return 1;
     }

and the log:

Connected to WiFi
:x1 0x3fff1394
state: 5 -> 2 (3c0)
rm 0
:x2 0x017500ad
Fatal exception 28(LoadProhibitedCause):

We can see that ClientContext's members are modified (this is untouched, _pcb address is borked) when WiFi is lost during the delay (and possibly *_connected() callbacks happening during that delay).
In my tests, this behaviour is very seldom. I will dig further.

Thanks to an added delay(), I can now more systematically trigger the issue.
AP disconnection must happen during the second delay.

sources:

--- a/libraries/ESP8266WiFi/src/include/ClientContext.h
+++ b/libraries/ESP8266WiFi/src/include/ClientContext.h
@@ -129,13 +129,26 @@ public:
         }
         _connect_pending = 1;
         _op_start_time = millis();
+void* pcbsave = _pcb;
+os_printf(":x1 %p\n", _pcb);
         // This delay will be interrupted by esp_schedule in the connect callback
         delay(_timeout_ms);
+// this delay should not be interrupted if connection occured
+os_printf(":x1b %p rx=%p this=%p\n", _pcb, _rx_buf, this);
+malloc(0);//umm's integrity check
+delay(2000);
+malloc(0);//umm's integrity check
+os_printf(":x2 %p =? %p rx=%p this=%p\n", _pcb, pcbsave, _rx_buf, this);
+assert(_pcb == pcbsave);
+os_printf(":x2b %p\n", _pcb);
         _connect_pending = 0;
         if (state() != ESTABLISHED) {
+os_printf(":x3 %p\n", _pcb);
             abort();
+os_printf(":x4 %p\n", _pcb);
             return 0;
         }
+os_printf(":x5 %p\n", _pcb);
         return 1;
     }

logs:

Connected to WiFi
:x1 0x3fff1604
:x1b 0x3fff1604 rx=0x00000000 this=0x3fff115c
:oom(0)@.../src/include/ClientContext.h:138
:oom(0)@.../src/include/ClientContext.h:140
:x2 0x3fff1604 =? 0x3fff1604 rx=0x3fff0e1c this=0x3fff115c
:x2b 0x3fff1604
:x5 0x3fff1604
Client connected.
Connected to WiFi
:x1 0x3fff16b4
:x1b 0x3fff16b4 rx=0x00000000 this=0x3fff115c
:oom(0)@.../src/include/ClientContext.h:138
:oom(0)@.../src/include/ClientContext.h:140
:x2 0x3fff16b4 =? 0x3fff16b4 rx=0x3fff0e1c this=0x3fff115c
:x2b 0x3fff16b4
:x5 0x3fff16b4
Client connected.
Connected to WiFi
:x1 0x3fff1764
:x1b 0x3fff1764 rx=0x00000000 this=0x3fff115c
:oom(0)@.../src/include/ClientContext.h:138
:oom(0)@.../src/include/ClientContext.h:140
:x2 0x3fff1764 =? 0x3fff1764 rx=0x3fff0e1c this=0x3fff115c
:x2b 0x3fff1764
:x5 0x3fff1764
Client connected.
Connected to WiFi
:x1 0x3fff1814
:x1b 0x3fff1814 rx=0x00000000 this=0x3fff115c
:oom(0)@.../src/include/ClientContext.h:138
<---------------------------------- disconnection here
state: 5 -> 2 (3c0)
rm 0
pm close 7
ip:0.0.0.0,mask:255.255.255.0,gw:192.168.43.1
:oom(0)@.../src/include/ClientContext.h:140
:x2 0x0a64253a =? 0x3fff1814 rx=0xa5a5a500 this=0x3fff115c
:x2b 0x0000008e
Fatal exception 28(LoadProhibitedCause):
epc1=0x40202d92, epc2=0x00000000, epc3=0x00000000, excvaddr=0x0a64254e, depc=0x00000000

Panic .../src/include/ClientContext.h:142 int ClientContext::connect(ip_addr_t*, uint16_t)

L142 is the assert line.
I have enabled umm's integrity check (I will make a PR for that).
umm's poison is also enabled (debug mode), and we can see that _rx member is overwritten with a5a5 which is the poison byte.
At this point, I have still not figured out why _pcb member is modified but at the light of the poison byte, there could be a free() occuring at some points (wifi callbacks on disconnections maybe).
It has to be noted that in normal operation, there seems to be a memory leak, by looking at _pcb address before failing.

@pakokol Do you think you could try the pull request ?

edit: ~You can simply replace your libraries/ESP8266WiFi/src/include/ClientContext.h by this one and recompile.~ try latest git master.

The fix is not good (github should monitor contributors' sleep)
edit:
@pakokol Do you think you could try this fix ?
You can simply replace

  • libraries/ESP8266WiFi/src/include/ClientContext.h by this
  • libraries/ESP8266WiFi/src/WiFiClient.cpp by that

and recompile.

@d-a-v Hi sorry for the late response I was away for the weekend. I retested it with your changes and I couldn't reproduce the exception. Thank you for your support on that issue!! :)

The reason is found, this proposed fix is valid but a better one is coming.

What are the build flags for lwip, using platformio?

I am having a problem with this correction, after 5 days working well, communication with the access point is ok, I can ping the device, the device connect to the server but there is no communication with the server. The sketch stop when calling httpclient. Timeout dont work and hardware watch dog dont restart the device. After reboot the access point or the device, every thing go normal.

I'm using iis 8.5 server to connect and i got the error 1232:
ERROR_HOST_UNREACHABLE
The network location cannot be reached.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

markusschweitzer picture markusschweitzer  路  3Comments

mechanic98 picture mechanic98  路  3Comments

Geend picture Geend  路  3Comments

mreschka picture mreschka  路  3Comments

SmartSouth picture SmartSouth  路  3Comments