Arduino: OTA update fails every other attempt after reset (> r2.4.2)

Created on 6 Apr 2019  路  25Comments  路  Source: esp8266/Arduino

Basic Infos

  • [x] This issue complies with the issue POLICY doc.
  • [x] I have read the documentation at readthedocs and the issue is not addressed there.
  • [x] I have tested that the issue is present in current master branch (aka latest git).
  • [x] I have searched the issue tracker for a similar issue.
  • [x] If there is a stack dump, I have decoded it.
  • [x] I have filled out all fields below.

Platform

  • Hardware: [ESP-12]
  • Core Version: [2.5.0-b3, 2.5.0, 2.4.2]
  • Development Env: [Arduino IDE, cmd line]
  • Operating System: [MacOS]

Settings in IDE

  • Module: [Nodemcu]
  • Flash Mode: [qio]
  • Flash Size: [4MB]
  • lwip Variant: [v2 Lower Memory]
  • Reset Method: [ck]
  • Flash Frequency: [40Mhz]
  • CPU Frequency: [80Mhz]
  • Upload Using: [OTA]
  • Upload Speed: OTA

Problem Description

Hello, Arduino OTA fails every other attempt after rest, after much searching and debugging I believe this is a side effect of issue #3481. Arduino OTA uses UDP to initiate the transfer and espota.py expects a reply to the UDP Flash command packet within a timeout. If the esp8266 does not reply with an UDP "OK" payload the script will timeout with the famous "No Answer".

Upon doing various tests, wireshark proves the above behaviour. The first UDP packet issue does not seem to exist in r2.4.2, where OTA update works all the time. The OTA UDP "OK" reply is always sent after an esp8266 reset.

I am not familiar with esp8266 src code, but i've looked at the diffs for 'ArduinoOTA.cpp' of 2.4.2 and 2.5.0, and although i know NULL about C++, it does not seem the issue is introduced from all those updates.

I also quickly diffed lwip v2 of 2.4.2 vs 2.5.0, there is a bunch of patches that also likely did not affect the UDP first packet issue.

I am willing to continue to dig further, if someone could be kind to point me in some direction! A temporary workaround could be to implement a retry on the sendto socket call https://github.com/esp8266/Arduino/blob/77451d6ec9c4c8ce014704214aa98252a37f3d76/tools/espota.py#L95 in espota.py script.

Steps to reproduce

  • Use either 2.5.0 or later installed from IDE board manager.
  • Load BasicOTA sketch via serial, then reset
  • Attempt OTA update , it will fail on first try, observe that there is NO UDP OK packet sent from esp8266
  • Attempt another OTA update, it will work as the UDP OK packet is sent.

OTA_Fail_NoUDP_Reply
OTA_success

OTA enhancement

All 25 comments

Me, I would replace this by

  for (int i = 0; i < 10; i++)
  {
    _udp_ota->append("OK", 2);
    _udp_ota->send(ota_ip, _ota_udp_port);
    delay(10);
  }

and replace python's if (data != "OK"): by if ("OK" not in data):.

Better, I would replace this UDP message by a TCP one.
Because UDP has nothing to do here. A TCP message, would be always sent, received and acknowledged reliably.

@d-a-v , merci 馃槃 your workaround should work as we wait for a conclusion on issue #3481 . I wrongly thought a temporary workaround to retry on revc fail in espota.py would work, but noticed that the ArduinoOTA code initiates the TCP transfer irregardless if the UDP OK is sent (and it can鈥檛 know as it鈥檚 UDP), whereafter espota times out on UDP recv.

I agree whole heartedly that the OTA update init sequence (Flash command/OK) is best served using TCP, but I can鈥檛 comment why UDP was chosen.

PS: I just tried your code and it works as intended.

Salut @d-a-v , is the enhancement tag for implementing a TCP solution for OTA session initiation instead of UDP? If yes, and if this is accepted, I would like to 芦聽attempt聽禄 implementing it. I know it would take you guys literally hours to do. I have much spare time and would be nice to make this tiny contribution. I think the scope is mostly limited to ArduinoOTA.cpp and espota.py?

What do you think?

Yes go ahead!

(Keep backward compatibility in mind though)

That was quick 馃榾

By backward compatibility, do you mean a solution that initiates the OTA handshake based on a common transport between client and esp?

A scenario I can think of, client has an old version of esports.py which does not support TCP - so switch to UDP, How valid is this?

Afraid I don鈥檛 much insight or history about all this so I may be way off... my first attempt will be a simple TCP transport implemented in both .py and .cpp.

PS: heavy focus on attempt as very new to esp code base and C++/python

@igrr , do you have any insight why UDP was chosen as the transport for the OTA Flash command? As you are one of the authors for OTA, do you have any additional thoughts regarding our proposal to use TCP for that initial OTA setup?

As this is my first public contrib, I just want to consult and include all relevant folks. Thank you.

UDP was chosen due to resources used, mostly. Keeping a TCP PCB (~socket) listening needed more memory than an open UDP PCB. Also the total number of TCP sockets was limited to 4 or so, in earlier versions of the LwIP (which came pre-compiled with the SDK).

Although on one hand I agree with the "eh, UDP is not reliable, let's slap on TCP here" approach, it might be worth actually looking into the UDP loss issue, especially if it happens predictably ("on every other reset").

Thank you for the very valuable insight and quick reply.
One point to add, and I know very little, after the U_FLASH command exchange which uses UDP, a TCP session is used for the firmware transfer.
Couldn鈥檛 this same TCP socket be also used for the initial commands too? This way there is no additional resource usage.

The UDP first packet loss is reproduced in every other OTA attempt, predictably, which I believe is related to issue #3481. I can also attempt to look into this but some guidance would be helpful.

it might be worth actually looking into the UDP loss issue, especially if it happens predictably ("on every other reset").

I agree with that. I got back to #3481 and found that without some delays, UDP packets are not even sent out.

Couldn鈥檛 this same TCP socket be also used for the initial commands too? This way there is no additional resource usage.

I agree with that too.

@d-a-v @igrr , Small update, lwip v1.4 (high BW) does NOT exhibit the issue compared to lwip v2 (high BW) in same release 2.5.0. OTA transfer (and hence UDP exchange) works everytime.

The lwip v1.4 -> v2 evolution is convoluted to my noob eyes, I don't think I can narrow it further as the lwip codebase is large and don't have much history.

@d-a-v , need some help svp, trying to use lwip debugging by setting ULWIPDEBUG, using your instructions from https://github.com/esp8266/Arduino/issues/3884#issuecomment-347537526, running the sketch outputs nothing related to those defines. I verified that sdk/lib contains the newly compiled -llwip* (am using git checkout 2.5.0)

Other defines that I set, e.g. DEBUG_ESP_MDNS_RESPONDER works fine and show up on serial port.

And I believe if I set both defines UDEBUG & ULWIPDEBUG, compilations breaks ...


lwip2 make install

```make -f makefiles/Makefile.glue target=arduino BUILD=build-536-feat-v6 TCP_MSS=536 LWIP_FEATURES=1 LWIP_IPV6=1 V=0
C build-536-feat-v6/glue/doprint.o
C build-536-feat-v6/glue/uprint.o
C build-536-feat-v6/glue-lwip/lwip-git.o
glue-lwip/lwip-git.c: In function 'new_display_netif':
glue-lwip/lwip-git.c:144:37: error: 'ip_addr_t' has no member named 'addr'
display_ip32(" ip=", netif->ip_addr.addr);
^
glue-lwip/lwip-git.c:145:39: error: 'ip_addr_t' has no member named 'addr'
display_ip32(" mask=", netif->netmask.addr);
^
glue-lwip/lwip-git.c:146:32: error: 'ip_addr_t' has no member named 'addr'
display_ip32(" gw=", netif->gw.addr);
^
glue-lwip/lwip-git.c: At top level:
glue-lwip/lwip-git.c:467:9: note: #pragma message:

-------- TCP_MSS = 536 --------
-------- LWIP_FEATURES = 1 --------
-------- LWIP_IPV6 = 1 --------

#pragma message "nn" VAR_NAME_VALUE(TCP_MSS) VAR_NAME_VALUE(LWIP_FEATURES) VAR_NAME_VALUE(LWIP_IPV6)
^
make[3]: * [build-536-feat-v6/glue-lwip/lwip-git.o] Error 1
make[2]: *
[liblwip6-536-feat.a] Error 2
make -f makefiles/Makefile.glue-esp BUILD=build-1460-feat-v6 V=0
C build-1460-feat-v6/glue-esp/lwip-esp.o
AR liblwip6-1460-feat.a
make -f makefiles/Makefile.glue target=arduino BUILD=build-1460-feat-v6 TCP_MSS=1460 LWIP_FEATURES=1 LWIP_IPV6=1 V=0
C build-1460-feat-v6/glue/doprint.o
C build-1460-feat-v6/glue/uprint.o
C build-1460-feat-v6/glue-lwip/lwip-git.o
glue-lwip/lwip-git.c: In function 'new_display_netif':
glue-lwip/lwip-git.c:144:37: error: 'ip_addr_t' has no member named 'addr'
display_ip32(" ip=", netif->ip_addr.addr);
^
glue-lwip/lwip-git.c:145:39: error: 'ip_addr_t' has no member named 'addr'
display_ip32(" mask=", netif->netmask.addr);
^
glue-lwip/lwip-git.c:146:32: error: 'ip_addr_t' has no member named 'addr'
display_ip32(" gw=", netif->gw.addr);
^
glue-lwip/lwip-git.c: At top level:
glue-lwip/lwip-git.c:467:9: note: #pragma message:

-------- TCP_MSS = 1460 --------
-------- LWIP_FEATURES = 1 --------
-------- LWIP_IPV6 = 1 --------

#pragma message "nn" VAR_NAME_VALUE(TCP_MSS) VAR_NAME_VALUE(LWIP_FEATURES) VAR_NAME_VALUE(LWIP_IPV6)
^
make[3]: * [build-1460-feat-v6/glue-lwip/lwip-git.o] Error 1
make[2]:
[liblwip6-1460-feat.a] Error 2
make[1]:
[install] Error 2
make: *
[install] Error 2```

This bug is already fixed in master (and I just fixed the previous fix), in tools/sdk/lwip2/:

cd builder
git fetch
git checkout master

edit: Sorry for being unclear. Only the compilation error is fixed.

Sorry, unsure to which bug you say is fixed, no lwip debugging output or setting both defines that breaks compilation? Ok the latter.

In any case I鈥檒l checkout builder master and go from there. Thank you.
PS: found it https://github.com/d-a-v/esp82xx-nonos-linklayer/commit/10c1fe160a2c5f69613a05722116e3deda920968 Oh its fresh off the oven!

You might also have a look to https://github.com/esp8266/Arduino/issues/3481#issuecomment-480643361

Thanks I saw it when you commented there. For this OTA usecase, it鈥檚 just one single UDP packet not being sent (the first one post boot). I鈥檝e done about 100鈥檚 of uploads, and I can鈥檛 put my finger on it, but probably there鈥檚 some interaction with the first ARP. I鈥檝e noticed in some case esp does not respond to ARP request. And one person in that issue also mentioned ARP...

It would really help me if I can activate lwip2 debugging, does your latest fix has anything to do with ULWIPDEBUG ?
PS: I will run your script as this netdump may be helpful

does your latest fix has anything to do with ULWIPDEBUG ?

No, it's supposed to work. You need to enable debug on serial in IDE Tools menu.

does your latest fix has anything to do with ULWIPDEBUG ?

No, it's supposed to work. You need to enable debug on serial in IDE Tools menu.

Yes, I did that, and also added (in case) Serial.setdebugouput(true) , no additional debug info is shown from whatever I set in debug level (OTA, wifi, core,...). Sadly I spent like 4 hours trying to figure it out before I asked for help!

Ah!
That was working with lwIP-2.0.x but when I updated to lwIP-2.1.2 I reworked the whole lwipopts.h
and I didn't change the debug default options.
You need to change all (or some) LWIP_DBG_OFF to LWIP_DBG_ON in:
https://github.com/d-a-v/esp82xx-nonos-linklayer/blob/25d5e8186f710a230221021cba97727dbfdfd953/glue-lwip/arduino/lwipopts.h#L3259

@d-a-v , I needed to enable both UDEBUG & ULWIPDEBUG to get debug info printed. In your old comment https://github.com/esp8266/Arduino/issues/3884#issuecomment-347537526 it seems to imply that one can define only ULWIPDEBUG ?

In any case, there's tons of debug info, and once when I attempted an OTA transfer, i got a stack dump and most other times during the upload. Not sure if its a side effect of the vvv debug logs or not ... and there are periodic Exception (28) crashes by just letting the sketch run. I just attached FYI.

Maybe I'll use master so I am at par w most of you and take it from there.


Crash during OTA update and UDEBUG/ULWIPDEBUG = 1

Decoding stack results 0x4020414d: ClientContext::_s_connected(void*, tcp_pcb*, long) at $ESP8266/libraries/ESP8266WiFi/src/include/ClientContext.h line 605 0x40219475: tcp_input at core/tcp_in.c line 408 0x4021dd4d: ip_reass_free_complete_datagram at core/ipv4/ip4_frag.c line 190 0x40100630: _umm_free at $ESP8266/cores/esp8266/umm_malloc/umm_malloc.c line 1300 0x40215a25: dns_recv at core/dns.c line 1239 0x4021581f: dns_check_entry at core/dns.c line 1010 0x40235474: wifi_softap_dhcps_client_update at glue-lwip/esp-dhcpserver.c line 1239 0x4020c4fc: esp_yield() at $ESP8266/cores/esp8266/core_esp8266_main.cpp line 91 0x40201ad4: uart0_write_char at $ESP8266/cores/esp8266/uart.c line 450 0x4020c400: String::trim() at $ESP8266/cores/esp8266/WString.cpp line 776 0x4020c4fc: esp_yield() at $ESP8266/cores/esp8266/core_esp8266_main.cpp line 91 0x4020157f: delay at $ESP8266/cores/esp8266/core_esp8266_wiring.c line 51 0x40201ad4: uart0_write_char at $ESP8266/cores/esp8266/uart.c line 450 0x40100d84: free at $ESP8266/cores/esp8266/umm_malloc/umm_malloc.c line 1760 0x40100298: vPortFree at $ESP8266/cores/esp8266/heap.c line 59 0x40100630: _umm_free at $ESP8266/cores/esp8266/umm_malloc/umm_malloc.c line 1300 0x401002a8: pvPortMalloc at $ESP8266/cores/esp8266/heap.c line 153 0x40100b66: malloc at $ESP8266/cores/esp8266/umm_malloc/umm_malloc.c line 1681 0x402150f0: netif_init_ap at glue-lwip/lwip-git.c line 344 0x4021560e: lwip_htonl at core/def.c line 92 0x4021a84c: sys_timeouts_init at core/timeouts.c line 271 0x4021df68: ip4_reass at core/ipv4/ip4_frag.c line 558 0x4021a84c: sys_timeouts_init at core/timeouts.c line 271 0x4021d0ac: icmp_input at core/ipv4/icmp.c line 250 0x4021a84c: sys_timeouts_init at core/timeouts.c line 271 0x4021b1d8: autoip_tmr at core/ipv4/autoip.c line 242 0x4021df68: ip4_reass at core/ipv4/ip4_frag.c line 558 0x4021b258: autoip_tmr at core/ipv4/autoip.c line 424 0x4021dfb0: ip4_reass at core/ipv4/ip4_frag.c line 289 0x402162d0: pbuf_free_ooseq at core/pbuf.c line 131 0x402162d0: pbuf_free_ooseq at core/pbuf.c line 131 0x4021dfd6: ip4_reass at core/ipv4/ip4_frag.c line 301 0x4021a766: lwip_cyclic_timer at core/timeouts.c line 244 0x40219a0c: tcp_write at core/tcp_out.c line 781 0x40217344: tcp_close_shutdown at core/tcp.c line 352 0x40204120: ClientContext::_s_connected(void*, tcp_pcb*, long) at $ESP8266/libraries/ESP8266WiFi/src/include/ClientContext.h line 635 0x40100b66: malloc at $ESP8266/cores/esp8266/umm_malloc/umm_malloc.c line 1681 0x402048ab: WiFiClient::connect(IPAddress const&, unsigned short) at $ESP8266/libraries/ESP8266WiFi/src/include/ClientContext.h line 127 0x40203e80: WiFiClient::connect(char const*, unsigned short) at $ESP8266/libraries/ESP8266WiFi/src/WiFiClient.cpp line 133 0x4020b460: Print::println() at $ESP8266/cores/esp8266/Print.cpp line 178 0x4020cbe4: PubSubClient::connected() at $Arduinolibraries/PubSubClient/src/PubSubClient.cpp line 606 0x40204d03: PubSubClient::connect(char const*, char const*, char const*, char const*, unsigned char, unsigned char, char const*, unsigned char) at $Arduinolibraries/PubSubClient/src/PubSubClient.cpp line 129 0x4020c61d: optimistic_yield(uint32_t) at $ESP8266/cores/esp8266/core_esp8266_main.cpp line 115 0x40204ef4: PubSubClient::connect(char const*, char const*, unsigned char, unsigned char, char const*) at $Arduinolibraries/PubSubClient/src/PubSubClient.cpp line 114 0x401001b6: millis at $ESP8266/cores/esp8266/core_esp8266_wiring.c line 183 0x402028fa: mqttreconnect() at $Arduinoscripts/rssi/rssi.ino line 95 0x40202a10: loop() at $Arduinoscripts/rssi/rssi.ino line 202 0x4020c5a8: loop_wrapper() at $ESP8266/cores/esp8266/core_esp8266_main.cpp line 125


PS: just saw your above post after this one. I'll follow-up, and selectively use the debug option I need.

it seems to imply that one can define only ULWIPDEBUG

It was enough when I tried today

periodic Exception (28) crashes

Can you please try and change https://github.com/esp8266/Arduino/blob/77451d6ec9c4c8ce014704214aa98252a37f3d76/cores/esp8266/debug.h#L8 to
#define DEBUGV(fmt, ...) ::printf(fmt, ## __VA_ARGS__) and report ?

Do you want me to try that on release 2.5.0 (which I am on now), or checkout master and apply that define change ? but first I need to digest the debug options inlwipopts.h

I recommend using git when it comes to debugging.
On 2.5.0 you don't have the same line but you can try this change. Please note that on 2.5.0 the firmware is quite slower (loop() called less often, and there are not-understood issues with some boards)

Ok I set ETHARP_DEBUG, I got nada in the serial console. so there still must be something else to set... :triumph: darn darn darn, need to set both ULWIPDEBUG && the individual defines !

I just got a nasty panic, i'll implement your one liner and try again.
Panic $ESP8266/libraries/ESP8266WiFi/src/include/ClientContext.h:604 err_t ClientContext::_connected(tcp_pcb*, err: Assertion '_connect_pending' failed.

PS: from now on, I am on master.

ok I implemented your one liner above, recompiled, uploaded... run a bit, then I reset the board, I got 3 panics in a row... in between resets. Now, is this related to the debug defines ???
Panic 1,2,3:
Panic $ESP8266/libraries/ESP8266WiFi/src/include/ClientContext.h:604 err_t ClientContext::_connected(tcp_pcb*, err: Assertion '_connect_pending' failed.

Ok solution and problem details in my comment https://github.com/esp8266/Arduino/issues/3481#issuecomment-482293996. In the end, we will not change the OTA transport to TCP given ARP_QUEUEING=1 solves the OTA UDP packet loss.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

mechanic98 picture mechanic98  路  3Comments

mreschka picture mreschka  路  3Comments

rudydevolder picture rudydevolder  路  3Comments

treii28 picture treii28  路  3Comments

Marcelphilippeandrade picture Marcelphilippeandrade  路  3Comments