Arduino: OTA of large files results in device hangs due to bootloader compare error

Created on 15 Jul 2020  Â·  31Comments  Â·  Source: esp8266/Arduino

After a recent core change Tasmota fails to load binaries over 560k due to new bootloader functionality.

After the file is loaded this is reported:

load 0x4010f000, len 3664, room 16 
tail 0
chksum 0xee
csum 0xee
v39c79d9b
@cp:0
cmp:
 ets Jan  8 2013,rst cause:3, boot mode:(3,7)

ets_main.c 

It seems the bootloader fails to compare the copied binary with the uploaded binary (for obvious reasons as part of the uploaded file are being overwritten by the bootloader copy routine).

A recent commit 51daecc236fbff0d1ed30a9c63a3af6e8b5c7392 introduces this cmp functionality not present in previous bootloader versions and leads to the device hangs.

Pls remove the compare functionality as it breaks OTA functionality

EDIT: Using core 2.7.1 the OTA uploads work fine (as it has no compare functionality)

Most helpful comment

As a sidenote, I was wondering about caching mechanisms, ...

@mcspr, As I understand it, ICACHE is out of the picture at boot time. You have 64K of IRAM with eboot sitting at the end. At least until Cache_Read_Enable is called by the NONOS-SDK initialization. By the time the SDK calls user_init in the core, Cache_Read_Enable has been called and IRAM is now 32K and ICACHE is 32K. (At least for now :smiley:)

After this, to do flash read/writes you need to use SDK system APIs to handle flash read/writes. Direct ROM flash API calls will fail. The SDK system calls use calls to Cache_Read_Enable_2 and Cache_Read_Disable_2 as needed.

My memory is a little fuzzy, Cache_Read_Enable_2 and Cache_Read_Disable_2 are more like a suspend/resume kind of operation for ICACHE. Where Cache_Read_Enable and Cache_Read_Disable have more logic dedicated to setup and teardown.

If you want to see some examples using it, I have 3 active PRs that turn on Cache_Read_Enable early, as in before the SDK runs and turns it off before the SDK is launched. I try not to surprise the SDK. Method 3 in _PoC for handling Erase WiFi Setting after OTA #6965_ uses Cache_Read_Enable_2 and Cache_Read_Disable_2.

One note, Cache_Read_Disable is not a 100% restore to the previous state. The cache window size does not shrink; however, it can grow. If you were at 32K when you disable you cannot go back to 16K at a new enable. For this reason, I only use the 16K cache size option. All is properly restored with a reboot.

In summary, while eboot is running there should not be any cache interference while accessing the flash with ROM API calls.

All 31 comments

Oh poop, I think I see what you're saying.

You have a firmware of 100K and a 1MB flash chip. You OTA a 200K app placed at 100K-300K and copied down to 0-200K...overwriting the 1st 100K of the original OTA space. Is that the general gist of the issue?

From what I see here:

    if (cmd.action == ACTION_COPY_RAW) {
        ets_putc('c'); ets_putc('p'); ets_putc(':');

        ets_wdt_disable();
        res = copy_raw(cmd.args[0], cmd.args[1], cmd.args[2], false);
        ets_wdt_enable();

        ets_putc('0'+res); ets_putc('\n');

        // Verify the copy
        ets_putc('c'); ets_putc('m'); ets_putc('p'); ets_putc(':');
        if (res == 0) {
            ets_wdt_disable();
            res = copy_raw(cmd.args[0], cmd.args[1], cmd.args[2], true);
            ets_wdt_enable();
            }

it first does a copy_raw loop moving the downloaded data to it's destination. It then calls the copy_raw loop again comparing the data.

Yes. That's what I meant

I suggest to get rid of the verify anyway. If verify fails it will stall as it does now. Without verify it may run but terminate with exceptions allowing to re-ota a correct binary. It provides additional functionality which a simple verify doesn't.

BTW until now, three years in the process, I never observed any bootloader binary failure so I think there is no need for this functionality at all.

As an additional advantage it also reboots faster without the verify ;-)

If you still want verify why not make a hash of the source binary while copying it to it's final destination and then re-read the copied locations and compare to the hash.

If you still want verify why not make a hash of the source binary while copying it to it's final destination and then re-read the copied locations and compare to the hash.

It feels like this also could be immediate read of the written data? By changing ‘verify’ option to select between ‘just write the data’ and ‘write the data, read it back, compare’

I never observed any bootloader binary failure

Have you encountered flash chips like Puya or XMC? Paranoia like this wasn't needed until write failures were reported, and after lots of pain we figured out it had to do with the chip brand.

why not make a hash

Because the hash algorithm would need to be in the eboot code, and there's almost no more space (4KB size max). In theory we could go above 4KB, but that would jump to 8KB of bootloader, and most of it would be empty. And we're not sure it will work.

It feels like this also could be immediate read of the written data? By changing ‘verify’ option to select between ‘just write the data’ and ‘write the data, read it back, compare’

It could but then it might be you're reading the ram part (cache) of the flash (if it exists) instead of the real flash area. The hash is a better verify process I think. It shouldn't be that tricky something like:

uint32_t GetCfgCrc32(uint8_t *bytes, uint32_t size)
{
  // https://create.stephan-brumme.com/crc32/#bitwise
  uint32_t crc = 0;

  while (size--) {
    crc ^= *bytes++;
    for (uint32_t j = 0; j < 8; j++) {
      crc = (crc >> 1) ^ (-int(crc & 1) & 0xEDB88320);
    }
  }
  return ~crc;
}

Have you encountered flash chips like Puya or XMC? Paranoia like this wasn't needed until write failures were reported, and after lots of pain we figured out it had to do with the chip brand.

I have met Puya but they never failed me on restart. They missed the binary OTA upload indeed but then they still rebooted as the current implementation will halt waiting for a power cycle.

The point is that the verify is supposed to detect other potential cases like that which are currently unknown. New flash brands are showing up every so often, they can have quirks like this, and we've already bled twice as a result, so removing all verification is not an option.
Having said that, overwriting the beginning of the source image in the empty space may be valid. HOWEVER: you should be aware that after such an OTA upgrade, the _next_ OTA must be a _smaller_ image, and you may not be able to OTA again if you don't prepare for that (see below).

About what @earlephilhower said:

You have a firmware of 100K and a 1MB flash chip. You OTA a 200K app placed at 100K-200K, and copied down to 0-200K

The received image is end-aligned in the empty space area, so what would happen is something like this:
"You have a firmware of 100K and a 1MB flash chip. You OTA a 200K app placed at 800K-1000K (minus FS + config block, etc), and copied down to 0-200K"

So, in your case:

  • 1MB flash size
  • 560K bin size
  • assume no FS and negligible config etc
  • assume 420KB current bin size (just some number)
  • assume no compression (to simplify)

Then:

  • Your current bin is placed at 0-420KB
  • Your new image is received at 440-1000KB (highest 560KB)
  • New image is copied to 0-560KB (overwrites beginning of empty area image 440-560KB => verify fails)
  • Your new empty space is 560-1000KB = 440KB

So for your next OTA you're constrained to 440KB size, down from 560KB. Can you handle that? ln general, hitting this case means that future updates must be done with a 2-step process by first flashing a small stub app, in which case you won't have this verify problem anyways.

@devyte , FWIW there are some hashing functions in ROM (SHA1, MD5) which could be used. Might still be issues with fitting in the code on our end, but it's not clear it's going to cross the 4k boundary.

@devyte Yes I can handle that.

Tasmota uses OTAMagic. This detects that the requested image won't fit in the space allocated and switches to an intermediate binary called tasmota-minimal which then in turn loads the requested binary. All without user intervention.

This mechanism worked fine for over a year until the bootloader verify option came in play; with the minimal binary active all requested OTA binaries would be larger and were bound to fail.

It could but then it might be you're reading the ram part (cache) of the flash (if it exists) instead of the real flash area. The hash is a better verify process I think. It shouldn't be that tricky something like:

Just to be on the same page, the idea is for eboot to continuously update hash based on write buffer? Doesn't this still hit the caching problem when re-reading the data to calculate the hash for a 2nd time? Speed is another question though, as right now it is only a single memcmp.
I have tested my own suggestion with write->read->check with 1MB board, so at least that works. I would agree that new behavior is not really user friendly :/

As a sidenote, I was wondering about caching mechanisms, it looks like SDK rom .ld has something that looks like related to the flash operations - Cache_Read_{Enable,Disable}. There are also same functions with _2 suffix, without arguments, but those are not listed in headers, but are included in the spi_flash.o of the SDK. Cache_Read_Enable(u8, u8, u8) needs some specific arguments which does not seem to be documented anywhere, and _2-suffixed also don't really work with bootleg version I c/p from the RTOS sources

Even then, is caching even enabled by default (i.e. is it a hw property) or does sdk 'enable' it only after the app is loaded?

As a sidenote, I was wondering about caching mechanisms, ...

@mcspr, As I understand it, ICACHE is out of the picture at boot time. You have 64K of IRAM with eboot sitting at the end. At least until Cache_Read_Enable is called by the NONOS-SDK initialization. By the time the SDK calls user_init in the core, Cache_Read_Enable has been called and IRAM is now 32K and ICACHE is 32K. (At least for now :smiley:)

After this, to do flash read/writes you need to use SDK system APIs to handle flash read/writes. Direct ROM flash API calls will fail. The SDK system calls use calls to Cache_Read_Enable_2 and Cache_Read_Disable_2 as needed.

My memory is a little fuzzy, Cache_Read_Enable_2 and Cache_Read_Disable_2 are more like a suspend/resume kind of operation for ICACHE. Where Cache_Read_Enable and Cache_Read_Disable have more logic dedicated to setup and teardown.

If you want to see some examples using it, I have 3 active PRs that turn on Cache_Read_Enable early, as in before the SDK runs and turns it off before the SDK is launched. I try not to surprise the SDK. Method 3 in _PoC for handling Erase WiFi Setting after OTA #6965_ uses Cache_Read_Enable_2 and Cache_Read_Disable_2.

One note, Cache_Read_Disable is not a 100% restore to the previous state. The cache window size does not shrink; however, it can grow. If you were at 32K when you disable you cannot go back to 16K at a new enable. For this reason, I only use the 16K cache size option. All is properly restored with a reboot.

In summary, while eboot is running there should not be any cache interference while accessing the flash with ROM API calls.

@devyte

Have you encountered flash chips like Puya or XMC? Paranoia like this wasn't needed until write failures were reported, and
after lots of pain we figured out it had to do with the chip brand.

Yes, problems in normal use of the flash. NEVER when writting with the bootloader routines.
See the revert of the uneeded XMC PR for eboot.

Is it sufficient to revert #7317 from 2.7.2 for a clean 2.7.3 ?

Fine with me.

~Can someone with an XMC flash chip can check if https://github.com/esp8266/Arduino/commit/a4b31052e443c02aaf98f48f8cb47b3ba9379ff4 is working with OTA (where 2.7.2 is failing) ?~

~That one is ready to be published and is 2.7.2 without #7317.~

As discussed internally, reverting #7317 entirely is not an option. Only the verify step added in that merge must be disabled.

2.7.3 branch has been force-updated.
Testers are welcome before publishing the hotfix release !

Edit: link to the changes (including recompiled eboot.elf)

No issues so far, OTA or otherwise. Both xmc d1mini clone and esp8285 board

@mcspr did you try the new 2.7.3 branch (not the link provided above which I just ~"disabled "~ ?

Last one posted, where https://github.com/esp8266/Arduino/commit/59bbfc70610c68e2728175cd5ac9f0519bb10992 is the last commit.

$ git describe --dirty --abbrev=40
2.7.2-4-g59bbfc70610c68e2728175cd5ac9f0519bb10992

@d-a-v the branch 2.7.3 need to set build flags for the compiler version.

build_unflags        -std=c17 -std=gnu++17

build_flags            -std=gnu99 -std=c++11

The branch generates warnings which are not in core 2.7.2.

In file included from /home/gitpod/.platformio/packages/framework-arduinoespressif8266/libraries/Ticker/src/Ticker.cpp:26:0:
/home/gitpod/.platformio/packages/framework-arduinoespressif8266/libraries/Ticker/src/Ticker.h: In member function 'void Ticker::attach(float, void (*)(TArg), TArg)':
/home/gitpod/.platformio/packages/framework-arduinoespressif8266/libraries/Ticker/src/Ticker.h:78:32: warning: unknown option after '#pragma GCC diagnostic' kind [-Wpragmas]
 #pragma GCC diagnostic ignored "-Wcast-function-type"
                                ^
/home/gitpod/.platformio/packages/framework-arduinoespressif8266/libraries/Ticker/src/Ticker.h: In member function 'void Ticker::attach_ms(uint32_t, void (*)(TArg), TArg)':
/home/gitpod/.platformio/packages/framework-arduinoespressif8266/libraries/Ticker/src/Ticker.h:89:32: warning: unknown option after '#pragma GCC diagnostic' kind [-Wpragmas]
 #pragma GCC diagnostic ignored "-Wcast-function-type"
                                ^
/home/gitpod/.platformio/packages/framework-arduinoespressif8266/cores/esp8266/gdb_hooks.cpp:37:32: warning: unknown option after '#pragma GCC diagnostic' kind [-Wpragmas]
 #pragma GCC diagnostic ignored "-Wattribute-alias"
                                ^
/home/gitpod/.platformio/packages/framework-arduinoespressif8266/cores/esp8266/gdb_hooks.cpp:38:32: warning: unknown option after '#pragma GCC diagnostic' kind [-Wpragmas]
 #pragma GCC diagnostic ignored "-Wmissing-attributes"
                                ^

Sorry guys. Forget my posts above. I messed up something in my fork with git commands...
Everything is working without a warning!

With your two confirmations we are going to release it.
It's an unusual (not automated) release process so let's cross fingers.

A few thoughts - yes just commenting the verify call is the way to go at least for a quick fix.
Having some form of verify is probably a good idea though.... how about generating a hash from the write buffer and then generating a second hash from reading the flash and comparing. ROM code could be used, or there were about 200 bytes of eboot space gained from removing the XMC code from eboot (which was never actually needed in the end). Its not much but might be enough.
The alternative is to store a hash of the uncompressed data in the downloaded image reducing the workload during the write, however about 90% of the write time is actually spent waiting for the flash chip to store the data so I see no real gain in that.

Alternately - what about simply flagging that data was written beyond the start address of the compressed image, and skipping the verify when that happens?

Sorry, I've been offline a few days. I'll test it this evening, as I have a bunch of modules with XMC flash.

Haven't had OTA issues since the last XMC fix, but I'm not up to current build.

Earle rebuilt the eboot.elf with current git + https://github.com/esp8266/Arduino/pull/7468 and I've tested it with two different D1 Minis with XMC, both at 80 & 160 and going back and forth. No errors with my test code, which is nowhere near as complex as Tasmota is. Mine's only 326032 bytes, 31% of 1meg program storage space on 4meg XMC flash.

I also tested with d-a-v's eboot.elf, same results: no problem with 2 boards at any compile speed.

2.7.3 is now released with the patch in #7468 (it's also in master), so I'm closing this. If there still is any issue, please let us know.

Thx.

Was this page helpful?
0 / 5 - 0 ratings