Nodemcu-firmware: SDK 3.0 crashes during file.format()

Created on 19 May 2019 · 22Comments · Source: nodemcu/nodemcu-firmware

@EgorOA @NicolSpies reported in PR https://github.com/nodemcu/nodemcu-firmware/pull/2757#issuecomment-493603965 that the build craters when trying to format the SPIFFS.

Test code

Boot image with preformatted SPIFFS and then execute file.format(). I've added some diagnostic code to (temporarily) remove the start-up automatic format and entry /exit bookends around the call to spi_flash_erase_sector(sector_id) in the platform_flash_erase_sector() wrapper. This shows that the cpu resets during the spi_flash_erase_sector() call typically after 10-30 erase calls.

NodeMCU version

Current dev

Hardware

Wemos D1 Mini

Discussion

I have reverse compiled the spi_flash_erase_sector() in libmain.a(user_interface.o) and this does what I would expect:

int spi_flash_erase_sector(int n) {
  if (!protect_flag || !spi_flash_erase_sector_check(n))
    return SPI_FLASH_RESULT_ERR;
  spi_flash_check_wr_protect();
  system_soft_wdt_feed();
  Cache_Read_Disable_2();
  int s = SPIEraseSector(n);
  Cache_Read_Enable_2();
  return s;
}

except that it uses the libmain.a(spi_flash.o) Cache_Read_Disable_2() and Cache_Read_Enable_2() instead of the equivalent BootROM Cache_Read_Disable() and Cache_Read_Enable(). These seem to be more limited in scope in that they apparently disable/enable the Icache without flushing it.

The stochastic nature of the crash strongly suggests to me that we have an ISR firing which includes Icached code. More investigation is needed

Source

TerryE

All 22 comments

OK, I have found the issue. In merging the V3.0 sdk/ld/eagle.app.v6.ld changes into ld/nodemcu.ld, I had missed forcing libpp.a code into iRAM. This (wrongly) saves a load of iRAM by moving this code into irom1, and this also means that the interrupt and other time critical code in libpp.a which services the WiFi can crater when the Icache is disabled -- hence the reboots during long flash erases.

So we are now back to the situation where we have slightly less RAM available than with SDK 2.2.1. Sorry guys. I'll do the PR tomorrow.

TerryE on 19 May 2019

One aside issue is that now there are two malloc variants dictated by the extra parameter to pvPortMalloc(); if this is true then iRAM memory is allocated preferentially before dRAM. This does beg the Q about the behaviour of os_realloc() which doesn't pass the parameter, so another bit of reverse compilation gives:

#define os_realloc(p, s)  pvPortRealloc(p, s, "", __LINE__)

void *pvPortRealloc (void *p, size_t n, const char *srcFile, unsigned srcLine) {
  if (n == 0)
    vPortFree(p, srcFile, srcLine);
    return 0;
  }
  void *q = pvPortMalloc(n, srcFile, srcLine, false);
  if (q && n) {
    ets_memcpy(q, p, n);
    vPortFree(p, srcFile, srcLine);
    }
  return q;
}

in other words it forces dRAM allocation which is what I want. We have about 5K of iRAM free in standard builds, but including gdbstub uses all of this up, so I can't really rely on being able to allocating data in iRAM statically. However we might be able to use explicit os_malloc_iram() during startup to allocate what are effectively static structures in iRAM if memory is available.

TerryE on 19 May 2019

👍1

Looking at the content of iRAM above _text_end in the debugger, it appears that the SDK does its own os_malloc_iram() calls to allocate resources so it is probably not going to free up any memory doing any os_malloc_iram() calls in our startup code.

TerryE on 19 May 2019

OK, I've done the PR for this bugfix. I've also merged it, since the previous dev was broken.

TerryE on 19 May 2019

👍1

@TerryE , ~~the crash during file system formatting is fixed.~~

I am now experiencing the wdt reset as reported after the SDK 3.0 change as reported by Egor. This was not a problem under SDK 2.2.1. I understand that SDK 3.0 moved a lot of constant data from RAM to Flash ROM to free some RAM with the possible negative effect of the overhead of the exception handler firing .
The latest build reports about the same RAM as the last commit before the SDK 3.0 build, b6cd2c3e dated 5 April without the wdt reset problem.

The problem appears to be related to the tls module where extensive security related processing are required.

Is this the unfortunate reality that is here to stay under SDK 3.0 ?

NicolSpies on 19 May 2019

the crash during file system formatting is fixed.

No, there are some problems stayed - there are a few rebooting during format till the task done.
And this can be done by file.format()

file.format()
................
ets Jan 8 2013,rst cause:2, boot mode:(3,6)
load 0x40100000, len 28600, room 16
tail 8
chksum 0x8d
load 0x3ffe8000, len 2592, room 0
tail 0
chksum 0x1e
load 0x3ffe8a20, len 8, room 8
tail 0
chksum 0x71
csum 0x71

Formatting file system. Please wait...
...
ets Jan 8 2013,rst cause:2, boot mode:(3,6)
load 0x40100000, len 28600, room 16
tail 8
chksum 0x8d
load 0x3ffe8000, len 2592, room 0
tail 0
chksum 0x1e
load 0x3ffe8a20, len 8, room 8
tail 0
chksum 0x71
csum 0x71
Formatting file system. Please wait...
........................................................
NodeMCU custom build by frightanic.com
branch: dev
commit: d6980ad802d58ecc083f953bfc83d5b6ec226c18
SSL: false
modules: adc,bit,bme280,dht,enduser_setup,file,gpio,i2c,mqtt,net,node,ow,rtctime,sntp,spi,tmr,uart,ucg,wifi
build created on 2019-05-19 17:06
powered by Lua 5.1.4 on SDK 3.0.1-dev(fce080e)
lua: cannot open init.lua

=node.heap()
=node.heap()
41456

And free heap size is dramatically low :( - only 41456 free bytes
nodemcu-dev-19-modules-2019-05-19-17-07-58-integer.zip

EgorOA on 19 May 2019

@EgorOA, I confirm the rebooting during formatting.

NicolSpies on 19 May 2019

And free heap size is dramatically low ... only 41456 free bytes

When I first got involved with the project we had ~15Kb heap for data _and_ code. :rofl:

With each SDK up-issue Espressif add features and functionality. These constantly generate code growth within the SDK libraries. Everyone is bouncing against the iRAM limit so they've been systematically moving code from iRAM to iROM to keep applications linkable. This is undoubtedly impacting system stability either because of SDK internal ISR-triggered code being hoisted into iROM or breaking simple time-critical assumptions because of the slower execution when moving instruction fetch from iRAM to iROM.

As the Espressif codebase is closed source, we as a community have no ability to work with the Espressif maintainers on this.

From a project PoV, moving from 2.2.1 to 3.0 seems to have introduced a step drop in stability for no functional gain. I do wonder if the wise course here would be to move dev to dev-3.0 and regress the mainline dev to a 2.2.1 baseline. It would be trivial to backport a simple partition table for our partitions (the SDK effectively ignores these anyway) so that we could still offer the extra runtime / provision-time configuration of LFS, SPIFFS, etc.

TerryE on 19 May 2019

👍1

I do wonder if the wise course here would be to move dev to dev-3.0 and regress the mainline dev to a 2.2.1 baseline.

Wise idea

EgorOA on 19 May 2019

Not a bad idea, I have been developing against the last pre-3.0 commit full time for the last 1,1/2 months without any of the stability issues. I assume there are some improvements in SDK 3.0 that we will not be benefiting from. Does that mean the current heap is the maximum "safe" amount ?

NicolSpies on 19 May 2019

The main functional addition in SDK 3.0 is the ability to drop the ICache from 32Kb to 16Kb and thus use the extra 16Kb page as iROM + the ability to do os_malloc_iram() calls to allocate this dynamically. This may have advantages for traditional small C applications which need a lot of RAM, but for large code applications with components like the Lua VM and TLS, dropping the ICache from 32Kb to 16Kb absolutely kills performance -- so the tl;dr is that SDK 3.0 doesn't really offer anything material to NodeMCU.

TerryE on 19 May 2019

@devsaurus @HHHartmann @jmattsson @marcelstoer @nwf @pjsg, do you guys have a view on this?

TerryE on 19 May 2019

wrt to the technical "details" in this issue: it's all Greek to me, sorry.

tl;dr is that SDK 3.0 doesn't really offer anything material to NodeMCU.

Considering that 3.x is the only version still being maintained by Espressif that in itself is a benefit - better features or not aside. So, I wouldn't take the decision to skip this major lightheartedly. However, this is a more general remark not based on a solid technical foundation (see Greek comment).

I do wonder if the wise course here would be to move dev to dev-3.0 and regress the mainline dev to a 2.2.1 baseline.

Oh no, not another branch to maintain. Who will have time & energy to look after it?

my 2 cents: we _can_ ignore 3.x but then we do that for better or for worse.

marcelstoer on 19 May 2019

Oh no, not another branch to maintain. Who will have time & energy to look after it?

I've spent a few of weeks of solid work getting to the SDK 3.0 version of dev that we have now and it still isn't usable in its current form; I don't know how to fix it. The problems seem to be in a closed source set of libraries, and I've already started disassembling bits of it to work out how and why it does what it does. I shouldn't need to do this. We already use a fork of the open source LwIP code and Espressif have some changes to their version and I haven't rolled these forward into our fork. I don't know if that is a factor here.

I am not sure what "being maintained by Espressif" means if the current SDK is unreliable when the previous version was.

If we vote to stay with 3.0.1-dev then perhaps the best option that I can see is for me to stop work on dev until we have a more stable SDK 3.x. I would be more than happy for one of the other committers to take the lead for a bit.

TerryE on 19 May 2019

Considering 3.0 is the only thing Espressif will maintain, I'm reluctant to go back to 2.2.1, but I don't have a feel for just how bad things are on 3.0 atm.

jmattsson on 20 May 2019

dev is currently broken. I will do an MCV test case later today, which is just a flash erase loop. AFAIK, the only think that could cause this to crash is if some ISR code is invoking iRom0 routines. But I can't see where any of our ISRs are firing in this loop and I've reviewed our ISR code, so I can only guess that the problem lies somewhere in the SDK code base.

If this is the case then dev so stay broken whilst it is based on SDK 3.x.

TerryE on 20 May 2019

👀1

Can you get gdb breakpoints onto the ISRs? Or maybe the exception-wrapper? Do the crashes go away if we close interrupts before commencing the erase? Does the bootinfo after the crash yield anything interesting after the reboot? Something in EPC perhaps?

jmattsson on 20 May 2019

I have been retesting for the last hour inserting watchdog feeds after every wdt reset. I do sequential TLS operations connecting to cloud servers (Telegram and BitBucket) doing messaging and file up and download. With SDK 2.2.1 I get no wdt resets and with latest dev I get one every time in one of the above actions, inserting wdt feeds does not make a difference. @TerryE, I will not make any code changes and test against new builds when they drop.

NicolSpies on 20 May 2019

Thanks @marcelstoer @jmattsson for staying on my case. I've found it. Here is my MCV

static int test_bulk_erase (lua_State *L) {
  uint32_t i, j, sa, ss;
  ss = platform_flash_get_partition (NODEMCU_SPIFFS0_PARTITION, &sa);
  sa /= INTERNAL_FLASH_SECTOR_SIZE; 
  ss /= INTERNAL_FLASH_SECTOR_SIZE;
  for (j=1; j<10;j++) {
    // print prolog and wait until UART empty 
    system_set_os_print(1);
    os_printf ("Try %u: erasing sectors %u to %u\n", j, sa, sa+ss-1);
    os_delay_us(50000); system_soft_wdt_feed();
    // now erase the spiffs partition
    for (i = sa; i < sa+ss; i++) {
      spi_flash_erase_sector(i);
    }
  }
  return 0;
}

Running test stub creates the crash with the fatal exception reported.

> test.bulk_erase() -- This will crash with a ~50% chance in any one loop
Try 1: erasing sectors 116 to 1023
Fatal exception 0(IllegalInstructionCause):
�pc1=0x4023b774, epc2=0x00000000, epc3=0x00000000, excvaddr=0x402711a8, depc=0x00000000
 ets Jan  8 2013,rst cause:2, boot mode:(3,6)

And rebooting into GDB with an x/i 0x4023b380 shows that this is chm_get_current_channel is a .text psect entry in libnet80211.a(wl_chm.o). One of the things that I had to do was to regress the eagle.app.v6.ld changes into nodemcu.ld. This moved libnet80211.a into irom0.

_Well not quite_ -- in that it actually moved .literal.* .text.* into irom0, but there is a single .text entry in libnet80211.a _for chm_get_current_channel_ which is clearly called by one of the libphy or libpp ISRs.

Fix this one-line in nodemcu.ld and my test.bulk_erase() now executes solidly. I'll do the push. Sorry guys.

TerryE on 20 May 2019

🎉1

OK PR #2762 in and merged.

@marcelstoer, this really does underline the MCV approach: by distilling the issue down to its bare essentials, this enabled me to isolate the problem.

@NicolSpies, this might or might not fix your TLS problems, because this fault will only crater the system if an ISR calls chm_get_current_channel() in a time window when the ICache is disabled -- for example during SPIFFS I/O.

I still have the heap reported at 42,744 with LFS._dummy_strings loaded, which is a couple of kB less than with SDK 2.2.1

TerryE on 20 May 2019

@TerryE, thanks for the feedback, started the Docker build when you merged, will test soon.

NicolSpies on 20 May 2019