Nodemcu-firmware: mysterious crash "already freed"

Created on 22 Mar 2018  路  30Comments  路  Source: nodemcu/nodemcu-firmware

Expected behavior

Lua program does not crash because a "table has been freed" mysteriously.

Actual behavior

Lua program reports "3fff5168 already freed" (the address is arbitrary, of course) and crashes right afterwards when code tries to use a value of a table that has been freed already.

Test code

Provide a [Minimal, Complete, and Verifiable example](http://stackoverflow.com/help/mcve) which will reproduce the problem.
I know what you need but I cannot provide it, unfortunately. My program started crashing only recently when it grown to a certain size. Even though it's modularized I am afraid that the crash is related to the total size.

Some examples that do crash:

local names = {}
dofile("names.lc").load(names)
local name = names[1]

The names is somehow freed right after the dofile() call so the subsequent names[1] access crashes. However, if I try to print the names address (to see if it matches the address reported before the crash) it no longer crashes and the code continues to run perfectly well!

local names = {}
dofile("names.lc").load(names)
print(names)
local name = names[1]

Yes, one single print() added to the "right" place makes the crash go away.

Another example of the same issue:

local ok, rules = pcall(sjson.decode, data)
local x = rules[1]

Again crashes because the rules has been freed. But if I insert any print() there it starts working correctly:

local ok, rules = pcall(sjson.decode, data)
print("Please do not crash!")
local x = rules[1]

I must say this makes developing with NodeMCU kinda nightmare-ish.

BTW, if I replace the print("Please do not crash") with print(node.heap()) it also "fixes" the issue and shows that there's about 23 kB free on the heap.

Any idea why a Lua table gets freed at unexpected places for no apparent reason? And why inserting a random print() fixes it?

FWIW, replacing the "right" print() with collectgarbage() makes the crash disappear as well.

NodeMCU version

Current master.

Hardware

Stock ESP8266-07.

bug

All 30 comments

BTW, I don't know what print() does that it "fixes it" but I can imagine it delays the program execution a little bit (just like the collectgarbage() does cause a small delay, I imagine) so perhaps this is a race condition, a timing issue?

One could also suspect file related stuff because there is code for reading a small text file being executed both in the load() function of "names.lc" and also few lines above the sjson.decode() call.

@joysfera Petr, I need a complete example that does systemically crash -- for example (1) + names.lua. Sanitize it of personal data; put it in a gist and link to it here, complete with the config info that I need to build it. (e.g. a git diff app/include/user*.h or just a verbal of the changes that you made.)

There is zero chance of tracking this down with the info that you've given. Clearly it is going to help using the minimal library set and application. I'll first try this with my LFS version (because I've fixed the remote debugger in this patch).

This sort of thing is normally down to a library function not following the rules for handling Lua resources. //Terry

I'll do my best to isolate the crash but if it turns out that the issue disappears when I remove some completely unrelated code from a different module then I'd have to share the whole project which is not an open source ready thing. So I could either send it to you privately or I'm on my own with this issue, I'm afraid.

I have just noticed one more suspicious thing:
in the following code:

local r = rules[1]

The crash report is this:

3fff4fe0 already freed
PANIC: unprotected error in call to Lua API (attempt to index a boolean value)

Please note the index a boolean value - Lua identifies the rules table as a boolean value now. Scary, isn't it?

Petr, I've sent you my email details. I am particularly interested in any modules that you are using, including any of your own that you might have developed. In my experience the underlying issue is usually that the Lua GC can kick off on any alloc / free / pcall and if the C code is using any GCObjects that aren't properly hooked into one of the Lua root nodes (the environment, the registry and the stack) then the GC will decide that it is unused on its sweep and free it. The C code then comes along later and frees it again -- or it doesn't and allocater reallocates it so it now doubly used, which causes this type of error.

These potential issues have been well hosed out of the Lua core which is why my first instinct is to look at the C library code.

But send me your stuff and I will take a look.

@TerryE Terry, after removing as much code as possible I've got a 175 lines long main.lua that still crashes, but as soon as I remove any line (that is not even being executed prior the crash!) then it stop crashing. I'm going to email it to you, thank you.

@TerryE FYI, I have just tried firmware built from the very latest dev (commit: f8311855d11df25274290c8245c8b2c0ecd4826d) and the crash is reproducible there, too. That's good news, I think.

OK, I've also reproduced this in my LFS debug build. When you compile in debug there are all sorts of extra asserts going on and this traps out messily on these checks unless you replace the line

local ok, rules = pcall(sjson.decode, data)

with

local ok, rules = pcall(function() return {{mode = "auto", rules = {{ "10deadbabe01", "20.5", "0.3", 0 }}, ["and"] = true }} end)

(The native Lua equivalent.)

This is really one for @pjsg, but I _think_ that the issue is that the SJSON routine is overrunning the Lua stack. Basically C library routines have a "contract" with the RTC. The RTS ensures that there are LUA_MINSTACK (20) slots free before calling a C library routine, but in return the C routine _must_ tell the RTS if it needs more. It can in fact do this at any time, but the RTS may actually realloc (move) the stack to do this so any StkID local variables might get invalidated. The issue here is that your JSON has 4 deep tables so this uses a lot fo stack entries.

So nothing to do with the Lua core -- phhheewwww.

@pjsg, would you like me to send you my archive that reproduces the crash? Or would @TerryE send it to you eventually?

Also, Terry, please note that first time I go this kind of panic crash it was not related to JSON in any way, it was just about reading a plain text file and parsing it using string.match (in the names.lc module). But it would be far harder to create a reproducible case, I am afraid (because it required an interaction with a web server). I would still give it a try because I would like to help you iron out such nasty bug from NodeMCU.

BTW, if such simple JSON file cannot be decoded then I think there's something seriously wrong with the SJSON module:

[ { "mode": "auto", "rules": [ [ "10deadbabe01", "20.5", "0.3", 0 ] ], "and": true } ]

Phil, I'll email you the local dir and a git diff from the latest LFS build so that you can just apply it as a patch and to the make. I've been debugging by having the test driven non-interactively from the init.lua.

One PITA that I've notice is there is no simple test to see if you are going to run out of RAM0. you just get loads of diagnostics at some other baud rate than 115200 so this is unreadable

Petr. the JSON might be simple but it is four array deep. E.g. [{{{ "10deadbabe01"}}}] and RAM resources are always to be carefully managed on the ESP.

Seems like the four array deep takes about 800 bytes of RAM:

> =node.heap()
44416
> a={{mode = "auto", rules = {{ "10deadbabe01", "20.5", "0.3", 0 }}, ["and"] = true }}
> =node.heap()
43592

That could/should fit, shouldn't it? In my application the print(node.heap()) always shows at least 20-23 kB.

I tried to be careful when parsing the json to do it right. I may have made a mistake (or not understood the constraints exactly). I'll take a look...

The sjson code runs on my host system under valgrind with no errors even when passed the object above.

And how many stack slots does it require to parse that single line?

@pjsg Phillip, I am feeling rather shit at the mo thanks to a dose of flu., so I am back off to bed. I'll have a bit more of a nose around tomorrow and email with a tarball / patch that you can use to test this with the debugger enabled. I admit that I might be talking out of my backside, but one thing that I've also learnt is that whilst bootstrapping through a vanilla Lua on the host can help a lot, the eLua patches and the far ore aggressive GC on the ESPs really adds another dimension of complexity.

Petr, you've found a serious bug that merits Philip and me looking into it, but this is all nasty internals stuff :(

Terry & Phillip, I'm not pushing you or anything, just curious and willing to help. Wish you best luck debugging it.

Petr, to be honest, we've only just got to the point where we can sensibly debug this type of issue. I had to sort out a few gremlins that made the remote debugger not usuable in practice in order to sort out the GC issues in LFS, but the upside is that this capability is now available for other issues.

Getting there slowly, What I done is to document my debug session (less all of the embarrassing blind alleys): Commented debug log.

I have just bashed the comments in so there are loads of typos in the commentary that need fixing. I also want to include the final reveal of the actual bug, so this needs adding to. But the important point here is that we just couldn't do this sort of debugging 4 months ago, and having this sort of tool is pretty much essential for hunting down complex or subtle bugs.

OK, I've found it. It's an incredible subtle bug in the RTS introduced as part of the ECG. The nub is this routine ar ldo.c:144-152:

void luaD_reallocstack (lua_State *L, int newsize) {
  TValue *oldstack = L->stack;
  int realsize = newsize + 1 + EXTRA_STACK;
  lua_assert(L->stack_last - L->stack == L->stacksize - EXTRA_STACK - 1);
  luaM_reallocvector(L, L->stack, L->stacksize, realsize, TValue);
  L->stacksize = realsize;
  L->stack_last = L->stack+newsize;
  correctstack(L, oldstack);
}

Those who are experienced C programmers will realise that this code is not reentrant. However, with eLua and under bizarre circumstances, it can be reentered. The luaM_reallocvector() function can trigger a full GC, which can involve calling routines like sjson_decode_destructor() which are sufficiently complex that these also trigger a luaD_reallocstack() -- and this point everything goes horribly wrong. And here is the proof:

Breakpoint 1, luaD_reallocstack (L=0x3fff00e8, newsize=108) at ldo.c:145
145   TValue *oldstack = L->stack;
(gdb) bt
#0  luaD_reallocstack (L=0x3fff00e8, newsize=108) at ldo.c:145
#1  0x40248bd5 in luaD_growstack (L=0x3fff00e8, n=20) at ldo.c:166
#2  0x402493a0 in luaD_precall (L=0x3fff00e8, func=0x3fff4a98, nresults=0) at ldo.c:322
#3  0x4024969c in luaD_call (L=0x3fff00e8, func=0x3fff4a98, nResults=0) at ldo.c:392
#4  0x4024bed5 in GCTM (L=0x3fff00e8) at lgc.c:495
#5  0x4024c3db in singlestep (L=0x3fff00e8) at lgc.c:626
#6  0x4024c64e in luaC_fullgc (L=0x3fff00e8) at lgc.c:703
#7  0x40246be2 in l_alloc (ud=0x3fff00e8, ptr=0x3fff49b8, osize=432, nsize=912) at lauxlib.c:940
#8  0x4024cdc9 in luaM_realloc_ (L=0x3fff00e8, block=0x3fff49b8, osize=432, nsize=912) at lmem.c:78
#9  0x40248ad8 in luaD_reallocstack (L=0x3fff00e8, newsize=108) at ldo.c:148
#10 0x40248bd5 in luaD_growstack (L=0x3fff00e8, n=20) at ldo.c:166
#11 0x402493a0 in luaD_precall (L=0x3fff00e8, func=0x3fff4a80, nresults=1) at ldo.c:322
#12 0x4024969c in luaD_call (L=0x3fff00e8, func=0x3fff4a80, nResults=1) at ldo.c:392
#13 0x4025639b in callTMres (L=0x3fff00e8, res=0x3fff4a10, f=0x3fff3b18, p1=0x3ffffed0, p2=0x3fff2860) at lvm.c:108
#14 0x40256788 in luaV_gettable (L=0x3fff00e8, t=0x3ffffed0, key=0x3fff2860, val=0x3fff4a10) at lvm.c:146
#15 0x40257d90 in luaV_execute (L=0x3fff00e8, nexeccalls=2) at lvm.c:491
#16 0x402496a5 in luaD_call (L=0x3fff00e8, func=0x3fff49c0, nResults=0) at ldo.c:393
#17 0x40245484 in lua_call (L=0x3fff00e8, nargs=1, nresults=0) at lapi.c:824
#18 0x4026709e in do_node_task (task_fn_ref=2, prio=1 '\001') at node.c:329
#19 0x4024206d in task_dispatch (e=<optimized out>) at task.c:32
#20 0x40000f49 in ?? ()

Compare frames 0 and 9. I think that the simplest fix is to disable GC during a stack reallocation.
by bracketing luaM_reallocvector call as follows

  int block_status = is_block_gc(L);
  set_block_gc(L);
  luaM_reallocvector(L, L->stack, L->stacksize, realsize, TValue);
  if (!block_status) unset_block_gc(L);

We should get this core bugfix into the 2.2 release.

Wow -- that is amazing. Congratulations on finding it. However, it does raise a whole load of issues surrounding destructors being called at unlikely times and what trouble that can cause.

I think that disabling the GC during a stack realloc is a good idea, but I wonder how many other places deserve the same treatment.

Yup, A corollary to this is that we forked from eLua over 3 years ago, It might be worth going through off of the eLua bugfixes relating to the Lua core and merge the appropriate ones across.

I'll be restarting my 5.3 port soon, and as I explained in the whitepaper lua-5.3 had already pinched the bits of eLua that we really use and has been hammered in general distribution. So my 5.3 dumps most of the NodeNCU / eLua modifications to the Lua core and only applies the essential stuff that we need to run efficiently out of flash: LFS, ROtables, and the module API so that we don't have to rewrite all of our modules.

But 5.3 also include some really nice performance enhancement to cache string lookup and I've added ROtable lookup as well.

@pjsg, I've also updated that _Using the debugger_ gist. You might have a scan. I think that this gives our C library developers an idea of how you can use the remote debugger very effectively despite its limitations.

I've also had a look at the eLua issues / PR list and deciphering this is going to be a dog. Simpler to pull the repo and do a git diff of the core back when Zeroday first cloned it.

Terry, amazing work on finding this!

May I ask why inserting a print() at certain place seemed to fix it? Or did it just hide the issue and the stack was still corrupted but I just haven't noticed it (because it overwrote something else than an array that got indexed)?

Also, does the collectgarbage() I use currently at several places in my application to overcome these crashes really help to avoid this dangerous issue or is it again just masking it and I still get a corrupted memory?

Two days ago I tried to reproduce the other crash I documented above that was right after dofile('names.lc').load(names) - I didn't succeed. It's apparently sensitive thing that is detected only at the "right" circumstances. However, there wasn't any "sufficiently complex" destructor called - unless the dofile() itself does that.

Please note that I experience instability and crashes at various moments of using NodeMCU - sometime even during uploading new Lua files. I also have my own "updating system" (that downloads updated files and compiles them) and it is also crashing almost randomly. I'm wondering if this crucial fix will solve all these issues.

Looking forward to using a new 'master' firmware with K.R.A.C.K. and this - how are you going to call this bugfix (you know, every bug must get a name nowadays :)

Thanks again! NodeMCU is going to be much better now (as I know about people who gave up on using it because of the random crashes - these are gone now, hopefully)

Petr, the crash occurs o an incredibly rare combination of circumstances where the VM needs to dynamically increase the stack to meet the needs of your application. What eLua (which is what the NodeMCU is forked of) adds is logic to execute a GC before trying to extend the stack. In your test case, you had just called sjson so some sjsonresources where there to be collected but in order to collect then the system executes the sjson_decode_destructor() which triggers a stack resize inside the stack resize and the core Lua code wasn't designed to handle this. On NodeMCU, the GC is called extremely often so the sjson Udata would normally be long gone.

So just to be absolutely clear, there is nothing wrong with sjson that I know of, nor with the main Lua core: the bug was introduced with the eLua EGC patch and because it is very rare, it hasn't been hunted down. What you brought to the party was a repeatable test case, that I could use to hunt it down.

Terry, I do understand that well. What I'm saying is that I experienced the very same kind of crash in a completely different Lua code, without any SJSON, and was able to reproduce it and "fix" it by adding print() so this is not that incredibly rare as you might think. I also experience other (not necessarily related) instabilities and crashes/reboots of NodeMCU rather often so I was curious if they all get fixed magically by your fix or not.

Petr, as a general point upvalues and locals are extremely cheap in terms of runtime and memory footprint -- certainly a lot cheaper than the convoluted process that the VM goes through to resolve a ROM global such as print or collect garbage. Have a look at that ds18b20 example module that I sent you, which takes advantage of this. In this case you use the following globals in your main.lua: cjson, dofile, file, _G, gpio, LEDon, node, pcall, pinButton, pinCidlo, pinLED, pinsRele, print, program_override, sjson, string, string_to_temp, teploty, tonumber, tostring. You will find that your code runs faster if any frequent base functions and ROtables are mapped to locals, e.g at the top of your module insert:

local collectgarbage, file, gpio, node, print, sjson, string, tonumber, tostring = 
      collectgarbage, file, gpio, node, print, sjson, string, tonumber, tostring

And never use globals where a local or upval will do. And now you can do the same. You can also issue a collectgarbage() immediately after your pcall(sjson.decode,data) so the distructor runs before any more ROM global resolutions.
40 [41] GETGLOBAL 6 -12 ; pcall 41 [41] GETUPVAL 7 3 ; - 42 [41] GETTABLE 7 7 -13 ; "decode" 43 [41] MOVE 8 5 44 [41] CALL 6 3 3 45 [42] GETUPVAL 8 4 ; collectgarbage // The closure is already resolved in a local function upval. 46 [42] CALL 8 1 1

As to other mysterious crashes, to be brutally honest there is no use in Lua developers posting here "my code mysteriously crashes". The _only_ time that someone like me. Phillip or Johny will be able to hunt them down is if we have a standard-alone reproducible test case which we can use to recreate the crash and to start investigating it as I explained in my worked example of this type of session.

I don't have my ESP Lua code crashing, so I can't bug hunt bugs that I am not experiencing.

I happen to be a programmer as well and I debug user issues daily so I do understand that. I provided a reproducible test case above in one case where I was able to isolate the bug. The other issues happen so randomly that I cannot provide anything useful (yet), sorry.

Thanks for your suggestion regarding locals, I use them whenever possible. Though I never thought that local print = print could help, this is almost counter-intuitive. Is there a RTFM with best practices for eLua/NodeMCU programming, apart from http://nodemcu.readthedocs.io/en/master/en/lua-developer-faq/ that I need to re-read as it's been updated a lot since I read it last time few years ago.

Re the FAQ, I wrote the first version about 3 years ago and it's been through a few update cycles as I continue to learn the subtlies of Lua over the SDK. It is really too monolithic, I feel that the strict review lifecycle of our release process doesn't suit an FAQ in my view. But LFS is going to represent a step change as does the 5.3 platform, so I will synchronise any updates with the releases.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

marcelstoer picture marcelstoer  路  4Comments

djphoenix picture djphoenix  路  3Comments

NicolSpies picture NicolSpies  路  6Comments

tibinoo picture tibinoo  路  5Comments

pastukhov picture pastukhov  路  6Comments