Nodemcu-firmware: Error Handling in Lua

Created on 28 Apr 2020 · 26Comments · Source: nodemcu/nodemcu-firmware

3075 introduced the new `luaL_callx()` to facilitate improved Panic Handling. I want to use this issue to discuss error handling in more general terms and why the focus on panic handling.

Caught Errors. The Lua architecture allow the application so establish an error handler via the pcall() and xpcall(), thus creating a _protected environment_ for executing a code hierachy. In such cases the error is returned to the application and the application code itself determines how to process or log the error.
Uncaught Errors. Any Lua code executed outside such a protected environment results in a Lua panic, which in standard Lua coding typically terminates the application. In the case of NodeMCU, this triggers a processor restart.

One of the main issues that catches new developers is that whilst the interactive thread established a default protected environment to catch and print errors interactively, any Lua callback runs as separate execution thread and therefore is not protected; any errors here will panic and reboot the processor. The issue that I want to consider here is whether this is the correct behaviour for NodeMCU and if not then how we can improve this in a way which doesn't break existing applications.

We have three broad mechanisms for reporting errors in the firmware:

Print direct to the UART. We adopt this approach during Lua startup because this is the simplest and most robust mechanism that avoids dependencies on services that might not be yet running.
Print the error using the standard print() function. Prior to SDK 3.0 releases, we always did (1) but this meant that errors generated during node.output() where reported only on the UART. print() now sends output to the STDOUT which is in turn emptied to the UART or the node.output() reader using the Pipe module, and this enables error reporting in a terminal session to occur through the session.
Use a separate panic process manger to process uncaught errors. The problem with (2) during panics is that a simple "print error and restart" will queue the error message to STDOUT then restart the processor before the printing task has executed. At a minimum, the restart operation should wait until the STDOUT pipe is empty (or a maximum time elapsed) before restarting the ESP.

In case (3) the default panic process should still print the error, but at least wait until the STDOUT pipe is empty or a maximum time elapsed before restarting the ESP. We should also provide a node.setonerror() call to enable the application to override this default (say to add some form of error logging).

Scope of Work

So I suggest that a good next PR would be to tidy up such error handling:

Review Lua code for complaince to above (1)-(3) separation.
Modify node.restart() functionality to wait until STDOUT or N sec elapsed before restarting ESP. Add extra delay boolean parameter to enable this mode and document this. _Q: Should this wait until errors flushed be thedefault?_
Add node.setonerror() call and document.
Modify all module CBs to use luaL_pcallx()
~~Modify debug.debug() function to read from STDIN pipe, and output to print() (Maybe a separate PR.)~~

Source

TerryE

Most helpful comment

@nwf wrote:

@TerryE: You may as well land the whole bunch in a single PR and we can go from there. I think we are currently obligated to not merge anything to dev so that we can cut master soon.

I will formulate the PR on this basis. Also since I am making changes to lauxlib and the modules to add the luaL_pallx() calls and l, I will also add the luaL_unref2() API as discussed here in #1028 and tidy up tmr.c as a worked example.

TerryE on 13 May 2020

🎉1 👍1

All 26 comments

~~Modify debug.debug() function to read from STDIN pipe, and output to print() (Maybe a separate PR.)~~

This one needs a bit of explanation. One of the ways (in standard Lua at least) to debug is use debug.debug as an xpcall() error handler (see PiL 8.5). So if the error occurs 5 Lua call frames down, then the stripped down debug.debug() interpreter runs at call frame 6.

The issue here with NodeMCU is that SDK tasks always start at call stack frame 0 and grow / release the frames until a return from stack frame 0 completes the task returning control to the SDK, and allowing the next callback task to again execute from Lua call-frame zero. In order to make thestack diagnostics available, we would need the interpreter to run at stack frame 6 not 0. This is going to be really difficult to achieve without botching the Lua architecture. This needs quite a bit more though before I could consider doing this. In the meantime we have a lot more valuable low hanging fruit to pick.

TerryE on 28 Apr 2020

One of the points of fragility in the C module is that luaL_ref can throw on OOM (if it fails to expand the Lua table that is the registry, e.g.). Can we catch that exception in C, too?

nwf on 30 Apr 2020

One of the points of fragility in the C module

@nwf , I hold a dissenting view here. The underlying fragility is that people write apps that exhaust memory. If you've run out of memory to the point that the application can't even grow registry (which should be a fairly small table if the application is correctly written), and it is in this exhausted RAM state, then it is truly f**ked; it isn't going to recover. Returning a status for this one API call when every other Lua API call throws the error isn't going to make anything any less fragile. The best thing to do is to throw an error so that error reporting can give a decent traceback.

As I asked offline give me the link to such an example and I can give you my feedback.

When I develop my apps, I continuously check the registry for growth, e.g.

r=debug.getregistry()
for k,v in pairs(r) do if type(v) == "function" then print (k,v) end end
-- this will list off registered functions and this list is growing then you've got a leak
-- With Lua 5.3 builds you can also get debug info like line numbers and upvalues on these
t=tmr.create()
local cnt = 0; local function counter(t) cnt=cnt+1 end; t:alarm(5000,1,counter)
-- if this is in r[5], say, then
for i=1,10 do local u,v=debug.getupvalue(r[5],i);if v then print(u,v) end end
cnt 16
for i=1,10 do local u,v=debug.getupvalue(r[5],i);if v then print(u,v) end end
cnt 19

TerryE on 1 May 2020

An example is the proposed #2854 PR: the constructor began with a call to malloc and then luaL_ref'd a callback. I think the solution there is to carefully stage the creation of a userdata with a __gc metamethod that unhooks things and to immediately install any luaL_ref'd result into said userdata, and I've suggested that change.

As the code was written, a failure of luaL_ref would have exacerbated an OOM situation by leaking the ping_t structure allocated at the beginning.

nwf on 2 May 2020

@nwf, Thanks Nathaniel. I will do a review of the module and feed in comments.

TerryE on 2 May 2020

One of the points of fragility in the C module is that luaL_ref can throw on OOM

@nwf, I've been considering this issue. This "throw on memory error" is the same for all C API calls which can allocate collectable resources such as extra stack, strings, tables or additional table entries and userdata. In general resources created on the stack are cleaned up by the Lua CG after exiting the routine (and using a userdata this way is actually the simplest way of mallocing temporary data).

The issue with creating resources in the registry is that these will persist if the routine throws an error for whatever reason, in that registry entries will persist even if an error is thrown, leading to memory leakage of dead resources. Something that I propose to clean up in #1028, but you and I should agree some best practice patterns for coding this. But another peeve of mine if that a lot of our code tests for error values and error logic on return even though these call throw an error and will _never_ return an error value.

TerryE on 6 May 2020

node.setonerror() (with no function argument) sets the default on error option, which is to print the error, but also to kick off a 2sec timer which restarts the processor after this delay This is to ensure that error can printed or be sent to output redirection before the restart occurs. If you are debugging and watching UART0, then you also have the option of doing a node.setonerror(print) which will print the error without restarting, or even use your own Lua function to do something else.

TerryE on 8 May 2020

One of the complications here is that we need a unified error architecture across both the Lua 5.1 and 5.3 VMs so that the rules are the same for modules and independent of which Lua version used. I've had this challenge in other functional areas as well such as table and function handling. A consequence of this is that for from Lua 5.1 being in "frozen" support, it is receiving a constant "bleed-in" of Lua 5.3 goodies.

TerryE on 10 May 2020

@nwf, @HHHartmann, I've set up node.setonerror() as discussed above, and also review all of the lua_call() uses in the modules. The majority just do the call and then immediately exit, so we can just do a global substitute in these for luaL_pcallx(). There are a few (dcc,file, net, sjson, softuart, wifi) which do a bit of cleanup after the call which need to be guarded by the pcallx status test as this now returns instead of an immediate panic.

I can include all of these changes as a single batch PR or alternatively just change one module on this PR (e.g. tmr)so that you can evaluate its behaviour and then do all of the others as a separate PR. What are your thoughts?

TerryE on 11 May 2020

Well this seems to working fine. No more undiagnosed tracebacks in CBs and optionally being able to turn off panic reboot.

> file.putcontents('f.lua', 'function f() g() end')
> dofile'f.lua'
> node.setonerror(print)
> tmr.create():alarm(1000,0,f)
> f.lua:1: attempt to call global 'g' (a nil value)
stack traceback:
    f.lua:1: in function <f.lua:1>

> node.setonerror()
> tmr.create():alarm(1000,0,f)
> f.lua:1: attempt to call global 'g' (a nil value)
stack traceback:
    f.lua:1: in function <f.lua:1>

 ets Jan  8 2013,rst cause:2, boot mode:(3,6)

load 0x40100000, len 31864, room 16 
...

TerryE on 12 May 2020

@TerryE: You may as well land the whole bunch in a single PR and we can go from there. I think we are currently obligated to not merge anything to dev so that we can cut master soon (@marcelstoer, yes?).

nwf on 12 May 2020

we are currently obligated to not merge anything to dev so that we can cut master soon

https://github.com/nodemcu/nodemcu-firmware#releases
https://github.com/nodemcu/nodemcu-firmware/issues/3066#issuecomment-626370485

marcelstoer on 13 May 2020

@marcelstoer we are on the same page. A target date would be nice.

TerryE on 13 May 2020

@nwf wrote:

@TerryE: You may as well land the whole bunch in a single PR and we can go from there. I think we are currently obligated to not merge anything to dev so that we can cut master soon.

TerryE on 13 May 2020

🎉1 👍1

Adding the luaL_reref API would also be nice; it would let us tidy up a lot of places. :)

nwf on 13 May 2020

Already done :smile:

I will tidy up a couple of the modules using luaL_reref and luaL_unref2 as worked examples but let's leave the bulk until after you have had a review of these and are happy.

TerryE on 13 May 2020

@nwf, IIRC you are working on mqtt, so I haven't done the luaL_pallx() changes to this module. Ditto tls

TerryE on 13 May 2020

That's fine; I can take those. :)

nwf on 14 May 2020

@nwf One note whilst I think on. We don't want to convert all lua_call() instances, just those (and in the majority) that are executed via a SDK callback, that is those where the execution path uses lua_getstate() to acquire the lua state. The remainder (for example in sjson) are invoked from the Lua application and may already be protected by a pcall() so we don't want to turn these into a panic.

TerryE on 14 May 2020

👍1

@nwf I have come across an issue in my testing and that is when the error is triggered memory exhaustion, for example:

> a={}; function f() for i=1,3000 do a[i]=0 end end 
> f()
E:M 32784
E:M 32784
Lua error:  not enough memory
> print(#a) a=nil  -- still got a in memory
2048
> collectgarbage();  print(node.heap())
40784
> tmr:create():alarm(500,0,f) -- now run out mem in a CB
> E:M 32784
E:M 32784
> -- this should have done the TB and printed it, then rebooted but it didn't
> a= nil; collectgarbage(); return node.heap()
40760

The issue is that the error handler is running at the deepest stack frame and I think that I need to run the GC before I try to create the closure for the post and do the reboot.

From the manual:

LUA_ERRMEM: memory allocation error. For such errors, Lua does not call the message handler.

It looks like we just need to call the onerror directly after the call.

TerryE on 14 May 2020

OK, that works.

> a={}  collectgarbage();  print(node.heap())
40752
> tmr:create():alarm(500,0,f)
> E:M 32784
E:M 32784
out of memory

 ets Jan  8 2013,rst cause:2, boot mode:(3,6)
...

It's a feature of Lua that event handlers are _not_ called on "out of memory" conditions, but at least node.setonerror() is still honoured, so if you do the same with it set to print then you have lost the call stack, but you can still try to diagnose what triggered the exhaustion at the interactive prompt.

PS: the above diagnostics were with Lua 5.3, but its the same with Lua 5.1.

TerryE on 14 May 2020

👍1

BTW, the above comments had absolutely nothing to do with this issue so I have moved them onto a separate one: #3101

TerryE on 19 May 2020

hi when will node.setonerror() be available in custom build online...?

chathurangawijetunge on 28 May 2020

@chathurangawijetunge There's no better answer than "when it's ready". @TerryE is moving heaven and earth with his changes to the Lua core of NodeMCU and we're all eagerly awaiting his work landing on dev.

nwf on 28 May 2020

@chathurangawijetunge, I've got this all working on my working branch, but we want to do the next cut to master before adding this PR.

TerryE on 28 May 2020

PR has been merged. On to next functional tranche :smile:

TerryE on 16 Jun 2020

Was this page helpful?

0 / 5 - 0 ratings

Related issues

ws2812.init() cause esp12-e reboot

sivix · 4Comments

Additional documentation for Lua 5.3

TerryE · 7Comments

After second reboot/reset modules lose their settings obtain by enduser_setup

Michal78S · 6Comments

Bump SDK to 3.0.4

marcelstoer · 4Comments

SNTP Feature Request

ShAzmoodeh · 6Comments

Nodemcu-firmware: Error Handling in Lua

3075 introduced the new luaL_callx() to facilitate improved Panic Handling. I want to use this issue to discuss error handling in more general terms and why the focus on panic handling.

Scope of Work

Most helpful comment

All 26 comments

Related issues

3075 introduced the new `luaL_callx()` to facilitate improved Panic Handling. I want to use this issue to discuss error handling in more general terms and why the focus on panic handling.