Nodemcu-firmware: Error Handling in Lua

Created on 28 Apr 2020  路  26Comments  路  Source: nodemcu/nodemcu-firmware

3075 introduced the new luaL_callx() to facilitate improved Panic Handling. I want to use this issue to discuss error handling in more general terms and why the focus on panic handling.

  • Caught Errors. The Lua architecture allow the application so establish an error handler via the pcall() and xpcall(), thus creating a _protected environment_ for executing a code hierachy. In such cases the error is returned to the application and the application code itself determines how to process or log the error.
  • Uncaught Errors. Any Lua code executed outside such a protected environment results in a Lua panic, which in standard Lua coding typically terminates the application. In the case of NodeMCU, this triggers a processor restart.

One of the main issues that catches new developers is that whilst the interactive thread established a default protected environment to catch and print errors interactively, any Lua callback runs as separate execution thread and therefore is not protected; any errors here will panic and reboot the processor. The issue that I want to consider here is whether this is the correct behaviour for NodeMCU and if not then how we can improve this in a way which doesn't break existing applications.

We have three broad mechanisms for reporting errors in the firmware:

  1. Print direct to the UART. We adopt this approach during Lua startup because this is the simplest and most robust mechanism that avoids dependencies on services that might not be yet running.
  2. Print the error using the standard print() function. Prior to SDK 3.0 releases, we always did (1) but this meant that errors generated during node.output() where reported only on the UART. print() now sends output to the STDOUT which is in turn emptied to the UART or the node.output() reader using the Pipe module, and this enables error reporting in a terminal session to occur through the session.
  3. Use a separate panic process manger to process uncaught errors. The problem with (2) during panics is that a simple "print error and restart" will queue the error message to STDOUT then restart the processor before the printing task has executed. At a minimum, the restart operation should wait until the STDOUT pipe is empty (or a maximum time elapsed) before restarting the ESP.

In case (3) the default panic process should still print the error, but at least wait until the STDOUT pipe is empty or a maximum time elapsed before restarting the ESP. We should also provide a node.setonerror() call to enable the application to override this default (say to add some form of error logging).

Scope of Work

So I suggest that a good next PR would be to tidy up such error handling:

  • Review Lua code for complaince to above (1)-(3) separation.
  • Modify node.restart() functionality to wait until STDOUT or N sec elapsed before restarting ESP. Add extra delay boolean parameter to enable this mode and document this. _Q: Should this wait until errors flushed be thedefault?_
  • Add node.setonerror() call and document.
  • Modify all module CBs to use luaL_pcallx()
  • Modify debug.debug() function to read from STDIN pipe, and output to print() (Maybe a separate PR.)

Most helpful comment

@nwf wrote:

@TerryE: You may as well land the whole bunch in a single PR and we can go from there. I think we are currently obligated to not merge anything to dev so that we can cut master soon.

I will formulate the PR on this basis. Also since I am making changes to lauxlib and the modules to add the luaL_pallx() calls and l, I will also add the luaL_unref2() API as discussed here in #1028 and tidy up tmr.c as a worked example.

All 26 comments

Modify debug.debug() function to read from STDIN pipe, and output to print() (Maybe a separate PR.)

This one needs a bit of explanation. One of the ways (in standard Lua at least) to debug is use debug.debug as an xpcall() error handler (see PiL 8.5). So if the error occurs 5 Lua call frames down, then the stripped down debug.debug() interpreter runs at call frame 6.

The issue here with NodeMCU is that SDK tasks always start at call stack frame 0 and grow / release the frames until a return from stack frame 0 completes the task returning control to the SDK, and allowing the next callback task to again execute from Lua call-frame zero. In order to make thestack diagnostics available, we would need the interpreter to run at stack frame 6 not 0. This is going to be really difficult to achieve without botching the Lua architecture. This needs quite a bit more though before I could consider doing this. In the meantime we have a lot more valuable low hanging fruit to pick.

One of the points of fragility in the C module is that luaL_ref can throw on OOM (if it fails to expand the Lua table that is the registry, e.g.). Can we catch that exception in C, too?

One of the points of fragility in the C module

@nwf , I hold a dissenting view here. The underlying fragility is that people write apps that exhaust memory. If you've run out of memory to the point that the application can't even grow registry (which should be a fairly small table if the application is correctly written), and it is in this exhausted RAM state, then it is truly f**ked; it isn't going to recover. Returning a status for this one API call when every other Lua API call throws the error isn't going to make anything any less fragile. The best thing to do is to throw an error so that error reporting can give a decent traceback.

As I asked offline give me the link to such an example and I can give you my feedback.

When I develop my apps, I continuously check the registry for growth, e.g.

r=debug.getregistry()
for k,v in pairs(r) do if type(v) == "function" then print (k,v) end end
-- this will list off registered functions and this list is growing then you've got a leak
-- With Lua 5.3 builds you can also get debug info like line numbers and upvalues on these
t=tmr.create()
local cnt = 0; local function counter(t) cnt=cnt+1 end; t:alarm(5000,1,counter)
-- if this is in r[5], say, then
for i=1,10 do local u,v=debug.getupvalue(r[5],i);if v then print(u,v) end end
cnt 16
for i=1,10 do local u,v=debug.getupvalue(r[5],i);if v then print(u,v) end end
cnt 19

An example is the proposed #2854 PR: the constructor began with a call to malloc and then luaL_ref'd a callback. I think the solution there is to carefully stage the creation of a userdata with a __gc metamethod that unhooks things and to immediately install any luaL_ref'd result into said userdata, and I've suggested that change.

As the code was written, a failure of luaL_ref would have exacerbated an OOM situation by leaking the ping_t structure allocated at the beginning.

@nwf, Thanks Nathaniel. I will do a review of the module and feed in comments.

One of the points of fragility in the C module is that luaL_ref can throw on OOM

@nwf, I've been considering this issue. This "throw on memory error" is the same for all C API calls which can allocate collectable resources such as extra stack, strings, tables or additional table entries and userdata. In general resources created on the stack are cleaned up by the Lua CG after exiting the routine (and using a userdata this way is actually the simplest way of mallocing temporary data).

The issue with creating resources in the registry is that these will persist if the routine throws an error for whatever reason, in that registry entries will persist even if an error is thrown, leading to memory leakage of dead resources. Something that I propose to clean up in #1028, but you and I should agree some best practice patterns for coding this. But another peeve of mine if that a lot of our code tests for error values and error logic on return even though these call throw an error and will _never_ return an error value.

node.setonerror() (with no function argument) sets the default on error option, which is to print the error, but also to kick off a 2sec timer which restarts the processor after this delay This is to ensure that error can printed or be sent to output redirection before the restart occurs. If you are debugging and watching UART0, then you also have the option of doing a node.setonerror(print) which will print the error without restarting, or even use your own Lua function to do something else.

One of the complications here is that we need a unified error architecture across both the Lua 5.1 and 5.3 VMs so that the rules are the same for modules and independent of which Lua version used. I've had this challenge in other functional areas as well such as table and function handling. A consequence of this is that for from Lua 5.1 being in "frozen" support, it is receiving a constant "bleed-in" of Lua 5.3 goodies.

@nwf, @HHHartmann, I've set up node.setonerror() as discussed above, and also review all of the lua_call() uses in the modules. The majority just do the call and then immediately exit, so we can just do a global substitute in these for luaL_pcallx(). There are a few (dcc,file, net, sjson, softuart, wifi) which do a bit of cleanup after the call which need to be guarded by the pcallx status test as this now returns instead of an immediate panic.

I can include all of these changes as a single batch PR or alternatively just change one module on this PR (e.g. tmr)so that you can evaluate its behaviour and then do all of the others as a separate PR. What are your thoughts?

Well this seems to working fine. No more undiagnosed tracebacks in CBs and optionally being able to turn off panic reboot.

> file.putcontents('f.lua', 'function f() g() end')
> dofile'f.lua'
> node.setonerror(print)
> tmr.create():alarm(1000,0,f)
> f.lua:1: attempt to call global 'g' (a nil value)
stack traceback:
    f.lua:1: in function <f.lua:1>

> node.setonerror()
> tmr.create():alarm(1000,0,f)
> f.lua:1: attempt to call global 'g' (a nil value)
stack traceback:
    f.lua:1: in function <f.lua:1>

 ets Jan  8 2013,rst cause:2, boot mode:(3,6)

load 0x40100000, len 31864, room 16 
...

@TerryE: You may as well land the whole bunch in a single PR and we can go from there. I think we are currently obligated to not merge anything to dev so that we can cut master soon (@marcelstoer, yes?).

we are currently obligated to not merge anything to dev so that we can cut master soon

https://github.com/nodemcu/nodemcu-firmware#releases
https://github.com/nodemcu/nodemcu-firmware/issues/3066#issuecomment-626370485

@marcelstoer we are on the same page. A target date would be nice.

@nwf wrote:

@TerryE: You may as well land the whole bunch in a single PR and we can go from there. I think we are currently obligated to not merge anything to dev so that we can cut master soon.

I will formulate the PR on this basis. Also since I am making changes to lauxlib and the modules to add the luaL_pallx() calls and l, I will also add the luaL_unref2() API as discussed here in #1028 and tidy up tmr.c as a worked example.

Adding the luaL_reref API would also be nice; it would let us tidy up a lot of places. :)

Already done :smile:

I will tidy up a couple of the modules using luaL_reref and luaL_unref2 as worked examples but let's leave the bulk until after you have had a review of these and are happy.

@nwf, IIRC you are working on mqtt, so I haven't done the luaL_pallx() changes to this module. Ditto tls

That's fine; I can take those. :)

@nwf One note whilst I think on. We don't want to convert all lua_call() instances, just those (and in the majority) that are executed via a SDK callback, that is those where the execution path uses lua_getstate() to acquire the lua state. The remainder (for example in sjson) are invoked from the Lua application and may already be protected by a pcall() so we don't want to turn these into a panic.

@nwf I have come across an issue in my testing and that is when the error is triggered memory exhaustion, for example:

> a={}; function f() for i=1,3000 do a[i]=0 end end 
> f()
E:M 32784
E:M 32784
Lua error:  not enough memory
> print(#a) a=nil  -- still got a in memory
2048
> collectgarbage();  print(node.heap())
40784
> tmr:create():alarm(500,0,f) -- now run out mem in a CB
> E:M 32784
E:M 32784
> -- this should have done the TB and printed it, then rebooted but it didn't
> a= nil; collectgarbage(); return node.heap()
40760

The issue is that the error handler is running at the deepest stack frame and I think that I need to run the GC before I try to create the closure for the post and do the reboot.

From the manual:

LUA_ERRMEM: memory allocation error. For such errors, Lua does not call the message handler.

It looks like we just need to call the onerror directly after the call.

OK, that works.

> a={}  collectgarbage();  print(node.heap())
40752
> tmr:create():alarm(500,0,f)
> E:M 32784
E:M 32784
out of memory

 ets Jan  8 2013,rst cause:2, boot mode:(3,6)
...

It's a feature of Lua that event handlers are _not_ called on "out of memory" conditions, but at least node.setonerror() is still honoured, so if you do the same with it set to print then you have lost the call stack, but you can still try to diagnose what triggered the exhaustion at the interactive prompt.

PS: the above diagnostics were with Lua 5.3, but its the same with Lua 5.1.

BTW, the above comments had absolutely nothing to do with this issue so I have moved them onto a separate one: #3101

hi when will node.setonerror() be available in custom build online...?

@chathurangawijetunge There's no better answer than "when it's ready". @TerryE is moving heaven and earth with his changes to the Lua core of NodeMCU and we're all eagerly awaiting his work landing on dev.

@chathurangawijetunge, I've got this all working on my working branch, but we want to do the next cut to master before adding this PR.

PR has been merged. On to next functional tranche :smile:

Was this page helpful?
0 / 5 - 0 ratings

Related issues

Michal78S picture Michal78S  路  6Comments

HHHartmann picture HHHartmann  路  7Comments

tibinoo picture tibinoo  路  5Comments

dicer picture dicer  路  6Comments

ShAzmoodeh picture ShAzmoodeh  路  6Comments