luaL_callx() to facilitate improved Panic Handling. I want to use this issue to discuss error handling in more general terms and why the focus on panic handling.pcall() and xpcall(), thus creating a _protected environment_ for executing a code hierachy. In such cases the error is returned to the application and the application code itself determines how to process or log the error.One of the main issues that catches new developers is that whilst the interactive thread established a default protected environment to catch and print errors interactively, any Lua callback runs as separate execution thread and therefore is not protected; any errors here will panic and reboot the processor. The issue that I want to consider here is whether this is the correct behaviour for NodeMCU and if not then how we can improve this in a way which doesn't break existing applications.
We have three broad mechanisms for reporting errors in the firmware:
print() function. Prior to SDK 3.0 releases, we always did (1) but this meant that errors generated during node.output() where reported only on the UART. print() now sends output to the STDOUT which is in turn emptied to the UART or the node.output() reader using the Pipe module, and this enables error reporting in a terminal session to occur through the session.In case (3) the default panic process should still print the error, but at least wait until the STDOUT pipe is empty or a maximum time elapsed before restarting the ESP. We should also provide a node.setonerror() call to enable the application to override this default (say to add some form of error logging).
So I suggest that a good next PR would be to tidy up such error handling:
node.restart() functionality to wait until STDOUT or N sec elapsed before restarting ESP. Add extra delay boolean parameter to enable this mode and document this. _Q: Should this wait until errors flushed be thedefault?_node.setonerror() call and document.luaL_pcallx()debug.debug() function to read from STDIN pipe, and output to print() (Maybe a separate PR.)
Modifydebug.debug()function to read from STDIN pipe, and output toprint()(Maybe a separate PR.)
This one needs a bit of explanation. One of the ways (in standard Lua at least) to debug is use debug.debug as an xpcall() error handler (see PiL 8.5). So if the error occurs 5 Lua call frames down, then the stripped down debug.debug() interpreter runs at call frame 6.
The issue here with NodeMCU is that SDK tasks always start at call stack frame 0 and grow / release the frames until a return from stack frame 0 completes the task returning control to the SDK, and allowing the next callback task to again execute from Lua call-frame zero. In order to make thestack diagnostics available, we would need the interpreter to run at stack frame 6 not 0. This is going to be really difficult to achieve without botching the Lua architecture. This needs quite a bit more though before I could consider doing this. In the meantime we have a lot more valuable low hanging fruit to pick.
One of the points of fragility in the C module is that luaL_ref can throw on OOM (if it fails to expand the Lua table that is the registry, e.g.). Can we catch that exception in C, too?
One of the points of fragility in the C module
@nwf , I hold a dissenting view here. The underlying fragility is that people write apps that exhaust memory. If you've run out of memory to the point that the application can't even grow registry (which should be a fairly small table if the application is correctly written), and it is in this exhausted RAM state, then it is truly f**ked; it isn't going to recover. Returning a status for this one API call when every other Lua API call throws the error isn't going to make anything any less fragile. The best thing to do is to throw an error so that error reporting can give a decent traceback.
As I asked offline give me the link to such an example and I can give you my feedback.
When I develop my apps, I continuously check the registry for growth, e.g.
r=debug.getregistry()
for k,v in pairs(r) do if type(v) == "function" then print (k,v) end end
-- this will list off registered functions and this list is growing then you've got a leak
-- With Lua 5.3 builds you can also get debug info like line numbers and upvalues on these
t=tmr.create()
local cnt = 0; local function counter(t) cnt=cnt+1 end; t:alarm(5000,1,counter)
-- if this is in r[5], say, then
for i=1,10 do local u,v=debug.getupvalue(r[5],i);if v then print(u,v) end end
cnt 16
for i=1,10 do local u,v=debug.getupvalue(r[5],i);if v then print(u,v) end end
cnt 19
An example is the proposed #2854 PR: the constructor began with a call to malloc and then luaL_ref'd a callback. I think the solution there is to carefully stage the creation of a userdata with a __gc metamethod that unhooks things and to immediately install any luaL_ref'd result into said userdata, and I've suggested that change.
As the code was written, a failure of luaL_ref would have exacerbated an OOM situation by leaking the ping_t structure allocated at the beginning.
@nwf, Thanks Nathaniel. I will do a review of the module and feed in comments.
One of the points of fragility in the C module is that
luaL_refcan throw on OOM
@nwf, I've been considering this issue. This "throw on memory error" is the same for all C API calls which can allocate collectable resources such as extra stack, strings, tables or additional table entries and userdata. In general resources created on the stack are cleaned up by the Lua CG after exiting the routine (and using a userdata this way is actually the simplest way of mallocing temporary data).
The issue with creating resources in the registry is that these will persist if the routine throws an error for whatever reason, in that registry entries will persist even if an error is thrown, leading to memory leakage of dead resources. Something that I propose to clean up in #1028, but you and I should agree some best practice patterns for coding this. But another peeve of mine if that a lot of our code tests for error values and error logic on return even though these call throw an error and will _never_ return an error value.
node.setonerror() (with no function argument) sets the default on error option, which is to print the error, but also to kick off a 2sec timer which restarts the processor after this delay This is to ensure that error can printed or be sent to output redirection before the restart occurs. If you are debugging and watching UART0, then you also have the option of doing a node.setonerror(print) which will print the error without restarting, or even use your own Lua function to do something else.
One of the complications here is that we need a unified error architecture across both the Lua 5.1 and 5.3 VMs so that the rules are the same for modules and independent of which Lua version used. I've had this challenge in other functional areas as well such as table and function handling. A consequence of this is that for from Lua 5.1 being in "frozen" support, it is receiving a constant "bleed-in" of Lua 5.3 goodies.
@nwf, @HHHartmann, I've set up node.setonerror() as discussed above, and also review all of the lua_call() uses in the modules. The majority just do the call and then immediately exit, so we can just do a global substitute in these for luaL_pcallx(). There are a few (dcc,file, net, sjson, softuart, wifi) which do a bit of cleanup after the call which need to be guarded by the pcallx status test as this now returns instead of an immediate panic.
I can include all of these changes as a single batch PR or alternatively just change one module on this PR (e.g. tmr)so that you can evaluate its behaviour and then do all of the others as a separate PR. What are your thoughts?
Well this seems to working fine. No more undiagnosed tracebacks in CBs and optionally being able to turn off panic reboot.
> file.putcontents('f.lua', 'function f() g() end')
> dofile'f.lua'
> node.setonerror(print)
> tmr.create():alarm(1000,0,f)
> f.lua:1: attempt to call global 'g' (a nil value)
stack traceback:
f.lua:1: in function <f.lua:1>
> node.setonerror()
> tmr.create():alarm(1000,0,f)
> f.lua:1: attempt to call global 'g' (a nil value)
stack traceback:
f.lua:1: in function <f.lua:1>
ets Jan 8 2013,rst cause:2, boot mode:(3,6)
load 0x40100000, len 31864, room 16
...
@TerryE: You may as well land the whole bunch in a single PR and we can go from there. I think we are currently obligated to not merge anything to dev so that we can cut master soon (@marcelstoer, yes?).
we are currently obligated to not merge anything to
devso that we can cutmastersoon
https://github.com/nodemcu/nodemcu-firmware#releases
https://github.com/nodemcu/nodemcu-firmware/issues/3066#issuecomment-626370485
@marcelstoer we are on the same page. A target date would be nice.
@nwf wrote:
@TerryE: You may as well land the whole bunch in a single PR and we can go from there. I think we are currently obligated to not merge anything to dev so that we can cut master soon.
I will formulate the PR on this basis. Also since I am making changes to lauxlib and the modules to add the luaL_pallx() calls and l, I will also add the luaL_unref2() API as discussed here in #1028 and tidy up tmr.c as a worked example.
Adding the luaL_reref API would also be nice; it would let us tidy up a lot of places. :)
Already done :smile:
I will tidy up a couple of the modules using luaL_reref and luaL_unref2 as worked examples but let's leave the bulk until after you have had a review of these and are happy.
@nwf, IIRC you are working on mqtt, so I haven't done the luaL_pallx() changes to this module. Ditto tls
That's fine; I can take those. :)
@nwf One note whilst I think on. We don't want to convert all lua_call() instances, just those (and in the majority) that are executed via a SDK callback, that is those where the execution path uses lua_getstate() to acquire the lua state. The remainder (for example in sjson) are invoked from the Lua application and may already be protected by a pcall() so we don't want to turn these into a panic.
@nwf I have come across an issue in my testing and that is when the error is triggered memory exhaustion, for example:
> a={}; function f() for i=1,3000 do a[i]=0 end end
> f()
E:M 32784
E:M 32784
Lua error: not enough memory
> print(#a) a=nil -- still got a in memory
2048
> collectgarbage(); print(node.heap())
40784
> tmr:create():alarm(500,0,f) -- now run out mem in a CB
> E:M 32784
E:M 32784
> -- this should have done the TB and printed it, then rebooted but it didn't
> a= nil; collectgarbage(); return node.heap()
40760
The issue is that the error handler is running at the deepest stack frame and I think that I need to run the GC before I try to create the closure for the post and do the reboot.
From the manual:
LUA_ERRMEM: memory allocation error. For such errors, Lua does not call the message handler.
It looks like we just need to call the onerror directly after the call.
OK, that works.
> a={} collectgarbage(); print(node.heap())
40752
> tmr:create():alarm(500,0,f)
> E:M 32784
E:M 32784
out of memory
ets Jan 8 2013,rst cause:2, boot mode:(3,6)
...
It's a feature of Lua that event handlers are _not_ called on "out of memory" conditions, but at least node.setonerror() is still honoured, so if you do the same with it set to print then you have lost the call stack, but you can still try to diagnose what triggered the exhaustion at the interactive prompt.
PS: the above diagnostics were with Lua 5.3, but its the same with Lua 5.1.
BTW, the above comments had absolutely nothing to do with this issue so I have moved them onto a separate one: #3101
hi when will node.setonerror() be available in custom build online...?
@chathurangawijetunge There's no better answer than "when it's ready". @TerryE is moving heaven and earth with his changes to the Lua core of NodeMCU and we're all eagerly awaiting his work landing on dev.
@chathurangawijetunge, I've got this all working on my working branch, but we want to do the next cut to master before adding this PR.
PR has been merged. On to next functional tranche :smile:
Most helpful comment
@nwf wrote:
I will formulate the PR on this basis. Also since I am making changes to lauxlib and the modules to add the
luaL_pallx()calls and l, I will also add theluaL_unref2()API as discussed here in #1028 and tidy uptmr.cas a worked example.