Although this issue links to earlier discussions in #1289 and #1661, I see this as a _policy_ issue mainly for the committers, so can you all read this and give your comments so we can move forward on the basis of some form of consensus?
Of the ~45Kb RAM available on the ESP8266, typically half or more of this RAM is Lua compiled code and constant data as opposed to true R/W data. The facility to move Lua binary code in to Flash _will more than double the effective RAM available to programmers_.
A hierarchy of function prototypes. and their associated vectors (constants, instructions, meta data for debug) are loaded into RAM when any Lua source or lc file is loaded into memory. Because in the Lua architecture, each Proto hierarchy can be bound to multiple closures (this closure creation is only done by executing the CLOSURE statement at runtime), such hierarchies are intrinsically read-only and therefore in principle ROMable.
The main complication here is that, like all other Lua resources, Proto hierarchies are garbage-collectable (and advanced Lua programmers exploit this collection). So IMO, the difficulties arise when devising the details of how any compiled Lua in ROM interacts nicely and stably with the GC: it's fairly straightforward to implement a scheme which work _mostly_: but we need one which works _all of the time_ in a well determined manner if we proceed with this.
I haven't worked out a robust way of doing an incremental storage system, as Phil discusses in #128, and IMO this will be hard to realise. What I have worked out how to do is essentially an "freeze into flash, then reboot" approach.
lua or lc files.node.rebuild_flash() function supplying a list of lua files that you wanted including into the ROM. This rebuild_flash call should be preferably called just after reboot. This call then _either_ rebuilds flash block and reboots the ESP immediately on completion, _or_ leaves the flash block unchanged and reboots with an error status.require path and so can be executed by a require statement; the loadfile and dofile will also parse rom:module syntax and return or execute a closure accordingly.rebuild_flash routine unhooks the current ROM table, and does two passes of loading the modules.Proto hierarchies; (ii) to fill the RAM string table with all the strings needed to store the hierarchies.Proto hierarchies are now in ROM, these hierarchies can now persist over reboot, and only the closure-based resources will occupy RAM.This process is simple and robust, but the Lua RTS is built around the assumption that collectable objects don't move their location and that strings are interned. It will be impossible to return control to the invoking Lua after a successful load, and difficult to return control after a failed one, which is why this "reload flash and immediately reboot" option is the most robust.
This system would enable Lua programmers to be able to compile and execute significantly larger Lua programs within the ESP resources.
There are some extra wrinkles for the Lua 5.3 environment but I will park these for now. So comments so far?
With the version that I built, I wasn't convinced that you could entirely rom-ify some functions (without a lot of hacking in the GC). The issue is that some objects need to be followed by the GC process, and so cannot be romified. [My memory is a little hazy, and I don't know if you can actually write such a function. My implementation had all sorts of corner cases where some of the internal vectors could not be rommed]
It's quite straight forward to get the Proto hierarchy only depending on only ROM resources, so you can safely short circuit the GC sweeping here. IIRC, your implementation also pushes closures into ROM, and this is bad news, IMO, because the upvalue chain will invariably point back into RAM.
I'll do the usual trick of adding this code to a vanilla Lua 5.1 version first and hammering it there on the PC before porting it to NodeMCU. It's just a pity that MMAPing a file into an absolute address window with the MAP_FIXED attribute is such a dog on modern OSs kernels which use with address space randomisation. But getting around this is only a few dozen lines of code. At least this way you can hammer out the GC interactions using a decent gdb implementation.
But to my core Qs, if we can get a robust approach here should be push it through dev to master?
As someone who often exhausts the heap, I'd be for such a change, assuming it's not crazy to support. The only downside, other than code complexity, perhaps, that leaps to mind is that running from flash means that it's less clear ahead of time when the flash chip will be engaged, so anyone trying to use the flash-associated GPIOs might be in for a surprise.
Not to ask a stupid question, but does anything in lua 5.2 or 5.3 make this easier? (I recall reading somewhere that there was an effort to bring nodemcu over to 5.3; am I just making that up?)
Getting Lua running out of flash would be a big boon given the RAM constraints.
My Qs/concerns are:
.text" or so)? If the former, would we need to do something to deal with uneven wear? I'd very much like to avoid that if possible. Having the "frozen" flash block just hanging off the end of the firmware would move it about depending on which modules were compiled in at least, which is probably Good Enough(tm).require('rom:mymodule') to look into flash? If not, can we make it so? I could foresee a case where you have a read-only filesystem from which you've "frozen" some code over into flash, and if you then just require('mymodule') you'd need to search the frozen area first, but if we do that then you're likely to end up surprised because you're loading old code even though your .lua (or .lc) has been updated. Of course, it could be argued we're sticking to tradition given we already have the issue with .lc vs .lua...rom:module syntax, would it be possible to hook the "freezer" into the VFS read-only? That way you could simply do require('/freezer/module') or such, and would avoid introducing another namespace.Proto in order to be able to read them straight from flash? How much impact would that have in cases where the freezer isn't used?Sorry for the long reply guys, but I've tried to cover all of Johny's and Nathaniel's below.
I will do a separate update on the 5.3 work on #1661, but as to the specifics of this functionality, like Johnny I see addressing the RAM constraints a key criteria for the success of NodeMCU Lua, so my original intent was only to add this functionality to 5.3. What I've raised here is essentially a backport of the technology to 5.1. Unlike the rest of the 5.3 work which from a user perspective is either out-of-the-box 5.3 functionality or compatibility with the existing NodeMCU/eLua module API, this is new and pretty decoupled from the rest of the 5.3 work.
Moving it into 5.1 allows a more vigorous engagement with the current developer community to get a consensus on how this API should work -- as well as bringing forward the benefits to the community.
I see this as less of an issue for a number of reasons. The current Windbond chips such as the W25QxxFV series quote a life of more than 100,000 erase/program cycles, and even if the modules have the earlier generation NAND flash chips with a 10K cycle life, say, the mode that I suggest we use here which is essentially a reboot-reload-reboot cycle envisages a usecase more similar to the convention C rebuild and flash life-cycle. Even during active development the module might see 10 reloads a day, and in production maybe 1 a month so this isn't going to be an issue. It would also be trivial to consider refinements such as the SPIFFS_FIXED_LOCATION type parameter.
The Lua require system uses a set of helpers defined in the package.loaders array. These can even be changed at a Lua programming level. Even so I recommend that the default order should be that the ROM should be searched first, for performance reasons, but that in a development mode the developer would be free to do in init.lua
local pl=package.loaders; pl[3],pl[2] = pl[2],pl[3]
to reverse this search order. And note that you can _only_ specify the module name as a require parameter; it is the loaders (or searchersin Lua 5.3) that determine where to look.
The load functions are different in that these don't have a searcher concept and so we need some simple method of encapsulating accessing the ROM store at a Lua API level. Also accessing the ROM store is fundamentally different from any of the load functions (load, loadfile, loadstring, dofile) as these all execute a load operation which is expensive at runtime. The ROM store contains a set of compiled Proto hierarchies in memory, and that is needed to convert them into a closure (which is a "function" in Lua terms) is to execute the CLOSURE VM statement, and this needs a few lines of C to be executed as Protos are hidden from Lua execution world.
This is why I would promote the use of modules rather than functions in the store, as this is more transparent and fits better into the Lua paradigm. Nonetheless if we want to make a more transparent method of loading functions from the store or VFS whatever we do is going to be slightly a botch because internally within the relevant load function this is't encapsulated within the vfs because you don't actual do a load with a stored routine. _It's already loaded, just not bound to a closure._
We have exactly the same issue today with lc vs lua loads. The standard API leaves the handling of this and any precedence issues to the Lua programmer. I would just add rom to the list and still leave it to the programmer. My own standard template is to use an autoloader for my functions to hide all of the error handling and precedence issues. I myself would just extend this with one line:
setmetatable( self, {__index=function(self, func) --upval: loadfile
func = self.prefix .. func
local f, msg
if not skiprom or not skiprom[func] then f = getrom(func) end -- handle ROM load
if not f then f,msg = loadfile( func..".lc") end
if msg then f, msg = loadfile(func..".lua") end
if msg then error (msg,2) end
if func:sub(8,8) ~= "_" then self[func] = f end
return f
end} )
or actually the ROM optimised version which saves about 300 bytes RAM:
-- skiprom if defined is a global
setmetatable( self, {__index=getrom("autoloader")} )
The getrom function returns nil if the Proto isn't in the store, so this is the cheapest method of checking for existence. However we could also have a debug.getromprotos in the same way we have debug.getregistry though this would have return a list of names since the Proto values are meaningless in Lua.
Yes access from RAM is roughly 13-25脳 slower than Flash in the case of cache-miss, but at the moment executing every Lua VM instruction (4 bytes) involves reading 100s of bytes of xtensa instructions from Flash to interpret this one instruction. However, flash access is RAM cached and this reduces the overall impact, though accessing from scattered flash address regions will increase cache fault rates, so IMO moving code into rom will slightly decrease instruction execution performance.
But we also have to balance the slight increase of runtime in accessing rom-based constants and strings with the fact that all of these resources are in ROM and have been removed from the scope of the GC, so GC sweeps will be a lot shorter and required less often. A big runtime saving.
Also, the RAM limitations mean that non-trivial Lua programs involve a lot of dynamic loading of code from SPIFFS which is _slow_ because of the double whammy of SPIFFS overheads and the Lua load process. Converting ROM Protos to encapsulated functions is _fast_.
So I believe that the average Lua application will run faster overall.
Unaligned (in the Lua RTS nearly all byte) fetches are slow because of the overhead of the unaligned exception handle overhead. -O2 instead of -Os helps for general string access but not for this. However there are (inline macro assembler) techniques we can use to replace unaligned fetches by a two instruction aligned fetch and extract. But I see this as a second order optimisation for later.
Thanks much for the very detailed response!
Nice comprehensive response, cheers. Just a minor comment regarding unaligned stuff; I really did mean unaligned (exception code 9), rather than the sub-32bit-wide loads ("load/store error", code 3). The latter we have our custom exception handler to patch up and recover with. Unaligned 32bit access however would still be fatal. It may not be an issue as you say _"nearly all byte"_, but worth keeping an eye on.
I'm in favour of the approach outlined here. Obviously we'd need to have good docs explaining how to use it, when the time comes :)
Ah, and one more question: a 64k block is obviously larger than the amount of free RAM we currently have, and thus would likely go partially unused no matter how badly one tries to move code to it. Any ideas on how to get the best use out of it? I know you ruled out incremental freezes above...
It's late for me so a quick response. I understand your exception code 9 point and I will check this, but I don't believe that this is an issue.
Re 64kKb, the reason for two passes is to serialise the load process. There are two constraining factors: the size of the string table, and the size of the largest module that you need to load, because each file is loaded into RAM, then cloned into flash. We need to play to see how much of a constraint this is in practice, but whatever it is, it's still a _lot_ better than current constraints.
For those concerned with performance, I would argue that one of the key usecases for the ESP8266 (& ESP32) is IoT applications. For the majority of cases that I can think of, fast execution is not critical. So even if this resulted in a slightly slower execution time, I would be ok with it.
I can see great benefit from this feature:
These are just 3 benefits that completely justifies such initiative.
I would be curious what is the effort estimate for completing this task. Are we talking weeks or months ? (for a developer)
I believe we run the risk of losing many community members for the ESP8266 if we don't solve the RAM shortage which appears to affect proper support of secured connections.
The issue isn't so much absolute hours, but elapsed. The internals of the Lua engine are both subtle and complex, and I (TerryE) seem to have taken the short straw to get to grips with all this. The issue is that all of this work is unfunded and done in my spare time, threaded amongst my other commitments like finishing off a house that my wife and I are building -- and doing the Home Automation for the same which needs its own ESP code. But as to your core point it's man-days of work (my being a male) rather person-weeks, as a lot of the foundation work is already done as part of my Lua 5.3 upgrade for NodeMCU.
Would it help you if some of us were helping you with the funding ? We could gather a few volonteers to help out with donations. I am willing to help if I know this feature will result in fixing the current problem with secured tcp connection that has started with the SDK 2.x. on the ESP8266.
@dtran123, nah. don't need the dosh. I spent 35 years in IT and ended up on top of the techie shit-heap. I am now a gentleman living on a sinecure (pension). It's hours in the day and priorities that are my problems :wink: Let me crack on, whilst the brain is still working.
@dtran123 Increasing available heap may help with secure SSL, but as I reported in #1707, I think it is already viable to work with (and verify) ECC keys instead of RSA keys. If you control both ends, I think this is the most immediate path forward. You'll have to tweak the mbedtls configuration file as done in https://github.com/nwf/nodemcu-firmware/commit/c1ed48c09a2fafc85e53febf1298ae945da09531 (and likely want to cherry-pick @djphoenix's update to mbedtls first, https://github.com/djphoenix/nodemcu-firmware/commit/4958a4a12a16d91d58ec73652a0b1ddc4df4f6fa) and/or see if @marcelstoer can add some checkboxes to the web builder to achieve the same effect.
@TerryE Please don't take any of that to mean that I amn't rooting for your success. If not donations, perhaps a beverage of your choosing if we're ever in the same place. :)
I want to support this in any way I can.
Donations, testing, beer... please let me know if I can help! 馃憤
At the moment, I am thinking about work-arounds for some interesting catch-22s thrown up in standard Lua testing. The clone to flash process destructively overwrites the old version of the cloned ROstrt, but this in turn was a clone of and earlier version of the strt, so contains all of the strings like package.loaders keys, and I need these to persist in the same locations, so that loading code itself doesn't fall over. So I can't quite use a simple serial allocator for flash. It's an issue that I will need to address anyway with 5.3, but this is one to solve during some ZZZZZZZ or over a glass of wine, and not in the editor :smile:
Would it help to have two segments of the in-flash data? One which was objects whose positions needed to remain invariant across updates, and one which could be overwritten at will at each clone? I presume the former can be relatively small and so loaded into RAM at the start of cloning, and then written back to flash only after the second segment has been constructed and any requisite additions made?
Close. My current approach is to treat the first boot after flashing the Lua firmware as special. This is partly to solve some issues in the NodeMCU 5.3 version where you can declare TStrings at compile time. The RTS performs library initialisation then executes a clone before starting Lua execution. This just clones the base string table. The addresses of this first tranche of TStrings are then preserved across subsequent clones, so the tables which use them are OK.
Time for bed for me, as I am on UTS+0
Please see my paper on this approach: LRO Functions in NodeMCU Lua. Sorry, it includes some typos and other errors, but I will fix these if I update it following any review comments.
@TerryE This looks really well thought-out. I very much like the flash-block lifecycle tracking trick (1F -> 7 -> 3) and the multi-reboot design seems like it will work well without being too complicated.
ETA: Is there any way we could compute the flash block on the host as part of the image build? Obviously not exclusively, given the intended node.rebuild_flash() API, but in addition, perhaps?
I already have a lot of code that can do this in a 5.3 environment, but then again the standard NodeMCU make generates a luac which runs on the host and does the same as the current luac.cross, except for a -X option which allows you to run NodeMCU Lua on the host,. This makes it simpler to implement this, but there's no reason in principle why the equivalent shouldn't be done for 5.1, but let's get the on-chip version working and released first.
I have been thinking about this cross build issue. It would in principle be a straightforward variation. The way to do this would be to extend luac with an option -F <stringlist> where the string list would just be a text file list of strings to be included in the ROstrt, so
luac.cross -O flash.bin -F default_strings.txt MyProj/*.lua
might build a flash image for downloading based on the Lua files in MyProj. I couple of wrinkles here: (i) luac.cross uses host-native pointers for its in-memory data references which are typically 64bit these day, not 32, so some resizing would need to be done. (ii) I'd need a relocatable format for these flash binaries. These are both trivial to address, but I don't want to get sidetracked on this just yet.
Just throwing another thought into the pot here.
For the modules with mere 512k flash, would it be feasible to exclude the "freezer" support? Would it make sense to have the interface sitting in e.g. a freezer module with load(), isempty() and clear() functions, and then either #ifdef out or use a zero-size page/area for the storage if the module is not enabled? Just thinking that 64k might prevent people from using the old ESP01 modules with modern NodeMCU outwise.
@jmattsson, I'd already decided to do that (and also make this option disabled by default in early releases, at least) for two reasons: first, so that those that don't want it don't have the flash overhead, and second just in case we find issues in early testing. If we can conditionally remove the code then we can do safely release it into dev.
A second point: what do we formally call this? Philip first proposed the idea and called it "freezer". Do we use that, or do we use "flash"?
I've got the vanilla PC 5.1.5 version working fine now. This does a mmap() of a pseudo flash area into the VM, and then used mprotect() to turn off write access whilst the VM is running.
I had a bit of fun with the GC as this still tries to mark fixed resources during CG sweeps even if it isn't doing this in string sweeps. This relates to its algo for weak tables, especially kv mode tables that typically used for memoized functions and ephermon tables(see PiL 17.2 and 17.3), but I can't see where strings in Flash being fixed would break the application behaviour. More to the point, I very much doubt that _any_ IoT Lua application would fall foul of this, and having double the RAM available would help sweeten the pill if is does.
The other area is in the cascade clean-up of Proto hierarchies when a closure is GCed, but here the Proto fixing works as intended.
The PC-based version has to support PIC Flash because of Linux address randomisation, and if we start looking at @nwf Nathaniel's suggestion of Host buildable images then we might want to do the same for the NodeMCU version. However, I suggest that we keep the NodeMCU version simple as possible in its first iteration. I am not going to include byte-access optimisation in this first version so it will hit the aligned handler, but adding this as a second pass is pretty straight forward.
One thing that did strike me is that as soon as the VM starts running, the minimal string table is around 10Kb. Because this minimal core is moved to the ROstrt, this immediately frees up ~10Kb from a tpyical running Lua app even if it hasn't freezed any code into the flash.
Because this minimal core is moved to the ROstrt, this immediately frees up ~10Kb from a tpyical running Lua app
馃憤
I've just been doing an L8UI code review. (This is instruction that trigger 99+% of the unaligned fetch from flash exceptions.) It isn't to bad al all: the 'hot' modules, lobject.c,lstring.c, ltable.c have only 8,7,11 respectively and a couple of simple macro changes will avoid the material ones here, and in one case (luaO_log2()) a reasonable chunk of code replaced by a single asm instruction
lgc.c does a lot of marking and sweeping so accesses here should really avoid byte-based bit diddling in flag byte fields. For example, the compiler generates a 3 instruction sequence to test a single bit: load a byte; shift left the bit into the sign bit and branch if negative -- and this generates an unaligned exception. But the equivalent load 32bit; shift left the bit into the sign bit and branch if negative also takes 3 instruction, executes as fast and doesn't generate the exception.
lstrlib.c has a _lot_ of character based manipulation, especially in the pattern-based searching and matching so ROM-based patterns will be bad news, but it is pretty straightforward to define a macro to clone the string into an alloca()ed copy (if the string is in ROM and less than some safe limit long) and this would be a single line addition per parameter to such hot routines.
But we can do this sort of optimisation once we've got the basic code working.
Testing this lot is a total bitch. If you use the gdbstub then you can't use uart0 for Lua input or output. So you have to hook up a second USB serial chip to the UART1 and get debug logging that way. I've got two methods of loading code: a RAD cycle based on spiffsimg'ing small 32Kb FS with various test stubs, and potentially a telnet stub, but you've got to get your execution past the basic bootstrapping processes. I am still fighting EGC issues which are subtly different to the standardd Lua, and I am jugging this with all of my other time pressures.
At least the PC version works OK.
It doesn't help that the the gdbstub is very fragile if you can get to a breakpoint then you can examine RAM, but flash-based exceptions just seem to bypass the GDB exception handler entirely, and panic the CPU, so there's no opportunity for PM diagnosis. :disappointed:
Any pearls of wisdom or even sarcasm welcomed :smile:
@TerryE I am afraid I have no wisdom to add, and sarcasm seems like it won't help much. I don't suppose the ESP8266 believes in JTAG?
Sure @TerryE , how's "if it was easy then some other idiot would already have done it"? ;)
Is the gdb stub being bypassed because we've already hooked the flash exception? I think Philip changed it so the handlers would chain for anything we didn't handle though, so I might be way off.
@nwf As far as I know JTAG is a BITE interfacing technology. I've got more than that level of access and diagnostics; as Johny says: if it was easy ... We're working way up the stack here. Philip has already done some extremely valuable ground-breaking to help. It's a balancing act: I accept what we've got for now and work around it, or get sidetracked in improving and integrating the built-in test. What is clear is that we should to do a major rewrite of the Extension Developer FAQ to include stuff like using the gdb stub, logging to UART1, using the mapfile, ...
Yes, Johny, perhaps we need to make the flash exception handler gdb aware in the die branch. I will think about this.
As it stands I have Lua 5.3 working as a NodeMCU host build and ditto the flash variant of std Lua 5.1.5, but bootstrapping this into the ESP8266 just takes time and perseverance. It's just that my other commitments mean that the elapsed time is more that I'd prefer. Luckily I an old enough fart that i've done quite a bit of this low level hacking professionally back in the day, so it's just a matter of dusting off the cobwebs.
Well, JTAG is used for BITE/BIST access to be sure, but some MCUs have mechanisms for opcode-level debugging via JTAG. The Atmel AVRs, for example, IIRC, can be single-stepped and support hardware breakpoints through gdb when connected via JTAG (or their own "DebugWire" interface). There's no need for a gdbstub on the core, and no risk of missing exceptional control transfers.
Is the gdb stub being bypassed because we've already hooked the flash exception?
Johny, I think that this is the magic Q. gdbstub.c:install_exceptions() sets the exception handler for a bunch of exceptions including EXCCAUSE_LOAD_STORE_ERROR -- which fails silently because this has already been hooked by user_exceptions.c . This handler catches unaligned fetches and returns to the interrupted code after emulating the fetch, but it goes into a while(1) {} loop to force a reboot if not. What the should do is to daisy chain to the gdb handler otherwise.
Ah, it looks like even though https://github.com/nodemcu/nodemcu-firmware/commit/2dacec156a8cf1f39c56bfb056e493f36aba50cf introduced the chaining, it continues on to the while(1). Looks like there should be a return after the load_store_handler(ef, cause) call then. After all, if someone else is claiming to handle the exception, we shouldn't have to babysit it afterwards.
@nwf I was going to write that there is no known JTAG support for the esp8266, but it seems I would've been wrong to say that.
@jmattsson Intriguing! Still, if it's possible to do all that's needed by the gdb stub, that's likely simpler. :)
The asm break 1,1 causes entry to the gdbstub (as I recall). You cant continue from it -- I never found a way to continue after an exception (though it ought to be possible somehow).
Let's ignore JTAG is this issue, please as it isn't relevant to it. As far as the gdbstub / unaligned fetch handler interoperation goes, then surely the main thing to do is to ensure that the two handers interoperate properly?
The GDB stub uses an extended Xtensa OS HAL exception frame structure (see gdbstub_entry.S:L34-L45) and polls back to the remote debugger until a continue command is received in which case it updates the exception frame and attempts to return execution to the application.
The exception handler catches all Flash exceptions does something similar for L8UI, L16UI and L16SI instructions, emulating the instruction. Otherwise it chains to the GDB stub handler and this is where we could be going wrong. I'll have to try some use cases, but if the GDB stub does try to unroll the exception and continue, then shouldn't the unaligned fetch handler honour this and return control.
C
if (load_store_handler) {
load_store_handler(ef, cause);
return;
}
That's precisely what I was trying to say above. Try it!! :)
Sorry Johny, I am just being slow tonight. I will do :smile:
Hello Terry,
Have you been able to make some progress ? I think this feature will be a game changer for the ESP8266. My biggest hope is that this will resolve our secured tcp connection issue with the latest SDK version (possibly due to lack of memory).
I am on holiday at our house on one of the Greek islands at the moment. Along with my laptop and ESP chip. And this break has given me the bandwidth to iron out the final wrinkles.
I'll push an Alpha version in the next day or so.
Super ! We appreciate all the work here and owe you a lot of beers ! Enjoy your vacation too I hope :) Will the Alpha version be on the dev branch so some of us can test it out ? (I am not in a position to make my own builds)
The patch uses a conditional to enable / disable this functionality so that we can quickly and safely add it to dev as this will be disabled by default in the first instance. It is up to @marcelstoer as to whether adds the extra switch to the Cloud Builder so users can request cloud builds with this enabled. We will be updating the dev default once some of the other committers have had a chance to evaluate and test the patch.
Seeing that dev usage on the Cloud Builder is down to 5% (https://nodemcu-build.com/stats.php) I'd send people over to using my, now further improved, Docker image during the transition phase.
Thx. I will follow the docker instructions at https://hub.docker.com/r/marcelstoer/nodemcu-build/ to build the image. Once you have the code ready, please give us the details such as which conditional to set, etc. To get us started maybe you could share a link to a build with basic modules included that allows testing of secured tcp connections via tls. That is enough for me to do most of the critical tests.
@dtran123 As discussed in other threads, I wouldn't expect TLS to work out of the box, or at least, not well, even after more heap is made available by storing RO contents in flash. (Though it will certainly improve matters, but IIRC the TLS issues are not exclusively of the "out of heap space" mode.)
OK, I have an Alpha version working quite well. To do a node.rebuildflash(...) only takes seconds, even if it involves 3 reboots of the chip. Once its loaded, the CPU comes back with ~45K heap available, but the main difference is then at the mo' this then rapidly drops to 20Kb or whatever as soon as you start loading modules. When they are coming from Flash, you only lose heap for genuine R/W variables, so my rough 2x estimate seems to be holding up.
I could push this to my repo, but I've had to do quite a few other hacks to be able to test it. See #1862 for the back-story. I need to make the diagnostics conditional..
On another note, what this also throws up is how the GC hammers default performance, especially GC of strings. Need to think about this further.
If any of the committers want access to the alpha code for their own evaluation, then I will do a push, but @dtran123 et al: sorry guys there's just too much learning curve at the moment for me to push something that you could usefully use :disappointed:
@TerryE I amn't a committer, but I wouldn't object to seeing the bits. Congrats on hitting Alpha; it all sounds very exciting! :)
I'll have a slightly polished version in a day or so.
I have spilt the debugging discussions into a separate issue / PR #2146. And if I assume that is commited to dev first, then the LFS (Lua Flash Store) specific changes are as follows:
lua/lflash.c, lua/lflash.h The new module implementing the LFS functionality and the luaN_* entry points.
lua/ldblib.c. Add extra debug functions getstrings and getflashmodules to return an array of string table entries and of modules in LFS.
lua/lgc.c, lua/lgc.h. Prevent the marking or GC of ROM based GObjects. Also some GC diagnostics (to be removed).
lua/lobject.c, lua/lobject.h, lua/lrotable.h. Remove LUA_PACK_VALUE conditionals. eLua option not supported on NodeMCU and also incompatible with LFS. Some changes to macros to prevent the modification of ROM GCObjects. Also inlining a "nsau %0, %1;" instruction to remove an equivalent C function which hit the flash unaligned exception handler badly.
lua/lstate.c, lua/lstate.h. Add initialisation of G(L)->ROstrt plus LFS hook.
lua/lstring.c, lua/lstring.h. Add scanning of G(L)->ROstrt and also some macro tweaks to prevent marking of RO strings.
lua/lua.c and user/user_main.c. Hooks into LFS initialisation and to allow fast reboot from LFS phase without going into init.lua. I also removed some of the #if 0 crud which confused things.
modules/node.c. Added function entries for getflash and rebuildflash to call luaN_getfunction and luaN_setfilelist_reboot respectivey.
platform/common.c, platform/platform.c, platform/platform.h. Extend platform flash interface to enable reservation of flash areas, plus platform_flash_mapped2phys and platform_flash_phys2mapped functions.
I still need to test my variants to ensure that I debug-free version and one without the -DLUA_FLASH_STORE option compile and run as intended. This last options essentially drops the LFS functionality from a build, this allowing us to promote the patch to dev earlier, since the default build will conditionally remove the LFS code.
I've still got a lot of testing and possibly some performance issues to address that testing the current have thrown up. These could significantly improve runtime performance, but even in its current form, the patch in practice allows Lua developers to deploy far larger embedded Lua applications.
I will give a example of the sort of wrinkle that I am debugging.
Lua stores strings in the strings table by a TString header immediately followed by the string literal. eLua introduces the concept of ROM string constants and if the string being interned is a ROM string then instead of storing the string itself, it appends a pointer to the string if the string is longer than 3 bytes. This saves ALIGN(strlen) - 4 bytes RAM for each string optimised this way, albit at the cost of extra testing on every string access.
There is a +/- debate as to whether this is worthwhile for LFS strings, since the string is already in Flash. OK it _slightly_ improves flash usage, but at a cost of extra cache faults to access the string if needed: inline storage performs better.
However there is also an unintended side effect: the LFS store can persist between different S/W builds so long as the size doesn't move the page boundaries, however all such eLua address indirections might well no longer point to the correct strings in the new build: therefore you must rebuild the flash after downloading a new build. Bugger.
Not an issue that an end user will face, but a bear trap during development. So do I:
_etext and invalidate the LFS store if this happens;At the moment, I am doing the last, but it did give rise to some head stratching until I realised what the issue was.
I have added/extended the following functions:
node.flash sub-table with functions getfunction, rebuild and info. See my changes to the node documentation for more info.
debug. getstrings takes a single argument "RAM" or "ROM" and returns a sorted list of the strings in the relevant string table. getflashmodules is an introspection function which lists the modules in the LFS.
Another issue that I need to address are the Lua loaders used to require modules. NodeMCU keeps to the standard list (see loadlib.c) even though 3 aren't used / don't work:
loader_preload. This looks up the module in package.preload and loads it. eLua/NodeMCU _should_ use this to do its ROM modules resolution, but instead ignores this and has a patch in the require function itself to first check for the ROM table.loader_Lua. This is the standard Lua module loader which uses the package.path to search for the file corresponding to the module.loader_C. On std Lua, this uses the OS hooks to load a C shareable library dynamically using package.cpath . On NodeMC, dynamic overlays don't work so this code just fails.loader_Croot. A variant of (3). Ditto on the failure.The require function tries each loader in turn and if successful, it then returns the Closure (a.k.a function in Lua). Incidentally, I use package.loaders[2] to load my Lua functions from SPIFFS as this handles the lc/lua precidence for me. The main advantage of this system is that both package.path and package.loaders can be updated by the application.
I think that the best approach is to replace (1) with a loader_flash and remove (3) and (4), so that require can transparently load modules either from flash or SPIFFS. Using separate loaders allows the application to configure the precedence. I can't change the ROM library lookup patch because this implements a non-standard behaviour of not adding the module to package.loaded and using a loader would add the table, and thus break backwards compatability.
Comments please.
I still have some odd gremlins with GC paths during the store rebuild that I am chasing down. Once I've doe this, I raise a PR.
Regarding the bear trap, I think it might be better to at least start off with having the strings inline. Unless I'm mistaken, you already have plenty of gremlins to deal with, so whacking one off the table for the time being would seem to be helpful. This is an optimisation we could look at including further down the track, when the dust has settled a bit.
As to the loaders, yes, what you suggest sounds sane and proper.
This is a bit like praying in that I have a problem and putting it into words helps. The act of posting it here also helps, but I doubt that anyone will answer with a miracle.
The issue that I have is that I made some change and the GC is corrupting my RAM, and I am trying to diagnose the wheres and whys. So I am now in the remote debugger within the bowels of Lua LVM. The symptom is I've diagnosed so far is that the Lua Closure for init.lua is getting stomped on after the VM executes the first instruction in the code which is a GETGLOBAL 0 -1 ; node though this doesn't cause a exception until the next GETGLOBAL at instruction six.
Well not so much the Closure but the contents of it's env field gets overwritten. (The cl->env points to the Lua Table for its Lua environment which is the main global table _G at this stage.) It is not the pointer, but the actual Table structure contents. Working this all out is a pain because of largely undocumented remote debugger constraints:
break only works if the code is running out of RAM, but since this almost never happens and it is pain to toggle code into RAM, so it's just easier to say that it doesn't work.step and next work so long as you don't hit a L8 or L8.i instruction which is addressing Flash. The debugger can't deal with these since we've added this as another S/W emulation. A nice to do for Johny or me would be to add the unialigned fetch functionality into the debugger to handle these so that stepping works properly.hbr. awatch, rwatch or multiple (serially single) so long as you remember to delete the previous one.xtensa-lx106-elf-gdb.Anyway, it look like one of the memory locations is getting stomped before my Flash code starts to run. Sounded like a double free or the like. So I moved a cut-down version of the debug realloc (from the Lua test suite) into the NodeMCU code in luaxlib.c This bookends each Lua memory block with a 0x55555555 marker and a check length. It issues a break 0,0 to throw you into the debugger if anything is wrong. The watch on this location when I add it isn't firing. Need to think about this, but I am tired and I have to get up in 6 hrs to catch a plane.
Even so, the extra level of diagnostics seems worthwhile for locating this sort of nasty problem. Literally, upwards and onwards.
Sorry guys, I've been a bit distracted since I've got back to the UK. I'm commissioning some elements of the Home Automation system in our new build, and need to bring the heating and Direct H/W online -- both non-standard since the house is a passive house, plus lots of other jobs. So this work has been on the back-burner a bit.
As the the bug what is happening is that there seems to be some subtle interaction between the GC, the eLua emergency GC patches and the flash "don't try to mark ROM" ones so the white bit marker gets out of sync with the GC sweep and the GC starts to free resources still in use. I need to add a bit of instrumentation to work out what is failing and why.
No sweat :) We appreciate your work. This piece is a game changer for me...more RAM opens up new possibilities for sure and help with secured connections. Just putting a little note to show that we care about your efforts here. High on my radar is this (PR #2068) and PR #1707. Those are two key PRs to be resolved for me to invest more time on the ESP8266.
So, so great to hear that this is progressing!! Thank you very much @TerryE for your continued support!
We all completely understand Terry that you are under no obligation to meet deadlines or schedules and that this is a 'done in your spare time' kinda thing, however, for those of us that are eagerly awaiting a dev version of this, do you have a rough estimate of when you think you'll be able to release something? Not looking for a commitment in any shape or form, just a realistic prediction of when we are likely to be able to start using NodeMCU again! :)
I understand you "don't need the dosh" but if a donation would sweeten the deal, then please let us all know. Unfortunately money is all I can offer, I wish it was technical support, but this is all a little beyond me!
Hope it's going well!
No money. Just priorities, I'm sorry to say. I am up to my eyeballs in Lua and Node Red commissioning my home automation system for my new house. I'll take a break soon and spend a half a day getting to the bottom of this GC issue.
As soon as I have a stable build, I will push a commit to my github fork.
Hi @TerryE - Any update on this please?
I am desperate to get my hands on a version of NodeMCU that allows me to comfortably connect securely and also have enough heap for other stuff too!
Again, I know you have no obligation here, but do you have a rough idea of when you'll be able to find some time to complete this?
I am currently looking to invest in a developer to get a usable version up and running and wanted to see what stage you were at first, before I engage them...?
Many thanks :)
Hi, @georeb. I've just moved into the house that I've been building for the last few years and am typing this in my office on the first floor (in England the ground floor = zeroth). I hope to work on this over the holiday break and get a version out for evaluation. As to your investing in a developer to do this, my advice is: don't bother. This is complex stuff because you've got standard Lua, the eLua hacks, and all of the ESP issues interplaying. The learning curve is huge.
Any progress over the Christmas break @TerryE ?
I understand that it'll be a learning curve employing a developer, but I don't have much choice. You are unfortunately the bottleneck and as I cannot interest you in payment, I have to pay someone that will. As always, I understand that you have no obligation; however I (and others) have been waiting 5 months for this now and I have to do something before it all gets superseded by something else :/
If others want to chip in to help out with developer costs, please get in touch.
@georeb, we sold our old house and moved into the one we built on the 19th Dec and I've just been getting the HA system to the point where it will run the house's heating and environmental controls. After working 7 days a week on the new build, we've got to the point where my wife and I both have time available for our interests.
By all means employ a programmer to do this work, but don't underestimate the learning curve. You will probably waste your money as I will beat her or him to this deliverable.
Congrats on moving into your new build! :) How far off the deliverable would you say you are @TerryE ?
Any update please? @TerryE
Yup. Long story, short. I ran into a bit of show stopper that has forced my to change my implementation strategy. The problem wasn't that the approach doesn't work but more of a scaling issue because of how the GC (and the EGC modifications) interact with the build process. The EGC includes extra GC pause / restart directive around some operations and nested pause / restarts aren't honoured, so these would always restart the GC. The Lua GC will aggressively scan all collectables and mark any that aren't in Lua scope for GC.
What this means is that the build process doesn't scale robustly, and above a certain size of flash image the GC could come in and collect elements that I was assembling for the flash. Getting around this by referencing them was creating extra overheads which hit the scaling issue even more. Either that or start making fundamental changes to the GC, which I just am not willing to do.
So this issue was about robustly building a flash image on device, rather than executing it on the node once built.
My alternative approach is to move the flash image building into the luac.cross build, so that doing a cross luac on a collection of files with the correct switch builds a PIC flash image, which you can copy into the SPIFFS on the target and execute a single API call to reload the LFS with this image and restart the node.
The only complication is that the host environment must be a little endian architecture such as Intel or ARM, but the code has to cope with 32bit and 64 host environments.
I am junking the current eLua-based cross-lua.lua approach, and the standard make now builds luac.cross as well ( as I do with 5.3) so long as the host includes the standard build-essential toolchain.
The luac side is working. I am in the middle of stripping out the rebuild stuff from the ESP end and adding the small PIC loader, Another few days of dev work.
You're right, sounds extremely complicated @TerryE !!
So, this is sort of good news then? It's hard to tell, not being technical!
Are we close?! :)
OK, It looks as if I have ironed out most of the issues and can put together an evaluation PR. I just need to check that my build without all of the debug hooks works as anticipated. We will clearly need a tweak of the API stuff and I still have some bits to add. But the highlight so far are:
luac.crossluac.cross can build a PIC flash image ffrom a file list of lua files.local/lua directory, placing the luac.cross generated flash image in local/fs and this is then included in the SPIFFS imageuser_config.h and honours the SPIFFS_MAX_FILESYSTEM_SIZE and SPIFFS_FIXED_LOCATION defines. (I have these at 32Kb and 1Mb resp for my testing.)node.flash.reload(filename) will reload the LFS and reboot the ESP.node.flash.reload() returns a list of top level function names in the cache.node.flash.reload(functionName) returns the corresponding of top level function.debug.getstrings() which can return a list of strings in either the ROM or RAM table.There's stlill a TODO list, for example:
__index metafield so that node.flash.someFunction(args) has the expected result.I've just been playing with a test LFS which has 7 function files loaded, has 135 string constants in the ROM table, 22 are in the RAM string table and there is over 39Kb heap still available for the App, so this is all looking promising.
I've also fixed a bug in the remote debugger and become adept at using this. I've also added some gdb macros which will help library developers examine the Lua stack, and I need to write all of this up in the developer guide sometime.
How does the following get built?
local M = {}
M.add1 = function(x) return x + 1 end
return M
I'm hoping that it will be possible to write a wrapper for require that searches the file system first, and it not found, uses the version in the node.flash. (The rationale for that order is that it allows easy development by only having to upload the one file that you want to change)
I'm hoping that it will be possible to write a wrapper for
require
Phillip, there's no need. Read up on package.loaders. The require loader passes the module name to each in turn, and this handler then either
TValue containing a LClosure in API terms) or The package.loaders table is in RAM so the application can reorder the handlers or replace/add one. (search for lua_CFunction loaders in app/lua/loadlib.c). We only use the second, loader_Lua in NodeMCU, so you can replace any of the other 3 with your own Lua function:
local index = node.flash.index
local function loader_flash(module)
local r = node.flash.index(module)
return type(r) == 'function' and r -- or nil otherwise
end
if index then package.loaders[2] = loader_flash end
If you have some init module in flash then you can stick this fragment in it, then the only RAM overhead is the loader_flash LClosurewith its one upval.
As far as how it gets build, you can either just stick the modules in fs/lua and do a make, or you can do your own process. I am going to update my own provisioning system to be LFS aware, so this will all be seamless for me.
Another trick is that I include a dummy module preload which is just a single lua line:
-- preload a bunch of strings into the ROstrt and avoid the RAM overhead.
-- use debug.getstrings('RAM') to work out which you might want to add
-- for your application
local preload = "?.lc;?.lua", "@init.lua" -- , ... extend as you need
or add more preload = .... if you have lots of string that you want to preload into ROM. This creates a dummy module with just a load of LOADK instructions and a constant list of all of these strings, which luac.cross will then preload in the ROstrt, so you won't chew up your RAMstrt and have all of the associated GC overhead. You never need to call this; just including it in the compile is enough.
OK you are wasting n 脳 (TValue + Instruction) in the LFS to do this, but with up to 256Kb available and it never being called, do you care?
I was thinking about reverse engineering the compiler to preload all of the common strings used during compilation to drops the compilation overhead.
@TerryE Makes sense. Looking forward to seeing this in action!
Incidentally one of the best tricks to do with the debugger is to add a macro for lua_assert which does a debugger break and then enable this for your test code. The Lua API macros use lua_assert a lot to do validation so this will pick up a lot of consistence errors. You can also make heavy use of lua_assert in your own code. If not enabled then this all gets optimised away / removed by the GCC code generator at -O2. The real PITA with using the debugger is that you loose the ability to input strings through the UART input, so you need to use a telnet stub for interactive testing.
I am thinking of having a variant assert stub which puts out a warning message to come out of your UART terminal session and start xtensa-lx106-elf-gdb before itself starting the GDB remote stub then issuing a break so that the host and target can rendezvous in a debug session, and this way you get the best of both interactive and debug use.
This is GREAT news @TerryE !! :) Thankyou.
Is the plan to release a DEV version that will eventually be merged with the MASTER branch?
The Alpha version will stay in my fork until at least one other committer has checked it out. Then it will be pulled into dev. It will go into master on the following release cycle, but with the LUA_FLASH_STORE define in user_config.h commented out so that builds won't have LFS enabled by default. However individual developers will be able to enable it for their builds. We might subsequently switch it be default but that will be up to a consensus of the committers, not just me.
Excellent!
Will the version in your fork be an adapted version of NodeMCU MASTER branch?
Sorry for the, perhaps, obvious questions.
The way that the release cycle works is that we commit to dev, then batches of commits to dev once stable are then committed to Master. The only path to updating master is to move dev patches into it. So I am not sure what you mean by your repeated Q. There should be a master version with LFS support in the next 2-3 months, but the delay is only because of the dev to master promotion cycle.
About half the community use dev builds to take advantage of the latest bug fixes etc. The delay ensures that we have a reasonable chance to give good usage coverage to any changes before moving them into master.
Will the version in your fork be an adapted version of NodeMCU MASTER branch?
What I meant was, will your version be a standard copy of the current MASTER with the addition of LFS?
I hope this is a little clearer?
@georeb Unlikely; it's more likely to be a fork of dev, rather than master, since that's the target for merge.
Okay, understood. Thanks
I have just updated my Lua Flash Store (LFS) whitepaper so it now reflects the current LFS implementation. _Anyone interested in this, please reread carefully_. The LFS patch is so large that I have also had split it into 5 commits, each of which is larger than a typical PR here.
For those who are wondering about my delays here, I find it quite time consuming to cover all of the base test cases and their variants: float vs Integer build; host (luac) vs target (lua) firmware; without LFS; with but no LFS used; with with LFS used. In my testing, I have come across a subtle architectural issue which related to my implementation of GC marking, and this really needed reworking before I release this.
We made quite a few compromises in getting the 0.9x versions of Lua out within the timescales that zeroday achieved. By now we have the luxury of a robust working 2.1 version. I don't want to compromise this by rushing out an LFS version too soon.
See #2292 for further discussion.
Most helpful comment
OK, It looks as if I have ironed out most of the issues and can put together an evaluation PR. I just need to check that my build without all of the debug hooks works as anticipated. We will clearly need a tweak of the API stuff and I still have some bits to add. But the highlight so far are:
luac.crossluac.crosscan build a PIC flash image ffrom a file list of lua files.local/luadirectory, placing theluac.crossgenerated flash image inlocal/fsand this is then included in the SPIFFS imageuser_config.hand honours theSPIFFS_MAX_FILESYSTEM_SIZEandSPIFFS_FIXED_LOCATIONdefines. (I have these at 32Kb and 1Mb resp for my testing.)node.flash.reload(filename)will reload the LFS and reboot the ESP.node.flash.reload()returns a list of top level function names in the cache.node.flash.reload(functionName)returns the corresponding of top level function.debug.getstrings()which can return a list of strings in either the ROM or RAM table.There's stlill a TODO list, for example:
__indexmetafield so thatnode.flash.someFunction(args)has the expected result.I've just been playing with a test LFS which has 7 function files loaded, has 135 string constants in the ROM table, 22 are in the RAM string table and there is over 39Kb heap still available for the App, so this is all looking promising.
I've also fixed a bug in the remote debugger and become adept at using this. I've also added some gdb macros which will help library developers examine the Lua stack, and I need to write all of this up in the developer guide sometime.