When NodeMCU modules are used in the correct way using LFS together with the overlay table method used by @nwf to load flash components, the components does not "stick" in RAM even when components are called again over time.
This results that RAM is constantly released as the lua application runs for an extended period of time.
The releasing of RAM holds true for all NodeMCU modules except when using the tls module. After every use of the tls module functionality, the available RAM decreases by about 2 KB.
The nodemcu-firmware/lua_examples/lfs/HTTP_OTA.lua from @TerryE has been modified by replacing local con = net.createConnection(net.TCP,0) with local con = tls.createConnection().
The specific example appears not to display this problem when it is used without the tls functionality. This strengthens the suspicion that it it might be related to a memory leak in the tls module.
NodeMCU 3.0.0.0 built with Docker provided by frightanic.com
branch: dev
commit: c116d9d25f5d15e2b1d0d52ff4c8ebdfd18f75c3
release: 2.0.0-master_20170202 +477
release DTS: 202003062124
SSL: true
build type: float
LFS: 0x20000 bytes total capacity
modules: adxl345,bit,cron,crypto,file,gpio,http,i2c,net,node,ow,rtcmem,rtctime,sjson,sntp,softuart,tls,tmr,uart,wifi
build 2020-03-09 11:52 powered by Lua 5.1.4 on SDK 3.0.1-dev(fce080e)
Standard ESP-12 is used.
While I'm flattered, I cannot take credit for the overlay table technique. As with so many things LFS, full credit is due to @TerryE.
Memory leaking, especially at that rate, doesn't sound stellar... 2K is an awkward size to explain, too, since it's too small to be the full TLS connection state object (with its associated buffers of much fame) but too large to be a LwIP structure or anything of that sort, I think. Would you mind repeating your experiment on top of #3060? I don't expect that it will fix the problem, but there was some minor re-jiggling of tls (de)allocation associated with that.
Since this seems easy for you to reproduce, can you tell me if the Lua registry is growing as memory seeps away? Dumping something like for k,v in ipairs(debug.getregistry()) do print(k,v) end when you sample node.heap() is likely informative. (It's not as simple as #debug.getregistry() because the registry has a free list plumbed through it, and so peaks of registered entities might obfuscate what's going on.)
@nwf I have confirmed the following before using #3060:
The Lua functions in registry grows consistently every time the tls module is used while the userdata stays constant.
I used r=debug.getregistry() and called r[k]() for every function indicated in registry to get the designated code line of the indicated function to identify the functions growing in the registry.
It appears every time a thetls.socket:on() event is triggered the callback function(tls.socket[, string]) stays behind in the registry.
Ah ha! Thanks for having a deeper look. A callback leak is certainly possible, and would explain why this isn't found with tcp. I will sniff around.
I didn't see any overt reference leaks, but I did see some dodgy corner-case behavior which I've now taken a stab at fixing. Could you try again with e38ce43dcfa5c3ad3f5a87e7db31e2edb03d6cfb from my for-upstream branch (which should now be part of #3060)?
Sorry, grabbed the wrong commit hash; I meant 0cd7c9409e8b2005134a0fad10ce33c6d5bc785c, but it, too, is part of #3060, so if you've just pulled that whole mess you've got the right thing.
No harm done, i will grab #3060. Sorry it is taking a bit longer, struggling to do a Docker build from your commit, I need to figure out if it is Docker or the build make file.
@nwf Ok, I get a tls.c Docker build failure. Maybe it is because of the shortcut I tried. I have used my existing nodemcu-firmware build pull (c116d9d2 6 March) and just replaced the tls.c file with the file in your 0cd7c94 commit. This is to ensure that when I test again everything stays the same except for your changes to tls.c
My shortcut was too short :roll_eyes:, I did a "proper" pull, builds fine, will continue to start testing soon.
@nwf, Ok, completed the same tests as yesterday with the commit as detailed below that includes the tls.c changes:
NodeMCU 3.0.0.0 built with Docker provided by frightanic.com
branch: for-upstream
commit: e38ce43dcfa5c3ad3f5a87e7db31e2edb03d6cfb
release: 2.0.0-master_20170202 +500
release DTS: 202004050230
SSL: true
build type: float
LFS: 0x20000 bytes total capacity
modules: adxl345,bit,cron,crypto,file,gpio,http,i2c,net,node,ow,rtcmem,rtctime,sjson,sntp,softuart,tls,tmr,uart,wifi
build 2020-04-05 18:39 powered by Lua 5.1.4 on SDK 3.0.1-dev(fce080e)
The results are unfortunately exactly the same as before. The functions in the registry increases in exactly the same way as before with the userdata staying constant.
I am very confused. I have attempted to run HTTP_OTA thus: LFS.HTTP_OTA(server,"/~nwf/","foobar") and have discovered that...
tls:on() did not allow :on(..., nil) to unregister callbacks, so it always crashed; I am not sure how this worked for you.tls sockets without callbacks registered showed a sort of stylistic bug in HTTP_OTA: the connection callback should be registered before the call to :connect().HTTP_OTA once per boot; later invocations apparently do not do anything, even if the previous attempt bailed out. I am not sure what's going on, but that seems unintentional.Specifically, I am running HTTP_OTA.lua with the following patch applied:
diff --git a/lua_examples/lfs/HTTP_OTA.lua b/lua_examples/lfs/HTTP_OTA.lua
index 513771b3..a4e3201a 100644
--- a/lua_examples/lfs/HTTP_OTA.lua
+++ b/lua_examples/lfs/HTTP_OTA.lua
@@ -12,8 +12,7 @@ local n, total, size = 0, 0
doRequest = function(socket, hostIP) -- luacheck: no unused
if hostIP then
- local con = net.createConnection(net.TCP,0)
- con:connect(80,hostIP)
+ local con = tls.createConnection()
-- Note that the current dev version can only accept uncompressed LFS images
con:on("connection",function(sck)
local request = table.concat( {
@@ -28,6 +27,7 @@ doRequest = function(socket, hostIP) -- luacheck: no unused
sck:send(request)
sck:on("receive",firstRec)
end)
+ con:connect(443,hostIP)
end
end
atop the full #3060 series (which includes, now, fb24db482537411beacb5fc595a5aaa96242b577, the fix to the TLS :on issue). A debug build yields, trimming out everything except the TLS chatter, ...
> LFS.HTTP_OTA(server,"/~nwf/","foobar")
tls_socket_connect 3fff05f0
tls_socket_ondns 3fff05f0
client handshake start.
client handshake ok!
tls_socket_onconnect 3fff05f0
tls_socket_onsent 3fff05f0
tls_socket_onrecv 3fff05f0
0 of 12, tls_socket_onrecv 3fff05f0
12 of 12, tls_socket_close 3fff05f0
client's data invalid protocol
Reason:[-0x45]
tls_socket_last_call 3fff05f0 -2 'Unknown error'
tls_socket_cleanup 3fff05f0
tls_socket_delete 3fff05f0
I'm not sure what that Reason is, but it's entirely unrelated; the file transfer has taken place successfully (all 12 bytes copied OK).
For clarity the following:
I have modified nodemcu-firmware/lua_examples/lfs/HTTP_OTA.lua from @TerryE as follows to create a Post and Get function to download and upload a file to a cloud repository via TLS.
Both works well but leaves functions behind in the registry making it to grow.
The trimmed down Post function is as follows:
function Post()
local Rec = function (sck,rec)
local i = rec:find('\r\n',1,true) or 1
if rec:find("201",1,i)==nil then
return false
else
print("uploaded")
end
sck:close()
sck,rec,con,res=nil,nil,nil,nil
return true
end
local doRequest = function(conn,hostIP)
if hostIP then
local request = ........
local con = tls.createConnection()
con:on("connection",function(sck)
sck:send(request)
request = nil
collectgarbage();collectgarbage();
con:on("receive",Rec)
end)
con:connect(443,hostIP)
end
end
local res = net.dns.resolve("bitbucket.org", doRequest)
end --Post
The Post function leaves one function behind in the registry from callback function(tls.socket[, string]) of the tls.on event.
The trimmed down Get function is as follows:
function Get()
local n,total=0,0
local status,size
local finalise = function(sck)
file.close()
sck:on("receive",function(sck,c) print(c) end)
sck:close()
sck,con,firstRec,subsRec,c=nil,nil,nil,nil,nil
collectgarbage();collectgarbage()
local s = file.stat(fname)
if (s and size == s.size) then
print("pass")
return true
else
print("fail")
return false
end
end
local subsRec = function(sck,rec)
total, n = total + #rec, n + 1
if n % 4 == 1 then
sck:hold()
node.task.post(0, function() sck:unhold() end)
end
file.write(rec)
if total == size then finalise(sck) end
end
local firstRec = function (sck,rec)
print("firstRX")
-- Process the headers; only interested in content length
local i = rec:find('\r\n\r\n',1,true) or 1
local header = rec:sub(1,i+1):lower()
status = tonumber(header:match('http/1.1 *(%d+)%s.+\r') or 0)
size = tonumber(header:match('\ncontent%-length: *(%d+)\r') or 0)
if status==200 and size > 0 then --200=OK
sck:on("receive",subsRec)
file.open(fname, 'w')
subsRec(sck, rec:sub(i+4)) --pass content without header to subsRec
else
if status==404 then --404=not file found
sck:on("receive", function() end)
sck:close()
sck,rec,con=nil,nil,nil
return false
end
end
end
local doRequest = function(conn,hostIP)
if hostIP then
local request = ........
local con = tls.createConnection()
con:on("connection",function(sck)
print("Connected")
sck:send(request)
request = nil
collectgarbage();collectgarbage()
sck:close()
sck,con=nil,nil
con:on("receive",firstRec)
end)
con:connect(443,hostIP)
end
end
net.dns.resolve("bitbucket.org", doRequest)
end --Get
The Get function leaves five functions behind in the registry from callback function(tls.socket[, string]) of the tls.on events.
I'll offer commentary on Post.
function Post() local Rec = function (sck,rec) local i = rec:find('\r\n',1,true) or 1 if rec:find("201",1,i)==nil then return false
This and the other return from this function are discarded.
print("uploaded") end sck:close() sck,rec,con,res=nil,nil,nil,nil
This does not do anything except delete any globals incidentally named con or res; sck and rec are local variables in this function's scope, and this code only introduces con and res as local identifiers, so they are not reachable here.
return true end local doRequest = function(conn,hostIP)
conn is confusingly named and threw me for a moment; it is just nil for reasons that escape me. Another wart of our API.
if hostIP then local request = ........ local con = tls.createConnection() con:on("connection",function(sck) sck:send(request) request = nil collectgarbage();collectgarbage(); con:on("receive",Rec)
Here is the problem! Given all of these...
con C object holds reference to itself in the registry because it is OK for Lua to drop it and rely on callbacks.functions in the registry, as that is the only way it has to refer to them.:on("connection") callback makes direct reference to con (as a Lua upval) rather than using the passed sck.There are now cyclic references through the registry, and the garbage collector is unable to reclaim any of the objects.
end) con:connect(443,hostIP) end end local res = net.dns.resolve("bitbucket.org", doRequest)
net.dns.resolve does not return a value; res serves no purpose (and is, indeed, unused anyway).
I am not particularly enamored of our APIs for C objects that pin themselves to the registry. Unfortunately, any real fix is going to be far-reaching and invasive (see #3062 for what might be the first step). Until then, we will just have to be very careful of what upvalues we capture and what C objects we retain in Lua when.
Thanks for your detailed feedback and explanation.
My code "grew" a bit from the nodemcu-firmware/lua_examples/lfs/HTTP_OTA.lua example that I used as a starting point. Hence the unnecessary additions in my attempt to isolate the "leak" since I mentioned this problem about a year ago. I will clean up and get rid of all the "stuff" not doing anything.
I understand and get it 馃憤 as explained in #3062, the fundamental problem, as brought to the surface in the lfs example code, exposes the quirk of our APIs related to C objects when callbacks are unregistered leaving the underlying C object pinned to registry out of reach of the gc.
I also need to re-look upvalues for a better understanding.
Thank you for triggering a deeper investigation into the code!
@nwf , Eureka !!, The sticky tls event callback functions do no stick anymore.
I have used your latest reworked commit fb24db4 ("tls: allow :on(...,nil) to unregister a callback") and repeated the tests without any improvement. Everything still sticking exactly as before.
I then decided to move local con = tls.createConnection() out of the doRequest function in both the Post() and Get() functions to be first in the main functions to ensure that all tls sockets will be destroyed when the calling sck:close(). This seem to work. The registry does not grow anymore when tls is repeatedly used. I need to try this with the unchanged tls.c still. I believe the combination of your and my changes did the trick.
The fix for Post should be as simple as replacing con:on("receive",Rec) with sck:on("receive",Rec); moving con to a higher scope means that the later con = nil assignments will clobber it and thereby sort of fix the problem, but it does not change the underlying cyclicity of references.
Ok, I will move the definition back, do the change and test first thing tomorrow morning.