Nodemcu-firmware: TLS Module Memory Leak ?

Created on 3 Apr 2020  路  16Comments  路  Source: nodemcu/nodemcu-firmware

Expected behavior

When NodeMCU modules are used in the correct way using LFS together with the overlay table method used by @nwf to load flash components, the components does not "stick" in RAM even when components are called again over time.
This results that RAM is constantly released as the lua application runs for an extended period of time.

Actual behavior

The releasing of RAM holds true for all NodeMCU modules except when using the tls module. After every use of the tls module functionality, the available RAM decreases by about 2 KB.

Test code

The nodemcu-firmware/lua_examples/lfs/HTTP_OTA.lua from @TerryE has been modified by replacing local con = net.createConnection(net.TCP,0) with local con = tls.createConnection().

The specific example appears not to display this problem when it is used without the tls functionality. This strengthens the suspicion that it it might be related to a memory leak in the tls module.

NodeMCU startup banner

NodeMCU 3.0.0.0 built with Docker provided by frightanic.com branch: dev commit: c116d9d25f5d15e2b1d0d52ff4c8ebdfd18f75c3 release: 2.0.0-master_20170202 +477 release DTS: 202003062124 SSL: true build type: float LFS: 0x20000 bytes total capacity modules: adxl345,bit,cron,crypto,file,gpio,http,i2c,net,node,ow,rtcmem,rtctime,sjson,sntp,softuart,tls,tmr,uart,wifi build 2020-03-09 11:52 powered by Lua 5.1.4 on SDK 3.0.1-dev(fce080e)

Hardware

Standard ESP-12 is used.

All 16 comments

While I'm flattered, I cannot take credit for the overlay table technique. As with so many things LFS, full credit is due to @TerryE.

Memory leaking, especially at that rate, doesn't sound stellar... 2K is an awkward size to explain, too, since it's too small to be the full TLS connection state object (with its associated buffers of much fame) but too large to be a LwIP structure or anything of that sort, I think. Would you mind repeating your experiment on top of #3060? I don't expect that it will fix the problem, but there was some minor re-jiggling of tls (de)allocation associated with that.

Since this seems easy for you to reproduce, can you tell me if the Lua registry is growing as memory seeps away? Dumping something like for k,v in ipairs(debug.getregistry()) do print(k,v) end when you sample node.heap() is likely informative. (It's not as simple as #debug.getregistry() because the registry has a free list plumbed through it, and so peaks of registered entities might obfuscate what's going on.)

@nwf I have confirmed the following before using #3060:
The Lua functions in registry grows consistently every time the tls module is used while the userdata stays constant.
I used r=debug.getregistry() and called r[k]() for every function indicated in registry to get the designated code line of the indicated function to identify the functions growing in the registry.
It appears every time a thetls.socket:on() event is triggered the callback function(tls.socket[, string]) stays behind in the registry.

Ah ha! Thanks for having a deeper look. A callback leak is certainly possible, and would explain why this isn't found with tcp. I will sniff around.

I didn't see any overt reference leaks, but I did see some dodgy corner-case behavior which I've now taken a stab at fixing. Could you try again with e38ce43dcfa5c3ad3f5a87e7db31e2edb03d6cfb from my for-upstream branch (which should now be part of #3060)?

Sorry, grabbed the wrong commit hash; I meant 0cd7c9409e8b2005134a0fad10ce33c6d5bc785c, but it, too, is part of #3060, so if you've just pulled that whole mess you've got the right thing.

No harm done, i will grab #3060. Sorry it is taking a bit longer, struggling to do a Docker build from your commit, I need to figure out if it is Docker or the build make file.

@nwf Ok, I get a tls.c Docker build failure. Maybe it is because of the shortcut I tried. I have used my existing nodemcu-firmware build pull (c116d9d2 6 March) and just replaced the tls.c file with the file in your 0cd7c94 commit. This is to ensure that when I test again everything stays the same except for your changes to tls.c

My shortcut was too short :roll_eyes:, I did a "proper" pull, builds fine, will continue to start testing soon.

@nwf, Ok, completed the same tests as yesterday with the commit as detailed below that includes the tls.c changes:

NodeMCU 3.0.0.0 built with Docker provided by frightanic.com branch: for-upstream commit: e38ce43dcfa5c3ad3f5a87e7db31e2edb03d6cfb release: 2.0.0-master_20170202 +500 release DTS: 202004050230 SSL: true build type: float LFS: 0x20000 bytes total capacity modules: adxl345,bit,cron,crypto,file,gpio,http,i2c,net,node,ow,rtcmem,rtctime,sjson,sntp,softuart,tls,tmr,uart,wifi build 2020-04-05 18:39 powered by Lua 5.1.4 on SDK 3.0.1-dev(fce080e)

The results are unfortunately exactly the same as before. The functions in the registry increases in exactly the same way as before with the userdata staying constant.

I am very confused. I have attempted to run HTTP_OTA thus: LFS.HTTP_OTA(server,"/~nwf/","foobar") and have discovered that...

  1. tls:on() did not allow :on(..., nil) to unregister callbacks, so it always crashed; I am not sure how this worked for you.
  2. My patch to drop tls sockets without callbacks registered showed a sort of stylistic bug in HTTP_OTA: the connection callback should be registered before the call to :connect().
  3. I can only run HTTP_OTA once per boot; later invocations apparently do not do anything, even if the previous attempt bailed out. I am not sure what's going on, but that seems unintentional.
  4. On bailing out, it sure looks like we tear down the context correctly and don't leave anything lingering in the registry.

Specifically, I am running HTTP_OTA.lua with the following patch applied:

diff --git a/lua_examples/lfs/HTTP_OTA.lua b/lua_examples/lfs/HTTP_OTA.lua
index 513771b3..a4e3201a 100644
--- a/lua_examples/lfs/HTTP_OTA.lua
+++ b/lua_examples/lfs/HTTP_OTA.lua
@@ -12,8 +12,7 @@ local n, total, size = 0, 0

 doRequest = function(socket, hostIP) -- luacheck: no unused
   if hostIP then
-    local con = net.createConnection(net.TCP,0)
-    con:connect(80,hostIP)
+    local con = tls.createConnection()
     -- Note that the current dev version can only accept uncompressed LFS images
     con:on("connection",function(sck)
       local request = table.concat( {
@@ -28,6 +27,7 @@ doRequest = function(socket, hostIP) -- luacheck: no unused
         sck:send(request)
         sck:on("receive",firstRec)
       end)
+    con:connect(443,hostIP)
   end
 end

atop the full #3060 series (which includes, now, fb24db482537411beacb5fc595a5aaa96242b577, the fix to the TLS :on issue). A debug build yields, trimming out everything except the TLS chatter, ...

> LFS.HTTP_OTA(server,"/~nwf/","foobar")
tls_socket_connect 3fff05f0
tls_socket_ondns 3fff05f0
client handshake start.
client handshake ok!
tls_socket_onconnect 3fff05f0
tls_socket_onsent 3fff05f0
tls_socket_onrecv 3fff05f0
0 of 12, tls_socket_onrecv 3fff05f0
12 of 12, tls_socket_close 3fff05f0
client's data invalid protocol
Reason:[-0x45]
tls_socket_last_call 3fff05f0 -2 'Unknown error'
tls_socket_cleanup 3fff05f0
tls_socket_delete 3fff05f0

I'm not sure what that Reason is, but it's entirely unrelated; the file transfer has taken place successfully (all 12 bytes copied OK).

For clarity the following:
I have modified nodemcu-firmware/lua_examples/lfs/HTTP_OTA.lua from @TerryE as follows to create a Post and Get function to download and upload a file to a cloud repository via TLS.

Both works well but leaves functions behind in the registry making it to grow.

The trimmed down Post function is as follows:

function Post()

  local Rec = function (sck,rec)
    local i = rec:find('\r\n',1,true) or 1
    if rec:find("201",1,i)==nil then
      return false
    else
      print("uploaded")
    end
    sck:close()
    sck,rec,con,res=nil,nil,nil,nil
    return true
  end

  local doRequest = function(conn,hostIP)
    if hostIP then
      local request = ........

      local con = tls.createConnection()

      con:on("connection",function(sck)
        sck:send(request)
        request = nil
        collectgarbage();collectgarbage();
        con:on("receive",Rec) 
      end)

      con:connect(443,hostIP)
    end
  end

  local res = net.dns.resolve("bitbucket.org", doRequest) 

end --Post

The Post function leaves one function behind in the registry from callback function(tls.socket[, string]) of the tls.on event.

The trimmed down Get function is as follows:

function Get()
  local n,total=0,0
  local status,size

  local finalise = function(sck)
    file.close()
    sck:on("receive",function(sck,c) print(c) end)
    sck:close()
    sck,con,firstRec,subsRec,c=nil,nil,nil,nil,nil
    collectgarbage();collectgarbage()

    local s = file.stat(fname)
    if (s and size == s.size) then
      print("pass")
      return true
    else
      print("fail")
      return false
    end
  end

  local subsRec = function(sck,rec)
    total, n = total + #rec, n + 1
      if n % 4 == 1 then
        sck:hold()
        node.task.post(0, function() sck:unhold() end)
      end
      file.write(rec)
      if total == size then finalise(sck) end
  end

  local firstRec = function (sck,rec)
    print("firstRX")
    -- Process the headers; only interested in content length

    local i = rec:find('\r\n\r\n',1,true) or 1
    local header = rec:sub(1,i+1):lower()
    status = tonumber(header:match('http/1.1 *(%d+)%s.+\r') or 0)
    size = tonumber(header:match('\ncontent%-length: *(%d+)\r') or 0)

    if status==200 and size > 0 then --200=OK
      sck:on("receive",subsRec)
      file.open(fname, 'w')
      subsRec(sck, rec:sub(i+4)) --pass content without header to subsRec
    else
      if status==404 then --404=not file found
        sck:on("receive", function() end)
        sck:close()
        sck,rec,con=nil,nil,nil        
        return false
      end
    end
  end

  local doRequest = function(conn,hostIP)
    if hostIP then
      local request = ........

      local con = tls.createConnection()

      con:on("connection",function(sck)
        print("Connected")
        sck:send(request)
        request = nil
        collectgarbage();collectgarbage()
        sck:close()
        sck,con=nil,nil

        con:on("receive",firstRec)
      end)

      con:connect(443,hostIP)
    end
  end

  net.dns.resolve("bitbucket.org", doRequest)

end --Get

The Get function leaves five functions behind in the registry from callback function(tls.socket[, string]) of the tls.on events.

I'll offer commentary on Post.

function Post()

  local Rec = function (sck,rec)
    local i = rec:find('\r\n',1,true) or 1
    if rec:find("201",1,i)==nil then
      return false

This and the other return from this function are discarded.

      print("uploaded")
    end
    sck:close()
    sck,rec,con,res=nil,nil,nil,nil

This does not do anything except delete any globals incidentally named con or res; sck and rec are local variables in this function's scope, and this code only introduces con and res as local identifiers, so they are not reachable here.

    return true
  end

  local doRequest = function(conn,hostIP)

conn is confusingly named and threw me for a moment; it is just nil for reasons that escape me. Another wart of our API.

    if hostIP then
      local request = ........

      local con = tls.createConnection()

      con:on("connection",function(sck)
        sck:send(request)
        request = nil
        collectgarbage();collectgarbage();
        con:on("receive",Rec)

Here is the problem! Given all of these...

  1. The con C object holds reference to itself in the registry because it is OK for Lua to drop it and rely on callbacks.
  2. The C must also persist its callback functions in the registry, as that is the only way it has to refer to them.
  3. This :on("connection") callback makes direct reference to con (as a Lua upval) rather than using the passed sck.

There are now cyclic references through the registry, and the garbage collector is unable to reclaim any of the objects.

      end)

      con:connect(443,hostIP)
    end
  end

  local res = net.dns.resolve("bitbucket.org", doRequest)

net.dns.resolve does not return a value; res serves no purpose (and is, indeed, unused anyway).

I am not particularly enamored of our APIs for C objects that pin themselves to the registry. Unfortunately, any real fix is going to be far-reaching and invasive (see #3062 for what might be the first step). Until then, we will just have to be very careful of what upvalues we capture and what C objects we retain in Lua when.

Thanks for your detailed feedback and explanation.

My code "grew" a bit from the nodemcu-firmware/lua_examples/lfs/HTTP_OTA.lua example that I used as a starting point. Hence the unnecessary additions in my attempt to isolate the "leak" since I mentioned this problem about a year ago. I will clean up and get rid of all the "stuff" not doing anything.

I understand and get it 馃憤 as explained in #3062, the fundamental problem, as brought to the surface in the lfs example code, exposes the quirk of our APIs related to C objects when callbacks are unregistered leaving the underlying C object pinned to registry out of reach of the gc.
I also need to re-look upvalues for a better understanding.

Thank you for triggering a deeper investigation into the code!

@nwf , Eureka !!, The sticky tls event callback functions do no stick anymore.

I have used your latest reworked commit fb24db4 ("tls: allow :on(...,nil) to unregister a callback") and repeated the tests without any improvement. Everything still sticking exactly as before.

I then decided to move local con = tls.createConnection() out of the doRequest function in both the Post() and Get() functions to be first in the main functions to ensure that all tls sockets will be destroyed when the calling sck:close(). This seem to work. The registry does not grow anymore when tls is repeatedly used. I need to try this with the unchanged tls.c still. I believe the combination of your and my changes did the trick.

The fix for Post should be as simple as replacing con:on("receive",Rec) with sck:on("receive",Rec); moving con to a higher scope means that the later con = nil assignments will clobber it and thereby sort of fix the problem, but it does not change the underlying cyclicity of references.

Ok, I will move the definition back, do the change and test first thing tomorrow morning.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

pastukhov picture pastukhov  路  6Comments

Michal78S picture Michal78S  路  6Comments

fsch2 picture fsch2  路  7Comments

marcelstoer picture marcelstoer  路  4Comments

adamdyga picture adamdyga  路  4Comments