Riot: gnrc: Packet buffer full errors after few hours of uptime

Created on 21 Nov 2017  路  21Comments  路  Source: RIOT-OS/RIOT

After running for an hour or so, one of my nodes can't ping the border router anymore with three times error: packet buffer full as a message. Ping from the border router to the node also doesn't work anymore (I assume also due to the packet buffer being full)

If there are steps I can take to debug this, I'd like to know, for now I can keep the GDB session attached.

setup:
2 nucleo-f446, mrf24j40 radio
1 configured as gnrc_border_router example mrf24j40 settings
1 configured as gnrc_networking example with mrf24j40 settings

network bug

All 21 comments

Using the following patch I can definitively see that there are memory leaks. Will investigate!

diff --git a/examples/gnrc_border_router/main.c b/examples/gnrc_border_router/main.c
index 654b411..e37a2c8 100644
--- a/examples/gnrc_border_router/main.c
+++ b/examples/gnrc_border_router/main.c
@@ -22,10 +22,22 @@

 #include "shell.h"
 #include "msg.h"
+#include "net/gnrc/pktbuf.h"

 #define MAIN_QUEUE_SIZE     (8)
 static msg_t _main_msg_queue[MAIN_QUEUE_SIZE];

+int pktbuf_cmd(int argc, char **argv)
+{
+    gnrc_pktbuf_stats();
+    return 0;
+}
+
+static const shell_command_t shell_commands[] = {
+    { "pktbuf", "", pktbuf_cmd },
+    { NULL, NULL, NULL }
+};
+
 int main(void)
 {
     /* we need a message queue for the thread running the shell in order to
@@ -36,7 +48,7 @@ int main(void)
     /* start shell */
     puts("All up, running the shell now");
     char line_buf[SHELL_DEFAULT_BUFSIZE];
-    shell_run(NULL, line_buf, SHELL_DEFAULT_BUFSIZE);
+    shell_run(shell_commands, line_buf, SHELL_DEFAULT_BUFSIZE);

     /* should be never reached */
     return 0;
diff --git a/examples/gnrc_networking/main.c b/examples/gnrc_networking/main.c
index 6301f42..ce44400 100644
--- a/examples/gnrc_networking/main.c
+++ b/examples/gnrc_networking/main.c
@@ -22,14 +22,22 @@

 #include "shell.h"
 #include "msg.h"
+#include "net/gnrc/pktbuf.h"

 #define MAIN_QUEUE_SIZE     (8)
 static msg_t _main_msg_queue[MAIN_QUEUE_SIZE];

 extern int udp_cmd(int argc, char **argv);

+int pktbuf_cmd(int argc, char **argv)
+{
+    gnrc_pktbuf_stats();
+    return 0;
+}
+
 static const shell_command_t shell_commands[] = {
     { "udp", "send data over UDP and listen on UDP ports", udp_cmd },
+    { "pktbuf", "", pktbuf_cmd },
     { NULL, NULL, NULL }
 };

(I'm pinging from the host node btw. Hope this doesn't make a difference, but only this way I can analyse the packet buffer on the border router)

(The od module needs to be included for the border router, btw)

Ok... This only happens if I spam the node while resetting it, so this seems to be a different issue :-/.

(the leaks appear immediately).

I can't reproduce any of my issues from today either, both my nodes are currently rock stable. :(

Maybe some of the other bugfixes "accidentally" fixed it ^^

From what I can see I am suspecting that there is some very old, non-deterministic bug in GNRC (or the packet buffer) that pops up from time to time:

  • It causes (seemingly) randomly appearing memory leaks in the packet buffer
  • the data in those leaks also seems to be quite random, they at least do not look like packet data or data contained in gnrc_pktsnip_t instances.

Due to the non-deterministic nature of the bug it is quite straining to hunt for it, so I'm basically just hoping I find the culprit someday. But maybe a good way to assure randomness on embedded hardware
:stuck_out_tongue_winking_eye: runs

I'm starting to suspect the RPL code on this one. Just had the the second node in the chain of 3 quit responding due to full packet buffers. I'm going to investigate some more tomorrow.
If I remember correctly I had RPL enabled when I discovered this issue, but had rpl disabled here

Took around 45 minutes before the packet buffer got filled up again on the second node in the chain. Third chain is also spewing packet buffer full errors, but the first node (the 6lbr) appears fine. This might point to the DAO send functions as the 6lbr doesn't use that (but this is just a hunch).

Confirmed, after initializing RPL I get memory leaks. Will investigate.

So far I've only noticed on this only on !6lbr nodes. That's why my initial suspicion is with the DAO code. Also, could #8126 be related, options appear to be missing from the packets there.

Could be :-/ but I'm not that deep into the RPL code to make a more assuring statement. Might be, that @cgundogan can find something, but he is currently on vacation :-/

Since we could isolate it to RPL and the RPL changes in #7925 don't touch the packet buffer, I would prefer to make this issue not blocking for #7925, what do you think?

@miri64 I might have found something:

Here the RPL options are added to the DAO packets for every route known in the routing table. Iteration over the routes in the routing table and add them as snips to the chain. I think the problem is that the _dao_transit_build is passed NULL as previous gnrc_pktsnip_t pkt argument. The previously allocated options no longer have a reference to them and thus are never freed. This also explains why the DAO frame was missing target options (#8126).

True. Nice catch! Will fix.

(or are you already on it?)

(i have some unrelated RPL fixes incoming anyway ;-))

No significant code here, only verified so far that it is fixes #8126. I'll wait for your PR :)

@bergzand please close this, if your long-term RPL tests proof this to be fixed.

https://github.com/RIOT-OS/RIOT/pull/7925#issuecomment-346792780 tells me, this can be closed.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

hcnhcn012 picture hcnhcn012  路  5Comments

miri64 picture miri64  路  5Comments

chrysn picture chrysn  路  5Comments

nmeum picture nmeum  路  5Comments

nikosft picture nikosft  路  6Comments