Riot: gnrc: Packet buffer full errors after few hours of uptime

Created on 21 Nov 2017 · 21Comments · Source: RIOT-OS/RIOT

After running for an hour or so, one of my nodes can't ping the border router anymore with three times error: packet buffer full as a message. Ping from the border router to the node also doesn't work anymore (I assume also due to the packet buffer being full)

If there are steps I can take to debug this, I'd like to know, for now I can keep the GDB session attached.

setup:
2 nucleo-f446, mrf24j40 radio
1 configured as gnrc_border_router example mrf24j40 settings
1 configured as gnrc_networking example with mrf24j40 settings

network bug

Source

bergzand

All 21 comments

Using the following patch I can definitively see that there are memory leaks. Will investigate!

diff --git a/examples/gnrc_border_router/main.c b/examples/gnrc_border_router/main.c
index 654b411..e37a2c8 100644
--- a/examples/gnrc_border_router/main.c
+++ b/examples/gnrc_border_router/main.c
@@ -22,10 +22,22 @@

 #include "shell.h"
 #include "msg.h"
+#include "net/gnrc/pktbuf.h"

 #define MAIN_QUEUE_SIZE     (8)
 static msg_t _main_msg_queue[MAIN_QUEUE_SIZE];

+int pktbuf_cmd(int argc, char **argv)
+{
+    gnrc_pktbuf_stats();
+    return 0;
+}
+
+static const shell_command_t shell_commands[] = {
+    { "pktbuf", "", pktbuf_cmd },
+    { NULL, NULL, NULL }
+};
+
 int main(void)
 {
     /* we need a message queue for the thread running the shell in order to
@@ -36,7 +48,7 @@ int main(void)
     /* start shell */
     puts("All up, running the shell now");
     char line_buf[SHELL_DEFAULT_BUFSIZE];
-    shell_run(NULL, line_buf, SHELL_DEFAULT_BUFSIZE);
+    shell_run(shell_commands, line_buf, SHELL_DEFAULT_BUFSIZE);

     /* should be never reached */
     return 0;
diff --git a/examples/gnrc_networking/main.c b/examples/gnrc_networking/main.c
index 6301f42..ce44400 100644
--- a/examples/gnrc_networking/main.c
+++ b/examples/gnrc_networking/main.c
@@ -22,14 +22,22 @@

 #include "shell.h"
 #include "msg.h"
+#include "net/gnrc/pktbuf.h"

 #define MAIN_QUEUE_SIZE     (8)
 static msg_t _main_msg_queue[MAIN_QUEUE_SIZE];

 extern int udp_cmd(int argc, char **argv);

+int pktbuf_cmd(int argc, char **argv)
+{
+    gnrc_pktbuf_stats();
+    return 0;
+}
+
 static const shell_command_t shell_commands[] = {
     { "udp", "send data over UDP and listen on UDP ports", udp_cmd },
+    { "pktbuf", "", pktbuf_cmd },
     { NULL, NULL, NULL }
 };

miri64 on 21 Nov 2017

(I'm pinging from the host node btw. Hope this doesn't make a difference, but only this way I can analyse the packet buffer on the border router)

miri64 on 21 Nov 2017

(The od module needs to be included for the border router, btw)

miri64 on 21 Nov 2017

Ok... This only happens if I spam the node while resetting it, so this seems to be a different issue :-/.

miri64 on 21 Nov 2017

(the leaks appear immediately).

miri64 on 21 Nov 2017

I can't reproduce any of my issues from today either, both my nodes are currently rock stable. :(

bergzand on 21 Nov 2017

Maybe some of the other bugfixes "accidentally" fixed it ^^

miri64 on 21 Nov 2017

From what I can see I am suspecting that there is some very old, non-deterministic bug in GNRC (or the packet buffer) that pops up from time to time:

It causes (seemingly) randomly appearing memory leaks in the packet buffer
the data in those leaks also seems to be quite random, they at least do not look like packet data or data contained in gnrc_pktsnip_t instances.

Due to the non-deterministic nature of the bug it is quite straining to hunt for it, so I'm basically just hoping I find the culprit someday. But maybe a good way to assure randomness on embedded hardware
:stuck_out_tongue_winking_eye: runs

miri64 on 21 Nov 2017

I'm starting to suspect the RPL code on this one. Just had the the second node in the chain of 3 quit responding due to full packet buffers. I'm going to investigate some more tomorrow.
If I remember correctly I had RPL enabled when I discovered this issue, but had rpl disabled here

bergzand on 23 Nov 2017

Took around 45 minutes before the packet buffer got filled up again on the second node in the chain. Third chain is also spewing packet buffer full errors, but the first node (the 6lbr) appears fine. This might point to the DAO send functions as the 6lbr doesn't use that (but this is just a hunch).

bergzand on 23 Nov 2017

Confirmed, after initializing RPL I get memory leaks. Will investigate.

miri64 on 23 Nov 2017

So far I've only noticed on this only on !6lbr nodes. That's why my initial suspicion is with the DAO code. Also, could #8126 be related, options appear to be missing from the packets there.

bergzand on 23 Nov 2017

Could be :-/ but I'm not that deep into the RPL code to make a more assuring statement. Might be, that @cgundogan can find something, but he is currently on vacation :-/

miri64 on 23 Nov 2017

Since we could isolate it to RPL and the RPL changes in #7925 don't touch the packet buffer, I would prefer to make this issue not blocking for #7925, what do you think?

miri64 on 23 Nov 2017

@miri64 I might have found something:

Here the RPL options are added to the DAO packets for every route known in the routing table. Iteration over the routes in the routing table and add them as snips to the chain. I think the problem is that the _dao_transit_build is passed NULL as previous gnrc_pktsnip_t pkt argument. The previously allocated options no longer have a reference to them and thus are never freed. This also explains why the DAO frame was missing target options (#8126).

bergzand on 23 Nov 2017

True. Nice catch! Will fix.

miri64 on 23 Nov 2017

(or are you already on it?)

miri64 on 23 Nov 2017

(i have some unrelated RPL fixes incoming anyway ;-))

miri64 on 23 Nov 2017

No significant code here, only verified so far that it is fixes #8126. I'll wait for your PR :)

bergzand on 23 Nov 2017

@bergzand please close this, if your long-term RPL tests proof this to be fixed.

miri64 on 24 Nov 2017

👍1

https://github.com/RIOT-OS/RIOT/pull/7925#issuecomment-346792780 tells me, this can be closed.

miri64 on 24 Nov 2017

👍1

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Communication between two cc2538 motes using GNRC(Generic network stack)

hcnhcn012 · 5Comments

Survey: How will we communicate in the future (Mailing lists, forum, ...)?

miri64 · 5Comments

6TiSCH not supported in RIOT

chrysn · 5Comments

hangs in cbor parser

nmeum · 5Comments

Avoid the use of goto

nikosft · 6Comments