RIOT 🚀 - gnrc: crash with (excessive) traffic in native

I have a hunch that this _might_ be related to #5748 (or at least fixable with #6086). Can you try with #6086?

miri64 on 15 Nov 2016

(it might be, that gnrc_pktbuf_start_write() needs fixing in #6086 too)

miri64 on 15 Nov 2016

okay does not seem to be related. Here's a more detailed GDB dump:

Program received signal SIGSEGV, Segmentation fault.
0x0804e026 in gnrc_pktbuf_start_write (pkt=0x20) at /home/mlenders/Repositories/RIOT-OS/RIOT/sys/net/gnrc/pktbuf_static/gnrc_pktbuf_static.c:256
256     if ((pkt == NULL) || (pkt->size == 0)) {
(gdb) where
#0  0x0804e026 in gnrc_pktbuf_start_write (pkt=0x20) at /home/mlenders/Repositories/RIOT-OS/RIOT/sys/net/gnrc/pktbuf_static/gnrc_pktbuf_static.c:256
#1  0x0805c447 in _send (pkt=0x8083ef4 <_pktbuf+756>, prep_hdr=true) at /home/mlenders/Repositories/RIOT-OS/RIOT/sys/net/gnrc/network_layer/ipv6/gnrc_ipv6.c:684
#2  0x0805bd8e in _event_loop (args=0x0) at /home/mlenders/Repositories/RIOT-OS/RIOT/sys/net/gnrc/network_layer/ipv6/gnrc_ipv6.c:279
#3  0xf7e3116b in makecontext () at ../sysdeps/unix/sysv/linux/i386/makecontext.S:91
#4  0x00000000 in ?? ()

miri64 on 16 Nov 2016

Or maybe it is.. sorry about the noise. pkt=0x20 comes from the packet in _send, so this still might be a packet buffer problem...

miri64 on 16 Nov 2016

(but #6086 does not solve the problem)

miri64 on 16 Nov 2016

(but #6086 does not solve the problem)

Too bad. Thanks for testing, I didn't get around, yet. So you can reproduce this?

kaspar030 on 16 Nov 2016

So you can reproduce this?

Yepp. But since #6086 doesn't fix #5748 either the problems still seem to be related. But don't have the time atm to look into that. :(

miri64 on 16 Nov 2016

cannot reproduce with 2018.01-RC2 on macOS native, flood pinged for >20min with 8 instances against 1 RIOT still works

smlng on 31 Jan 2018

Still happens here, current master, on linux.

kaspar030 on 31 Jan 2018

Still happens here, current master, on linux.

8 instances, started manually. after 6-7 first packets get lost and riot prints a couple of "gnrc_netif: possibly lost interrupt.", like every second, then it crashes:

> gnrc_netif: possibly lost interrupt. gnrc_netif: possibly lost interrupt. gnrc_netif: possibly lost interrupt. gnrc_netif: possibly lost interrupt. Segmentation fault (core dumped)

kaspar030 on 31 Jan 2018

my test still runs on macOS, RIOT native not crashed after > 1h with 8 parallel ping6 -f on its link-local address. Though, according to Wireshark not all ping request get a reply but many do.

smlng on 31 Jan 2018

cannot reproduce with 2018.01-RC2 on macOS native

That might hint that the problem is linux specific. What machine did you run the test on? Maybe this only happens when RIOT doesn't get scheduled enough? My thinkpad is fairly slow by today's standards.

kaspar030 on 31 Jan 2018

my test still runs on macOS, RIOT native not crashed after > 1h with 8 parallel ping6 -f on its link-local address.

Can you try increasing to 16 or more?

kaspar030 on 31 Jan 2018

latest iMac with i7 @ ~4Ghz and 16GB RAM

Can you try increasing to 16 or more?

sure!

smlng on 31 Jan 2018

latest iMac with i7 @ ~4Ghz and 16GB RAM

That's a 4GHz quad core? Eight ping6 -f are ~~enough~~ needed to break native on dual 1.6GHz. Do the math, we need to starve the RIOT native thread. ;)

kaspar030 on 31 Jan 2018

so 16 parallel pings on RIOT native, still working - but lots of `ping6: sendmsg: No buffer space available" from macOS

smlng on 31 Jan 2018

and Wireshark has a hard time to capture all those ping requests and replies ...

smlng on 31 Jan 2018

The other difference between macOS and Linux is that the async_read used by netdev_tap has a different implementation. Maybe that is causing the different observations here.

bergzand on 31 Jan 2018

I tried to reproduce this issue and it really took longtime before crashing, though I didn't exaggerate the number of pings, only 4 instances flooding on vagrant.

Is it expected that RIOT ensures an "eternal" functioning state?

kYc0o on 20 Jul 2018

It is expected that a system exposed to a network is able to either handle the packets or drop them. Crashing is unexceptable as this leaves doors for DoS attacks open. It is unclear though if it is actually GNRC, RIOT, the native sublayer or even Linux' TAP infrastructure that is crashing here. Maybe some simple, complexity-increasing tests (first test a TAP application on Linux, than just netdev_tap, then netdev_tap in a non-main thread, then go to setup explained above) could pin-point the cause of the issue.

miri64 on 20 Jul 2018

Is it expected that RIOT ensures an "eternal" functioning state?

Yes.

kaspar030 on 21 Jul 2018

I still can reproduce btw. Maybe in September I find some time to bite myself into this like I did with the leak in gnrc_pktbuf_static.

miri64 on 21 Jul 2018

This issue was re-reported in #10875 with a bit more detailed test description. If the description above is not enough for a tester, please refer to #10875.

miri64 on 26 Jan 2019

In #10875 I said

[…] in [the GDB dump in this issue and and in #10875] the packet seems to get corrupted while being in gnrc_ipv6's message queue (possibly due to a too early release)

I can confirm this now at least for one isolated case (note that pkt in _send is at an unused spot):

Program received signal SIGSEGV, Segmentation fault.
0x5656d0a2 in gnrc_netif_hdr_get_netif (hdr=0x18) at /home/mlenders/Repositories/RIOT-OS/RIOT/sys/include/net/gnrc/netif/hdr.h:291
291         return gnrc_netif_get_by_pid(hdr->if_pid);
(gdb) call gnrc_pktbuf_stats()
packet buffer: first byte: 0x5659cd80, last byte: 0x5659e580 (size: 6144)
  position of last byte used: 6144
=========== chunk   0 (0x5659cd80 size:   96) ===========
00000000  98  CD  59  56  B0  D3  59  56  DC  05  00  00  01  00  00  00
00000010  01  00  00  00  00  00  00  00  00  00  00  00  B0  CD  59  56
00000020  14  00  00  00  01  00  00  00  FF  FF  FF  FF  3A  40  FE  80
00000030  06  06  06  00  00  00  00  00  1E  02  ED  E1  3E  C0  1E  02
00000040  ED  E1  3E  C1  00  00  1C  02  08  CE  59  56  80  DF  59  56
00000050  DC  05  00  00  01  00  00  00  01  00  00  00  00  00  DD  63
~ unused: 0x5659cde0 (next: 0x5659ce38, size:   40) ~
=========== chunk   1 (0x5659ce08 size:   48) ===========
00000000  00  00  00  00  20  CE  59  56  14  00  00  00  01  00  00  00
00000010  FF  FF  FF  FF  46  47  48  49  06  06  06  00  00  00  00  00
[…]
(gdb) where
#0  0x5656d0a2 in gnrc_netif_hdr_get_netif (hdr=0x18) at /home/mlenders/Repositories/RIOT-OS/RIOT/sys/include/net/gnrc/netif/hdr.h:291
#1  0x5656dcde in _send (pkt=0x5659cde0 <_pktbuf+96>, prep_hdr=true) at /home/mlenders/Repositories/RIOT-OS/RIOT/sys/net/gnrc/network_layer/ipv6/gnrc_ipv6.c:539
#2  0x5656d4a2 in _event_loop (args=0x0) at /home/mlenders/Repositories/RIOT-OS/RIOT/sys/net/gnrc/network_layer/ipv6/gnrc_ipv6.c:193
#3  0xf7d2e27b in makecontext () from /usr/lib32/libc.so.6
#4  0x00000000 in ?? ()

I'll try to reproduce it a few times more now.

miri64 on 26 Jan 2019

Reproduced it (with different pointers) three more times and one time even with sudo ping -s0 -f fe80::1c02:edff:fee1:3ec1%tapbr0, so size doesn't matter here as assessed by @gschorcht in https://github.com/RIOT-OS/RIOT/issues/10875#issuecomment-457830031 is also confirmed. I'll try to reproduce with gnrc_pktbuf_malloc and valgrind now.

miri64 on 26 Jan 2019

Was able to reproduce with valgrind, but I'm confused by the output.

==25483== Invalid read of size 4
==25483==    at 0x120672: _send (sys/net/gnrc/network_layer/ipv6/gnrc_ipv6.c:533)
==25483==    by 0x11FE55: _event_loop (sys/net/gnrc/network_layer/ipv6/gnrc_ipv6.c:193)
==25483==    by 0x498927A: makecontext (in /usr/lib32/libc-2.28.so)
==25483==  Address 0x5fd6c98 is 16 bytes inside a block of size 20 free'd
==25483==    at 0x4836857: free (/build/valgrind/src/valgrind/coregrind/m_replacemalloc/vg_replace_malloc.c:530)
==25483==    by 0x1120E5: _release_error_locked (sys/net/gnrc/pktbuf_malloc/gnrc_pktbuf_malloc.c:214)
==25483==    by 0x112165: gnrc_pktbuf_release_error (sys/net/gnrc/pktbuf_malloc/gnrc_pktbuf_malloc.c:228)
==25483==    by 0x12B753: gnrc_pktbuf_release (sys/include/net/gnrc/pktbuf.h:171)
==25483==    by 0x12B7F0: gnrc_pktbuf_remove_snip (sys/net/gnrc/pktbuf/gnrc_pktbuf.c:59)
==25483==    by 0x1206DC: _send (sys/net/gnrc/network_layer/ipv6/gnrc_ipv6.c:558)
==25483==    by 0x11FE55: _event_loop (sys/net/gnrc/network_layer/ipv6/gnrc_ipv6.c:193)
==25483==    by 0x498927A: makecontext (in /usr/lib32/libc-2.28.so)
==25483==  Block was alloc'd at
==25483==    at 0x483562B: malloc (/build/valgrind/src/valgrind/coregrind/m_replacemalloc/vg_replace_malloc.c:299)
==25483==    by 0x112301: _create_snip (sys/net/gnrc/pktbuf_malloc/gnrc_pktbuf_malloc.c:276)
==25483==    by 0x111BD8: gnrc_pktbuf_add (sys/net/gnrc/pktbuf_malloc/gnrc_pktbuf_malloc.c:97)
==25483==    by 0x11158D: gnrc_netif_hdr_build (sys/net/gnrc/netif/hdr/gnrc_netif_hdr.c:20)
==25483==    by 0x11E7E9: gnrc_icmpv6_echo_req_handle (sys/net/gnrc/network_layer/icmpv6/echo/gnrc_icmpv6_echo.c:94)
==25483==    by 0x11E252: gnrc_icmpv6_demux (sys/net/gnrc/network_layer/icmpv6/gnrc_icmpv6.c:91)
==25483==    by 0x11FC3F: _demux (sys/net/gnrc/network_layer/ipv6/gnrc_ipv6.c:112)
==25483==    by 0x120CF5: _receive (sys/net/gnrc/network_layer/ipv6/gnrc_ipv6.c:799)
==25483==    by 0x11FE42: _event_loop (sys/net/gnrc/network_layer/ipv6/gnrc_ipv6.c:188)
==25483==    by 0x498927A: makecontext (in /usr/lib32/libc-2.28.so)

So it crashes in https://github.com/RIOT-OS/RIOT/blob/bdd2d52fd57c8cc87f9705b2e31c32f58a123a06/sys/net/gnrc/network_layer/ipv6/gnrc_ipv6.c#L533 because it was already released in https://github.com/RIOT-OS/RIOT/blob/bdd2d52fd57c8cc87f9705b2e31c32f58a123a06/sys/net/gnrc/network_layer/ipv6/gnrc_ipv6.c#L558 (which comes after...). Either the packet was put into the queue twice or it did not properly dequeue for some reason. Could that be possible @kaspar030?

miri64 on 26 Jan 2019

Since I am not yet able to reproduce this by adding some printfs around the msg_receive in gnrc_ipv6:c:_event_loop, I'm wondering if this might be related to https://github.com/RIOT-OS/RIOT/issues/1811 (edit: which was only closed because vtimer got kicked out so the issue wasn't reproducable any more as described)... did hit the race condition now after some time

miri64 on 26 Jan 2019

I was able to reproduce the crash with the following patch:

diff --git a/sys/net/gnrc/network_layer/ipv6/gnrc_ipv6.c b/sys/net/gnrc/network_layer/ipv6/gnrc_ipv6.c
index feb2e8f10..d42696c18 100644
--- a/sys/net/gnrc/network_layer/ipv6/gnrc_ipv6.c
+++ b/sys/net/gnrc/network_layer/ipv6/gnrc_ipv6.c
@@ -180,7 +180,11 @@ static void *_event_loop(void *args)
     /* start event loop */
     while (1) {
         DEBUG("ipv6: waiting for incoming message.\n");
+        printf("%d\n", cib_peek((cib_t *)&sched_active_thread->msg_queue));
         msg_receive(&msg);
+        printf("%d (%p)\n", cib_peek((cib_t *)&sched_active_thread->msg_queue),
+               msg.content.ptr);
+

         switch (msg.type) {
             case GNRC_NETAPI_MSG_TYPE_RCV:

Shortly before the crash I get the following output (using gnrc_pktbuf_malloc again, but just plane gdb):

[…]
7
0 (0x565ac0a0)
0
1 (0x565ab1d0)
1
2 (0x565ac060)
2
3 (0x565ab1f0)
3
4 (0x565acd60)
4
5 (0x565ac740)
5
6 (0x565ab190)
6
7 (0x565abfc0)
7
0 (0x565ab210)
0
1 (0x565ac0a0)
1
2 (0x565ab1d0)
2
3 (0x565ac060)
3
4 (0x565ab1f0)
4
5 (0x565acd60)
5
6 (0x565ac740)
6
7 (0x565ab190)
7
-1 (0x565ab230)
0
-1 (0x565a9e50)
-1
-1 (0x565a9e50)

Program received signal SIGSEGV, Segmentation fault.
0x5655efa7 in _release_error_locked (pkt=0x20, err=22) at /home/mlenders/Repositories/RIOT-OS/RIOT/sys/net/gnrc/pktbuf_malloc/gnrc_pktbuf_malloc.c:210
210             tmp = pkt->next;
(gdb) where
#0  0x5655efa7 in _release_error_locked (pkt=0x20, err=22) at /home/mlenders/Repositories/RIOT-OS/RIOT/sys/net/gnrc/pktbuf_malloc/gnrc_pktbuf_malloc.c:210
#1  0x5655f05e in gnrc_pktbuf_release_error (pkt=0x565a9de8, err=22) at /home/mlenders/Repositories/RIOT-OS/RIOT/sys/net/gnrc/pktbuf_malloc/gnrc_pktbuf_malloc.c:228
#2  0x5656d6e9 in _send (pkt=0x565a9de8, prep_hdr=true) at /home/mlenders/Repositories/RIOT-OS/RIOT/sys/net/gnrc/network_layer/ipv6/gnrc_ipv6.c:566
#3  0x5656ce44 in _event_loop (args=0x0) at /home/mlenders/Repositories/RIOT-OS/RIOT/sys/net/gnrc/network_layer/ipv6/gnrc_ipv6.c:197
#4  0xf7d2e27b in makecontext () from /usr/lib32/libc.so.6
#5  0x00000000 in ?? ()

Notice the last for lines before the SIGSEGV. Weirdly, msg.content.ptr also is different from the pointer handed to _send as pkt (which should be msg.content.ptr).

miri64 on 26 Jan 2019

7
-1 (0x565ab230)
0
-1 (0x565a9e50)
-1
-1 (0x565a9e50)

I think that can be seen as significant, because these are the first -1 that comes up once the ping starts.

miri64 on 26 Jan 2019

Also able to reproduce with gnrc_pktbuf_static. This time the pointers match.

[…]
5
6 (0x5659cdb0)
6
7 (0x5659ce28)
7
0 (0x5659ce40)
0
1 (0x5659d1d0)
1
2 (0x5659cf88)
2
3 (0x5659d1e8)
3
4 (0x5659cf68)
4
5 (0x5659cff8)
5
6 (0x5659d080)
6
7 (0x5659d098)
7
0 (0x5659d118)
0
1 (0x5659ce50)
1
2 (0x5659cd80)
2
3 (0x5659cff8)
3
-1 (0x5659d118)
-1
-1 (0x5659d118)

Program received signal SIGSEGV, Segmentation fault.
0x5656d13f in gnrc_netif_hdr_get_netif (hdr=0x1468) at /home/mlenders/Repositories/RIOT-OS/RIOT/sys/include/net/gnrc/netif/hdr.h:291
291         return gnrc_netif_get_by_pid(hdr->if_pid);
(gdb) where
#0  0x5656d13f in gnrc_netif_hdr_get_netif (hdr=0x1468) at /home/mlenders/Repositories/RIOT-OS/RIOT/sys/include/net/gnrc/netif/hdr.h:291
#1  0x5656ddd4 in _send (pkt=0x5659d118 <_pktbuf+920>, prep_hdr=true) at /home/mlenders/Repositories/RIOT-OS/RIOT/sys/net/gnrc/network_layer/ipv6/gnrc_ipv6.c:543
#2  0x5656d598 in _event_loop (args=0x0) at /home/mlenders/Repositories/RIOT-OS/RIOT/sys/net/gnrc/network_layer/ipv6/gnrc_ipv6.c:197
#3  0xf7d2e27b in makecontext () from /usr/lib32/libc.so.6
#4  0x00000000 in ?? ()
(gdb) call gnrc_pktbuf_stats()
gnrc_netif: possibly lost interrupt.
packet buffer: first byte: 0x5659cd80, last byte: 0x5659e580 (size: 6144)
  position of last byte used: 2984
=========== chunk   0 (0x5659cd80 size:  224) ===========
00000000  B8  CF  59  56  50  CE  59  56  0E  00  00  00  01  00  00  00
00000010  FF  FF  FF  FF  00  00  00  00  00  00  00  00  B0  CE  59  56
[…]
=========== chunk   6 (0x5659d0f0 size:   40) ===========
00000000  60  00  00  00  00  08  3A  40  FE  80  00  00  00  00  00  00
00000010  1C  02  ED  FF  FE  E1  3E  C1  FE  80  00  00  00  00  00  00
00000020  1C  02  ED  FF  FE  E1  3E  C0
~ unused: 0x5659d118 (next: (nil), size: 5224) ~

miri64 on 26 Jan 2019

New patch, same result (but interesting inside):

diff --git a/core/msg.c b/core/msg.c
index a46875f16..61de2e7a2 100644
--- a/core/msg.c
+++ b/core/msg.c
@@ -310,6 +310,9 @@ static int _msg_receive(msg_t *m, int block)
         DEBUG("_msg_receive: %" PRIkernel_pid ": _msg_receive(): We've got a queued message.\n",
               sched_active_thread->pid);
         *m = me->msg_array[queue_index];
+        if (sched_active_pid == 4) {
+            printf("idx: %d (%p)\n", queue_index, m->content.ptr);
+        }
     }
     else {
         me->wait_data = (void *) m;
@@ -330,6 +333,9 @@ static int _msg_receive(msg_t *m, int block)
             thread_yield_higher();

             /* sender copied message */
+            if (sched_active_pid == 4) {
+                printf("blk %d => %p\n", queue_index, m->content.ptr);
+            }
         }
         else {
             irq_restore(state);
@@ -353,6 +359,7 @@ static int _msg_receive(msg_t *m, int block)
         /* copy msg */
         msg_t *sender_msg = (msg_t*) sender->wait_data;
         *m = *sender_msg;
+        printf("sbl %d => %p\n", queue_index, m->content.ptr);

         /* remove sender from queue */
         uint16_t sender_prio = THREAD_PRIORITY_IDLE;

idx: 0 (0x5659d210)
idx: 1 (0x5659cd80)
idx: 2 (0x5659cd98)
idx: 3 (0x5659cfe0)
idx: 4 (0x5659d0f0)
idx: 5 (0x5659cf40)
blk -1 => 0x5659cda8
idx: 6 (0x5659cdc0)
idx: 7 (0x5659cd80)
idx: 0 (0x5659cdd8)
idx: 1 (0x5659ce50)
idx: 2 (0x5659cec8)
idx: 3 (0x5659cf80)
idx: 4 (0x5659d178)
idx: 5 (0x5659cfe0)
idx: 6 (0x5659d108)
idx: 7 (0x5659ce38)
idx: 0 (0x5659ceb0)
idx: 1 (0x5659cf68)
idx: 2 (0x5659cff8)
idx: 3 (0x5659cd80)
idx: 4 (0x5659d130)
blk -1 => 0x5659d130

Program received signal SIGSEGV, Segmentation fault.
0x5656d114 in gnrc_netif_hdr_get_netif (hdr=0x1450) at /home/mlenders/Repositories/RIOT-OS/RIOT/sys/include/net/gnrc/netif/hdr.h:291
291         return gnrc_netif_get_by_pid(hdr->if_pid);
(gdb) where
#0  0x5656d114 in gnrc_netif_hdr_get_netif (hdr=0x1450) at /home/mlenders/Repositories/RIOT-OS/RIOT/sys/include/net/gnrc/netif/hdr.h:291
#1  0x5656dd50 in _send (pkt=0x5659d130 <_pktbuf+944>, prep_hdr=true) at /home/mlenders/Repositories/RIOT-OS/RIOT/sys/net/gnrc/network_layer/ipv6/gnrc_ipv6.c:543
#2  0x5656d514 in _event_loop (args=0x0) at /home/mlenders/Repositories/RIOT-OS/RIOT/sys/net/gnrc/network_layer/ipv6/gnrc_ipv6.c:197
#3  0xf7d2e27b in makecontext () from /usr/lib32/libc.so.6
#4  0x00000000 in ?? ()
(gdb) call gnrc_pktbuf_stats()
packet buffer: first byte: 0x5659cd80, last byte: 0x5659e580 (size: 6144)
  position of last byte used: 2944
=========== chunk   0 (0x5659cd80 size:   48) ===========
00000000  08  CE  59  56  D8  CD  59  56  30  00  00  00  01  00  00  00
00000010  01  00  00  00  00  00  00  00  80  CE  59  56  50  CE  59  56
[…]
~ unused: 0x5659d0d8 (next: 0x5659d148, size:   16) ~
=========== chunk   4 (0x5659d0e8 size:   96) ===========
00000000  60  09  3B  94  00  08  3A  40  FE  80  00  00  00  00  00  00
00000010  1C  02  ED  FF  FE  E1  3E  C0  FE  80  00  00  00  00  00  00
00000020  1C  02  ED  FF  FE  E1  3E  C1  80  00  54  C2  46  29  58  86
00000030  00  00  00  00  30  D1  59  56  14  00  00  00  01  00  00  00
00000040  FF  FF  FF  FF  3A  40  FE  80  06  06  06  00  00  00  00  00
00000050  1E  02  ED  E1  3E  C0  1E  02  ED  E1  3E  C1  00  00  1C  02
~ unused: 0x5659d148 (next: 0x5659d1a0, size:   40) ~
=========== chunk   5 (0x5659d170 size:   48) ===========
00000000  60  09  3B  94  00  08  3A  40  FE  80  00  00  00  00  00  00
00000010  1C  02  ED  FF  FE  E1  3E  C0  FE  80  00  00  00  00  00  00
00000020  1C  02  ED  FF  FE  E1  3E  C1  80  00  55  43  46  2B  58  03
~ unused: 0x5659d1a0 (next: (nil), size: 5088) ~

First interesting difference: this time pkt does not refer to an unused space for the first time, but I think this might be irrelevant for now. More interesting is, that the first time 0x5659d130 was taken from the queue in https://github.com/RIOT-OS/RIOT/blob/bdd2d52fd57c8cc87f9705b2e31c32f58a123a06/core/msg.c#L312 while the second time it was copied directly from the sending thread to wait_data set here
https://github.com/RIOT-OS/RIOT/blob/bdd2d52fd57c8cc87f9705b2e31c32f58a123a06/core/msg.c#L315

I'm not 100% sure, but I think that puts the odds more to the side of the packet being somehow ~~queued~~ dispatched for send twice in a row... I'll investigate further.

miri64 on 26 Jan 2019

@Good job :smile:

gschorcht on 26 Jan 2019

(btw the IPv6 thread itself is the only one in this scenario sending packets to IPv6)

miri64 on 26 Jan 2019

(btw the IPv6 thread itself is the only one in this scenario sending packets to IPv6)

Here is where I'm confused... msg_send_to_self() only uses the queue appearently:

https://github.com/RIOT-OS/RIOT/blob/bdd2d52fd57c8cc87f9705b2e31c32f58a123a06/core/msg.c#L174-L183

so how does the message end up in wait_data?

miri64 on 26 Jan 2019

Yepp.. definitely no double dispatch:

echo:112 0x5659d230
echo:112 0x5659d168
echo:112 0x5659cf28
echo:112 0x5659cf78
echo:112 0x5659d030
echo:112 0x5659d0e0
echo:112 0x5659cef8
echo:112 0x5659d180
echo:112 0x5659d180
echo:112 0x5659d0a0
echo:112 0x5659d198
echo:112 0x5659cf70
echo:112 0x5659d200
echo:112 0x5659cff8
echo:112 0x5659d218
echo:112 0x5659d230
echo:112 0x5659d238
echo:112 0x5659ce88
echo:112 0x5659d078
echo:112 0x5659d0d0
echo:112 0x5659d130
echo:112 0x5659cff0
echo:112 0x5659cfd0
echo:112 0x5659cf80
echo:112 0x5659d1c8
echo:112 0x5659ced0
echo:112 0x5659d1e0
echo:112 0x5659d068
echo:112 0x5659d0c8
echo:112 0x5659d1f8
echo:112 0x5659cf60
echo:112 0x5659d210
echo:112 0x5659d1f8
echo:112 0x5659cfe0
echo:112 0x5659d140
echo:112 0x5659d058
echo:112 0x5659d180
echo:112 0x5659d098
echo:112 0x5659d0b0
echo:112 0x5659cf58
echo:112 0x5659d270
echo:112 0x5659ceb0
echo:112 0x5659d1f0
echo:112 0x5659d0c8
echo:112 0x5659d010
echo:112 0x5659d098
echo:112 0x5659d208
echo:112 0x5659d1b8
echo:112 0x5659d1d0
echo:112 0x5659d1a8
echo:112 0x5659d0d8
echo:112 0x5659ce20
echo:112 0x5659ce98
echo:112 0x5659cf38
echo:112 0x5659cfd0
echo:112 0x5659d048
echo:112 0x5659d0c0
echo:112 0x5659d1d8
echo:112 0x5659d060
echo:112 0x5659d160
echo:112 0x5659ce98
echo:112 0x5659cf38
echo:112 0x5659cf10
echo:112 0x5659d108
echo:112 0x5659d048
echo:112 0x5659d198
echo:112 0x5659d128

Program received signal SIGSEGV, Segmentation fault.
0x5656d0fc in gnrc_netif_hdr_get_netif (hdr=0x18) at /home/mlenders/Repositories/RIOT-OS/RIOT/sys/include/net/gnrc/netif/hdr.h:291
291         return gnrc_netif_get_by_pid(hdr->if_pid);
(gdb) where
#0  0x5656d0fc in gnrc_netif_hdr_get_netif (hdr=0x18) at /home/mlenders/Repositories/RIOT-OS/RIOT/sys/include/net/gnrc/netif/hdr.h:291
#1  0x5656dd38 in _send (pkt=0x5659d128 <_pktbuf+936>, prep_hdr=true) at /home/mlenders/Repositories/RIOT-OS/RIOT/sys/net/gnrc/network_layer/ipv6/gnrc_ipv6.c:539
#2  0x5656d4fc in _event_loop (args=0x0) at /home/mlenders/Repositories/RIOT-OS/RIOT/sys/net/gnrc/network_layer/ipv6/gnrc_ipv6.c:193
#3  0xf7d2e27b in makecontext () from /usr/lib32/libc.so.6
#4  0x00000000 in ?? ()
(gdb)

(this was the patch I used)

https://gist.github.com/miri64/a812f7b71fcf6dc71c70d8a9d79b24a9

I'll retry that with the msg.content.ptr print from above again, to make 100% sure.

miri64 on 26 Jan 2019

Yepp no double dispatch...

diff --git a/sys/net/gnrc/network_layer/icmpv6/echo/gnrc_icmpv6_echo.c b/sys/net/gnrc/network_layer/icmpv6/echo/gnrc_icmpv6_echo.c
index a48ac04f3..0b6f9a73c 100644
--- a/sys/net/gnrc/network_layer/icmpv6/echo/gnrc_icmpv6_echo.c
+++ b/sys/net/gnrc/network_layer/icmpv6/echo/gnrc_icmpv6_echo.c
@@ -109,6 +109,7 @@ void gnrc_icmpv6_echo_req_handle(gnrc_netif_t *netif, ipv6_hdr_t *ipv6_hdr,

     LL_PREPEND(pkt, hdr);

+    printf("echo:112 %p\n", (void *)pkt);
     if (!gnrc_netapi_dispatch_send(GNRC_NETTYPE_IPV6, GNRC_NETREG_DEMUX_CTX_ALL,
                                    pkt)) {
         DEBUG("icmpv6_echo: no receivers for IPv6 packets\n");
diff --git a/sys/net/gnrc/network_layer/ipv6/gnrc_ipv6.c b/sys/net/gnrc/network_layer/ipv6/gnrc_ipv6.c
index feb2e8f10..e5dc0f32f 100644
--- a/sys/net/gnrc/network_layer/ipv6/gnrc_ipv6.c
+++ b/sys/net/gnrc/network_layer/ipv6/gnrc_ipv6.c
@@ -181,6 +181,7 @@ static void *_event_loop(void *args)
     while (1) {
         DEBUG("ipv6: waiting for incoming message.\n");
         msg_receive(&msg);
+        printf("ipv6:184 %p\n", msg.content.ptr);

         switch (msg.type) {
             case GNRC_NETAPI_MSG_TYPE_RCV:

[…]
ipv6:184 0x5659d0f0
ipv6:184 0x5659cd80
echo:112 0x5659d148
ipv6:184 0x5659ce08
echo:112 0x5659cea8
ipv6:184 0x5659d148
ipv6:184 0x5659cea8
ipv6:184 0x5659cea8
<crash>

miri64 on 26 Jan 2019

@kaspar030 it looks more and more that this is some issue in either core or native, since the message somehow ends up in wait_data I rather suspect the first.

miri64 on 26 Jan 2019

Summary

To summarize my findings from above (so you don't have to read all that):

A GNRC_NETAPI_MSG_TYPE_SND message is received twice in a row by the gnrc_ipv6 thread from itself at some point causing the thread to access parts of the packet released in https://github.com/RIOT-OS/RIOT/blob/bdd2d52fd57c8cc87f9705b2e31c32f58a123a06/sys/net/gnrc/network_layer/ipv6/gnrc_ipv6.c#L558
The first time the message is retrieved from the thread's message queue. At the second receive the queue is empty and the message comes from thread_t::wait_data.
msg_send_to_self() only queues the packet.
It was confirmed that the packet is not dispatched twice for sending by the IPv6 thread, so at the moment the bug seems to be somewhere in the IPC state machine.
Bonus (not found above): There is no other msg_send() that could cause this (confirmed with git grep msg_send)

miri64 on 26 Jan 2019

That was bullshit... forget what I said. I should stop now.

Edit: I deleted that stupid comment...

miri64 on 26 Jan 2019

While the ping is running:

^C
Program received signal SIGINT, Interrupt.
native_isr_entry (sig=29, info=0x5659a5cc <sigalt_stk+6956>, context=0x5659a64c <sigalt_stk+7084>) at /home/mlenders/Repositories/RIOT-OS/RIOT/cpu/native/irq_cpu.c:287
287     {
(gdb) print sched_threads[4]->wait_data
$1 = (void *) 0x565a7480 <_stack+7648>
(gdb) print cib_avail(&sched_threads[4]->msg_queue)
$2 = 6
(gdb) call ps()
        pid | name                 | state    Q | pri | stack  ( used) | base addr  | current
          - | isr_stack            | -        - |   - |   8192 (   -1) | 0x56596500 | 0x56596500
          1 | idle                 | pending  Q |  15 |   8192 (  436) | 0x56594220 | 0x56596084
          2 | main                 | pending  Q |   7 |  12288 ( 1688) | 0x56591220 | 0x56594084
          3 | pktdump              | bl rx    _ |   6 |  12288 ( 1008) | 0x5659e5a0 | 0x565a1404
          4 | ipv6                 | running  Q |   4 |   8192 ( 1864) | 0x565a56a0 | 0x565a7504
          5 | udp                  | bl rx    _ |   5 |   8192 ( 1008) | 0x565a15c0 | 0x565a3424
          6 | gnrc_netdev_tap      | bl rx    _ |   2 |   8192 ( 2380) | 0x565a3680 | 0x565a54e4
          7 | RPL                  | bl rx    _ |   5 |   8192 (  944) | 0x565a7aa0 | 0x565a9904
            | SUM                  |            |     |  73728 ( 9328)

Why is wait_data set, when the thread is running? This has IMHO potential for a race condition.

Edit: 0x565a7480 <_stack+7648> in the pointer to the msg variable in gnrc_ipv6.c:_event_loop.

miri64 on 26 Jan 2019

I definitely can cause some funky behavior:

diff --git a/sys/net/gnrc/network_layer/ipv6/gnrc_ipv6.c b/sys/net/gnrc/network_layer/ipv6/gnrc_ipv6.c
index feb2e8f10..0c3fa97c4 100644
--- a/sys/net/gnrc/network_layer/ipv6/gnrc_ipv6.c
+++ b/sys/net/gnrc/network_layer/ipv6/gnrc_ipv6.c
@@ -180,6 +180,7 @@ static void *_event_loop(void *args)
     /* start event loop */
     while (1) {
         DEBUG("ipv6: waiting for incoming message.\n");
+        memset(&msg, 0, sizeof(msg));
         msg_receive(&msg);

         switch (msg.type) {
@@ -220,6 +221,7 @@ static void *_event_loop(void *args)
                 gnrc_ipv6_nib_handle_timer_event(msg.content.ptr, msg.type);
                 break;
             default:
+                printf("ipv6: unknown message type 0x%04x\n", msg.type);
                 break;
         }
     }

yields

[…]
gnrc_netif: possibly lost interrupt.
gnrc_netif: possibly lost interrupt.
gnrc_netif: possibly lost interrupt.
ipv6: unknown message type 0x0000
gnrc_netif: possibly lost interrupt.
gnrc_netif: possibly lost interrupt.
gnrc_netif: possibly lost interrupt.
[…]

miri64 on 27 Jan 2019

BTW with that patch, it took way longer (~1.5h) to crash the node and it crashed somewhere else (in some _receive() but don't know which, since I forgot to compile with debugging symbols, but I guess it is not in gnrc_ipv6.c ;-)).

miri64 on 27 Jan 2019

I've opened an issue btw (#10881) to have discussions regarding the potential msg_receive() bug there.

miri64 on 27 Jan 2019

:dancer:

miri64 on 1 Feb 2019

🚀1 🎉1

Riot: gnrc: crash with (excessive) traffic in native

2071 seems to have fixed native interrupt handling. But I can still crash a riot native instance by flood pinging.

Most helpful comment

All 44 comments

Summary

Related issues