Riot: Global Address remains in state TNT[3] forever

Created on 8 Aug 2020  路  13Comments  路  Source: RIOT-OS/RIOT

Description

The issue described in #13838 is still there, but the original issue also uncovered a different bug that is now fixed.
So I'm creating a new issue to reduce the noise.

Steps to reproduce the issue

Have a node sleep for a certain period of time.
After a couple of cycles the public address of the node will not become valid anymore.
This is remedied by a restart of the border router.

main.c

#include <stdio.h>

#include "periph/rtc.h"
#include "periph/pm.h"
#include "msg.h"
#include "net/gnrc/netif.h"

#define NETIF_NUM_MAX               (2)
#define BORDER_ROUTER_TIMEOUT_US    (1000 * US_PER_MS)

/* ifconfig */
int _gnrc_netif_config(int argc, char **argv);

static void _rtc_alarm(void* arg)
{
    (void) arg;

    pm_reboot();
    return;
}

static void _gnrc_netapi_set_all(netopt_state_t state)
{
    gnrc_netif_t* netif = NULL;
    while ((netif = gnrc_netif_iter(netif))) {
        /* retry if busy */
        while (gnrc_netapi_set(netif->pid, NETOPT_STATE, 0,
                               &state, sizeof(state)) == -EBUSY) {}
    }
}

static void pm_hibernate_for(unsigned sec)
{
    struct tm now;

    /* disable all network interfaces */
    _gnrc_netapi_set_all(NETOPT_STATE_SLEEP);

    rtc_get_time(&now);
    now.tm_sec += sec;
    rtc_set_alarm(&now, _rtc_alarm, NULL);
    pm_set(0);
}

int main(void)
{
    msg_t _main_msg_queue[4];
    msg_bus_entry_t subs[NETIF_NUM_MAX];
    unsigned count;
    gnrc_netif_t *netif;

    msg_init_queue(_main_msg_queue, ARRAY_SIZE(_main_msg_queue));

    /* enable network interfaces */
    gnrc_netif_init_devs();

    /* subscribe to all interfaces */
    count = 0;
    netif = NULL;
    while ((netif = gnrc_netif_iter(netif)) && count < NETIF_NUM_MAX) {
        msg_bus_t *bus = gnrc_netif_get_bus(netif, GNRC_NETIF_BUS_IPV6);

        msg_bus_attach(bus, &subs[count]);
        msg_bus_subscribe(&subs[count], GNRC_IPV6_EVENT_ADDR_VALID);

        ++count;
    }

    /* wait for global address */
    msg_t m;
    do {
        if (xtimer_msg_receive_timeout(&m, BORDER_ROUTER_TIMEOUT_US) < 0) {
            puts(">> border router timeout <<");
            _gnrc_netif_config(0, NULL);
            break;
        }
    } while (ipv6_addr_is_link_local(m.content.ptr));

    /* unsubscribe */
    count = 0;
    netif = NULL;
    while ((netif = gnrc_netif_iter(netif)) && count < NETIF_NUM_MAX) {
        msg_bus_t *bus = gnrc_netif_get_bus(netif, GNRC_NETIF_BUS_IPV6);
        msg_bus_detach(bus, &subs[count++]);
    }

    pm_hibernate_for(5);
    puts(">> hibernate failed? <<");

    return 0;
}

Makefile

APPLICATION = sleepy_node

# add "DEVELHELP=1" to the call to the make command
# to enable safety checking and error reporting
DEVELHELP=1

BOARD ?= samr21-xpro

# This has to be the absolute path to the RIOT base directory:
RIOTBASE ?= $(CURDIR)/../..

# We neet RTC to wake from hibernate
FEATURES_REQUIRED += periph_rtc

# Modules to include:
USEMODULE += shell_commands

# gnrc is a meta module including all required, basic gnrc networking modules
USEMODULE += gnrc
USEMODULE += gnrc_netdev_default

# manually initialize the network interface
USEMODULE += gnrc_netif_init_devs

# pull in the GNRC IP stack
USEMODULE += gnrc_ipv6_default

# enable GNRC notifications
USEMODULE += gnrc_netif_bus

# only print on error
CFLAGS += -DCONFIG_SKIP_BOOT_MSG

include $(RIOTBASE)/Makefile.include

# Set a custom channel if needed
include $(RIOTMAKE)/default-radio-settings.inc.mk

Some trial & error showed that this behavior only occurs if there is a second node that is on the whole time.
This node is just idling with the gnrc_networking and did receive a valid public address.

edit I could also reproduce the behavior with only one sleepy node and a border router.
edit2 a second node sure helps in provoking the behavior. This also happens if the second node just uses gnrc_ipv6_default instead of gnrc_ipv6_router_default and no RPL a second gnrc_ipv6_router_default is certainly required for this behavior to occur.

edit3: see https://github.com/RIOT-OS/RIOT/issues/14735#issuecomment-674147653 on how to reproduce this on native

Expected results

On success, the above program will produce no output.

Actual results

after a couple iterations:

2020-08-08 17:20:27,121 #  >> border router timeout <<
2020-08-08 17:20:27,128 # Iface  5  HWaddr: 71:A1  Channel: 26  Page: 0  NID: 0x23  PHY: O-QPSK 
2020-08-08 17:20:27,129 #           
2020-08-08 17:20:27,133 #           Long HWaddr: 5A:06:59:C9:2C:36:7D:78 
2020-08-08 17:20:27,141 #            TX-Power: 0dBm  State: IDLE  max. Retrans.: 3  CSMA Retries: 4 
2020-08-08 17:20:27,149 #           AUTOACK  ACK_REQ  CSMA  L2-PDU:102  MTU:1280  HL:64  6LO  
2020-08-08 17:20:27,150 #           IPHC  
2020-08-08 17:20:27,153 #           Source address length: 8
2020-08-08 17:20:27,156 #           Link type: wireless
2020-08-08 17:20:27,162 #           inet6 addr: fe80::5806:59c9:2c36:7d78  scope: link  VAL
2020-08-08 17:20:27,169 #           inet6 addr: 2001:db8::5806:59c9:2c36:7d78  scope: global  TNT[3]
2020-08-08 17:20:27,171 #           inet6 group: ff02::1

and if the second node was unplugged

2020-08-08 17:24:18,174 #  >> border router timeout <<
2020-08-08 17:24:18,181 # Iface  5  HWaddr: 71:A1  Channel: 26  Page: 0  NID: 0x23  PHY: O-QPSK 
2020-08-08 17:24:18,182 #           
2020-08-08 17:24:18,186 #           Long HWaddr: 5A:06:59:C9:2C:36:7D:78 
2020-08-08 17:24:18,193 #            TX-Power: 0dBm  State: IDLE  max. Retrans.: 3  CSMA Retries: 4 
2020-08-08 17:24:18,200 #           AUTOACK  ACK_REQ  CSMA  L2-PDU:102  MTU:1280  HL:64  6LO  
2020-08-08 17:24:18,202 #           IPHC  
2020-08-08 17:24:18,205 #           Source address length: 8
2020-08-08 17:24:18,207 #           Link type: wireless
2020-08-08 17:24:18,213 #           inet6 addr: fe80::5806:59c9:2c36:7d78  scope: link  VAL
2020-08-08 17:24:18,216 #           inet6 group: ff02::1

On the border router:

ping ff02::1
12 bytes from fe80::1075:7bff:fe4d:6a7f%5: icmp_seq=0 ttl=64 time=27.100 ms
12 bytes from fe80::1075:7bff:fe4d:6a7f%5: icmp_seq=1 ttl=64 time=19.020 ms
12 bytes from fe80::5806:59c9:2c36:7d78%6: icmp_seq=2 ttl=64 rssi=-51 dBm time=8.265 ms

--- ff02::1 PING statistics ---
3 packets transmitted, 3 packets received, 0% packet loss
round-trip min/avg/max = 8.265/18.128/27.100 ms

fe80::1 dev #5 lladdr 12:75:7B:4D:6A:7F  STALE GC
fe80::1075:7bff:fe4d:6a7f dev #5 lladdr 12:75:7B:4D:6A:7F  REACHABLE GC

Iface  6  HWaddr: F6:FB  Channel: 26  Page: 0  NID: 0x23 
          Long HWaddr: 00:12:4B:00:18:F1:D6:7A 
           TX-Power: 0dBm  State: IDLE 
          AUTOACK  L2-PDU:102  MTU:1280  HL:64  RTR  
          RTR_ADV  6LO  IPHC  
          Source address length: 8
          Link type: wireless
          inet6 addr: fe80::212:4b00:18f1:d67a  scope: link  VAL
          inet6 addr: 2001:db8::212:4b00:18f1:d67a  scope: global  VAL
          inet6 group: ff02::2
          inet6 group: ff02::1
          inet6 group: ff02::1:fff1:d67a

Iface  5  HWaddr: D2:70:31:23:59:F5 
          L2-PDU:1500  MTU:1500  HL:64  RTR  
          Source address length: 6
          Link type: wired
          inet6 addr: fe80::d070:31ff:fe23:59f5  scope: link  VAL
          inet6 addr: fe80::2  scope: link  VAL
          inet6 group: ff02::2
          inet6 group: ff02::1
          inet6 group: ff02::1:ff23:59f5
          inet6 group: ff02::1:ff00:2

Versions

Riot Master

network bug

Most helpful comment

Found the real solution! RFC 6775, section 5.2:

   The routers SHOULD NOT garbage-collect Registered NCEs (see
   Section 3.4), since they need to retain them until the Registration
   Lifetime expires.  Similarly, if NUD on the router determines that
   the host is UNREACHABLE (based on the logic in [RFC4861]), the NCE
   SHOULD NOT be deleted but rather retained until the Registration
   Lifetime expires.  A renewed ARO should mark the cache entry as
   STALE.  Thus, for 6LoWPAN routers, the Neighbor Cache doesn't behave
   like a cache.  Instead, it behaves as a registry of all the host
   addresses that are attached to the router.

Will provide a PR to do that.

All 13 comments

Some debug output:

nib: get next hop link-layer address of fe80::5077:c931:e1b:1b8e%5
nib: Getting on-link node entry (addr = fe80::5077:c931:e1b:1b8e, iface = 5)
  No suitable entry found
nib: fe80::5077:c931:e1b:1b8e is on-link or in NC, start address resolution
nib: Handle packet (icmpv6->type = 133)
nib: Received valid router solicitation:
     - Source address: fe80::5077:c931:e1b:1b8e
     - Destination address: ff02::2
nib: Getting on-link node entry (addr = fe80::5077:c931:e1b:1b8e, iface = 5)
  No suitable entry found
nib: get next hop link-layer address of fe80::5077:c931:e1b:1b8e%5
nib: Getting on-link node entry (addr = fe80::5077:c931:e1b:1b8e, iface = 5)
  No suitable entry found
nib: fe80::5077:c931:e1b:1b8e is on-link or in NC, start address resolution
nib neigh
nib neigh
> nib: Handle packet (icmpv6->type = 133)
nib: Received valid router solicitation:
     - Source address: fe80::5077:c931:e1b:1b8e
     - Destination address: ff02::2
nib: Getting on-link node entry (addr = fe80::5077:c931:e1b:1b8e, iface = 5)
  No suitable entry found
nib: get next hop link-layer address of fe80::5077:c931:e1b:1b8e%5
nib: Getting on-link node entry (addr = fe80::5077:c931:e1b:1b8e, iface = 5)
  No suitable entry found
nib: fe80::5077:c931:e1b:1b8e is on-link or in NC, start address resolution
nib: Handle packet (icmpv6->type = 133)
nib: Received valid router solicitation:
     - Source address: fe80::5077:c931:e1b:1b8e
     - Destination address: ff02::2
nib: Getting on-link node entry (addr = fe80::5077:c931:e1b:1b8e, iface = 5)
  No suitable entry found
nib: get next hop link-layer address of fe80::5077:c931:e1b:1b8e%5
nib: Getting on-link node entry (addr = fe80::5077:c931:e1b:1b8e, iface = 5)
  No suitable entry found
nib: fe80::5077:c931:e1b:1b8e is on-link or in NC, start address resolution

While when I restart the other mode:

nib: fe80::d0ae:c1b:2054:58c is on-link or in NC, start address resolution
nib: Handle packet (icmpv6->type = 135)
nib: Received valid neighbor solicitation:
     - Target address: fe80::8d1b:5102:e402:d05e
     - Source address: 2001:db8::d0ae:c1b:2054:58c
     - Destination address: fe80::8d1b:5102:e402:d05e
nib: Storing SL2AO for later handling
nib: Storing ARO for later handling
nib: Getting on-link node entry (addr = 2001:db8::d0ae:c1b:2054:58c, iface = 5)
  Found 0x200011dc
nib: Getting on-link node entry (addr = 2001:db8::d0ae:c1b:2054:58c, iface = 5)
  Found 0x200011dc
nib: get next hop link-layer address of 2001:db8::d0ae:c1b:2054:58c%5
nib: Getting on-link node entry (addr = 2001:db8::d0ae:c1b:2054:58c, iface = 5)
  Found 0x200011dc
nib: 2001:db8::d0ae:c1b:2054:58c is on-link or in NC, start address resolution
nib: resolve address 2001:db8::d0ae:c1b:2054:58c%5 from neighbor cache
ninib: Handle timer event (ctx = 0x200011dc, type = 0x4fc8, now = 162178ms)
nib: Timeout DELAY state
nib: Probing nib: unicast to 2001:db8::d0ae:c1b:2054:58c (retrans. timer = 1000ms)
nib: get next hop link-layer address of 2001:db8::d0ae:c1b:2054:58c%5
nib: Getting on-link node entry (addr = 2001:db8::d0ae:c1b:2054:58c, iface = 5)
b  Found 0x200011dc
nib: 2001:db8::d0ae:c1b:2054:58c is on-link or in NC, start address resolution
nib: resolve address 2001:db8::d0ae:c1b:2054:58c%5 from neighbor cache
nib: Handle packet (icmpv6->type = 136)
nib: Received valid neighbor advertisement:
     - Target address: 2001:db8::d0ae:c1b:2054:58c
     - Source address: 2001:db8::d0ae:c1b:2054:58c
     - Destination address: 2001:db8::8d1b:5102:e402:d05e
     - Flags: RS-
nib: Getting on-link node entry (addr = 2001:db8::d0ae:c1b:2054:58c, iface = 5)
  Found 0x200011dc
nib: Set 2001:db8::d0ae:c1b:2054:58c%5 to REACHABLE for 15825ms
nib: No ARO in NA, falling back to classic DAD
 neigh
nib neigh
2001:db8::d0ae:c1b:2054:58c dev #5 lladdr D2:AE:0C:1B:20:54:05:8C  REACHABLE REGISTERED

I extended the Demo to also connect to a CoAP server behind the border router (well, that was the original setup and I stripped it down for the issue :wink: ) and I found that even if the node receiving a public IP, it sometimes would not receive an answer from the CoAP server.

The server would send out the answer as apparent from the log output, but the packet would not leave the border router. (I did not catch it on the air with a sniffer).

This might be the same issue or a completely different one, I'm not sure. Just posting it here to maybe narrow down the bug.

I pushed the complete demo client code to https://github.com/benpicco/sleepy_node

edit managed to reproduce this with just two samr21-xpro. (One border router, one sleepy node)
Once we get to

2001:db8::80b4:739d:2f34:7a79 dev #6 lladdr 82:B4:73:9D:2F:34:7A:79  UNREACHABLE REGISTERED

we don't go back

I can now reproduce this on native.

  • check out my zep_integration branch
  • check out the sleepy_node test application
  • run the border router:

    make -C examples/gnrc_border_router BOARD=native all term

    • send something over the 6lo interface to make sure the border router got connected to the ZEP dispatcher

      ping6 ff02::1

    • start the sleepy node

      make BOARD=native term

This gives

/home/benpicco/dev/sleepy_node/bin/native/sleepy_node.elf -z [::1]:17754 -Z 00:AA:BB:CC:DD:EE:FF:23  
RIOT native interrupts/signals initialized.
Native RTC initialized.
LED_RED_OFF
LED_GREEN_ON
RIOT native board initialized.
RIOT native hardware initialization complete.

timeout
>> get time failed << // we got a valid address but could not connect to the CoAP server because we didn't start one
>> hibernate failed? <<


        !! REBOOT !!

RIOT native interrupts/signals initialized.
Native RTC initialized.
LED_RED_OFF
LED_GREEN_ON
RIOT native board initialized.
RIOT native hardware initialization complete.

>> border router timeout << // we don't get a valid address ever again
Iface  7  HWaddr: FF:23  Channel: 26  NID: 0x23 
          Long HWaddr: 00:AA:BB:CC:DD:EE:FF:23 
           State: IDLE 
          L2-PDU:102  MTU:1280  HL:64  6LO  
          IPHC  
          Source address length: 8
          Link type: wireless
          inet6 addr: fe80::2aa:bbcc:ddee:ff23  scope: link  VAL
          inet6 addr: 2001:db8::2aa:bbcc:ddee:ff23  scope: global  TNT[3]
          inet6 group: ff02::1

>> hibernate failed? <<


        !! REBOOT !!

RIOT native interrupts/signals initialized.
Native RTC initialized.
LED_RED_OFF
LED_GREEN_ON
RIOT native board initialized.
RIOT native hardware initialization complete.

>> border router timeout <<
Iface  7  HWaddr: FF:23  Channel: 26  NID: 0x23 
          Long HWaddr: 00:AA:BB:CC:DD:EE:FF:23 
           State: IDLE 
          L2-PDU:102  MTU:1280  HL:64  6LO  
          IPHC  
          Source address length: 8
          Link type: wireless
          inet6 addr: fe80::2aa:bbcc:ddee:ff23  scope: link  VAL
          inet6 addr: 2001:db8::2aa:bbcc:ddee:ff23  scope: global  TNT[3]
          inet6 group: ff02::1

>> hibernate failed? <<

Enabling Debugging in _nib-arsm.c 'fixes' the issue

--- a/sys/net/gnrc/network_layer/ipv6/nib/_nib-arsm.c
+++ b/sys/net/gnrc/network_layer/ipv6/nib/_nib-arsm.c
@@ -26,7 +26,7 @@
 #include "_nib-router.h"
 #include "_nib-6lr.h"

-#define ENABLE_DEBUG    (0)
+#define ENABLE_DEBUG    (1)
 #include "debug.h"

 static char addr_str[IPV6_ADDR_MAX_STR_LEN];

What I found is that next_ns in _probe_nbr() is always 4294967295 in the GNRC_IPV6_NIB_NC_INFO_NUD_STATE_UNREACHABLE case:

multicast to 2001:db8::204:2519:1801:c8c5's solicited nodes (retrans. timer = 11512ms)
10678 > 1000
multicast to 2001:db8::204:2519:1801:c8c5's solicited nodes (retrans. timer = 2646ms)
17028 > 1000
multicast to 2001:db8::5077:c931:e1b:1b8e's solicited nodes (retrans. timer = 1746ms)
4294967295 > 1000
multicast to 2001:db8::5077:c931:e1b:1b8e's solicited nodes (retrans. timer = 1836ms)
4294967295 > 1000
multicast to 2001:db8::204:2519:1801:c8c5's solicited nodes (retrans. timer = 1072ms)
4294967295 > 1000
multicast to 2001:db8::204:2519:1801:c8c5's solicited nodes (retrans. timer = 2568ms)
4294967295 > 1000
multicast to 2001:db8::5077:c931:e1b:1b8e's solicited nodes (retrans. timer = 4752ms)
4294967295 > 1000
multicast to 2001:db8::204:2519:1801:c8c5's solicited nodes (retrans. timer = 8744ms)
4294967295 > 1000
multicast to 2001:db8::5077:c931:e1b:1b8e's solicited nodes (retrans. timer = 8064ms)
nib neigh
nib neigh
2001:db8::204:2519:1801:c8c5 dev #5 lladdr 00:04:25:19:18:01:C8:C5  UNREACHABLE REGISTERED
2001:db8::5077:c931:e1b:1b8e dev #5 lladdr 52:77:C9:31:0E:1B:1B:8E  UNREACHABLE REGISTERED

(I selectively only enabled this print statement)

What I found is that next_ns in _probe_nbr() is always 4294967295 in the GNRC_IPV6_NIB_NC_INFO_NUD_STATE_UNREACHABLE case:

I don't quite understand why you identified this as the cause of the problem. If next_ns is 4294967295 (= UINT32_MAX, so there is no NS scheduled in the future) it means it passes this if-condition (according to how I interpret your output with the minimal information you provided netif->ipv6.retrans_time is 1000)

https://github.com/RIOT-OS/RIOT/blob/61f37ce39577e9e8f040d6167399558bce45fd8d/sys/net/gnrc/network_layer/ipv6/nib/_nib-arsm.c#L283

so eventually it will reach the _snd_ns() here

https://github.com/RIOT-OS/RIOT/blob/61f37ce39577e9e8f040d6167399558bce45fd8d/sys/net/gnrc/network_layer/ipv6/nib/_nib-arsm.c#L302

and since the destination address dst is multicast it should just skip this whole block

https://github.com/RIOT-OS/RIOT/blob/61f37ce39577e9e8f040d6167399558bce45fd8d/sys/net/gnrc/network_layer/ipv6/nib/_nib-arsm.c#L39-L61

So a multicast neighbor solicitation should be issued unless the packet buffer is full

https://github.com/RIOT-OS/RIOT/blob/61f37ce39577e9e8f040d6167399558bce45fd8d/sys/net/gnrc/network_layer/ndp/gnrc_ndp.c#L311-L313

I don't quite understand why you identified this as the cause of the problem.

It鈥檚 not the cause of the problem, I was just poking around and writing down what I thought might be odd.

I think I also tried replacing the broadcast with another Unicast but that didn鈥檛 work, but I鈥檒l try again.

I think I also tried replacing the broadcast with another Unicast but that didn鈥檛 work, but I鈥檒l try again.

That wouldn't work also if you go to Ethernet or WiFi rather than 6LoWPAN, because there link-layer address determination only works via NDP, not back-translation from the IID.

Yea I'd also rather find the root cause than work around the issue.
Unfortunately I'm at a loss here, only tracing the steps I've taken so far.

I think there is a conceptual error resulting from the combination of RFC 7048 and RFC 6775 happening here. RFC 6775 allows nodes not to be a member of the solicited nodes multicast group, while RFC 7048 expects them to. I think an non-standard solution (but effectively emulating the creation of a new entry) could be to take UNREACHABLE for existing entries on a 6LN interface into account in _handle_sl2ao() and set it to STALE when an NS was received.

Normally,

!INCOMPLETE NS, RS, RA, Redirect - unchanged
Same link-layer
address as cached.

From RFC 4861, appendix C would apply there, but since the node will never reply to the multicast probings of the border router, it won't work. Alternatively, we could make RFC 7048 optional and deactivate it for 6Lo-ND altogether.

RFC 7048 explicitly allows the alternative:

If the Neighbor Cache Entry was created by a Redirect message, a node MAY delete the NCE instead of changing its state to UNREACHABLE.

Not quite sure what I read there this afternoon.

Found the real solution! RFC 6775, section 5.2:

   The routers SHOULD NOT garbage-collect Registered NCEs (see
   Section 3.4), since they need to retain them until the Registration
   Lifetime expires.  Similarly, if NUD on the router determines that
   the host is UNREACHABLE (based on the logic in [RFC4861]), the NCE
   SHOULD NOT be deleted but rather retained until the Registration
   Lifetime expires.  A renewed ARO should mark the cache entry as
   STALE.  Thus, for 6LoWPAN routers, the Neighbor Cache doesn't behave
   like a cache.  Instead, it behaves as a registry of all the host
   addresses that are attached to the router.

Will provide a PR to do that.

Will provide a PR to do that.

See https://github.com/RIOT-OS/RIOT/pull/15077

Was this page helpful?
0 / 5 - 0 ratings