Zigbee2mqtt: Enddevices unreachable because route not updating

Created on 11 Apr 2019  路  27Comments  路  Source: Koenkk/zigbee2mqtt

Bug Report

Some of my enddevices (hue motion sensor/dimmer switch) are frequently loosing connection. Did some sniffing, seems it is a shepherd (or firmware) issue. It seems, some required route requests are missing.

What happened

Hue enddevices seem to switch there parent frequently. All fine so far, if change is noticed by coordinator.
The problem is, shepherd seems not to handle "lazy" route updates, means if "parent switch" message got lost (due to weak link or what ever). Coordinator seems not to handle this correctly.

Here the example:

  • Device's 0xf0ce (blue) first parent was 0x35a3 (red). This is the parent, coordinator has in its routing table.

  • Then blue changed its parent to 0xbb5e (green)

  • but coordinator (CO) missed this change! (still valid scenario)
    This is the result:
    grafik
    blue sends a request to CO (destination broadcast), through it's parent green (wpan layer). green forwards to CO. CO responds (destination blue), but sends packet to wrong/old parent red!

  • old parent red replies with Source Route Failure, since it does not "own" the child anymore

What did you expect to happen

CO should search for the new parent of blue after it received a "Source Route Failure" (using "Route Request"). But it does not, so all following request are failing. Also all request, initiated by CO are failing permanently.

Workarround:
If I restart and replug CC2531, the first interaction with blue is expected "Route Request".

grafik
New parent green replies, CO knows correct parent now, "Read Attribute" request is send to correct parent and successfull delivered. Device working again.

How to reproduce it (minimal and precise)

My bulbs are still connected to normal light switch. Sometimes someone is switching accidentaly, so my routers may be offline and motion sensor is forced to search a new parent.

Debug Info

latest github ioBroker adapter (should be same for zigbee2mqtt)
CC2531_MAX_STABILITY_20190315
(there is no "Issues" enabled in forked shepherd repository, that's why I placed it here)

Most helpful comment

Seems to work. CC2531_20190425 runs since few days without major issues.

Hopefully Stack 3 will work with SourceRouting...

All 27 comments

I think this happens because the routes are never expired by the coordinator (https://github.com/Koenkk/Z-Stack-firmware/blob/master/coordinator/firmware.patch#L246). Could you compile a new firmware with #define SRC_RTG_EXPIRY_TIME 255?

I'm currently working on some Z-Stack 3 firmware so I will also take this into account.

I tried to compile, but Workbench it gives me following error after buildAll:

Error[e104]: Failed to fit all segments into specified ranges. Problem discovered in segment XDATA_N. Unable to place 2 block(s) (0x200 byte(s) total) in 0x1aa byte(s) of memory. The problem occurred while processing the segment placement command  
"-P(XDATA)XDATA_N=_XDATA_START-_XDATA_END", where at the moment of placement the available memory ranges were "XDATA:1d56-1eff" 
Error while running Linker 

Any suggests?

git patch applied (some whitespace errors)
CC2531-ProdHex
FIRMWARE_CC2531
FIRMWARE_MAX_STABILITY

can you recude the maxmemheap by 150?

Seems to help. Thanks.
Now I get"znp.js not found" at "Perform Post-Build Action"

Strange, is the .hex created?

No.
But znp.js exists in ZNP\CC253x\tools.

Hoped you had this error before already ;-)

Compile working now.
I move the Z-Stack..1.2.2.a... folder to root of drive C and renamed it to a short name. That way it worked.

It seems Xiaomi end devices show the same behavior. I keep loosing some of my sensors on regular base. I can reset them so they work for some time, but eventually they will lose their route and are unreachable. @allofmex Did the adjusted firmware solve your issue?

Didn't had much time to test yet. But I don't see much change as of now :-(

Tried SRC_RTG_EXPIRY_TIME = 255 and 2, also CONCENTRATOR_ROUTE_CACHE FALSE

Looks same as before, Sensor works some time, then it changes parent (rejoin request/response), then Device Announcement.
Next is Network Address Request, but Coordinators Network Address Response fails, because it uses wrong/old parent.
It follows permanent Network Address Request/Failed Response. No Route requests visible.

Feels for me that SRC_RTG_EXPIRY_TIME just means, the route entry is marked as "expired" and may be garbaged if route count exceeds MAX_RTG_SRC_ENTRIES. But it seems not to trigger a new route request.

Is there anywere a COMPLETE summary of all the possible compiler options?
I found Breakingthe 400-NodeZigBee NetworkBarrier, but thats not the complete list.

Got my network 99% stable now, but it's more a workaround than a solution :-(

Am using this firmware settings

  #define MAXMEMHEAP 3229
  ...
  #define CONCENTRATOR_ENABLE TRUE
  #define CONCENTRATOR_ROUTE_CACHE FALSE
  #define CONCENTRATOR_DISCOVERY_TIME 120
  #define MAX_RTG_SRC_ENTRIES 1
  #define SRC_RTG_EXPIRY_TIME 2
  #undef MAX_RTG_ENTRIES
  #define MAX_RTG_ENTRIES 4
  #undef ROUTE_EXPIRY_TIME
  #define ROUTE_EXPIRY_TIME 2

Summary:

  • force coordinator to use MAX_RTG_ENTRIES table instead of MAX_RTG_SRC_ENTRIES (by setting MAX_RTG_SRC_ENTRIES to min value).
  • force coordinator to frequently update routing table for fast recover from outdated entries (EXPIRY_TIME 2 and small MAX_RTG_ENTRIES)

Disadvantage:

  • higher network load
  • will not scale well for larger networks

Explanation:
There are 2 different routing tables used in firmware

MAX_RTG_ENTRIES
is standard table, created by CO broadcasting to all end devices, every router forwards any broadcast to all its siblings, the shortest route to devices gets selected.
-> Causes a lot of network traffic

MAX_RTG_SRC_ENTRIES
is a second table (source routing), the enddevice sends a route record message. While this is traveling through the mesh, all used router add there address to the message. When it arrives at coordinator, the message contains a valid route.
-> More efficient, only one message needed to get a route.

As i read in docu, coordinator first look in MAX_RTG_ENTRIES table, then in MAX_RTG_SRC_ENTRIES for a route. If nothing found it does the broadcast route query to find a route for MAX_RTG_ENTRIES list.

ROUTE_EXPIRY_TIME and SRC_RTG_EXPIRY_TIME mean: mark route in list as "can be deleted if space needed". It does NOT mean the route entry would be discarded after this time. It stays in list and will be used. It is only replaced if route is updated or list is full and another entry needs to be added.

CONCENTRATOR_ROUTE_CACHE is connected to source routing. False just means that coordinator tells devices "expect that I do not remember route record, so always send your route record along with any (to-be-acknowledge) message".

This is the scenario that causes problems with the hue motion sensor (and maybe others):

  • no route entry for device in MAX_RTG_ENTRIES, but entry in MAX_RTG_SRC_ENTRIES
  • enddevice changes parent
  • device must send a new route record message to update coordinators routing table.
  • if this message is not send or did not arrived coordinator, the old source routing entry will be used. -> Fails because wrong/old parent
  • sniffing of the motion sensor shows the result showing in screenshot in first post. Sensor never sends a route record anymore, and coordinator keeps using failing source route entry.
  • Above workaround: Since there is only one entry possible in MAX_RTG_SRC_ENTRIES and it is marked as "replaceable" after 2 seconds: as soon as another device sends a route record, it will replace the faulty entry and CO is forced to do the route broadcast. Sensor will "recover" after short time.

In my opinion the real solution should be:
If coordinator receives 'Source route failure' message, it must discard its entry in MAX_RTG_SRC_ENTRIES table. That way it would broadcast for a new MAX_RTG_ENTRIES route and get a valid new route.
Anyone an idea how we can achieve this behavior?

Then it will be possible to use better settings like
#define CONCENTRATOR_ROUTE_CACHE TRUE #define MAX_RTG_SRC_ENTRIES 40 #define SRC_RTG_EXPIRY_TIME 2 #define MAX_RTG_ENTRIES 4 #define ROUTE_EXPIRY_TIME 2

@allofmex thanks for the investigation, I'm wondering if such an issue has been addressed in Z-Stack 3 (see #1445)

@allofmex based on your comment, I would propose to disable source routing.

Reasons:

  • Like you mentioned, it gives problems, I think this is also related: https://github.com/Koenkk/zigbee2mqtt/issues/775#issuecomment-486215623
  • In order to get the current firmware with source routing stable, we need to decrease the NWK_MAX_DEVICE_LIST to a very low value (5), disabling source routing allows for the NWK_MAX_DEVICE_LIST to be increased.
  • Given this, we could provide a single Zigbee 1.2 firmware, which clears up a lot of confusion.
  • Users with big networks (where big seems to be 40 according to the XBEE docs, could use the CC2652R (#1429) which has enough memory for source routing.

EDIT: Perhaps we should also try to increase MAX_RTG_ENTRIES to compensate for the disabled source routing.

What do you think?

@Koenkk

I would propose to disable source routing

Do you know how to completely disable it? MAX_RTG_SRC_ENTRIES 0 does not work (some invalid array size crash)

increase MAX_RTG_ENTRIES to compensate for the disabled source routing

I thought to try this too, since I experience (sometimes) slow response of bulbs with my test firmware.

CC2652R

There is no cc253x compareable hardware with this chip available yet, right? I mean a usb key like version, not a developer board.

I've published the new firmwares: https://github.com/Koenkk/Z-Stack-firmware/tree/dev/coordinator/Z-Stack_Home_1.2/bin, could you test if that fixes this problem?

The CC2652R is only available as a usb key now.

Is it working fine with the latest dev firmware?

Seems to work. CC2531_20190425 runs since few days without major issues.

Hopefully Stack 3 will work with SourceRouting...

Seems to work. CC2531_20190425 runs since few days without major issues.

With this firmware my environment is now stable. Thanks!

The CC2652R is only available as a usb key now.

@Koenkk you meant "_is NOT available as usb..._" right?
I can find only this red TI developer board.

@allofmex typo indeed, it's only available as the board yet.

@allofmex I agree with you.

@Koenkk Could you provide a patch for your firmware modification here.

Looking into a similar issue on our 3.0 fork.

Rightyo @Koenkk I found your patch nevermind.

This is what I ended up with. Currently it's untested (I'll need to get out the JTAG in the office this week) but in theory should behave in a much better way. src route tables will be cleared if a MTO network status is received (hooked in ZDApp.c). Please note we have previously patched ZDApp.c for other functionality, you may need to tweak or manually work the patch to get it to apply.

3ea6c68c8516f70325f1779981f6e3eeb9d18027.diff.txt

If someone with a JTAG (or CC-Debugger on applicable devices) could get me the binary value of the srcRoute table when empty I could verify the osal_memset portion. I've made an educated guess that it's 0 initialized.

Some debugging occurred yesterday and testing today. This is our subsequent patch.

diff --git a/Components/stack/zdo/ZDApp.c b/Components/stack/zdo/ZDApp.c
index 20082d4..254b30e 100644
--- a/Components/stack/zdo/ZDApp.c
+++ b/Components/stack/zdo/ZDApp.c
@@ -3342,12 +3342,20 @@ void ZDO_NetworkStatusCB( uint16 nwkDstAddr, uint8 statusCode, uint16 dstAddr )
 {
   (void)dstAddr;     // Remove this line if this parameter is used.

-  if (nwkDstAddr == NLME_GetShortAddr()){
+  if (nwkDstAddr == NLME_GetShortAddr() || NLME_IsAddressBroadcast(nwkDstAddr) == ADDR_BCAST_FOR_ME){
     MT_ZdoNetworkStatus(statusCode, dstAddr);
-    if ( statusCode == NWKSTAT_NONTREE_LINK_FAILURE )
+
+    if ( statusCode == NWKSTAT_MANY_TO_ONE_ROUTE_FAILURE )
     {
-      // Routing error for dstAddr, this is informational and a Route
-      // Request should happen automatically.
+      // MH: Need to confirm this does what we need it to, how is the entry marked as NULL?
+      //     I am making a guess here that its the fact that one or more of the fields are NULL.
+      for(unsigned int i = 0; i < MAX_RTG_SRC_ENTRIES; i++){
+        if(rtgSrcTable[i].relayList){
+          osal_mem_free(rtgSrcTable[i].relayList);
+        }
+      }
+      osal_memset(&rtgSrcTable, 0, MAX_RTG_SRC_ENTRIES * sizeof(rtgSrcEntry_t));
+      // It may be better to use RTG_nextHopIsBad ?
     }
   }
 }
diff --git a/Projects/zstack/Tools/CC2538DB/f8wConfig.cfg b/Projects/zstack/Tools/CC2538DB/f8wConfig.cfg
index 236690f..aa1eeaa 100644
--- a/Projects/zstack/Tools/CC2538DB/f8wConfig.cfg
+++ b/Projects/zstack/Tools/CC2538DB/f8wConfig.cfg
@@ -89,9 +89,6 @@
  */
 -DLINK_STATUS_JITTER_MASK=0x007F

-/* in seconds; set to 0 to turn off route expiry */
--DROUTE_EXPIRY_TIME=30
-
 /* This number is used by polled devices, since the spec'd formula
  * doesn't work for sleeping end devices.  For non-polled devices,
  * a formula is used. Value is in 2 milliseconds periods
@@ -126,11 +123,6 @@
 /* The maximum number of groups in the groups table */
 -DAPS_MAX_GROUPS=16

-/* Number of entries in the regular routing table plus additional
- * entries for route repair
- */
--DMAX_RTG_ENTRIES=40
-
 /* Maximum number of entries in the Binding table. */
 -DNWK_MAX_BINDING_ENTRIES=4

@@ -197,4 +189,54 @@
 -DCONCENTRATOR_ENABLE

 /* Versioning based off GIT commit hash */
--DINCLUDE_REVISION_INFORMATION
\ No newline at end of file
+-DINCLUDE_REVISION_INFORMATION
+
+/****************************************
+ * Routing Control
+ ***************************************/
+
+/*
+* NOTE:
+* Source routing is broken in Z-Stack. Related Issue and interesting read: https://github.com/Koenkk/zigbee2mqtt/issues/1408
+* When a route breaks, it is marked as expired. A route effectively lasts forever. Also MTO errors should clear the route cache.
+* - MH
+*/
+
+/* Number of entries in the regular routing table plus additional
+ * entries for route repair
+ *
+ * This is standard table, created by ZC broadcasting to all end devices, every router forwards any broadcast to all its siblings, 
+ * the shortest route to devices gets selected. This Causes a lot of network traffic.
+ */
+-DMAX_RTG_ENTRIES=64
+
+/* This is a second table (source routing), when the enddevice sends a route record message. 
+ * While this is traveling through the mesh, all used router add there address to the message. 
+ * When it arrives at coordinator, the message contains a valid route.
+ * -> More efficient, only one message needed to get a route.
+ *
+ * Unreliable in standard Z-Stack, and a must for larger networks to prevent overload of RREQ messages.
+ * This bug has been patched, however the table is reduced as this data has a tendency to get stale, and for our purposes increasing RTG entries is sufficient.
+ */
+-DMAX_RTG_SRC_ENTRIES=4
+
+/* Number of missed Link status messages before considering the route to a specific ZR dead.
+ * Default (3) decreased for a more responsive network in case of ZR failure (2x15=30s failure time by default). 
+ */
+-DNWK_ROUTE_AGE_LIMIT=2
+
+/* Number of seconds before a route is considered expired within its respective table (standard or source).
+ * This does not remove or invalidate the respective route, instead the slot is available for re-use if MAX_*_ENTRIES is reached.
+ */
+-DSRC_RTG_EXPIRY_TIME=2
+-DROUTE_EXPIRY_TIME=90
+
+/* Number of devices this ZC knows are within radio range and can transmit to directly without performing route discovery. 
+ * Default 16 increased for easier Zigbee routing for all devices within range of ZC
+ */ 
+-DMAX_NEIGHBOR_ENTRIES=32
+
+/* The number of seconds a MTO routing entry will last. Default to not expiring.
+ * Not sure about anything being indefinite, so I set a high limit (10 minutes) - MH
+ */
+-DMTO_ROUTE_EXPIRY_TIME=600
\ No newline at end of file
diff --git a/Projects/zstack/ZNP/Source/znp.cfg b/Projects/zstack/ZNP/Source/znp.cfg
index aec2c79..1c8b49b 100644
--- a/Projects/zstack/ZNP/Source/znp.cfg
+++ b/Projects/zstack/ZNP/Source/znp.cfg
@@ -77,7 +77,6 @@
 //-DSRC_RTG_EXPIRY_TIME=255
 //-DCONCENTRATOR_ENABLE=TRUE
 //-DCONCENTRATOR_DISCOVERY_TIME=60
--DMAX_RTG_SRC_ENTRIES=50

 // Define this flag to enable ZNP implementation of the ZCL_KEY_ESTABLISHMENT_ENDPOINT and task.
 //-DTC_LINKKEY_JOIN

If anyone has any interesting results please do mention me :)

@splitice
Thank you very much for the patch. Didn't had time yet to test :-(

for(unsigned int i = 0; i < MAX_RTG_SRC_ENTRIES; i++){
    if(rtgSrcTable[i].relayList){
        osal_mem_free(rtgSrcTable[i].relayList);
    }
}

You are trying to clear the whole source-routing table right?
Do we have a chance to clear only the entry for the failed device (nwkDstAddr)?

@allofmex Possibly. It's not something we tested.

@splitice Can you share all modification made with CC2538? I am trying to build ZNP 3.0.2 optimize firmware.
MT_ZdoNetworkStatus i could not find any function like that, you write your own?

@dzungpv Everything that is suitable for release has been released. I am working for a commercial client at the end of the day.

MT_ZdoNetworkStatus is not required for this patch, it's part of a different feature.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

andreasbrett picture andreasbrett  路  3Comments

pepp86 picture pepp86  路  4Comments

LCerebo picture LCerebo  路  3Comments

jwilling picture jwilling  路  4Comments

alwashe picture alwashe  路  4Comments