Esp-idf: Low Mesh network performances (IDFGH-1158)

Created on 14 May 2019  ·  38Comments  ·  Source: espressif/esp-idf

Problem Description

Hi everybody,
We are currently making an estimation of the ESP MESH network to evaluate if the technology could be interesting for our company.
Starting from the "internal_communication" example, i build 2 FW:

ROOT:

  • This FW connects to the given Router and acts as fixed root.
  • Every 5 seconds dumps a map of the current network (each node with current layer and parent)
  • Periodically sends a packet to each device.
  • Listen for any packet answer and compute RTT of the given packet. On packet reception updates the from-station layer and parent in the network map.

CLIENT:

  • Answers to any received Unicast packet with the same SSC (source sequence counter), including current mesh layer and parent node.

HW:

  • We use 10 devC kits

PROBLEM:
Although layering is fixed to a maximum of 6 and the network seems to build up to 4,
performances are really bad (RTT of 20s!) and many packets get simply lost (after ~30sec the packet is considered lost)

On the developer site
https://docs.espressif.com/projects/esp-idf/en/latest/api-guides/mesh.html#performance
there is written that with 100 devices, hopping would result in a 30ms delay.

Since we have no control over the Mesh library, cause it's closed sourced, we would appreciate any explanation! :)

Steps to reproduce

I created a repository with the code used.
My Wifi connects to channel 10.
The code is already buildable.
https://github.com/KonssnoK/esp_mesh_performance

Flash 1 device with the mesh_server, 9 devices with the mesh_client.
Place them in 4 different rooms and switch on, this should be enough.

Logs for 8 devices (after trying to make things be better by moving devices)

W (13965) mesh_root: Layer 0 means unknown! ME(3305)
W (13965) mesh_root: layer 0 (0<-3304)
W (13965) mesh_root: layer 2 (3305<-3120) (3305<-387C) (3305<-45CC) (3305<-2ECC)
W (13975) mesh_root: layer 3 (387D<-4620)
W (13975) mesh_root: layer 4 (40C1<-38EC) (4621<-40C0) (40C1<-38FC)
I (14005) mesh_root: 2ECC: RTT 23. Elapsed 41ms
I (16105) mesh_root: 38FC: RTT 21. Elapsed 2626ms
I (16105) mesh_root: 387C: RTT 25. Elapsed 11ms
I (17015) mesh_root: 38EC: RTT 3. Elapsed 12798ms
I (17045) mesh_root: 38EC: RTT 10. Elapsed 7234ms
I (17175) mesh_root: 3120: RTT 24. Elapsed 1108ms
I (17185) mesh_root: 38EC: RTT 18. Elapsed 4459ms
I (17235) mesh_root: 38EC: RTT 26. Elapsed 219ms
I (18375) mesh_root: 38EC: RTT 26. Elapsed 1362ms
I (18495) mesh_root: 38EC: RTT 26. Elapsed 1485ms
I (18735) mesh_root: 45CC: RTT 30. Elapsed 37ms
I (18835) mesh_root: 40C0: RTT 27. Elapsed 1674ms
W (19585) mesh_root: Layer 0 means unknown! ME(3305)
W (19585) mesh_root: layer 0 (0<-3304)
W (19585) mesh_root: layer 2 (3305<-3120) (3305<-387C) (3305<-45CC) (3305<-2ECC)
W (19585) mesh_root: layer 3 (387D<-4620)
W (19595) mesh_root: layer 4 (40C1<-38EC) (4621<-40C0) (40C1<-38FC)
I (19655) mesh_root: 2ECC: RTT 31. Elapsed 71ms
I (21695) mesh_root: 3120: RTT 32. Elapsed 28ms
I (21705) mesh_root: 387C: RTT 33. Elapsed 17ms
I (23765) mesh_root: 4620: RTT 28. Elapsed 5445ms
I (23765) mesh_root: 38FC: RTT 29. Elapsed 5187ms
I (24145) mesh_root: 45CC: RTT 38. Elapsed 58ms
I (24155) mesh_root: 2ECC: RTT 39. Elapsed 10ms
I (24265) mesh_root: 40C0: RTT 35. Elapsed 730ms
I (24265) mesh_root: 4620: RTT 36. Elapsed 380ms
I (24285) mesh_root: 40C0: RTT 35. Elapsed 746ms
I (24305) mesh_root: 38FC: RTT 37. Elapsed 228ms
I (26205) mesh_root: 3120: RTT 40. Elapsed 64ms
I (26875) mesh_root: 387C: RTT 41. Elapsed 715ms
I (26955) mesh_root: 4620: RTT 44. Elapsed 177ms
W (28135) mesh_root: Layer 0 means unknown! ME(3305)
W (28135) mesh_root: layer 0 (0<-3304)
W (28135) mesh_root: layer 2 (3305<-3120) (3305<-387C) (3305<-45CC) (3305<-2ECC)
W (28145) mesh_root: layer 3 (387D<-4620)
W (28145) mesh_root: layer 4 (40C1<-38EC) (4621<-40C0) (40C1<-38FC)
I (28195) mesh_root: 45CC: RTT 46. Elapsed 79ms
I (28355) mesh_root: 2ECC: RTT 47. Elapsed 225ms
I (30175) mesh_root: 3120: RTT 48. Elapsed 23ms
I (31375) mesh_root: 40C0: RTT 43. Elapsed 4692ms
I (31475) mesh_root: 387C: RTT 49. Elapsed 510ms
I (31705) mesh_root: 45CC: RTT 54. Elapsed 98ms
I (32675) mesh_root: 2ECC: RTT 55. Elapsed 105ms
I (32905) mesh_root: 40C0: RTT 43. Elapsed 6224ms
I (33115) mesh_root: 38EC: RTT 34. Elapsed 11329ms
I (34905) mesh_root: 3120: RTT 56. Elapsed 309ms
I (34905) mesh_root: 387C: RTT 57. Elapsed 268ms
I (34945) mesh_root: 38FC: RTT 45. Elapsed 8058ms
W (35235) mesh_root: Layer 0 means unknown! ME(3305)
W (35235) mesh_root: layer 0 (0<-3304)
W (35235) mesh_root: layer 2 (3305<-3120) (3305<-387C) (3305<-45CC) (3305<-2ECC)
W (35245) mesh_root: layer 3 (387D<-4620)
W (35245) mesh_root: layer 4 (40C1<-38EC) (4621<-40C0) (40C1<-38FC)
I (35395) mesh_root: 2ECC: RTT 63. Elapsed 166ms
I (35465) mesh_root: 4620: RTT 52. Elapsed 4203ms
I (35465) mesh_root: 45CC: RTT 62. Elapsed 251ms
I (35475) mesh_root: 40C0: RTT 51. Elapsed 4365ms
I (35825) mesh_root: 4620: RTT 60. Elapsed 672ms
I (37495) mesh_root: 3120: RTT 64. Elapsed 226ms
I (37725) mesh_root: 387C: RTT 65. Elapsed 33ms
I (38815) mesh_root: 45CC: RTT 70. Elapsed 153ms
I (38825) mesh_root: 2ECC: RTT 71. Elapsed 105ms
I (38965) mesh_root: 4620: RTT 68. Elapsed 388ms
I (39065) mesh_root: 40C0: RTT 59. Elapsed 4161ms
I (39185) mesh_root: 38FC: RTT 53. Elapsed 7616ms
I (39195) mesh_root: 40C0: RTT 67. Elapsed 824ms
I (39205) mesh_root: 38FC: RTT 61. Elapsed 4029ms
I (39205) mesh_root: 38FC: RTT 69. Elapsed 611ms
I (41015) mesh_root: 387C: RTT 73. Elapsed 257ms
I (41075) mesh_root: 3120: RTT 72. Elapsed 322ms
W (41425) mesh_root: Layer 0 means unknown! ME(3305)
W (41425) mesh_root: layer 0 (0<-3304)
W (41435) mesh_root: layer 2 (3305<-3120) (3305<-387C) (3305<-45CC) (3305<-2ECC)
I (41435) mesh_root: 2ECC: RTT 79. Elapsed 7ms
W (41435) mesh_root: layer 3 (387D<-4620)
W (41445) mesh_root: layer 4 (40C1<-38EC) (4621<-40C0) (40C1<-38FC)

All 38 comments

I did some testing with your example, and I think this is primarily caused by AMPDU.

You can disable AMPDU in menuconfig: make menuconfig -> Component config -> Wi-Fi -> WiFi AMPDU TX & WiFi AMPDU RX

LOG (disable AMPDU, 8 devices):

I (425847) mesh_root: 73A0: RTT 1344. Elapsed 27ms
I (427837) mesh_root: 1044: RTT 1345. Elapsed 8ms
I (427837) mesh_root: 620: RTT 1346. Elapsed 3ms
I (427847) mesh_root: 1AF8: RTT 1347. Elapsed 4ms
I (427847) mesh_root: ECC4: RTT 1348. Elapsed 6ms
I (427867) mesh_root: 73A0: RTT 1351. Elapsed 11ms
I (427877) mesh_root: 60F8: RTT 1349. Elapsed 30ms
I (427887) mesh_root: 1160: RTT 1350. Elapsed 25ms
I (429867) mesh_root: 1044: RTT 1352. Elapsed 6ms
I (429867) mesh_root: 1AF8: RTT 1354. Elapsed 8ms
I (429867) mesh_root: 620: RTT 1353. Elapsed 11ms
I (429877) mesh_root: ECC4: RTT 1355. Elapsed 10ms
I (429887) mesh_root: 60F8: RTT 1356. Elapsed 13ms
I (429887) mesh_root: 73A0: RTT 1358. Elapsed 6ms
I (429887) mesh_root: 1160: RTT 1357. Elapsed 13ms
I (431877) mesh_root: 1044: RTT 1359. Elapsed 6ms
I (431887) mesh_root: 620: RTT 1360. Elapsed 15ms
I (431897) mesh_root: ECC4: RTT 1362. Elapsed 4ms
I (431897) mesh_root: 1AF8: RTT 1361. Elapsed 8ms
W (431897) mesh_root: Layer 0 means unknown! ME(6849)
W (431907) mesh_root: layer 0 (0<-6848) 
I (431907) mesh_root: 60F8: RTT 1363. Elapsed 12ms
W (431917) mesh_root: layer 2 (6849<-1044) (6849<-620) (6849<-1AF8) (6849<-ECC4) (6849<-60F8) (6849<-73A0) 
I (431927) mesh_root: 73A0: RTT 1365. Elapsed 21ms
W (431927) mesh_root: layer 3 (60F9<-1160) 
I (431937) mesh_root: 1160: RTT 1364. Elapsed 32ms
I (433947) mesh_root: 1044: RTT 1366. Elapsed 6ms
I (433957) mesh_root: 1AF8: RTT 1368. Elapsed 12ms
I (433967) mesh_root: ECC4: RTT 1369. Elapsed 12ms
I (433967) mesh_root: 620: RTT 1367. Elapsed 21ms
I (433977) mesh_root: 73A0: RTT 1372. Elapsed 5ms
I (434137) mesh_root: 60F8: RTT 1370. Elapsed 181ms
I (434147) mesh_root: 1160: RTT 1371. Elapsed 182ms
I (436017) mesh_root: 1044: RTT 1373. Elapsed 41ms
I (436017) mesh_root: 620: RTT 1374. Elapsed 10ms
I (436017) mesh_root: 1AF8: RTT 1375. Elapsed 5ms
I (436027) mesh_root: ECC4: RTT 1376. Elapsed 7ms
I (436087) mesh_root: 73A0: RTT 1379. Elapsed 24ms
I (436097) mesh_root: 60F8: RTT 1377. Elapsed 71ms
I (436137) mesh_root: 1160: RTT 1378. Elapsed 103ms
I (438077) mesh_root: 1044: RTT 1380. Elapsed 11ms
I (438077) mesh_root: 620: RTT 1381. Elapsed 9ms
I (438077) mesh_root: ECC4: RTT 1383. Elapsed 8ms
I (438077) mesh_root: 1AF8: RTT 1382. Elapsed 12ms
I (438087) mesh_root: 60F8: RTT 1384. Elapsed 11ms
W (438087) mesh_root: Layer 0 means unknown! ME(6849)
I (438097) mesh_root: 1160: RTT 1385. Elapsed 15ms

Hi there,
thanks for the fast response.
I tried removing the option (git repo updated) and did (make clean && make flash + make flash for all 10 devices)

This is the configuration of the network (Still using the simplified one, not the complete one)
MeshWifi_8device

And below there is the log.
I see 2 problems:

  • Not all devices are seen by the network (even if the handler shows 10 devices at start) This could be for low transmitting power, i guess can be delayed for analysis.
  • RTT times are still very high (up to 5seconds in early tests, to check if they degrade)

W (253346) mesh_root: Layer 0 means unknown! ME(3305)
W (253346) mesh_root: layer 0 (0<-3304)
W (253346) mesh_root: layer 2 (3305<-2ECC) (3305<-387C)
W (253356) mesh_root: layer 3 (387D<-4620) (2ECD<-3120)
W (253356) mesh_root: layer 4 (40C1<-3DBC) (4621<-40C0)
I (253386) mesh_root: 40C0: RTT 894. Elapsed 78ms
I (253386) mesh_root: 4620: RTT 895. Elapsed 63ms
I (253476) mesh_root: 3120: RTT 899. Elapsed 134ms
I (253666) mesh_root: 45CC: RTT 896. Elapsed 340ms
I (253666) mesh_root: 38FC: RTT 897. Elapsed 335ms
I (255436) mesh_root: 387C: RTT 900. Elapsed 34ms
I (255456) mesh_root: 2ECC: RTT 907. Elapsed 5ms
I (255556) mesh_root: 3120: RTT 908. Elapsed 104ms
I (257536) mesh_root: 387C: RTT 909. Elapsed 50ms
I (257556) mesh_root: 2ECC: RTT 916. Elapsed 4ms
I (257606) mesh_root: 3120: RTT 917. Elapsed 56ms
I (258676) mesh_root: 38EC: RTT 901. Elapsed 3270ms
I (258686) mesh_root: 4620: RTT 904. Elapsed 3248ms
I (258686) mesh_root: 3DBC: RTT 902. Elapsed 3277ms
I (259006) mesh_root: 4620: RTT 913. Elapsed 1508ms
I (259016) mesh_root: 40C0: RTT 903. Elapsed 3577ms
I (259566) mesh_root: 387C: RTT 918. Elapsed 7ms
W (259926) mesh_root: Layer 0 means unknown! ME(3305)
W (259926) mesh_root: layer 0 (0<-3304)
W (259926) mesh_root: layer 2 (3305<-2ECC) (3305<-387C)
W (259936) mesh_root: layer 3 (387D<-4620) (2ECD<-3120)
W (259936) mesh_root: layer 4 (40C1<-3DBC) (4621<-40C0)
I (260006) mesh_root: 3120: RTT 926. Elapsed 79ms
I (260026) mesh_root: 45CC: RTT 905. Elapsed 4587ms
I (260036) mesh_root: 4620: RTT 922. Elapsed 277ms
I (260326) mesh_root: 40C0: RTT 912. Elapsed 2835ms
I (260326) mesh_root: 38FC: RTT 906. Elapsed 4885ms
I (260386) mesh_root: 40C0: RTT 921. Elapsed 689ms
I (260386) mesh_root: 45CC: RTT 905. Elapsed 4941ms
I (260396) mesh_root: 3DBC: RTT 911. Elapsed 2904ms
I (262016) mesh_root: 387C: RTT 927. Elapsed 45ms
I (262066) mesh_root: 2ECC: RTT 934. Elapsed 18ms
I (262106) mesh_root: 3120: RTT 935. Elapsed 62ms
I (262486) mesh_root: 4620: RTT 931. Elapsed 455ms
I (262656) mesh_root: 40C0: RTT 930. Elapsed 658ms
I (264246) mesh_root: 387C: RTT 936. Elapsed 194ms
I (264496) mesh_root: 2ECC: RTT 943. Elapsed 21ms
I (264606) mesh_root: 3120: RTT 944. Elapsed 108ms
I (265006) mesh_root: 4620: RTT 940. Elapsed 875ms
I (265266) mesh_root: 40C0: RTT 939. Elapsed 1210ms
I (266086) mesh_root: 38EC: RTT 910. Elapsed 8602ms
I (266696) mesh_root: 2ECC: RTT 952. Elapsed 6ms
I (266706) mesh_root: 387C: RTT 945. Elapsed 112ms

As foreseen also in the previous test, conditions degrade over time, although it seems to be able to retrieve a normal RTT timing randomly... Really strange behavior ..

W (752827) mesh_root: Layer 0 means unknown! ME(3305)
W (752827) mesh_root: layer 0 (0<-3304)
W (752827) mesh_root: layer 2 (3305<-2ECC) (3305<-3120) (3305<-387C)
W (752827) mesh_root: layer 3 (387D<-4620) (387D<-4620) (387D<-40C0)
W (752837) mesh_root: layer 5 (3DBD<-38FC) (40C1<-3DBC) (3DBD<-45CC) (3DBD<-38EC)
W (752867) wifi: [beacon]interval:1000ms
I (753467) mesh_root: 4620: RTT 2816. Elapsed 683ms
I (754347) mesh_root: 40C0: RTT 2810. Elapsed 3651ms
I (754367) mesh_root: 38FC: RTT 2799. Elapsed 5697ms
I (754367) mesh_root: 3DBC: RTT 2809. Elapsed 3671ms
I (754387) mesh_root: 40C0: RTT 2819. Elapsed 1587ms
I (754857) mesh_root: 2ECC: RTT 2822. Elapsed 17ms
I (754877) mesh_root: 3120: RTT 2823. Elapsed 18ms
I (754877) mesh_root: 387C: RTT 2824. Elapsed 19ms
I (755247) mesh_root: 4620: RTT 2825. Elapsed 381ms
I (755947) mesh_root: 40C0: RTT 2828. Elapsed 1048ms
I (757017) mesh_root: 2ECC: RTT 2831. Elapsed 5ms
I (757017) mesh_root: 3120: RTT 2832. Elapsed 5ms
I (757047) mesh_root: 38FC: RTT 2799. Elapsed 8378ms
I (757117) mesh_root: 387C: RTT 2833. Elapsed 93ms
I (757227) mesh_root: 3DBC: RTT 2818. Elapsed 4433ms
I (757287) mesh_root: 45CC: RTT 2802. Elapsed 8599ms
I (757357) mesh_root: 4620: RTT 2834. Elapsed 322ms
I (757357) mesh_root: 3DBC: RTT 2827. Elapsed 2484ms
I (757427) mesh_root: 40C0: RTT 2837. Elapsed 389ms
I (757437) mesh_root: 38EC: RTT 2803. Elapsed 8742ms
I (757457) mesh_root: 3DBC: RTT 2836. Elapsed 420ms
I (757457) mesh_root: 38FC: RTT 2808. Elapsed 6767ms
I (757467) mesh_root: 45CC: RTT 2811. Elapsed 6761ms
I (757467) mesh_root: 38EC: RTT 2812. Elapsed 6705ms
I (757477) mesh_root: 38FC: RTT 2817. Elapsed 4684ms
I (757497) mesh_root: 38EC: RTT 2821. Elapsed 4673ms
I (757507) mesh_root: 38EC: RTT 2830. Elapsed 2482ms
I (757527) mesh_root: 38EC: RTT 2839. Elapsed 473ms
I (757897) mesh_root: 45CC: RTT 2811. Elapsed 7195ms
I (757897) mesh_root: 45CC: RTT 2820. Elapsed 5101ms
I (757927) mesh_root: 45CC: RTT 2829. Elapsed 2909ms
I (757937) mesh_root: 45CC: RTT 2838. Elapsed 889ms
I (758067) mesh_root: 38FC: RTT 2826. Elapsed 3194ms
I (758077) mesh_root: 38FC: RTT 2835. Elapsed 1041ms
I (759067) mesh_root: 2ECC: RTT 2840. Elapsed 9ms
I (759077) mesh_root: 3120: RTT 2841. Elapsed 14ms
W (759087) mesh_root: Layer 0 means unknown! ME(3305)
I (759087) mesh_root: 387C: RTT 2842. Elapsed 23ms
W (759097) mesh_root: layer 0 (0<-3304)
W (759097) mesh_root: layer 2 (3305<-2ECC) (3305<-3120) (3305<-387C)
W (759107) mesh_root: layer 3 (387D<-4620) (387D<-4620) (387D<-40C0)
W (759107) mesh_root: layer 5 (3DBD<-38FC) (40C1<-3DBC) (3DBD<-45CC) (3DBD<-38EC)
I (759437) mesh_root: 40C0: RTT 2846. Elapsed 351ms
I (759507) mesh_root: 4620: RTT 2843. Elapsed 430ms
I (761117) mesh_root: 2ECC: RTT 2849. Elapsed 5ms
I (761127) mesh_root: 387C: RTT 2851. Elapsed 5ms
I (761127) mesh_root: 3120: RTT 2850. Elapsed 8ms
I (761387) mesh_root: 4620: RTT 2852. Elapsed 254ms
I (761447) mesh_root: 40C0: RTT 2855. Elapsed 169ms
I (762227) mesh_root: 38FC: RTT 2844. Elapsed 3150ms
I (762267) mesh_root: 3DBC: RTT 2845. Elapsed 3182ms
I (762267) mesh_root: 45CC: RTT 2847. Elapsed 3179ms
I (762277) mesh_root: 3DBC: RTT 2854. Elapsed 1130ms
I (762297) mesh_root: 38EC: RTT 2848. Elapsed 3202ms
I (762337) mesh_root: 38FC: RTT 2853. Elapsed 1210ms
I (762357) mesh_root: 45CC: RTT 2856. Elapsed 1068ms
I (762367) mesh_root: 38EC: RTT 2857. Elapsed 987ms
I (763397) mesh_root: 2ECC: RTT 2858. Elapsed 10ms
I (763397) mesh_root: 3120: RTT 2859. Elapsed 10ms
I (763477) mesh_root: 387C: RTT 2860. Elapsed 86ms
I (763497) mesh_root: 4620: RTT 2861. Elapsed 100ms
I (763877) mesh_root: 3DBC: RTT 2863. Elapsed 402ms
I (763877) mesh_root: 38FC: RTT 2862. Elapsed 480ms
I (763897) mesh_root: 40C0: RTT 2864. Elapsed 402ms
I (763917) mesh_root: 45CC: RTT 2865. Elapsed 426ms

Sorry, It's difficult for me to build such an environment.

I think one possibility is the poor signal between devices. Maybe you could do a similar test between two of the devices using UDP rather than ESP-MESH.

What would be the purpose of that? We target up to 20 devices in the same environment, can't use UDP.

And how would low signal influence the RTT of close devices (such as the ones in the next room)?
A low signal could trigger retries on the line but this should impact only leaf hopping nodes (the RTT of the poor connected being high and rest being normal)..

I think he means to establish a baseline of rtt based on environment and independent of mesh protocol.

W (752827) mesh_root: layer 3 (387D<-4620) (387D<-4620) (387D<-40C0)
W (752837) mesh_root: layer 5 (3DBD<-38FC) (40C1<-3DBC) (3DBD<-45CC) (3DBD<-38EC)

How can 3dbc be connected to a layer 3 node and be in layer 5 with other layer 5 nodes connected to it?

Yes, I mean, do a communication test between two nodes without starting mesh. So that we can know the signal influence between every two nodes. It's just a test.

ESP-MESH is a Cluster-Tree network, and messages are forwarded through intermediate nodes. So the low signal could impact all the nodes that need to pass this path.

For example:

  • If now we get a network, A is connected to B, B is connected to C, C is connected to D...
  • if the signal between C and D is bad.
  • The RTT between C and D could be affected.
  • The RTT of A-D or B-D is affected also.

I think he means to establish a baseline of rtt based on environment and independent of mesh protocol.

W (752827) mesh_root: layer 3 (387D<-4620) (387D<-4620) (387D<-40C0)
W (752837) mesh_root: layer 5 (3DBD<-38FC) (40C1<-3DBC) (3DBD<-45CC) (3DBD<-38EC)

How can 3dbc be connected to a layer 3 node and be in layer 5 with other layer 5 nodes connected to it?

That is a very nice question, no idea.
Layer is retrieved directly by each node, who sends its own "layer level", retrieved by MESH events (connected, disconnected, layer_change etc).
So, either the node changed layer and is not any longer reachable (you don't see the real layer cause you don't get any packet from that node) or the "get_layer" function has something strange in it.

Yes, I mean, do a communication test between two nodes without starting mesh. So that we can know the signal influence between every two nodes. It's just a test.

ESP-MESH is a Cluster-Tree network, and messages are forwarded through intermediate nodes. So the low signal could impact all the nodes that need to pass this path.

For example:

* If now we get a network, A is connected to B, B is connected to C, C is connected to D...

* if the signal between C and D is bad.

* The RTT between C and D could be affected.

* The RTT of  A-D or B-D is affected also.

Sure, the problem is here we are having high RTT between A and B :) !
From the tree you can see that
layer 2 if made by 3 devices: 2ECC 3120 and 387C.

Although they seem to work better than the rest

I (761117) mesh_root: 2ECC: RTT 2849. Elapsed 5ms
I (761127) mesh_root: 387C: RTT 2851. Elapsed 5ms
I (761127) mesh_root: 3120: RTT 2850. Elapsed 8ms

you can spot strange delays which should be independent from other nodes

I (763477) mesh_root: 387C: RTT 2860. Elapsed 86ms

So, by looking at layer 3 we can see that all the children are connected to this delayed node (4620 and 40C0).
Physical distance between 387C and 4620 is about 2m in the same room.

So 4620 has to hop through 387C.

I (753467) mesh_root: 4620: RTT 2816. Elapsed 683ms

Therefore 4620 is impacting 387C ability to communicate (maybe retries).
Please note that 4620 is a leaf node! So there is (theoretically) no other impact on the hops a packet would do when coming from that node.

TOP

Here is the topology of your network in your previous test. And I have marked all the nodes with bad RTT. All nodes except 4620 have to hop through 387C and 40C0. Also the result seems to fit this topology.

4620 is closer to 387C than 40C0, also it has better RTT.
Others has bad RTT because they need to hop through 40C0

I have a mesh network with the same issue as KonssnoK
Some messages have a bad RTT, and a few seconds later they are fine.

Also sending an ota (using MDF-upgrade), to a single device takes a lot of time (more than 20 minutes!!). Updated device is on layer2, connection to root has an RSSI of -61.

TOP

Here is the topology of your network in your previous test. And I have marked all the nodes with bad RTT. All nodes except 4620 have to hop through 387C and 40C0. Also the result seems to fit this topology.

4620 is closer to 387C than 40C0, also it has better RTT.
Others has bad RTT because they need to hop through 40C0

Sorry i don't get the point of this reply.. How is the high RTT of 4620 justified? (if that was the point) :)

4620 doesn't have a very bad RTT, it's worse than the three in layer2 but better than the other's.

I (753467) mesh_root: 4620: RTT 2816. Elapsed 683ms
I (755247) mesh_root: 4620: RTT 2825. Elapsed 381ms
I (757357) mesh_root: 4620: RTT 2834. Elapsed 322ms
I (761387) mesh_root: 4620: RTT 2852. Elapsed 254ms
I (763497) mesh_root: 4620: RTT 2861. Elapsed 100ms

The specs...

Function | Description
-- | --
Networking Building Time | < 60 seconds
Healing time | Root Node Break Down: < 10 secondsChild Node Break Down: < 5 seconds
Per-hop latency | 10 to 30 milliseconds

683ms is bad

Is this related to the esp-now issue of 1mbps phy rate and ampdu?

4620 doesn't have a very bad RTT, it's worse than the three in layer2 but better than the other's.

I (753467) mesh_root: 4620: RTT 2816. Elapsed 683ms
I (755247) mesh_root: 4620: RTT 2825. Elapsed 381ms
I (757357) mesh_root: 4620: RTT 2834. Elapsed 322ms
I (761387) mesh_root: 4620: RTT 2852. Elapsed 254ms
I (763497) mesh_root: 4620: RTT 2861. Elapsed 100ms

Actually in this quote you see exactly why the RTT is bad.. 4620 should have a 2hops RTT -> ~60ms + margin.
We are one order of magnitute out of specs :D

EDIT:
actually 4020 should be considered 1 hop.. therefore -> ~30ms + margin

Ah, the result in document was measured by several devices put together. The RTT was not only affected by ESP-MESH when the devices far away.

By the way, I think actually 4620 should be considered 4 hop: ROOT->387C, 387C->4620, 4620 ->387C, 387C -> ROOT.

mmm i guess (but i cannot know) that a hop is considered getting data from the AP and repeating from the STA or viceversa...
Therefore you get ROOT -> Node1 -> 4620 = 1 hop
Since this is a RTT, you need to do x2 -> 2hops (i was right the first time :P )
Hop is the handling in a temporary device, not the link between 2 devices :)

@dongdong004
How is it possible that a layer2 device has delays of minimal 5 seconds by sending a single packet?
most of the time messages are fast...

Distance is about 15meter
Outdoors
RSSI around -60
TX-power 20dB
AMPDU is disabled

@KonssnoK
In document 1 hop is from one device to another, so ROOT -> LAYER2 is 1 hop.

@edjeed

How is it possible that a layer2 device has delays of minimal 5 seconds by sending a single packet?
most of the time messages are fast...

Sorry, I don't know. Can you know how long esp_mesh_send took? Is it more than 5 sec? ( from call to return)

Also, you may need to use e.g. Wireshark to find out what exactly is going on.

sry @dongdong004 but this doesn't make any sense.. :)
Can you point me where in the doc it is stated something like that?
The link can be considered transparent in terms of time, since data propagates at lightspeed -> ~0ms

What can delay the execution is the following:

  • Send in the ROOT node -> should be negligible cause of dedicated core and no other generated traffic than the one issued by the device (10 packets every X seconds)
  • Reception in the Layer2 node -> Can be delayed if the Layer2 node is processing a lot of packets
  • Send in the Layer2 node -> Can be delayed if the Layer2 node is processing a lot of packets
  • Reception in the leaf node -> negligible, since the node should not processing any traffic
  • Now go up in the bullets replacing "Reception" with "Send"

You can see that the only stressed part is the bold one, corresponding to the handling of packets of an intermediate node.
One theory could be that the packets coming from the non-reachable/hardly-reachable nodes are flooding the second layer, but the contained data is not meaningful (network state packets i.e.)
This would mean that the handling of the network can heavily impact the network itself, which is actually not nice if it doesn't have a fixed window to let packets go through...

But the time couldn't be zero when we send a packet from ROOT to LAYER2, right?

The meaning of Wireless Hop in document

Data Stream

Performance

Per-hop latency: The latency of data transmission over one wireless hop. In other words, the time taken to transmit a data packet from a parent node to a child node or vice versa.

Thanks for the reference.
The speed of actual transmission is negligible (speed is 3*10^8m/s .. ~ 3us/Km).
What could impact timing is:

  • usage of frames (the device can send once every Xms)
  • HW processing (expected to be less than 1ms)
  • filled queues (either sending or receiving)
  • Other interrupts/tasks delaying SW processing (considering an RTOS)

Sadly i don't know the WiFi standard to say what could apply in this case..
And i cannot investigate on my own because the Wifi module is close sourced.. hence this issue :)

You are right.

But there are other things (about MAC layer) :

  • A device is not allowed to send packet at any time.
  • Not all packets could be sent successfully, there could be failure.

Refer to ieee80211 protocol.

To be useful average performance has to be much closer to stated 30ms/hop. Multiple second rtt is like worst case and should only happen with very bad rf environment or very crowded channel. Need some sniffer captures.

The following table lists the common performance figures of an ESP-MESH network. However users should note that performance numbers can vary greatly between installations based on network configuration and operating environment.

Found this in the documentation...

It seems that 30ms per-hop latency does not fit your environment.

@KonssnoK

The speed of actual transmission is negligible (speed is 3*10^8m/s .. ~ 3us/Km).

It's right that radio waves propagate at ~300 m/μs (speed of light) in free space. But before that nodes must obtain the access of medium first. All nodes listen to the channel before transmission to determine whether some one else is transmitting. Even nodes got the opertunity of transmission, the transmission rate is not the speed of light. The transmission rate can be seen from the wireless captures.

Per-hop latency: 10 to 30 milliseconds

This data is mainly for downstream.
As upstreams are limited by the memory size of received nodes and the number of its Sibling Nodes. The upstream latency variation might be greater. See Upstream Flow Control.

In your case,

CLIENT:
Answers to any received Unicast packet with the same SSC (source sequence counter), including current mesh layer and parent node.

Can you include the RSSI value? The RSSI value can be obtained by the esp_wifi_sta_get_ap_info().

Every 5 seconds dumps a map of the current network (each node with current layer and parent)

Can you also dump the RSSI value for each node?

Periodically sends a packet to each device.

How much is the period time?
How many connections you set?

From the discussions, most likely it's RSSI and intra-mesh congestion. Please include the RSSI and let me know the test results, thanks.

Hello @ESP-MESH-QI :)
here are the logs with RSSI for each station [in brackets].

Also i updated the repository

W (1196679) mesh_root: Layer 0 means unknown! ME(3305)
W (1196679) mesh_root: layer 0 (0<-3304[0])
W (1196679) mesh_root: layer 2 (3305<-3120[-61]) (3305<-2ECC[-67]) (3305<-387C[-78])
W (1196689) mesh_root: layer 3 (387D<-4620[-67])
W (1196699) mesh_root: layer 4 (4621<-40C0[-62])
W (1196699) mesh_root: layer 5 (40C1<-38FC[-70]) (40C1<-3DBC[-77])
I (1196749) mesh_root: 38FC: RTT 54. Elapsed 1120ms
I (1196779) mesh_root: 38FC: RTT 54. Elapsed 1147ms
I (1196789) mesh_root: 45CC: RTT 55. Elapsed 1117ms
I (1197689) mesh_root: 3DBC: RTT 56. Elapsed 1414ms
I (1197709) mesh_root: 3DBC: RTT 56. Elapsed 1431ms
I (1197709) mesh_root: 3DBC: RTT 56. Elapsed 1433ms
I (1197839) mesh_root: 38EC: RTT 57. Elapsed 1157ms
I (1198839) mesh_root: 3120: RTT 58. Elapsed 85ms
I (1198909) mesh_root: 387C: RTT 60. Elapsed 71ms
W (1206119) mesh_root: Layer 0 means unknown! ME(3305)
W (1206119) mesh_root: layer 0 (0<-3304[0])
W (1206119) mesh_root: layer 2 (3305<-3120[-61]) (3305<-2ECC[-68]) (3305<-387C[-80])
W (1206129) mesh_root: layer 3 (387D<-4620[-67])
W (1206129) mesh_root: layer 4 (4621<-40C0[-62])
W (1206139) mesh_root: layer 5 (40C1<-38FC[-70]) (40C1<-3DBC[-77])
I (1206199) mesh_root: 4620: RTT 61. Elapsed 2732ms
I (1206259) mesh_root: 4620: RTT 61. Elapsed 2785ms
I (1206489) mesh_root: 4620: RTT 61. Elapsed 3019ms
I (1207019) mesh_root: 4620: RTT 61. Elapsed 3544ms
I (1208179) mesh_root: 3120: RTT 67. Elapsed 3ms
I (1208509) mesh_root: 2ECC: RTT 68. Elapsed 332ms
I (1211149) mesh_root: 40C0: RTT 62. Elapsed 6850ms
I (1211149) mesh_root: 387C: RTT 69. Elapsed 2670ms
I (1211379) mesh_root: 4620: RTT 61. Elapsed 7912ms
I (1211379) mesh_root: 387C: RTT 69. Elapsed 2905ms
I (1211549) mesh_root: 40C0: RTT 62. Elapsed 7254ms
I (1211879) mesh_root: 4620: RTT 61. Elapsed 8406ms
I (1211909) mesh_root: 40C0: RTT 62. Elapsed 7615ms
I (1211919) mesh_root: 4620: RTT 61. Elapsed 8451ms
I (1212019) mesh_root: 38FC: RTT 63. Elapsed 7307ms
I (1212019) mesh_root: 4620: RTT 70. Elapsed 3346ms
I (1212059) mesh_root: 38FC: RTT 63. Elapsed 7350ms
I (1212089) mesh_root: 40C0: RTT 71. Elapsed 3328ms
I (1212099) mesh_root: 3DBC: RTT 65. Elapsed 6058ms
I (1212109) mesh_root: 38EC: RTT 66. Elapsed 5986ms
I (1212239) mesh_root: 45CC: RTT 64. Elapsed 6322ms
I (1212279) mesh_root: 45CC: RTT 64. Elapsed 6361ms
I (1212459) mesh_root: 3DBC: RTT 74. Elapsed 2444ms
I (1212509) mesh_root: 38EC: RTT 75. Elapsed 1827ms
I (1212589) mesh_root: 38EC: RTT 75. Elapsed 1908ms
I (1212599) mesh_root: 45CC: RTT 64. Elapsed 6674ms
I (1212609) mesh_root: 38FC: RTT 72. Elapsed 2906ms
I (1212659) mesh_root: 45CC: RTT 73. Elapsed 2856ms
I (1212689) mesh_root: 2ECC: RTT 77. Elapsed 4ms
I (1212699) mesh_root: 3120: RTT 76. Elapsed 16ms
I (1216069) mesh_root: 38FC: RTT 72. Elapsed 6365ms
I (1216069) mesh_root: 387C: RTT 78. Elapsed 3281ms
I (1226289) mesh_root: 38FC: RTT 72. Elapsed 16590ms
I (1226499) mesh_root: 38EC: RTT 75. Elapsed 15813ms
W (1227419) mesh_root: Layer 0 means unknown! ME(3305)
W (1227419) mesh_root: layer 0 (0<-3304[0])
W (1227419) mesh_root: layer 2 (3305<-3120[-61]) (3305<-2ECC[-68]) (3305<-387C[-75])
W (1227429) mesh_root: layer 3 (387D<-4620[-67])
W (1227429) mesh_root: layer 4 (4621<-40C0[-64])
W (1227439) mesh_root: layer 5 (40C1<-38FC[-70]) (40C1<-3DBC[-78])
I (1227509) mesh_root: 38EC: RTT 75. Elapsed 16822ms

Hi @KonssnoK , thanks for your quick reply.

From the logs, the RSSI values seem bad. Acturally, -70 is minimum signal strength for reliable packet delivery. <-80 is only for basic connectivity, packet delivery may be unreliable. If you happen to capture the wireless packets, you may see a lot of retransmission. That's the root cause of low performances.

To obtain a good performance, I suggest you add more intermidiate nodes or move current nodes closer to obtain greater RSSI. BTW, you mentioned these nodes are placed in four rooms, what's the wall material? After the RF signal is radiated into the air via the antenna, the signal will attenuate due to absorption, distance, or possibely the negative effetcs of multipath. Different materials typically yield different attenuation results. e.g. through a concrete wall will attnuate -12dB.

@KonssnoK
Before placing extra devices, increase the Tx power of the node's.
you are setting esp_wifi_set_max_tx_power to 7 (the minimum).
0 -> Highest Tx power
7 -> Lowest Tx power

Thanks @edjeed .
Hi @KonssnoK , you can also use menuconfig -> Component config -> PHY ->Max WiFi TX power (dBm) to change the TX power. 20 dBm is the default value.

Hi guys, I actually change the TX power inside the code itself (you can check the github).
The reason is i want to force multiple sublayers to be created in order to test the devices.
I already have on my list to do such a think (increase TX pwr by 1).
We always want to test out the worst condition environment (anybody is good with ideal conditions).
I can understand that -80dB is a bad signal, but this shouldn't influence the other nodes, meaning that i should see delays only for the nodes at -80dB.
So this could point to some underlying issue in handling the net under heavy losses.

I will post results after increasing by 1 the TX power as soon as I'll make the test (I have multiple topics to follow @ work ;) )

I can understand that -80dB is a bad signal, but this shouldn't influence the other nodes, meaning that i should see delays only for the nodes at -80dB.

387C[-80] is the only parent node for layer 3, so the only nodes not affected by it are the other layer 2 nodes, which are fast

I (1212689) mesh_root: 2ECC: RTT 77. Elapsed 4ms
I (1212699) mesh_root: 3120: RTT 76. Elapsed 16ms

1 bad node can have an influence on all other nodes.
When sending a message to a node with bad reception, the sender will try multiple times until getting an ack or timeout, during this time other messages will stay in the send queue.

Hello guys,
since i was tired of going around the office flashing devices I implemented the TX pwr as part of the packet sent by the ROOT. (github code updated)
In order to see a decent packet loss rate i have to raise the wifi TX power to 50 (level5 - 2dBm).

1 bad node can have an influence on all other nodes.
When sending a message to a node with bad reception, the sender will try multiple times until getting an ack or timeout, during this time other messages will stay in the send queue.

This should be true just for children. Architecturally it would make sense to handle traffic in a split way, so that 2 channels are used: children and parent. This way a leaking child will not impact any other communication to the parent node. You can call it priority queue if you want.

Below the logs at -2dBm after some time running, where you can find

I (837138) mesh_root: 2ECC: RTT 2550. Elapsed 1220ms

for a node which is layer2 -55 rssi

Also, using 50 as TX pwr seems to enforce the system to use 3 layers (at ~-75 rssi) when it would have a much stronger signal by adding one more layer..
How is 1 hop considered worse than a -75 rssi connection?

W (827998) mesh_root: Layer 0 means unknown! ME(3305)
W (827998) mesh_root: layer 0 (0<-3304[0])
W (827998) mesh_root: layer 2 (3305<-3120[-61]) (3305<-2ECC[-53]) (3305<-40C0[-81])
W (828008) mesh_root: layer 3 (3121<-387C[-73]) (2ECD<-4620[-74]) (40C1<-38FC[-64]) (40C1<-45CC[-74]) (40C1<-38EC[-78]) (40C1<-3DBC[-74])
I (828038) mesh_root: 38EC: RTT 2527. Elapsed 272ms
I (828048) mesh_root: 38EC: RTT 2527. Elapsed 276ms
I (828058) mesh_root: 3DBC: RTT 2528. Elapsed 235ms
I (828108) mesh_root: 45CC: RTT 2529. Elapsed 107ms
I (828158) mesh_root: 38EC: RTT 2527. Elapsed 390ms
I (830038) mesh_root: 3120: RTT 2530. Elapsed 16ms
I (830138) mesh_root: 40C0: RTT 2534. Elapsed 15ms
I (830298) mesh_root: 2ECC: RTT 2532. Elapsed 261ms
I (830298) mesh_root: 4620: RTT 2533. Elapsed 249ms
I (830338) mesh_root: 387C: RTT 2531. Elapsed 305ms
I (830578) mesh_root: 38FC: RTT 2535. Elapsed 398ms
I (830608) mesh_root: 38EC: RTT 2536. Elapsed 291ms
I (830898) mesh_root: 3DBC: RTT 2537. Elapsed 380ms
I (830908) mesh_root: 45CC: RTT 2538. Elapsed 92ms
I (832848) mesh_root: 3120: RTT 2539. Elapsed 41ms
I (832858) mesh_root: 387C: RTT 2540. Elapsed 44ms
I (832868) mesh_root: 2ECC: RTT 2541. Elapsed 16ms
I (833168) mesh_root: 4620: RTT 2542. Elapsed 274ms
I (833238) mesh_root: 40C0: RTT 2543. Elapsed 325ms
I (833248) mesh_root: 38FC: RTT 2544. Elapsed 13ms
W (833888) mesh_root: Layer 0 means unknown! ME(3305)
W (833888) mesh_root: layer 0 (0<-3304[0])
W (833888) mesh_root: layer 2 (3305<-3120[-60]) (3305<-2ECC[-54]) (3305<-40C0[-80])
W (833898) mesh_root: layer 3 (3121<-387C[-75]) (2ECD<-4620[-75]) (40C1<-38FC[-64]) (40C1<-45CC[-75]) (40C1<-38EC[-75]) (40C1<-3DBC[-77])
I (833978) mesh_root: 3DBC: RTT 2546. Elapsed 137ms
I (833988) mesh_root: 45CC: RTT 2547. Elapsed 93ms
I (835988) mesh_root: 3120: RTT 2548. Elapsed 83ms
I (835998) mesh_root: 387C: RTT 2549. Elapsed 83ms
I (837138) mesh_root: 38EC: RTT 2545. Elapsed 3577ms
I (837138) mesh_root: 40C0: RTT 2552. Elapsed 92ms
I (837138) mesh_root: 2ECC: RTT 2550. Elapsed 1220ms
I (837158) mesh_root: 4620: RTT 2551. Elapsed 1231ms
I (837448) mesh_root: 38FC: RTT 2553. Elapsed 293ms
I (837468) mesh_root: 38EC: RTT 2554. Elapsed 54ms
I (837888) mesh_root: 45CC: RTT 2556. Elapsed 52ms
I (837888) mesh_root: 3DBC: RTT 2555. Elapsed 86ms
I (839848) mesh_root: 3120: RTT 2557. Elapsed 16ms
I (839858) mesh_root: 2ECC: RTT 2559. Elapsed 8ms
I (840118) mesh_root: 4620: RTT 2560. Elapsed 258ms
I (840118) mesh_root: 387C: RTT 2558. Elapsed 284ms
E (840158) mesh_root: Overwriting pkt 2462 for 40C0 after 42293ms
I (840198) mesh_root: 40C0: RTT 2561. Elapsed 165ms
I (840708) mesh_root: 38FC: RTT 2562. Elapsed 542ms
W (840798) mesh_root: Layer 0 means unknown! ME(3305)
W (840798) mesh_root: layer 0 (0<-3304[0])
W (840798) mesh_root: layer 2 (3305<-3120[-58]) (3305<-2ECC[-54]) (3305<-40C0[-88])
W (840808) mesh_root: layer 3 (3121<-387C[-80]) (2ECD<-4620[-80]) (40C1<-38FC[-66]) (40C1<-45CC[-74]) (40C1<-38EC[-76]) (40C1<-3DBC[-76])

Based on the previous discussion, the document is only a reference. The RTT may be affected by many situations:

  • Test environment
  • RSSI / tx_power
  • Upstream Flow Control
  • Packet sending frequency
  • etc.

@dongdong004 我同样遇到了这个问题 数据下发没有任何延迟 数据向root发送会阻塞 并且阻塞长达3秒左右 3秒后数据蜂拥而至到达root,这个问题对我而言是致命的 不知道怎么改进这个 阻塞发生在esp_mesh_send

@dongdong004 忘了说了 我的rssi非常棒 在-15 - -20之间 依然出现这个问题

Was this page helpful?
0 / 5 - 0 ratings