Gluon: significantly increased management traffic after upgrade to Gluon v2017.1.x

Created on 24 Jun 2018  路  33Comments  路  Source: freifunk-gluon/gluon

this week we upgraded around 75% of our nodes to Gluon v2017.1.8, while about 20% were using this version already.
before, the nodes were running Gluon v2016.2.7.
(we still have ~20 nodes running Gluon v2016.1.x - i have no control over those, no autoupdate and no SSH)
in total, we have about 540 online nodes.

starting with the upgrade, the mgmt_tx of a single idle node without clients increased from 153 kbit/s to 254 kbit/s (by 66%) and the mgmt_rx from 57 kbit/s to 90 kbit/s (58%)

see the attached graph which shows the increased traffic with lowered spikes over all nodes, also the single node graph. the "all nodes" graph is capped at 1000 kbit/s to be able to see something useful, the spikes would make that impossible without the cap.

@ecsv also reported similar data from his community.

2018-06-24_15-39_gluon_network_mgmt_traffic_increase
2018-06-24_15-41_gluon_node_v201718_traffic_rates

bug

Most helpful comment

The patch is now part of Gluon master and comparing a few graphs above I think that fixes it.

If for any reason people think otherwise please comment and we can reopen.

We should backport this and do a v2018.2.2 release soon.

All 33 comments

It is currently unknown why this happens. I have two main suspects

  • batman-adv sends more ogms for an unknown reason
  • the changes in the wireless layer increased the reliability of broadcasts (which are used to transport OGMs)

The effect can only be seen when a significant portion nodes are updated from 2016.2.x to 2017.1.x

Regarding the batman-adv first point: Freifunk Vogtland increased its batman-adv version during the update from 2017.0.1+norebroadcast to 2018.1+maint+norebroadcast patches. Rotanid's update changed batman-adv from 2016.2.x+maint patches+norebroadcast to 2017.2+maint patches. I would therefore guess that the culprit (when batman-adv is the reason) has to be searched between batman-adv v2017.0.1..v2017.2


Maybe somebody is able to reproduce the problem and can install a modified gluon 2017.1.8 version on a lot of nodes in his network. It would be good when this person can then check whether returning to batman-adv 2016.2.x (from gluon 2016.2.x) reduces the number of mgmt overhead again. The three patches to revert to the older batman-adv version can be found in the branch https://github.com/FreifunkVogtland/gluon/commits/batadv/2016.2.x

FYI, my "spikes" are back (like before the upgrade) - the management traffic remains at the higher level

I think this also applies to v2018.1. Segmentation done with v2018.1 Rollout. Half the Nodes but increased management Traffic.

Update:

image

This describes it quite well. It is a node with nearly no Client Traffic. Orginator Interval has not changed..(5s)

after updating almost all nodes (over 500) to Gluon v2018.1.x with batman-adv 2018.1, mgmt_tx increased by ~10kbit/s and mgmt_rx by ~40kbit/s, so it increased again, but not nearly as much as last time.

@ecsv Have there been any new insights regarding this issue?

At least I received no info and I also cannot provide any

We (Ffmuc) also see a heavy increase of mgmt traffic after updating from gluon v2016.2.7 to gluon2018.2.1.

We have about 1500 nodes and our mgmt traffic nearly doubled.

mgmt-traffic

@T-X Is it possible that this is related to the removal of the no_rebroadcast hack from our batman-adv?

@NeoRaider: Hm, no, the no_rebroadcast patch should only have touched layer 2 broadcast frames, but not mgmt traffic (OGMs). The no_rebroadcast patch was only applied on "batadv_send_outstanding_bcast_packet()".

Similarly, the new, automatic approach in batman-adv is only applied to layer 2 broadcast frames and to BATMAN_V OGMs. But not BATMAN_IV.

@awlx: Urgh, yeah, that graph looks ugly... Is that a capture from a Freifunk router or from a gateway? Would it be possible for you to provide a capture from the VPN interface, filtered by OGMs (with wireshark or tshark you should be able to filter by batadv packet type)?

Also, do you have link to the a site/domain config used before and after the update, @awlx?

Hi @T-X, we as ffmuc had the following configs:

In the hope to reduce that traffic we then disabled ULA and ibss in the hope it would help but the benefits of v2019.0.2 where quite minimal to not observable.

Concerning the graph @awlx posted: It was from our grafana: The average reported mgmt traffic of all nodes collected via respondd. So it was only observable for us as the release got rolled out to more nodes...

Concerning a capture: I will try to do a capture on the vpn interface. Do you have a link or a simple command I can use to create it?

Hi @krombel,

You can filter just for the mgmt (== OGMs, originator messages) with this command:

$ tcpdump -i mesh-vpn0 'ether proto 0x4305 and ether[14] = 0x00 and ether[15] = 0x0f' -w /tmp/mgmt.cap

The capture filter here means:

  • ether proto 0x4305: The batman-adv ethernet frame type
  • ether[14] = 0x00: The batman-adv packet type for OGMs (v1) as defined here for the first byte after the ethernet frame header as defined here
  • ether[15] = 0x0f: The batman-adv compatibility version number (15, used since 2014) which is defined here for the second byte after the ethernet frame header as defined here

(An equivalent display filter for tshark or Wireshark would be eth.type == 0x4305 && batadv.batman.packet_type == 0x00 && batadv.iv_ogm.version == 15 in case you might be more familiar with those tools.)

In the hope to reduce that traffic we then disabled ULA and ibss in the hope it would help but the benefits of v2019.0.2 where quite minimal to not observable.

Hm, weird. Extra wifi interfaces in the new firmware could have been an explanation as the "mgmt" count is the sum of originator messages sent and received over all interfaces used by batman-adv. The mgmt_tx counter for instance is increased here but the caller, batadv_iv_ogm_send_to_if() is called for each outgoing interface.

Also when keeping in mind that we send OGMs three times on a wifi interface to compensate potential wifi packet loss vs. once on any other interface (including VPN interfaces) could have roughly matched your total increase, like 3+1=4 / 3+3+1=7 (single radio / dual radio) packets before, 3+3+1=7 / 3+3+3+3+1 = 13 (single radio / dual radio) packets after (disregarding a few other factors like OGM packet aggregation and rebroadcast suppression which depend on the topology).

Have you updated all your nodes to your site v2019.0.2 by now? Could you check whether the Gluon upgrade scripts have successfully removed the IBSS interface with your v2019.0.2 upgrade?

@ecsv

It is currently unknown why this happens. I have two main suspects

  • batman-adv sends more ogms for an unknown reason

I did some simple, two node tests in virtual machines. At least in this most simple setup there was no difference in OGMs between batman-adv v2017.0.1, v2018.1 and the current master branch:

https://gist.github.com/T-X/90cda122ae30ddd5b860a6df0987fc77

  • the changes in the wireless layer increased the reliability of broadcasts (which are used to transport OGMs)

I think that's what I would also tend to. Gluon v2016.2.x was pre-FQ-Codel and Gluon 2017 introduced FQ-Codel, right? That would have probably made a difference. Also, at some point the airtime fairness patch was added (to ath9k?) in OpenWrt (which version? And which Gluon version?).

Also note, that we should have reduced layer 2 broadcast overhead with the gluon-ebtables-limit-arp, the IGMP/MLD segmentation and the batman-adv broadcast avoidance patches. Which would lead to less pesky, small broadcast packets and therefore less possibilities for wifi packets, including OGMs, to collide with.

Ideally to conclude that the increased mgmt packets are caused by increased wifi reliability: Any chance anyone has TQ values before and after their update in their database? An overall average (and/or median) of that before and after the update would be interesting for comparison.

Ok, we have now had another community reporting an increased OGM/mgmt traffic:

Freifunk Kiel with a Gluon v2018.1.4 updated from batman-adv-legacy (compat-14) to batman-adv (compat 15). And have observed a 4x mgmt traffic increase:

https://grafana.freifunk.in-kiel.de/d/000000003/nodeinfo?orgId=1&panelId=3&width=1200&height=600&from=1558188586085&to=1558433114387&var-node=704f57455064&fullscreen

The interesting thing is that with this update they did not update the Gluon version, just switched the BATMAN variant.

Looking at pcap dumps we observed that for one thing the average OGM size has about doubled due to the added TVLVs.

The other 2x factor currently seems to point to a too fast OGM interval. The interval is configured to 5 seconds, but in practice they seem to be transmitted at about ~2-3 seconds rates. We have picked two random nodes which had the same behavior. Does anyone else observe a similar behavior?

I'll see whether I can write something to measure this in more detail and to make an overall statistic.

So far from looking at the code I can't find anything weird yet. Nor any specific changes related to the BATMAN IV OGM scheduler between compat 14 and compat 15.

Also when keeping in mind that we send OGMs three times on a wifi interface to compensate potential wifi packet loss

And sorry, that was actually wrong. We do the 3x broadcast for broadcast data packets only. And not for OGMs.

Please test the changes proposed here (changes for all kind of versions and OpenWrt/LEDE versions): https://www.open-mesh.org/issues/380#note-4

And for the record, the issue was introduced by batman-adv v2016.3.

  • Gluon v2016.2.7: batman-adv v2016.2 => unaffected
  • Gluon v2017.1 (or greater): batman-adv v2017.1 (and greater) => affected

Thanks to everyone for the incredibly helpful feedback (graphs, statistics and even Lua evaluation scripts for Wireshark (kudos to @sargon for the latter), ...) and thanks to all these amazing people that are having a watchful eye on how our mesh networks behave. You guys and gals are amazing :-).

I have now installed it here in a domain and attached is an image with the state for 3 randomly chosen nodes before the patch (left part of the graph) and after the patch (smaller right part of the graph). The gap in between is the time when the nodes were updating.

2019-06-02_mgmt-before-patch_after-patch

Looking forward to a backport of that patch!

The patch is now part of Gluon master and comparing a few graphs above I think that fixes it.

If for any reason people think otherwise please comment and we can reopen.

We should backport this and do a v2018.2.2 release soon.

Here is the graph for a server which is connected to all domains (so you see a combined graph of everything). 88% of the 451 nodes were updated to the new firmware:

2019-06-05_ffv-all-domains

The mgmt (down/up) went from 284.2/210.4 kbps to 153.1/111.5 kbps.

after having had ~70 nodes out of ~650 online nodes with the fix for a few weeks, we finally deployed to stable branch.
the load average of all nodes peaked to 0.42 before the rollout and is now at 0.36 in the last two days, although the overall load looks more like having dropped 30-40%
2019-07-19_23-43-26_load_avg
mesh mgmt traffic AVG is down from 292/364 to 162/201 kbit/s in our network (note: the previous value already had around 10% of the nodes fixed)
2019-07-19_23-47-53_mesh_traffic

We are also seeing significantly reduced (to less than 60%) management traffic in our 600-node-mesh. Thanks a lot to everyone involved in finding this!

Awesome! Btw, do any communities have load statistics before and after this fix?

Ideally I'd be interested in overall load average+median before+after, together with the number of nodes in the domain. And averages/median filtered for 32MB flash devices.

(I want to know whether the number of "background packets" has a noticeable impact on a node's overall load.)

@T-X here you go, I think you can figure out when the update had happened ;-)
https://grafana.freifunk.in-kiel.de/d/yA3Quidmk/node-overview?orgId=1&from=1561766040369&to=1563337377334

Awesome! Btw, do any communities have load statistics before and after this fix?

although i don't have it as detailed as you request it, my comment contained a screenshot of the load graph.

Is the load back to the value of Gluon 2016.2.x now?

Is the load back to the value of Gluon 2016.2.x now?

i'm not sure how this could be reproduced in a comparable way.
you would need a larger mesh with hundreds of nodes and test it with v2018.2.2 and v2016.2.x while not changing anything else, this is basically impossible.

why should this be impossible? there are some communities still running 2016.2.x and they could monitor the change while updateing to 2018.2.2

why should this be impossible? there are some communities still running 2016.2.x and they could monitor the change while updateing to 2018.2.2

"while not changing anything else" - i doubt this will apply.

and waiting for someone else to upgrade while still running v2016.2.x which didn't get any security update for a long time.... is not really responsible.
poor node owners...

On 2019-07-27 02:28, Andreas Ziegler wrote:

why should this be impossible? there are some communities still running 2016.2.x and they could monitor the change while updateing to 2018.2.2

"while not changing anything else" - i doubt this will apply.

and waiting for someone else to upgrade while still running v2016.2.x which didn't get any security update for a long time.... is not really responsible.
poor node owners...

To have a more clear ENV it should be able to just deploy hundreds of VMs in a
isolated v2016.2.x and the same for v2018.2.x.

vg
Tarek

To have a more clear ENV it should be able to just deploy hundreds of VMs in a isolated v2016.2.x and the same for v2018.2.x. vg Tarek

hm... without any clients? and does someone already have an automation for this?

Was this page helpful?
0 / 5 - 0 ratings

Related issues

mweinelt picture mweinelt  路  3Comments

lephisto picture lephisto  路  5Comments

kpanic23 picture kpanic23  路  5Comments

A-Kasper picture A-Kasper  路  4Comments

rotanid picture rotanid  路  4Comments