Gluon: System instability above specific point

Created on 5 May 2016 · 82Comments · Source: freifunk-gluon/gluon

As I've reported on IRC we at FFRN can observe a higher load and frequent reboots on nearly 30% of our nodes.This happens if we reach a specific number of nodes and clients in our network. We have debugged this issue for over a month now and we think we've limited the possible sources.

Let's start with our observations. The first time we became really aware of this problem was the moment we reached the number of 1500 clients in our network spread over nearly 700 nodes. This happened around the first of April this year. But after analyzing the problem we think it started even earlier with some "random" reboots we were analyzing too.

The first thing we observed was, that the majority of the affected nodes are the small tl-wr841 devices. This does not mean, that the bigger nodes like a tl-wr1043 are not affected, the problem just hasn't a big enough impact on them. But interestingly not all of these nodes where affected, only a portion of 30% is showing all the characteristics of this problem. All other nodes are running without any interruptions.

On an effected node we can see the following: If we reach more that 3000 entries in the transglobal table the problems start and if the number is falling under this mark the problems are mainly gone. Such a node shows a increased average load of around 0.45 to 0.9 compared with 0.2 to 0.25 of an not effected node. But the load also starts to peak to values of 2-4 in the time we are above the mark on the problematic nodes. And sometimes every few hours or every few minutes the node reboots.
Another interesting observation is, that affected nodes get alot of more free RAM when the problems start. The RAM usage decreases from the healthy default of around 85% (on a 841) to 75%-80%.

On a TL-1043 it looks like this:

On a TL-841v9 it looks like this:

At all times we can't see a single process making problems or using more RAM than usual, only the load and the system CPU utilization showed that something went wrong. So we though that the problem has to be in the kernel or in combination with the RAM.

So we started to debug the problems and first we tried to locate a pattern in our statistics to limit the number of possible sources for the problem. There were alot of other ideas we tested but all with nearly no effect. So the most promising was the TG table. But it's not the number of entries because we can't find any limits near this number in the sources and also some other things are speaking against it. So the problem has to be in the processing of the entries or somewhere else.

After that we found out that something in combination with the TG tabel, we think it's the writing of the table to tmpfs, was causing new page allocations. These page allocation couldn't be satisfied by
the available RAM so some parts of the page cache are cleaned up. This cache holds the frequent running scrips of the system so after that the system has to start rereading them. And here the first problem starts. The system is rereading the disk without an end. I've attached a log file for a affected and a not affected node.
notbroken.log.txt
broken.log.txt

If this continues for a while and then we try to write again our TG table, it could be that we run in the vm.dirty_ration which blocks the IO of all processes making everything even worse.

So to solve this problem we started optimizing alot of parameters in the sysfs. He is a list of all current additional options we set.

net.ipv6.neigh.default.gc_interval=60
net.ipv6.neigh.default.gc_stale_time=120
net.ipv6.neigh.default.gc_thresh1=64
net.ipv6.neigh.default.gc_thresh2=128
net.ipv6.neigh.default.gc_thresh3=512

net.ipv4.neigh.default.gc_interval=60
net.ipv4.neigh.default.gc_stale_time=120
net.ipv4.neigh.default.gc_thresh1=64
net.ipv4.neigh.default.gc_thresh2=128
net.ipv4.neigh.default.gc_thresh3=512

vm.min_free_kbytes=1024
vm.dirty_background_ratio=5
vm.dirty_ratio=30
vm.dirty_expire_centisecs=0

Here we save some RAM by using smaller neighbour tables, we increased the min_free_kbytes value to have a bigger buffer for allocation problems, we set down the dirty_background_ration so the system starts writing stuff to disk in background earlier (this is no problem because we write to a ramdisk) we set up the dirty_ration to prevent a whole IO lock and the dirty_expire_centisecs means that we write back the stuff only when we reach the background_ration and not after a time limit to prevent useless write.

With this changes we could increase the performance, we have decreased the load of effected nodes even under the average value for not effected nodes. So maybe some of these options are also relevant without this issue. To get even more free ram some people tried to disable haveged, this also makes it more stable, because we have more free RAM.

Then we saw that the community in Hamburg has a bigger network with alot more clients and nodes but doesn't look affected by the problem. So we analysed the site.conf and found out that the mesh VLAN feature we were using (Hamburg not) was causing double entries for every node. One with VID -1 and one with VID 0. This isn't that great.

Then we flashed a test node with the firmware from Hamburg. The first difference is that our firmware is based on gluon 2016.1.3 and the one from Hamburg is the version 2015.1.2. So there are a few differences.

But short back to the TG table. The TG table from Hamburg was around 3700 entrys long without the problem occurring. So the problem must be something that was changed between the versions. As the 2016 versions are based on OpenWRT CC and not on BB like 2015 this could be alot. But we think its not something on the OpenWRT base system. It has to be something more freifunk specific.
So we looked again on the process list of a node with our firmware and one with the firmware from Hamburg.
Her we found the following differences (First value for FFRN, second for FFHH)

/sbin/procd 1408 vs 1388
/sbin/ubusd 896 vs 888
/sbin/logd -S 16 1044 vs 1036
/sbin/netifd 1568 vs 1608
/usr/sbin/sse-multiplexd 780 vs 0 (doesn't exist)
radvd versionen 1108 vs 1104
/usr/sbin/uhttpd 1132 vs 1140
/usr/sbin/dnsmasq 1076 vs 916
/usr/bin/fastd 3316 vs 3300
odhcp6c 800 vs 812
/usr/sbin/batman-adv-visdata 784 vs 0 (doesn't exist)
/usr/sbin/dnsmasq 932 vs 924
respondd 2000 vs 2152

difference sum: 16844 vs 14728 = 2116

This means an increased RAM usage with the newer firmware of over 2MB. But this couldn't be the only source too, because we have nodes without the problem.

Then we started thinking about what we have, and also started writing some documentation of the work for the community. Here we get the idea that the sudden decrease of RAM usage maybe could be caused by the OOM killer. And also the characteristics we see in the RAM graph showed some characteristics of a memory leak. But again this couldn't be the only problem. So we thought again a little bit further and now think it's a combination of mem leak and a mem corruption causing the rereading of the flash storage without an end. With all these information we think that the only service that is really near to the problem is batman-adv-visdata, so this would be the first point to go deeper. But here we come to a limit in resources an knowledge about the system and hope that we find someone who can help us find a solution for this problem.

We know that this are alot of information and maybe alot of information is missing. Please ask if you need something.

You can find a german version including the discussions here: https://forum.ffrn.de/t/workaround-841n-load-neustart-problem/1167/29?u=ben

bug

Source

leahoswald

👍9

Most helpful comment

Hey @rotanid thanks for this informations. This makes it even more important to finde the bug. The problem is, that spliting the network is only a workarround for a important problem. So I think we should find and solve this bug.

@Adorfer I know your point and now please don't tell me this every time. This is not the way how new solutions are found. We just try to find a solution for problems instead running from on workarround to the next on. We would also like to experiment with the technologie and find such limitations to find a new way to handle them. And yes, we all know about the limitations of a batman-adv l2 network this is the reason why alot of people are experimenting with new solution. But they need there time to be good and stable. So please let uns discusse the problem here without your focus on small community networks. An hey, it looks like a software bug and such bugs can be fixed.

leahoswald on 7 May 2016

👍6

All 82 comments

Which version of batman-adv do you use?

RubenKelevra on 6 May 2016

We use the batman-adv-15 package from gluon. So it is batman-adv 2015.1

leahoswald on 6 May 2016

Freifunk München also had problems like yours in 2015 with a similar node count - so they split their network in three segments. afaik they didn't do such a detailed analysis as you did and i was told today the problems start to begin again because of the growth of the segments.

rotanid on 7 May 2016

Thanks for the all the work.
But for me this is just another confirmation, that the existing batman-adv does not skale properly for networks of more than 300 nodes.

Adorfer on 7 May 2016

👎4 👍1

leahoswald on 7 May 2016

👍6

Even if the workaround helps alot ... a solution would be great!

bitboy0 on 13 May 2016

That sounds plausible to me. We've seen a similar behaviour in the Regio Aachen network when we reached about 900 Nodes with 3000 clients.
At a very specific time of the day the load started to significantly increase an in the evening with clients leaven the network the load got back to a normal value.

load-haag
load-haag2

We thought that the mesh table just got to big for the little routers. A remarkable point was, that a strong offloader in front of these little devices protected them. Maybe because the table got simpler.
Having this in mind I tried to remove some of the mesh links in the core Network, changing the connection between the four gateways from full mesh to a ring. This seemed to help a little as well.
But a much bigger impact had the addition of two more gateways and to add additional fastd instances on the gateways (one for incoming IPv6 and one for IPv4). Resulting in a load of ~0.5 per core on the gateways.
Fast gateways with nearly no packet loss -> lower load on the mesh nodes.

This got us around to 1.100 nodes with 3.500 Clients, but network performance was dropping.

(A few month ago we finally splitted our network into many small networks using multiple fastd instances attached to different batman devices. One Firmware, multiple white-lists for fastd)

mmalte on 13 May 2016

By the way, we are using gluon-mesh-batman-adv-14.

mmalte on 13 May 2016

Would it be possible to get a dump from dmesg or /proc/vmstat once the issues start to occure on a node? Also a /proc/slabinfo would be great, but seems like it's not available on Gluon images by default. Finally, just to verify that it's a memory issue, the output of /sys/kernel/debug/crashlog would from a node that just crashed and rebooted would be great, too.

@nazco: Thanks for the very thorough, analytical report! By the way, one more things which next to batman-adv and the IP neighbor caches needs memory relative to the number of clients is the bridge. It's forwarding database (fdb for short) keeps a list of MACs behind ports, too.

Speaking of the bridge, I noticed that the bridge kernel code does not use kmalloc() for it's fdb entries, but kmem_cache_*() function calls. Maybe we are having similar issues as we had with the debugfs output until the added their fallback from kmalloc() to vmalloc(). Which is a very fragmented RAM. Could be interesting whether kmem_cache_alloc() might not just speed up memory allocation but might also help getting a less fragmented RAM (if that's the issue here).

Regarding the VLANs, that indeed sounds odd. I queried ordex, the guy behind the TT and its VLAN support on IRC. Btw. I just checked in a VMs with one isolated node and no matter with or without VLANs, I'm seeing a weird, additional local TT entry with VID 0 which has the MAC address of bat0. Do you have VID 0 entries without the P flag? How many have VID 0, how many VID 1 exactly?

T-X on 13 May 2016

Hey, thanks for the reply. I'll try to get these info for you.

leahoswald on 13 May 2016

And one more thing which would be interesting: Running wirerrd to see whether something weird happens on the network when the load is high.

Currently I'm running this for Freifunk Hamburg and Freifunk Lübeck and that's usually one of the first places we look at when something behaves oddly.

T-X on 13 May 2016

Regarding the process table, I'm currently wondering about two things:

Isn't haveged missing?
Is this really a 2016.1.3 device? That version should have the C-rewrite of respondd and I'm a little astonished that it allegedly still takes about 2MB of RAM.

T-X on 14 May 2016

Isn't haveged missing?

The process table in my initial post only shows the diffs to the 2015.2 firmware of hamburg

Is this really a 2016.1.3 device?

Yes, the device is running 2016.1.3

leahoswald on 14 May 2016

@T-X, I think these numbers are virtual memory (as that is the number shown by ps). The new respondd still uses about 2 MB of virtual memory, as it uses a dlopen a lot (and at least uClibc will use a lot of virtual memory per dlopened object).

@nazco, if the numbers you compared are virtual memory, they are meaningless, as virtual memory is often never actually allocated. AFAIK, the number VmRSS in /proc/$PID/status is the most relevent for actual RAM usage of a process.

I don't think any of the processes make much difference, the most important change from Barrier Breaker to Chaos Calmer is the newer kernel. I think the new kernel might work a bit worse under memory pressure, although it's hard to tell for sure.

NeoRaider on 14 May 2016

👍2

Unfortunately, kmem_cache_alloc() isn't really documented in the kernel. So we are unsure whether it'd help in anyway with this problem. It seems that from looking at other parts in the kernel, that it is common to use dedicated caches for larger amounts of objects which are frequently changing.

Would anyone be willing to give this patch a try to see whether it makes any difference?

https://lists.open-mesh.org/pipermail/b.a.t.m.a.n/2016-May/015368.html

T-X on 15 May 2016

Btw., regarding the fiddling with the neighbor tables in the first post. @ohrensessel and I noticed yesterday after applying #674 and #688 that the multicast snooping takes place before any additions to the bridge forwarding database. For instance, "$ bridge fdb show" had no more entries towards bat0, while before it had one MAC entry for nearly every client in the mesh. For the IP neighbor tables it should be similar.

@ohrensessel wanted to test and observe further, whether having these two patches makes any difference for the load peaks at Freifunk Hamburg in the evening. He'll probably report back later.

T-X on 15 May 2016

How could that be an explanation for the strict limit of 3000 entries in the transglobal-table?
Under that limit nothing happens... above the problems are there. I'm just interrested how that can be.

bitboy0 on 19 May 2016

@bitboy0: There is no strict limit for the global translation table. There is just a limit for the local translation table of a node (= the number of clients a node can serve; ~120 with batman-adv 2013.4, 16x that much with a recent version of batman-adv / since fragmentation v2 ).

The reports so far, backed by the observation that only 32MB devices are affected, seem to point to a simple out-of-memory problem on such devices (though I'm still waiting for a /sys/kernel/debug/crashlog or dmesg output from someone to confirm). When a device starts to get low on memory, then the Linux kernel memory allocator will have more and more trouble to serve requests and might even need to move some objects to be able to have consecutive, spare memory areas available again. Thus resulting in high load first and at some point even a reboot.

In my x86/amd64 VMs with many kernel debugging options enabled, a global TT entry had allocated about 200 bytes. Sven has mentioned a raw size of 48 bytes on OpenWRT ar71xx. Which will probably be aligned to 64 bytes. So 4000 entries times 64 bytes would result in about 250KB of RAM usage. Which doesn't seem much.

Of course, if the RAM is already full through kernel and userspace programs, then even a few additional hundred KB in the afternoon/evening through the batman-adv global TT, the bridge forwarding database or IP neighbor tables might be the straw to break the camel's back.

T-X on 19 May 2016

Regarding that hypothesis, it might further be interesting, whether:

devices are affected immediately or after a certain uptime
whether disabling some/any userspace service has a positive effect
- does anyone have a 32MB node affected without fastd or haveged running?

T-X on 19 May 2016

@T-X From the uptime of my little WA701ND i can see, that devices are affected immediately ( <10mins after restart ).
As mentioned in the workaround, some people disabled haveged, which results in a better behaviour due to lower RAM usage in total - but nevertheless this is only a workaround.

I added serial connection to some nodes, and try to get the remaining logs before crash.

Nurtic-Vibe on 20 May 2016

Thanks @Nurtic-Vibe! With "better behaviour", what do you mean by that exactly? No more Out-of-Memory-Crashes or less often? What do you mean by "this is only a workaround"? If a high static memory footprint (85% were mentioned in the initial post) were the issue, than reducing that would be a valid fix, wouldn't it?

Btw., it's probably not that well known, but OpenWRT has a great feature to preserve crashlogs over a non-power-cycled reboot. After a crash & reboot you should have a new file in /sys/kernel/debug/crashlog. So it would be great if anyone, even with no serial-console access, could have a look at that after a crash.

T-X on 20 May 2016

And one more question @nazco: For the 85% you mentioned, what does the graph show if you are running the same node just like that but cut the uplink? What is the memory footprint without this node seeing the rest of the network?

It'd be interesting to test whether it stays relatively high even without any other mesh participants. That could back or dismiss the too-much-static-memory-usage theory.

T-X on 20 May 2016

@T-X with haveged disabled we get OoM crashes less often, but they still occur regulary.

Nurtic-Vibe on 20 May 2016

Looking at @nazco's broken.log.txt again, @NeoRaider, do you know whether the vmalloc fallback for debugfs access made its way to Gluon yet? It seems that batman-adv-visd accesses the global translation table via debugfs first to translate the alfred server's MAC address to an originator address. Which then results in yet another debugfs originator table lookup to check the TQ and determine the best alfred server.

To inform others (@NeoRaider found this issue a while ago): Without the vmalloc fallback, that results in the need of a large consecutive memory area to be allocated upon accessing a debugfs file. The allocation size happens in a stupid first try x bytes and if while copying it turns out to be insufficient, double it and copy again. That could explain why a certain threshold of global TT entries might cause a jump in load times.

If all that were the case, then it'd be a mixture of: High static memory usage, many small, scattered allocations in the remaining memory. Which makes trouble for a large, consecutive allocation for debugfs access.

T-X on 20 May 2016

@T-X, the vmalloc patch is included since Gluon v2016.1.

NeoRaider on 20 May 2016

@T-X one of my nodes just crashed but there is no file like /sys/kernel/debug/crashlog

leahoswald on 20 May 2016

@T-X No, the bigger nodes do have the same RAM-eating Problem. Due to the fact, they have more RAM they don't care. But the bug itself is the same. With "strict limit" I don't say that there is a visible limit like maximal table size, but the problems occure if the this specific number of entries is in the list. Maybe better to say: the bug is only visible if the TG have 3000 entries.

And the Problem start imidiately at all nodes in the network the same time if the "limit" is reached. So some knodes can't even get back to work, but they restart directly after a reboot again and again.
If the number of entries in the TG-table is below 3000 again the knodes sudden work propper again.

bitboy0 on 20 May 2016

@T-X better behavior will say: the addidional space made by disable and stop haveged gives slightly more room for alloc. Because of the sysctl-changes prevents the kernel to write dirtied blocks with high priority it can simply handle the lack of memory more smoothly. This doesn't stop the Problem, but the kernel can manage that longer before OOM-killer triggers a panic.

bitboy0 on 20 May 2016

@nazco: hm, okay, thanks. And you didn't power-cycle the device, right? Then maybe crashlog is unreliable in some OOM cases :(. Keep looking out for it though :).

Btw., you can easily check whether your OpenWRT image supports crashlog by triggering a crash through "echo c > /proc/sysrq-trigger". The device should then reboot and there should be a new file in /sys/kernel/debug/crashlog (until you reboot again or power-cycle it).

I also just tried simply doing a "dd if=/dev/urandom of=/tmp/foo.bin" and after a few seconds the NanoStationM2 with a Freifunk Hamburg image rebooted here. Then I had a nice Out-of-Memory trace in /sys/kernel/debug/crashlog.

Here's the crash before any uplink connectivity: crashlog-841-no-uplink.txt.
And here after: crashlog-841-with-uplink.txt.

Though the userspace programs do not seem to show any suspiciously high memory usage, at least at that point of time (taken between 19:00 and 20:00).

T-X on 20 May 2016

Interesting: For a Freifunk Hamburg node with currently 3370 clients (batctl tg | wc -l) the byte count currently is 259407 (batctl tg | wc -c). Which is very close to 2^18. Not sure whether that might still be a relevant number with the vmalloc patch for debugfs.

T-X on 20 May 2016

So regarding this 2^18, it'd be very interesting if someone could check whether deactivating alfred and especially batman-adv-visd would have a more positive effect than deactivating haveged or fastd. batman-adv-visd was the one triggering the 2^n allocations upon accessing the translation table via debugfs at least before the vmalloc() patch for debugfs.

T-X on 21 May 2016

@T-X deactivating Alfred makes no big difference.
deactivating hafeged or fastd only frees up some more memory. This of course helps to survive the problem better, but the problem itself is independend from that

at the bigger nodes (f.e. TPLINK 1043) the RAM also is under pressure and the load also increases, but there is so much reserve RAM, that the problems are not so present.

bitboy0 on 25 May 2016

Started to script some test setup to simply, gradually increase the number of TT entries and to see when things crash due to an Out-of-Memory. In the tests I had ten batman-adv instances on an x86 PC (without Gluon, plain batman-adv 2016.1) which fed a poor 32MB RAM TP-Link TL-WR841ND v8. That router is running the latest Freifunk Rhein-Neckar stable firmware. But other than that was completely isolated from the FFRN network.

On average, the TP-Link router crashes at about 13500 global TT entries for me. While moving from zero to 13500 entries, the unreclaimable slab size according to /proc/meminfo raises from about 5MB to 10-11MB.

Having bridge fdb learning turned on or off for bat0 does not make much of a difference to the number of TT entries when it crashes. With or without learning it crashes at about 13500 entries (ref. #780).

Detailed meminfo logs can be found here (still I or someone else needs to write something to turn that into graphs :) ) -> http://metameute.de/~tux/Freifunk/mass-tt-memory-test-logs/

Either I'm misintepreting the meminfo output, or fdb entries are much smaller than global TT entries. I'm going to rebuild the image with @ecsv 's patch applied, and once with SLAB_HW_ALIGN and once without, to see whether that makes a difference to the number of supported global TT entries.

Test script is here: https://gist.github.com/T-X/1bea06b7abcf63de143ddcbe0a2aca56

T-X on 26 May 2016

Hey, I have some new information, even though I couldn't check all the things you requested. The first thing is that I never see a crash log after my node was crashing. The other thing is that someone briged our network with the network of another big community so for a relevant time we've had a lot more entries in the TG table. And this was interesting because there is no new gap in the load. We could see that, after the "magic" point, the load is again directly correlated with the number of entries in the TG like it is under the point. Here are some graphs:

The network bridge starts:
screenshot2016-05-30_22-16-17

We've found the problem:
screenshot2016-05-30_22-19-54

And to show the "magic" point again (from today):
screenshot2016-05-30_22-24-44

leahoswald on 30 May 2016

Wow, good catch! Looks like some good detective work you've done there :). Just out of curiosity, what kind of bridging was it? The LAN ports of two routers from different communities connected?

Just to make sure we also understand what exactly was causing the high load after the bridging, whether it is just the number of global TT entries or something else: Do you have the batman-adv bridge-loop-avoidance activated on your nodes? If yes, would it have made a difference for the load once disabled (just on the two (or more?) nodes bridging the two communities)?

Also, it was just bridged at one point, not two or more, right? So at least no possibility for a bridge loop, right? (I guess a real bridge loop with BLA disabled would have probably had the potentially worst outcome, so probably not the case - but just to understand your setup and check)

Anyways, thanks for the update and the great graphs! So no more load issues for you for now? (though we'd still have to at least look at what's going on at Freifunk Hamburg who also had load peaks for 32MB devices)

T-X on 31 May 2016

Well the problem was that someone flashed our firmware over the firmware of former "Ruhegebiet" node without the tic to ignore the old config. So the node held two vpn configurations connecting to each network with a vpn connection. Yes bla is activated but I couldn't tell you if it would have made a difference.

So no more load issues for you for now?

The problems are the same as before. The bridging of the communities only raised the TG entries to over 5000 instead of nearly 3000 at normal operations. Also we've lost a bigger Installation at the moment so we are most of the time barely under the "magic" number of entries but if the installation is up again the problem will occur immediately (compare the last graph from my last post)

leahoswald on 31 May 2016

Let me guess, it was #557

mmalte on 31 May 2016

👍1

@nazco: Ah, ok, sorry, wasn't careful enough in reading your last graph. Indeed, it still shows a raising load at 3000 clients. The cause of this is probably a doubled size of the debugfs global TT buffer (even with the vmalloc patch). Though 0.5 still seems okay to me. Let me know, if you still see such frequent crashes now with the community-bridging issue gone. By the way, this load, what's the averaging time window for it?

T-X on 31 May 2016

@T-X I think you misunderstood @nazco. I think what he was trying to say is this:

A few days ago, somebody bridged our community with another community, an that did not cause a second jump in the load. The slaves did not show a different behaviour when the TG was at 3100, 4000 or 5000. After the bridge was removed, we got back to TG sizes that jumps over the magic value a few times a day.

IIRC, the load of 0.5 is the average of all nodes, including many nodes with more than 32MB of RAM. So the load average on affected nodes usually is >1.

So the initial problem still exists, with many nodes rebooting very frequently (sometimes every 2-5 minutes).

I guess the load-avgerage is over one minute, but I am not 100% sure about this.

tobox on 2 Jun 2016

👍1

@T-X As @tobox said, the problem is not solved and only was not worse due to 2000 more TG-entries.
Now we are about 3000(+/-) entries over the day and every time we jump over the 3000 the problem ocures again.

bitboy0 on 2 Jun 2016

This might not be related but I think it is. From a client perspective, when I experience problems that seem like the "Net" hit the upper limit, I've seem to receive a permanent Dataflow between 8 and 15 KiB/s, only Downlink no Uplink flow.

waschmi on 6 Jun 2016

Thanks for your reply, but I didn't really understand what you mean by "the net upper limit"? Are you running a node with our firmware?

leahoswald on 7 Jun 2016

@nazco
Running Nodes with latest Gluon Master but with Hamburg "site" config within Hamburg. I run into a curious Problem a about three weeks ago. When the Hamburg Freifunk Net (acording to map) reached around 2600 Clients, the Internet Connection speed and quality suddenly took a sharp drop and a wr841ndv9 node seemed to drop wifi randomly. Even though the Nodes nearby didn't even have (m)any clients and either a cable or vdsl level of uplink to the internet. But! It also affected the mobile internet data connection of a certain Telcom while other mobile data Nets still worked fine.

I brought this up on Hamburg Mailing list and was told a bit Later the Data-center Uplink to that particular Telcom was indeed a bit too busy and it was changed. After that mail, it indeed worked fine the 2600 clients limit seemed lifted.

Yesterday I experienced it again as well today. Meaning, Clients connected to the Internet over freifunk get almost no bandwidth, ofter browse timeouts or loading symbol of death ;-)
While again, that particular mobile data connection also seemed problematic. This time I don't now the Client count, because the map is not reachable when that happens.

Plus, that in last post described data stream of 8-15KiB/s that seems to flow towards a Client and is basically entirely discarded (I think).

Again, this might not be related to this Bug and or a collection of many little bad thingies but I got the feeling it is. Sorry for all that Noise.

waschmi on 7 Jun 2016

So regarding this 2^18, it'd be very interesting if someone could check whether deactivating alfred and especially batman-adv-visd would have a more positive effect than deactivating haveged or fastd.

Sorry that my answers are very late but I'm working on my thesis (deadline 30.06) so I don't have the time to debug this as intense as I would wish. But yesterday (around 9pm) I checked this idea and the effect of disabling batman-adv-visdata is amazing. No peaks anymore and everything runs smoothly.
screenshot2016-06-16_23-25-28

So this leads me to the question what this service is used for exactly? Is it save to disable?

leahoswald on 16 Jun 2016

@nazco, it was a temporary solution in v2016.1.x for querying batman-adv's tables and provide this data to respond/alfred. In the Gluon master, respondd and alfred query it themselves again.

Without this service, there is be no topology information (neighbouring nodes, etc.), so respondd/alfred are more or less useless.

NeoRaider on 16 Jun 2016

Ok, then I wonder where I should see the effects of disabling it? At the moment with v2016.1.3 everything looks ok after I disabled it and I don't miss any data.

it was a temporary solution in v2016.1.x for querying batman-adv's tables and provide this data to respond/alfred. In the Gluon master, respondd and alfred query it themselves again.

Does this mean this software will be removed in v2016.2?

leahoswald on 17 Jun 2016

Ah, I mixed up batadv-vis and batman-adv-visdata. Which one did you disable?

NeoRaider on 17 Jun 2016

Well and I missed the 'data' in 'batman-adv-visdata :) (fixed it in my update)

leahoswald on 17 Jun 2016

Okay, then my explanation above is correct. Disabling the daemon will stop updates of topology data, so the TQ values announced by respondd/alfred etc. will be "frozen".

batadv-vis-data was removed shortly after the v2016.1 release; alfred and respondd query the kernel themselves again. The whole idea of the daemon was to centralize the queries of different daemons in one place to avoid OOM conditions. The underlying issue of the OOM issues was fixed in 0bd0df6f9303e5d553790ff49dc703b957fdac1d; an even better fix further reducing the memory requirements fix will follow when batman-adv gains a netlink API to query its topology data, replacing the human-readable debugfs files.

NeoRaider on 17 Jun 2016

Ok, thanks for the explanation. If I'm right, this means that the new implementation isn't writing so much data to the /tmp directory? Also if now disabling batman-adv-visdata helps this could mean that the problem described here maybe will be fixed with the changes of the current master, right?

leahoswald on 17 Jun 2016

It doesn't have to do anything with /tmp. v2016.2 will probably behave the same as v2016.1.x in this regard, the data needs to be read from debugfs (which uses a big buffer that is discarded and reallocated several times for each read the the data is big, so this can cause a lot of load). It doesn't matter if the reads are done by batadv-vis-data or the other daemons.

The netlink interface will use a much more compact representation for the data (binary instead of text, thus 6 instead of 18 bytes per MAC address). Also, the netlink interface won't dump all data in a single big buffer, but split it into multiple "packets" from kernel to userspace. Experimental patches for the kernel side of this interface exist, but aren't finished yet. They might make it in time for Gluon v2016.3.

NeoRaider on 17 Jun 2016

Well, not the news I hoped for. But for now we definitely know that this piece of software is causing our problems. As you said, disabling isn't the best idea but do you have an idea why the software is causing such issues at 3000 TG entries and how we could solve this problem maybe faster?

leahoswald on 17 Jun 2016

Also that doesn't explain, why the problems only occure when there are more than 3000 entries in the TG-table ... Does it?

bitboy0 on 18 Jun 2016

The allocation of the buffer for the TG table becomes bigger in powers of 2. Maybe at about 3000 entries, your _transtable_global_ hits 262144 (= 4096*2^6) bytes. The computational cost increases in O(n^2), as the kernel code will first try with 2^0 pages, notice it is not big enough, discard the buffer, try again with 2^1 pages, discard it again, then try again with 2^2 pages ... until at some point it will succeed at 2^7 pages. It's just a theory that this increase causes the spike, maybe you could check if the size of the translation table passes such a number?

NeoRaider on 18 Jun 2016

It's just a theory that this increase causes the spike, maybe you could check if the size of the translation table passes such a number?

Ok, I've added it to my monitoring script, let's wait a few hours.

leahoswald on 18 Jun 2016

Maybe at about 3000 entries, your transtable_global hits 262144 (= 4096*2^6) bytes.

Well, it looks like this is exactly what happens. So at the time we see the increased load starting, the size of the tg ist raising above 262144 bytes. To visualize it here are some graphs:

screenshot2016-06-18_15-46-14

Is there anything we can do about this?

leahoswald on 18 Jun 2016

An easy, straight forward workaround for pre-netlink might be to split the one originator and global translation table debugfs files into maybe 16 buckets. Like cat /sys/kernel/debug/batman_adv/bat0/transtable_global_. With this new suffix being the last nibble of a MAC address ([0-9A-F]). And add a patch to alfred/batman-adv-visdata to use the matching bucket for translations instead. Hackish, but quick&dirty to implement.

T-X on 20 Jun 2016

Some test results for various changes:

https://www.open-mesh.org/projects/batman-adv/wiki/Kmalloc-kmem-cache-tests

Currently one raw table and only one graph for a specific test type / patch. I (or someone else :)? ) could make some more pretty graphs for the other test configurations, I guess.

T-X on 20 Jun 2016

@T-X, we wouldn't even need to split the file, seq_file is built to output data in sequential records, and only a single record needs to fit into the buffer. A very effective workaround would be to print the tables bucket by bucket, decreasing the needed buffer size by factor 256 (for uniform distribution to hash buckets) without any API change seen by userspace. I think there was even a test patch for such a change by @ecsv? I can't find it at the moment...

I can see one downside to such an approach, but it probably doesn't matter too much in practise: The debugfs interface is very racy; if a batman-adv interface is removed while the debugfs files are open, each subsequent read access will lead to a use-after-free. Splitting the read into multiple parts would widen this race window.

NeoRaider on 20 Jun 2016

https://git.open-mesh.org/batman-adv.git/commit/ecsv/seq_operations was the test patch. It was only for originators and was not really well tested.

I haven't continued here because the netlink patches from @NeoRaider seemed to be the superior approach to the problem

ecsv on 20 Jun 2016

The Problem is not solved, is it? We really need further more help on this...

bitboy0 on 18 Jul 2016

Also we here from FFRGB have wired issues on a small portion of nodes repeatedly.
In our case the nodes randomly stop to mesh via ibss0. They are not entirely crashing.
The only thing to do fix this issue is to re initialize wifi by running wifi
I've written a little watchdog wich periodically checks and executes wifi if no path to any gateway is available. Could be another issue... Just fyi.

Sprinterfreak on 25 Jul 2016

Hey @Sprinterfreak I think this has noting to do with this bug . Rather you are affected by issue #605

leahoswald on 25 Jul 2016

👍1

Sorry for pushing ... four more weeks are gone. Is someone still working at this point?
It would really be great if that could be eventually solved.

bitboy0 on 19 Aug 2016

@bitboy: Various changes were made during these four weeks to reduce memory usage on the kernel side, which will hopefully trickle into Gluon soon:

gluon-mesh-batman-adv-core: disable bridge port learning on bat0
batman-adv: use kmem_cache for translation table
Various debugfs-to-netlink changes (great work by @NeoRaider and @ecsv):

Until this lands in a Gluon release, @bitboy0, would it be possible for you to give a recent batman-adv/batctl/alfred master branch and #780 a try and report back your new limits?

PS: Also, I'm still a little suspicious towards the new FQ-Codel. That's one more change that came with the more recent Gluon versions. And FQ-Codel is about queueing, which means it is about memory. Maybe it needs more memory in order to achieve it's incredible performance/latency improvements. (there seems to be a /proc/sys/net/core/default_qdisc, but not sure right now what it's value was prior to fq_codel)

T-X on 19 Aug 2016

And there is this ticket on OpenWRT still: https://dev.openwrt.org/ticket/22349

Can someone with an affected device try the patch mentioned there, "fq_codel: add batch ability to fq_codel_drop()" that is?

Also looks like it is possible to play with FQ-Codel parameters via tc (e.g. the "flows" and "limits" parameters): https://lists.openwrt.org/pipermail/openwrt-devel/2016-May/041445.html

T-X on 22 Aug 2016

@T-X I never compiled gluon by myself till now. I will try mey very best and thanks for the Informations!
If I get this done, I tell you the results of course!

bitboy0 on 24 Aug 2016

havent looked much in the fq_codel yet - in freiburg we have also this issue (while really lower numbers in nodes (330++) and clients(900++) but complex connected network of 10 supernodes)
i am now trying some suggestions from ffrn-forum - here

which result in this script
or this packet for gluon ...
just in case somebody want to play with this also - i've written to ffrn-forum also

edit: which is basicly the same patch as ffrn used (couldnt find it before)
https://github.com/Freifunk-Rhein-Neckar/ffrn-packages/tree/master/ffrn-lowmem-patches

viisauksena on 20 Sep 2016

Hey, do you have further information about the characteristics you observe and that led you to the conclusion that it could be the same bug?

leahoswald on 20 Sep 2016

not really, i would love to

fact is, that we observe for a long time a rising in reboots (up to several times a day) on 841 (or weak devices) while other nodes (same weak class of device) seem not affected, and some days later are affected , and than not...
everything we look into was not really giving us information (statistical data from monitoring about how many mesh-participants are inside network at all, how many supernodes and nodes are there - how was bandwith or how many clients on the specific nodes... we even tried to nail it down to specific routers from specific vendors (like some strange electricity failure) - to specific hw revisions with 841 - all , nothing )
(i left open some ssh connection and monitored dmesg and logread, nothing)

edit only minor thing, we have a test in one mesh-cloud with bigger mcast rate - there the routers reboot very often. The local router density is high.

we dont have detailed usage of ram over time, or load over time, just the observing that there is nothing out of the ordinary and some minutes later one is rebooting again.
(we could access 100++ router of our 400++ deployed routers in our network)

we have rather complex backbone with many bridged if on the supernodes, resulting in big originator tables on the supernodes (while then nodes could be reached equally good from all bridged supernodes) ... this should (so i think) should have no effect on the router, while there is nothing like that.
(compare on a tp841 # batctl o|wc 354 3644 40904
on supernode # batctl o|wc 408 33462 364424 )

now i want to test this on some routers around the city, and see if they reach a uptime of a several days or not
edit i made me this helper list (out of jq nodes.json) - which tells me which nodes of a specifi type are offline and how long the others are online - based on this list, i conclude that mostly nodes with mesh-vpn (uplink routers) work fine, while ibss0 (meshing routers) tend to fail. its not 100% but very obvious

edit2 the script does not help anything, some of our group think it could be some issue with network and a bunch of unaligned memory access (there is plenty on the routers... except from havin an vague idea of it , this is beyond my (c/assembler) knowledge) .. watch this number raise into millions cat /sys/kernel/debug/mips/unaligned_instructions

viisauksena on 20 Sep 2016

To the comment from https://github.com/freifunk-gluon/gluon/issues/753#issuecomment-241115746:

My gluon 2016.1.6-based repository now has:

added batman-adv 2016.4
- not yet released
- includes batadv netlink in alfred+batctl+batman-adv and kmem_cache for TT
removed batman-adv-visdata
converted gluon-mesh-batman-adv-core+gluon-status-page-api to batadv netlink

Interested people can just try https://github.com/FreifunkVogtland/gluon/tree/v2016.1.6-1 when they think that the memory usage caused by debugfs is the culprit behind this problem.

ecsv on 24 Sep 2016

👍2

@T-x, the fq codel stuff can really take a lot of memory. We should check out following patches for 2016.2 to reduce the impact with the new wifi driver:

These things were used to fix OOM problems in a test I did with Toke (using a 32MB device and 30 clients). Maybe reducing the limit for the qdisc might be a also possibility which can be tested because this is the part which is already in 2016.1.x. For example right now on LEDE it is using only 4Mb per qdisc:

tc -s -d qd sh dev eth1
qdisc fq_codel 0: root refcnt 2 limit 10240p flows 1024 quantum 1514 target 5.0ms interval 100.0ms memory_limit 4Mb ecn 
 Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0) 
 backlog 0b 0p requeues 0 
  maxpacket 0 drop_overlimit 0 new_flow_count 0 ecn_mark 0
  new_flows_len 0 old_flows_len 0

(The output was generated with the include/linux/pkt_sched.h part of https://git.kernel.org/cgit/linux/kernel/git/shemminger/iproute2.git/patch/?id=31ce6e010195d049ec3f8415e03d2951f494bf1d + https://patchwork.ozlabs.org/patch/628682/raw/ applied on iproute2.)

But OpenWrt CC doesn't yet have this memory_limit implementation because it was first introduced in 95b58430abe7 ("fq_codel: add memory limitation per queue"). So backporting the patches from LEDE (033-fq_codel-add-memory-limitation-per-queue.patch + 660-fq_codel_defaults.patch) could also be a good idea.

ecsv on 29 Sep 2016

👍2

I think it would be better not to use fq-codel as long as we're on CC, there are too many fixes that would need to be backported...

I've avoided to backport https://git.lede-project.org/?p=source.git;a=commitdiff;h=c4bfb119d85bcd5faf569f9cc83628ba19f58a1f , so fq shouldn't be effective for mac80211 anyways; I have no idea though if this shortcut in fq_flow_classify() will prevent it from buffering (too many) packets.

NeoRaider on 1 Oct 2016

Ok, didn't check whether fq_codel was active for the mac80211 queuing with 2016.2. So we can forget the point about the wifi driver and its internal queuing.

But just for clarification: fq_codel is still used as default queuing discipline on OpenWrt CC (and thus it most likely is also used by gluon 2016.1.x/2016.2.x). So the patch 033-fq_codel-add-memory-limitation-per-queue.patch + 660-fq_codel_defaults.patch may still be interesting for OpenWrt CC (2016.1.x and 2016.2.x) to reduce the chance that the normal qdiscs take up too much memory.

ecsv on 1 Oct 2016

👍1

There are existing patches on "https://github.com/Freifunk-Rhein-Neckar/ffrn-packages/tree/master/ffrn-lowmem-patches" which help Nodes with small RAM to stay stable. While discussing if we will include the patches in our FFS-Firmware we are wondering, why the patches are not included in the official Gluon code base, after there exists good experience on FFRN. Are there specific reasons?

FFS-Roland on 9 Nov 2016

@FFS-Roland the simplest reason might be: no one created a pull request to include them.

rotanid on 10 Nov 2016

Well we've developed them as a workarround for some problems we see in our (big) network. So we are not aware of some side effects this options might have in other setups. This is the main reason why we don't create a PR at the moment. If you see good results in Stuttgart too, than I think we can talk about a regular PR with this patches.

leahoswald on 10 Nov 2016

Meanwhile we were testing patched Gluon 2016.2.1 with WR841N and found some side effects of the sysctl-modifications. Nodes (not clients) cannot be accessed reliably by IPv6, and CPU load rises up. Therefore we will not use the complete patch in our build, but haveget related part only.

FFS-Roland on 21 Dec 2016

I'm surprised the neighbor table garbage collection in that patch set helps at all, because nodes should not have to manage so many neighbor entries anyway. My node currently has 25 entries for IPv6 and 3 for IPv4, in a mesh with 650 nodes and 1000 clients.

jplitza on 11 Feb 2017

Our last discussions in Stuttgart seem to result in not using the patch at all, because disabling haveged will limit entropy on the nodes significantly. So we will not profit of gaining 1 MB of RAM, but with our reduced subnet sizes we guess not running into trouble.

FFS-Roland on 11 Feb 2017

@Nurtic-Vibe reported on IRC that even FFRN doesn't use the lowmem-pkg anymore as it doesn't help much.
it also looks like the issue is even more pressing when running Gluon v2017.1.x

rotanid on 6 Nov 2017

👍1

closing in favor of #1243, although this one also describes problems of which some are already solved.
if you still have similar issues, please open a new issue with detailed information when running a current gluon master branch build. master branch has more fixes that can't be backported to older releases like v2017.1.x or v2016.2.x

rotanid on 5 Jun 2018

Was this page helpful?

0 / 5 - 0 ratings

Related issues

ATH10K and 64MB RAM

jenell95 · 3Comments

gluon-status-page-api: improve cross origin policy

lemoer · 3Comments

config mode remote access page misses passphrase entry

edeso · 3Comments

build fails after recent OpenWrt bumps

rotanid · 4Comments

Private WLAN should support WPA3

mweinelt · 3Comments