Gluon: Extensive Memory Leaks on Archer C5/C7 devices

Created on 16 Mar 2016  路  24Comments  路  Source: freifunk-gluon/gluon

Archer C5/C7 devices (ath10k based) there are extensive memory leaks leading to a reboot every 0,5-1,5 days.

This happens with gluon versions at least since 2016.1, and it can be seen on the current master branch with firmwares built as recent as 2016/03/13.

@andir has looked at the problem with kmemleak, since the problem does not seem to be in the userspace. Those logs however looked inconclusive afair.

Most helpful comment

Tested with v2016.1.x-mac80211-test. Still the same behavior (memory usage increasing till reboot).

All 24 comments

There are 71 Archer running in the Regio Aachen network. I didn't get any complains.

Even in refugee dormitories they are running like a charm. Uptime up th 51 days, most of them are up since the last update.

Most of them are running 2015.1.2, but the 24 devices running on newer versions are showing no problems as well.

You can filter for the archer at Statistiken:
http://map.freifunk-aachen.de/
http://map.freifunk-aachen.de/nodelist/

i cant' find C7 devices running 2016.x Firmware in Aachens nodelist, so perhaps this doesn't affect the C5?
@mmalte , do you have similar memory graphs like @mweinelt that would "proof" the C5 isn't affected with your firmware and community?

I cloud create them with our graphite data, but I don't see the point with the highest RAM consumption of 57,5% for archers running 2016.1.x
Only 2 devices with an uptime below 24h.
http://map.freifunk-aachen.de/nodelist/

Did you add any packages?

The Archer C5v1 and C7v2 are 100% identical, see:
https://wiki.openwrt.org/toh/tp-link/tl-wdr7500#hardware

I did some mining on the Aachen data and found the Node:
http://map.freifunk-aachen.de/#!v:m;n:c4e984ad6f6a to reboot quite frequently:
https://stats.darmstadt.freifunk.net/dashboard/db/c5-c7-aache?from=1458159639644&to=1458332439644&var-node=c4e984ad6f6a
Also there are some memory spikes just before the reboot.
@mmalte do you have any information about this particular node?

Edit:

After fixing the graph setup it is not all that obvious anymore.. might be a false-positive.

RAM Usages looks good, I don't know why it is rebooting:

c4e984ad6f6a-free-ram-uptime

Plenty of free RAM before the reboot, load looks good too:
c4e984ad6f6a-free-ram-load
c4e984ad6f6a-load-uptime

Can you please also test if the _v2016.1.x-mac80211-test_ branch solves this issue?

Random reboots are back on half of our C5/7 in Aachen (8 out of 17) since they were updated to 2016.1.4.

Crashlog reports "Kernel panic - not syncing: Out of memory: system-wide panic_on_oom is enabled" and htop shows high overall memory usage right before the reboot (but not for the processes listed).

Neoraider, I will test the suggested build and report back. Let me know if I can provide some additional, helpfull infos (as a linux noob).

Quite interesting about this is, that these nodes were not affected until v2016.1.4 (#758 is applied via cherry-pick, as documented in our wiki).

Tested with v2016.1.x-mac80211-test. Still the same behavior (memory usage increasing till reboot).

After trying every build on my c5 including 2016.1.5 and still getting high memory usage (around 80%) and some reboots every day (2-3 times between 19:30-22:00, which is pretty strange) I reverted back to 2016.1.2. And now there has been no reboot for 2 days and memory usage never got higher than 45%.
I used sysupgrade to downgrade keeping all settings untouched.

As result of my tests of v2016.1.4-15-gcebb753 with an Archer C7/V2 (w/ and w/o ad hoc mesh on 2.4/5Ghz, w/ and w/o privat WLAN, ad hoc-mesh only (2.4 Ghz)) the memory leakage seems to be as result of ad hoc mesh w/o any partner on 5Ghz.
Sorry, I can鈥榯 test 5 Ghz mesh with a partner router.
The client throughput (short/longterm) seems to be ok, no memory shortage.
Is it possible that meshing generates a lot of small (not reused) connections and the timeout of freeing memory (like garbage collection) ist too high?
The memory usage seems to be a function of the ambient noise level (mesh).

If this is true, what has changed since 2016.1.2 which is running without causing trouble on 61 Archer C5/C7 in Aachen?

Of the Archer running on experimental / beta we've some with an uptime of several weeks (some with no reboot since autoupdate) with both mesh wifi enabled. For example:
https://map.aachen.freifunk.net/#!v:m;n:c4e9846e7cda
https://map.aachen.freifunk.net/#!v:m;n:c4e984ad6cca
https://map.aachen.freifunk.net/#!v:m;n:c4e984cdde22

It looks like calling wifi is freeing up the the leaking memory:

root@ffac-jannic-test11:~# free
             total         used         free       shared      buffers
Mem:        126056       119184         6872          192         1308
-/+ buffers:             117876         8180
Swap:            0            0            0
root@ffac-jannic-test11:~# wifi
root@ffac-jannic-test11:~# free
             total         used         free       shared      buffers
Mem:        126056        34728        91328          192         1516
-/+ buffers:              33212        92844
Swap:            0            0            0

This was observed after 1 day of uptime, during which the memory used reported on the web page increased to 90%. After calling wifi it was down to 50%, again. Commit d31c1c9 on branch v2016.1.x-mac80211-test. Mesh links on both 2.4 and 5 GHz, with little load. (sent 60MB, received 420MB, forwarded 1 MB in 24h according to the web interface)

I'm able to confirm this for a node under quite heavy usage in refugee housing:

root@ffac-franzstrasse_aquarium:~# free 
             total         used         free       shared      buffers
Mem:        126056        72468        53588          228         2588
-/+ buffers:              69880        56176
Swap:            0            0            0
root@ffac-franzstrasse_aquarium:~# wifi
root@ffac-franzstrasse_aquarium:~# free
             total         used         free       shared      buffers
Mem:        126056        36980        89076          228         2588
-/+ buffers:              34392        91664
Swap:            0            0            0
root@ffac-franzstrasse_aquarium:~# uptime
 10:16:24 up 12:08,  load average: 0.19, 0.10, 0.12
root@ffac-franzstrasse_aquarium:~# cat /lib/gluon/gluon-version 
v2016.1.4-18-g7c0879c
root@ffac-franzstrasse_aquarium:~# cat /lib/gluon/release 
2016.1.4-4~mac8021120160520

If've also tested the behaviour on our stable version which is known for unlimited uptimes:

root@mon1-haag-c5:~# uptime 
 10:22:02 up 3 days, 10:25,  load average: 0.16, 0.07, 0.05
root@mon1-haag-c5:~# free 
             total         used         free       shared      buffers
Mem:        126004        60040        65964          200         2540
-/+ buffers:              57500        68504
Swap:            0            0            0
root@mon1-haag-c5:~# wifi
root@mon1-haag-c5:~# free 
             total         used         free       shared      buffers
Mem:        126004        31444        94560          200         2540
-/+ buffers:              28904        97100
Swap:            0            0            0
root@mon1-haag-c5:~# cat /lib/gluon/release 
2016.1.2-1-stable
root@mon1-haag-c5:~# cat /lib/gluon/gluon-version 
v2016.1.2

mon1-haag-c5 is under considerable use:

Laufzeit    3 Tage, 10:30

Traffic
Gesendet 3,64 GB
Empfangen 38,4 GB
Weitergeleitet 518 MB

We found out the our smoothly running 2016.1.2-stable Firmware is using a differen ath10k driver than the newer versions:

2016.1.2-1-stable: firmware-2.bin (10.1.467-ct-com-full-014-07e794)
2016.1.4-2-beta: firmware-5.bin (10.2.4.97-1-ct-com-F-002-5b119c)

@Metatron321 has replaced the firmware on his 2016.1.4-2-beta node two days ago, it is running smoothly Memory at 53%

I've attached the firmware, could you please try to reproduce by replacing the files in /lib/firmware/ath10k/QCA988X/hw2.0/ an rebooting the node.

ath10k-firmware.zip

Just to check if not only the currently used version of firmware-5.bin has an issue i tried two other versions of the candelatech bins i could find (10.2.4.97-1-ct-com-F-002-1177c5 and 10.2.4.97-1-ct-com-F-001-b0f9b0). Both resulted higher overall memory usage (around 65% instead of 45% with 10.1) and reboots within 24 hours.
I would like to test some older versions to isolate if the problem started somewhere between different versions but i could not find any (candelatech download repository and changelogs are quite confusing to me).

You should try a firmware-2.bin from http://www.candelatech.com/downloads/firmware-2-ct-full-community-16.bin instead (and delete firmware-5.bin, so the driver uses firmware-2.bin instead).

That is what I did before testing other 10.2.4 firmware-5 versions (thats what mmalte reported in my name) :)

Okay. We'll have to replace the CT firmware with a 11s-capable firmware in the near future anyways, so I'm not sure how much sense more tests with CT make...

I built a gluon based on v2016.1.5, but with ath10k firmware from http://www.candelatech.com/downloads/firmware-2-ct-full-community-16.bin, as NeoRaider suggested. Seems to work well. (After 15 hours of uptime, no sign of a memory leak.) I'll keep watching it, as 15h is a little bit too short to be sure.

This mail by Ben Greear (candelatech) suggests that the 10.2 firmware is known to be worse than the 10.1 one (firmware-2): http://lists.infradead.org/pipermail/ath10k/2016-June/007743.html
So perhaps going 'back' to firmware-2 is the way to go, independently of the question if it solves the Archer issue?

Seems like a good option for v2016.1.x. As I wrote before, v2016.2 won't use a CT firmware at all, as the CT firmware doesn't support 11s.

Fixed in fd237f6f43d4ad36e1987bf7461a16647287db18.

Was this page helpful?
0 / 5 - 0 ratings