Gluon: ath9k devices run out of memory above certain client count

Created on 25 Jun 2019  Â·  139Comments  Â·  Source: freifunk-gluon/gluon

Bug report

Welcome to "Debugging out of memory situations" 2019 Edition!

What is the problem?

Devices with ath9k radios seem to be quickly running out of memory as soon as they reach a certain client count.

Running wifi causes the memory to be temporarily freed again, but does not prevent the device from leaking.

Example: Archer C7 v2 looses ~68 MB of normally free RAM in roughly 20 minutes, then OOM reboots, multiple times

https://stats.darmstadt.freifunk.net/d/000000021/router-meshviewer-export?orgId=1&var-node=f4f26d707c53&refresh=15m

c7v2-clients
c7v2-loadmemory
c7v2-uptime

Other affected nodes are:

What is the expected behaviour?

Memory usage should increase at a reasonable rate per client

Gluon Version:

v2018.2.1-7-geed810aa

Site Configuration:

https://git.darmstadt.ccc.de/ffda/site/tree/v2018.2.x

Custom patches:

none

bug

Most helpful comment

We were experiencing this issue since last year, so probably everything based on LEDE 17.01 onwards is affected.

All 139 comments

We were experiencing this issue since last year, so probably everything based on LEDE 17.01 onwards is affected.

A log would be great. This should really not happen...

So far we have not seen anything meaningful in dmesg/logread. Errors/warnings in there are only symptomatic, as they happen late when little memory is left.

We have not found any issue in userspace and therefore continued looking into the kernel space.

I've asked tackin on IRC to provide use with the outputs of /proc/slabinfo while a router is leaking memory, which he will likely provide tomorrow.

We have currently only seen this kind of memory leak on devices featuring ath9k radios, but that might be totally circumstantial.

I'd be totally happy if people could check their high-client APs for frequent out-of-memory reboots. I'm especially interested in devices that are not ar71xx/ath79 based.

We currently assume IPQ40xx (dual ath10k) not to be affected, since we have this sample data:

FRITZ!Box 4040
0.11.8+tackin / gluon-v2018.2.1-3-g015b0d68+
https://maps.tackin.de/#!/en/map/f0b014618973
https://draco.freifunk-trier.starletp9.de:3000/d/q8Ud1fJiz/single-nodes?orgId=1&var-node_name=54309_Fritz4040_DigiLay_Test&from=1561088751091&to=1561341537117

image
image
image

Great, so this is likely an ath9k bug. The ath9k changes are sparse, but it does happen often that there are memory leaks.
It's important to know the first Gluon version with this issue. Then we can easily find the commit. The biggest change in ath9k recently was the introduction of airtime fairness. I remember there were many leaks which were fixed one after another, but maybe the haven't found all of them.

This is surely a leak:
leak

I think Ubiquiti also had some major memory leaks with ath9k and airtime fairness ... I just don't find the release notes for it anymore. Because this could give as hint in which kernel version it was fixed :/.

Or they fixed it for themselves and didn't upstream it :/.

Is it possible to disable airtime fairness at runtime? If the problem vanishes then that might be a first hint.

echo 0 > /sys/kernel/debug/ieee80211/phy0/ath9k/airtime_flags

@nbd168 I know you're very busy, but I believe to remember that you commited a fix of a memory leak introduced with airtime fairness support in ath9k.

Can you give us a hint on what to do?

BTW do you go to Paris?

echo 0 > /sys/kernel/debug/ieee80211/phy0/ath9k/airtime_flags

Thx, I hope I get some people to try this :).

We've disabled airtime fairness on device mentioned in https://github.com/freifunk-gluon/gluon/issues/1768#issuecomment-506291460 and https://github.com/freifunk-gluon/gluon/issues/1768#issuecomment-506291951. Let's wait and see.

Like to add one:
TP-Link Archer C7 v2 on 0.11.8+tackin / gluon-v2018.2.1-3-g015b0d68+
ArcherC7_v2_02
ArcherC7_v2_01

echo 0 > /sys/kernel/debug/ieee80211/phy0/ath9k/airtime_flags
ArcherC7_v2 says:
cat /sys/kernel/debug/ieee80211/phy0/ath9k/airtime_flags
cat: can't open '/sys/kernel/debug/ieee80211/phy0/ath9k/airtime_flags': No such file or directory

Am I wrong or is this disabled by default in the releases 2018.2.x ?

@tackin I'm on v2018.2.1 and have this entry.

@CodeFetch ergo: not the reason for the memory leak, as we don't seem to have this entry.

@tackin That doesn't mean it's not activated. Just that you can't disable it. Airtime fariness is activated since its introduction by default and a very nice feature BTW.

On an Archer C7 V2 the ath9k radio could as well be phy1, so check /sys/kernel/debug/ieee80211/phy1/ath9k/airtime_flags instead. But please don't do it on the device you are currentyl collecting slabinfo from! :)

I've checked our network. We don't have this issue but we have at most 35 clients on ath9k devices. It seems that this issue only happens when there are many, many clients...

Thus I'd recommend someone to create a lot of STA VIFs with an ath9k device and stress another ath9k device with that. ath9k devices can create more than 2000 STA VIFs as far as I understand.

On an Archer C7 V2 the ath9k radio could as well be phy1, so check /sys/kernel/debug/ieee80211/phy1/ath9k/airtime_flags instead. But please don't do it on the device you are currentyl collecting slabinfo from! :)
@mweinelt:
cat /sys/kernel/debug/ieee80211/phy1/ath9k/airtime_flags
7

Soooorry! Thought I tried both.

We have some Ubiquiti Bullets with many clients and Reboots. If you provide exact steps I would provide logs/monitor and test whatever you like. just need detailed instructions. I could also give direct access to these nodes, because they belong to our Verein and are placed in Streetlights

http://map.freifunk-bochum.de:3000/d/000000004/node-byid?orgId=1&var-nodeid=68725106af6e&from=now%2Fd&to=now

Here is a simple ash script to create STA VIFs for simulating WLAN clients. It should be used on a clean OpenWrt. I haven't really tested it, because I've other things to do atm.
With "scriptname down" you can remove the interfaces again.

#!/bin/sh

ssid=Freifunk

vifs=20
radio=phy0
prefix=test

if [ $1 = "down" ]; then
    i=0
    while [ $i -lt $vifs ]; do        
            ip link set $prefix$i down
            iw dev $prefix$i del
            i=$(( $i + 1 ))
    done
    exit 1
fi

i=0
while [ $i -lt $vifs ]; do
    mac=DE:AD:01:$(hexdump -n3 -e '/1 ":%02X"' /dev/urandom|sed s/^://g)
    iw phy $radio interface add $prefix$i type station
    ip link set $prefix$i address $mac
    ip link set $prefix$i up
    iw $prefix$i connect $ssid
    i=$(( $i + 1 ))
done

WR841N v9 with 64MB RAM in the city centre's pedestrian area. lots of cafes, bars, restaurants.
Memory usage peaks before every reboot.

Bildschirmfoto_2019-06-27_17-14-00

https://map.freifunk-3laendereck.net/grafana/d/000000006/freifunk-dreilandereck-einzelansicht?orgId=1&var-Knotenid=14cc204bfa5e&from=1559076137990&to=1561609364656

For those who want to check their network here's a very simple python3-script to extract 10 nodes with the highest client count from a meshviewer.json-file:

import json

print([[n['hostname'],n['clients'],n['clients_wifi24'],n['clients_wifi5']] for n in sorted(json.load(open('meshviewer.json'))['nodes'], key=lambda n: n['clients'])[::-1][:10]])

An extract of several /proc/slabinfo dumps during RAM-leaking:

Working fine:

while true; do echo '-------------------------------------------------------' ; date ; cat /proc/slabinfo ; sleep 30 ; done

Thu Jun 27 08:23:54 CEST 2019
slabinfo - version: 2.1
name : tunables : slabdata
batadv_tt_roam_cache 0 0 128 32 1 : tunables 0 0 0 : slabdata 0 0 0
batadv_tt_req_cache 0 32 128 32 1 : tunables 0 0 0 : slabdata 1 1 0
batadv_tt_change_cache 0 42 96 42 1 : tunables 0 0 0 : slabdata 1 1 0
batadv_tt_orig_cache 815 1120 128 32 1 : tunables 0 0 0 : slabdata 35 35 0
batadv_tg_cache 378 640 128 32 1 : tunables 0 0 0 : slabdata 20 20 0
batadv_tl_cache 9 32 128 32 1 : tunables 0 0 0 : slabdata 1 1 0
bridge_fdb_cache 55 96 128 32 1 : tunables 0 0 0 : slabdata 3 3 0
nf-frags 0 0 192 21 1 : tunables 0 0 0 : slabdata 0 0 0
ip6-frags 0 0 192 21 1 : tunables 0 0 0 : slabdata 0 0 0
fib6_nodes 39 64 128 32 1 : tunables 0 0 0 : slabdata 2 2 0
ip6_dst_cache 63 108 320 12 1 : tunables 0 0 0 : slabdata 9 9 0
PINGv6 0 0 864 9 2 : tunables 0 0 0 : slabdata 0 0 0
RAWv6 8 18 864 9 2 : tunables 0 0 0 : slabdata 2 2 0
UDPLITEv6 0 0 864 9 2 : tunables 0 0 0 : slabdata 0 0 0
UDPv6 23 45 864 9 2 : tunables 0 0 0 : slabdata 5 5 0
tw_sock_TCPv6 0 17 232 17 1 : tunables 0 0 0 : slabdata 1 1 0
request_sock_TCPv6 0 14 280 14 1 : tunables 0 0 0 : slabdata 1 1 0
TCPv6 18 30 1632 10 4 : tunables 0 0 0 : slabdata 3 3 0
nf_conntrack_expect 0 0 208 19 1 : tunables 0 0 0 : slabdata 0 0 0
nf_conntrack 6 56 288 14 1 : tunables 0 0 0 : slabdata 4 4 0
sgpool-128 2 15 2144 15 8 : tunables 0 0 0 : slabdata 1 1 0
sgpool-64 2 14 1120 14 4 : tunables 0 0 0 : slabdata 1 1 0
sgpool-32 2 13 608 13 2 : tunables 0 0 0 : slabdata 1 1 0
sgpool-16 2 11 352 11 1 : tunables 0 0 0 : slabdata 1 1 0
sgpool-8 2 18 224 18 1 : tunables 0 0 0 : slabdata 1 1 0
jffs2_xattr_ref 0 0 80 51 1 : tunables 0 0 0 : slabdata 0 0 0
jffs2_xattr_datum 0 0 112 36 1 : tunables 0 0 0 : slabdata 0 0 0
jffs2_inode_cache 134 153 80 51 1 : tunables 0 0 0 : slabdata 3 3 0
jffs2_node_frag 60 102 80 51 1 : tunables 0 0 0 : slabdata 2 2 0
jffs2_refblock 92 104 304 13 1 : tunables 0 0 0 : slabdata 8 8 0
jffs2_tmp_dnode 0 46 88 46 1 : tunables 0 0 0 : slabdata 1 1 0
jffs2_raw_inode 0 25 160 25 1 : tunables 0 0 0 : slabdata 1 1 0
jffs2_raw_dirent 0 32 128 32 1 : tunables 0 0 0 : slabdata 1 1 0
jffs2_full_dnode 120 168 72 56 1 : tunables 0 0 0 : slabdata 3 3 0
jffs2_i 79 80 392 10 1 : tunables 0 0 0 : slabdata 8 8 0
squashfs_inode_cache 707 711 448 9 1 : tunables 0 0 0 : slabdata 79 79 0
dio 0 0 392 10 1 : tunables 0 0 0 : slabdata 0 0 0
fasync_cache 4 51 80 51 1 : tunables 0 0 0 : slabdata 1 1 0
posix_timers_cache 0 0 216 18 1 : tunables 0 0 0 : slabdata 0 0 0
UNIX 32 48 672 12 2 : tunables 0 0 0 : slabdata 4 4 0
ip4-frags 0 0 200 20 1 : tunables 0 0 0 : slabdata 0 0 0
UDP-Lite 0 0 736 11 2 : tunables 0 0 0 : slabdata 0 0 0
tcp_bind_bucket 4 32 128 32 1 : tunables 0 0 0 : slabdata 1 1 0
inet_peer_cache 2 18 224 18 1 : tunables 0 0 0 : slabdata 1 1 0
secpath_cache 0 0 128 32 1 : tunables 0 0 0 : slabdata 0 0 0
flow_cache 0 0 160 25 1 : tunables 0 0 0 : slabdata 0 0 0
xfrm_dst_cache 0 0 352 11 1 : tunables 0 0 0 : slabdata 0 0 0
ip_fib_trie 8 46 88 46 1 : tunables 0 0 0 : slabdata 1 1 0
ip_fib_alias 9 46 88 46 1 : tunables 0 0 0 : slabdata 1 1 0
ip_dst_cache 6 18 224 18 1 : tunables 0 0 0 : slabdata 1 1 0
PING 0 0 704 11 2 : tunables 0 0 0 : slabdata 0 0 0
RAW 2 11 704 11 2 : tunables 0 0 0 : slabdata 1 1 0
UDP 8 22 736 11 2 : tunables 0 0 0 : slabdata 2 2 0
tw_sock_TCP 0 0 232 17 1 : tunables 0 0 0 : slabdata 0 0 0
request_sock_TCP 0 0 280 14 1 : tunables 0 0 0 : slabdata 0 0 0
TCP 5 10 1536 10 4 : tunables 0 0 0 : slabdata 1 1 0
eventpoll_pwq 44 92 88 46 1 : tunables 0 0 0 : slabdata 2 2 0
eventpoll_epi 44 50 160 25 1 : tunables 0 0 0 : slabdata 2 2 0
inotify_inode_mark 2 39 104 39 1 : tunables 0 0 0 : slabdata 1 1 0
request_queue 6 8 1016 8 2 : tunables 0 0 0 : slabdata 1 1 0
blkdev_requests 24 30 264 15 1 : tunables 0 0 0 : slabdata 2 2 0
blkdev_ioc 6 36 112 36 1 : tunables 0 0 0 : slabdata 1 1 0
bio-0 14 50 384 10 1 : tunables 0 0 0 : slabdata 5 5 0
biovec-max 14 20 3168 10 8 : tunables 0 0 0 : slabdata 2 2 0
biovec-128 0 0 1632 10 4 : tunables 0 0 0 : slabdata 0 0 0
biovec-64 0 0 864 9 2 : tunables 0 0 0 : slabdata 0 0 0
biovec-16 0 0 288 14 1 : tunables 0 0 0 : slabdata 0 0 0
uid_cache 1 25 160 25 1 : tunables 0 0 0 : slabdata 1 1 0
sock_inode_cache 127 152 416 19 2 : tunables 0 0 0 : slabdata 8 8 0
skbuff_fclone_cache 3 8 480 8 1 : tunables 0 0 0 : slabdata 1 1 0
skbuff_head_cache 517 644 288 14 1 : tunables 0 0 0 : slabdata 46 46 0
file_lock_cache 0 23 176 23 1 : tunables 0 0 0 : slabdata 1 1 0
file_lock_ctx 50 51 80 51 1 : tunables 0 0 0 : slabdata 1 1 0
shmem_inode_cache 216 220 392 10 1 : tunables 0 0 0 : slabdata 22 22 0
pool_workqueue 6 10 768 10 2 : tunables 0 0 0 : slabdata 1 1 0
proc_inode_cache 457 490 392 10 1 : tunables 0 0 0 : slabdata 49 49 0
sigqueue 0 20 200 20 1 : tunables 0 0 0 : slabdata 1 1 0
bdev_cache 4 8 512 8 1 : tunables 0 0 0 : slabdata 1 1 0
kernfs_node_cache 7331 7350 136 30 1 : tunables 0 0 0 : slabdata 245 245 0
mnt_cache 21 28 288 14 1 : tunables 0 0 0 : slabdata 2 2 0
filp 352 414 224 18 1 : tunables 0 0 0 : slabdata 23 23 0
inode_cache 1592 1606 352 11 1 : tunables 0 0 0 : slabdata 146 146 0
dentry 3926 3948 192 21 1 : tunables 0 0 0 : slabdata 188 188 0
names_cache 0 7 4192 7 8 : tunables 0 0 0 : slabdata 1 1 0
key_jar 2 21 192 21 1 : tunables 0 0 0 : slabdata 1 1 0
buffer_head 2176 2176 120 34 1 : tunables 0 0 0 : slabdata 64 64 0
nsproxy 0 0 80 51 1 : tunables 0 0 0 : slabdata 0 0 0
vm_area_struct 476 672 144 28 1 : tunables 0 0 0 : slabdata 24 24 0
mm_struct 25 56 512 8 1 : tunables 0 0 0 : slabdata 7 7 0
fs_cache 24 96 128 32 1 : tunables 0 0 0 : slabdata 3 3 0
files_cache 25 56 288 14 1 : tunables 0 0 0 : slabdata 4 4 0
signal_cache 53 78 608 13 2 : tunables 0 0 0 : slabdata 6 6 0
sighand_cache 53 80 3168 10 8 : tunables 0 0 0 : slabdata 8 8 0
task_struct 53 92 1392 23 8 : tunables 0 0 0 : slabdata 4 4 0
cred_jar 92 126 224 18 1 : tunables 0 0 0 : slabdata 7 7 0
anon_vma_chain 372 506 88 46 1 : tunables 0 0 0 : slabdata 11 11 0
anon_vma 292 414 88 46 1 : tunables 0 0 0 : slabdata 9 9 0
pid 61 96 128 32 1 : tunables 0 0 0 : slabdata 3 3 0
radix_tree_node 149 154 360 11 1 : tunables 0 0 0 : slabdata 14 14 0
idr_layer_cache 98 112 1120 14 4 : tunables 0 0 0 : slabdata 8 8 0
kmalloc-8192 12 15 8448 3 8 : tunables 0 0 0 : slabdata 5 5 0
kmalloc-4096 134 245 4352 7 8 : tunables 0 0 0 : slabdata 35 35 0
kmalloc-2048 73 98 2304 14 8 : tunables 0 0 0 : slabdata 7 7 0
kmalloc-1024 164 180 1280 12 4 : tunables 0 0 0 : slabdata 15 15 0
kmalloc-512 273 280 768 10 2 : tunables 0 0 0 : slabdata 28 28 0
kmalloc-256 461 472 512 8 1 : tunables 0 0 0 : slabdata 59 59 0
kmalloc-128 7654 7680 384 10 1 : tunables 0 0 0 : slabdata 768 768 0
kmem_cache_node 109 112 256 16 1 : tunables 0 0 0 : slabdata 7 7 0
kmem_cache 109 110 384 10 1 : tunables 0 0 0 : slabdata 11 11 0


RAM-leak starting:


Thu Jun 27 15:06:18 CEST 2019
slabinfo - version: 2.1
name : tunables : slabdata
batadv_tt_roam_cache 0 0 128 32 1 : tunables 0 0 0 : slabdata 0 0 0
batadv_tt_req_cache 0 32 128 32 1 : tunables 0 0 0 : slabdata 1 1 0
batadv_tt_change_cache 0 42 96 42 1 : tunables 0 0 0 : slabdata 1 1 0
batadv_tt_orig_cache 905 1120 128 32 1 : tunables 0 0 0 : slabdata 35 35 0
batadv_tg_cache 462 640 128 32 1 : tunables 0 0 0 : slabdata 20 20 0
batadv_tl_cache 135 160 128 32 1 : tunables 0 0 0 : slabdata 5 5 0
bridge_fdb_cache 109 128 128 32 1 : tunables 0 0 0 : slabdata 4 4 0
nf-frags 0 0 192 21 1 : tunables 0 0 0 : slabdata 0 0 0
ip6-frags 0 0 192 21 1 : tunables 0 0 0 : slabdata 0 0 0
fib6_nodes 39 64 128 32 1 : tunables 0 0 0 : slabdata 2 2 0
ip6_dst_cache 74 108 320 12 1 : tunables 0 0 0 : slabdata 9 9 0
PINGv6 0 0 864 9 2 : tunables 0 0 0 : slabdata 0 0 0
RAWv6 8 18 864 9 2 : tunables 0 0 0 : slabdata 2 2 0
UDPLITEv6 0 0 864 9 2 : tunables 0 0 0 : slabdata 0 0 0
UDPv6 23 45 864 9 2 : tunables 0 0 0 : slabdata 5 5 0
tw_sock_TCPv6 0 17 232 17 1 : tunables 0 0 0 : slabdata 1 1 0
request_sock_TCPv6 0 14 280 14 1 : tunables 0 0 0 : slabdata 1 1 0
TCPv6 18 30 1632 10 4 : tunables 0 0 0 : slabdata 3 3 0
nf_conntrack_expect 0 0 208 19 1 : tunables 0 0 0 : slabdata 0 0 0
nf_conntrack 6 56 288 14 1 : tunables 0 0 0 : slabdata 4 4 0
sgpool-128 2 15 2144 15 8 : tunables 0 0 0 : slabdata 1 1 0
sgpool-64 2 14 1120 14 4 : tunables 0 0 0 : slabdata 1 1 0
sgpool-32 2 13 608 13 2 : tunables 0 0 0 : slabdata 1 1 0
sgpool-16 2 11 352 11 1 : tunables 0 0 0 : slabdata 1 1 0
sgpool-8 2 18 224 18 1 : tunables 0 0 0 : slabdata 1 1 0
jffs2_xattr_ref 0 0 80 51 1 : tunables 0 0 0 : slabdata 0 0 0
jffs2_xattr_datum 0 0 112 36 1 : tunables 0 0 0 : slabdata 0 0 0
jffs2_inode_cache 134 153 80 51 1 : tunables 0 0 0 : slabdata 3 3 0
jffs2_node_frag 60 102 80 51 1 : tunables 0 0 0 : slabdata 2 2 0
jffs2_refblock 92 104 304 13 1 : tunables 0 0 0 : slabdata 8 8 0
jffs2_tmp_dnode 0 46 88 46 1 : tunables 0 0 0 : slabdata 1 1 0
jffs2_raw_inode 0 25 160 25 1 : tunables 0 0 0 : slabdata 1 1 0
jffs2_raw_dirent 0 32 128 32 1 : tunables 0 0 0 : slabdata 1 1 0
jffs2_full_dnode 120 168 72 56 1 : tunables 0 0 0 : slabdata 3 3 0
jffs2_i 79 80 392 10 1 : tunables 0 0 0 : slabdata 8 8 0
squashfs_inode_cache 707 711 448 9 1 : tunables 0 0 0 : slabdata 79 79 0
dio 0 0 392 10 1 : tunables 0 0 0 : slabdata 0 0 0
fasync_cache 4 51 80 51 1 : tunables 0 0 0 : slabdata 1 1 0
posix_timers_cache 0 0 216 18 1 : tunables 0 0 0 : slabdata 0 0 0
UNIX 32 48 672 12 2 : tunables 0 0 0 : slabdata 4 4 0
ip4-frags 0 0 200 20 1 : tunables 0 0 0 : slabdata 0 0 0
UDP-Lite 0 0 736 11 2 : tunables 0 0 0 : slabdata 0 0 0
tcp_bind_bucket 4 32 128 32 1 : tunables 0 0 0 : slabdata 1 1 0
inet_peer_cache 4 18 224 18 1 : tunables 0 0 0 : slabdata 1 1 0
secpath_cache 0 0 128 32 1 : tunables 0 0 0 : slabdata 0 0 0
flow_cache 0 0 160 25 1 : tunables 0 0 0 : slabdata 0 0 0
xfrm_dst_cache 0 0 352 11 1 : tunables 0 0 0 : slabdata 0 0 0
ip_fib_trie 8 46 88 46 1 : tunables 0 0 0 : slabdata 1 1 0
ip_fib_alias 9 46 88 46 1 : tunables 0 0 0 : slabdata 1 1 0
ip_dst_cache 6 18 224 18 1 : tunables 0 0 0 : slabdata 1 1 0
PING 0 0 704 11 2 : tunables 0 0 0 : slabdata 0 0 0
RAW 2 11 704 11 2 : tunables 0 0 0 : slabdata 1 1 0
UDP 8 22 736 11 2 : tunables 0 0 0 : slabdata 2 2 0
tw_sock_TCP 0 0 232 17 1 : tunables 0 0 0 : slabdata 0 0 0
request_sock_TCP 0 0 280 14 1 : tunables 0 0 0 : slabdata 0 0 0
TCP 5 10 1536 10 4 : tunables 0 0 0 : slabdata 1 1 0
eventpoll_pwq 44 92 88 46 1 : tunables 0 0 0 : slabdata 2 2 0
eventpoll_epi 44 50 160 25 1 : tunables 0 0 0 : slabdata 2 2 0
inotify_inode_mark 2 39 104 39 1 : tunables 0 0 0 : slabdata 1 1 0
request_queue 6 8 1016 8 2 : tunables 0 0 0 : slabdata 1 1 0
blkdev_requests 24 30 264 15 1 : tunables 0 0 0 : slabdata 2 2 0
blkdev_ioc 6 36 112 36 1 : tunables 0 0 0 : slabdata 1 1 0
bio-0 14 50 384 10 1 : tunables 0 0 0 : slabdata 5 5 0
biovec-max 14 20 3168 10 8 : tunables 0 0 0 : slabdata 2 2 0
biovec-128 0 0 1632 10 4 : tunables 0 0 0 : slabdata 0 0 0
biovec-64 0 0 864 9 2 : tunables 0 0 0 : slabdata 0 0 0
biovec-16 0 0 288 14 1 : tunables 0 0 0 : slabdata 0 0 0
uid_cache 1 25 160 25 1 : tunables 0 0 0 : slabdata 1 1 0
sock_inode_cache 127 152 416 19 2 : tunables 0 0 0 : slabdata 8 8 0
skbuff_fclone_cache 4 16 480 8 1 : tunables 0 0 0 : slabdata 2 2 0
skbuff_head_cache 5887 6104 288 14 1 : tunables 0 0 0 : slabdata 436 436 0
file_lock_cache 0 23 176 23 1 : tunables 0 0 0 : slabdata 1 1 0
file_lock_ctx 50 51 80 51 1 : tunables 0 0 0 : slabdata 1 1 0
shmem_inode_cache 216 220 392 10 1 : tunables 0 0 0 : slabdata 22 22 0
pool_workqueue 6 10 768 10 2 : tunables 0 0 0 : slabdata 1 1 0
proc_inode_cache 457 490 392 10 1 : tunables 0 0 0 : slabdata 49 49 0
sigqueue 0 20 200 20 1 : tunables 0 0 0 : slabdata 1 1 0
bdev_cache 4 8 512 8 1 : tunables 0 0 0 : slabdata 1 1 0
kernfs_node_cache 7331 7350 136 30 1 : tunables 0 0 0 : slabdata 245 245 0
mnt_cache 21 28 288 14 1 : tunables 0 0 0 : slabdata 2 2 0
filp 352 414 224 18 1 : tunables 0 0 0 : slabdata 23 23 0
inode_cache 2402 2431 352 11 1 : tunables 0 0 0 : slabdata 221 221 0
dentry 4736 4788 192 21 1 : tunables 0 0 0 : slabdata 228 228 0
names_cache 0 7 4192 7 8 : tunables 0 0 0 : slabdata 1 1 0
key_jar 2 21 192 21 1 : tunables 0 0 0 : slabdata 1 1 0
buffer_head 2176 2176 120 34 1 : tunables 0 0 0 : slabdata 64 64 0
nsproxy 0 0 80 51 1 : tunables 0 0 0 : slabdata 0 0 0
vm_area_struct 476 672 144 28 1 : tunables 0 0 0 : slabdata 24 24 0
mm_struct 25 56 512 8 1 : tunables 0 0 0 : slabdata 7 7 0
fs_cache 24 96 128 32 1 : tunables 0 0 0 : slabdata 3 3 0
files_cache 25 56 288 14 1 : tunables 0 0 0 : slabdata 4 4 0
signal_cache 54 78 608 13 2 : tunables 0 0 0 : slabdata 6 6 0
sighand_cache 54 80 3168 10 8 : tunables 0 0 0 : slabdata 8 8 0
task_struct 54 92 1392 23 8 : tunables 0 0 0 : slabdata 4 4 0
cred_jar 93 126 224 18 1 : tunables 0 0 0 : slabdata 7 7 0
anon_vma_chain 372 506 88 46 1 : tunables 0 0 0 : slabdata 11 11 0
anon_vma 292 414 88 46 1 : tunables 0 0 0 : slabdata 9 9 0
pid 62 96 128 32 1 : tunables 0 0 0 : slabdata 3 3 0
radix_tree_node 149 154 360 11 1 : tunables 0 0 0 : slabdata 14 14 0
idr_layer_cache 98 112 1120 14 4 : tunables 0 0 0 : slabdata 8 8 0
kmalloc-8192 57 60 8448 3 8 : tunables 0 0 0 : slabdata 20 20 0
kmalloc-4096 1700 1764 4352 7 8 : tunables 0 0 0 : slabdata 252 252 0
kmalloc-2048 1127 1148 2304 14 8 : tunables 0 0 0 : slabdata 82 82 0
kmalloc-1024 713 732 1280 12 4 : tunables 0 0 0 : slabdata 61 61 0
kmalloc-512 2138 2150 768 10 2 : tunables 0 0 0 : slabdata 215 215 0
kmalloc-256 529 536 512 8 1 : tunables 0 0 0 : slabdata 67 67 0
kmalloc-128 8593 8710 384 10 1 : tunables 0 0 0 : slabdata 871 871 0
kmem_cache_node 109 112 256 16 1 : tunables 0 0 0 : slabdata 7 7 0
kmem_cache 109 110 384 10 1 : tunables 0 0 0 : slabdata 11 11 0


More than 50% RAM lost (about 50MB)


Thu Jun 27 15:15:19 CEST 2019
slabinfo - version: 2.1
name : tunables : slabdata
batadv_tt_roam_cache 0 0 128 32 1 : tunables 0 0 0 : slabdata 0 0 0
batadv_tt_req_cache 0 32 128 32 1 : tunables 0 0 0 : slabdata 1 1 0
batadv_tt_change_cache 3 42 96 42 1 : tunables 0 0 0 : slabdata 1 1 0
batadv_tt_orig_cache 848 1120 128 32 1 : tunables 0 0 0 : slabdata 35 35 0
batadv_tg_cache 414 640 128 32 1 : tunables 0 0 0 : slabdata 20 20 0
batadv_tl_cache 154 192 128 32 1 : tunables 0 0 0 : slabdata 6 6 0
bridge_fdb_cache 100 128 128 32 1 : tunables 0 0 0 : slabdata 4 4 0
nf-frags 0 0 192 21 1 : tunables 0 0 0 : slabdata 0 0 0
ip6-frags 0 0 192 21 1 : tunables 0 0 0 : slabdata 0 0 0
fib6_nodes 39 64 128 32 1 : tunables 0 0 0 : slabdata 2 2 0
ip6_dst_cache 76 108 320 12 1 : tunables 0 0 0 : slabdata 9 9 0
PINGv6 0 0 864 9 2 : tunables 0 0 0 : slabdata 0 0 0
RAWv6 8 18 864 9 2 : tunables 0 0 0 : slabdata 2 2 0
UDPLITEv6 0 0 864 9 2 : tunables 0 0 0 : slabdata 0 0 0
UDPv6 23 45 864 9 2 : tunables 0 0 0 : slabdata 5 5 0
tw_sock_TCPv6 0 17 232 17 1 : tunables 0 0 0 : slabdata 1 1 0
request_sock_TCPv6 0 14 280 14 1 : tunables 0 0 0 : slabdata 1 1 0
TCPv6 18 30 1632 10 4 : tunables 0 0 0 : slabdata 3 3 0
nf_conntrack_expect 0 0 208 19 1 : tunables 0 0 0 : slabdata 0 0 0
nf_conntrack 7 56 288 14 1 : tunables 0 0 0 : slabdata 4 4 0
sgpool-128 2 15 2144 15 8 : tunables 0 0 0 : slabdata 1 1 0
sgpool-64 2 14 1120 14 4 : tunables 0 0 0 : slabdata 1 1 0
sgpool-32 2 13 608 13 2 : tunables 0 0 0 : slabdata 1 1 0
sgpool-16 2 11 352 11 1 : tunables 0 0 0 : slabdata 1 1 0
sgpool-8 2 18 224 18 1 : tunables 0 0 0 : slabdata 1 1 0
jffs2_xattr_ref 0 0 80 51 1 : tunables 0 0 0 : slabdata 0 0 0
jffs2_xattr_datum 0 0 112 36 1 : tunables 0 0 0 : slabdata 0 0 0
jffs2_inode_cache 134 153 80 51 1 : tunables 0 0 0 : slabdata 3 3 0
jffs2_node_frag 60 102 80 51 1 : tunables 0 0 0 : slabdata 2 2 0
jffs2_refblock 92 104 304 13 1 : tunables 0 0 0 : slabdata 8 8 0
jffs2_tmp_dnode 0 46 88 46 1 : tunables 0 0 0 : slabdata 1 1 0
jffs2_raw_inode 0 25 160 25 1 : tunables 0 0 0 : slabdata 1 1 0
jffs2_raw_dirent 0 32 128 32 1 : tunables 0 0 0 : slabdata 1 1 0
jffs2_full_dnode 120 168 72 56 1 : tunables 0 0 0 : slabdata 3 3 0
jffs2_i 79 80 392 10 1 : tunables 0 0 0 : slabdata 8 8 0
squashfs_inode_cache 707 711 448 9 1 : tunables 0 0 0 : slabdata 79 79 0
dio 0 0 392 10 1 : tunables 0 0 0 : slabdata 0 0 0
fasync_cache 4 51 80 51 1 : tunables 0 0 0 : slabdata 1 1 0
posix_timers_cache 0 0 216 18 1 : tunables 0 0 0 : slabdata 0 0 0
UNIX 32 48 672 12 2 : tunables 0 0 0 : slabdata 4 4 0
ip4-frags 0 0 200 20 1 : tunables 0 0 0 : slabdata 0 0 0
UDP-Lite 0 0 736 11 2 : tunables 0 0 0 : slabdata 0 0 0
tcp_bind_bucket 4 32 128 32 1 : tunables 0 0 0 : slabdata 1 1 0
inet_peer_cache 4 18 224 18 1 : tunables 0 0 0 : slabdata 1 1 0
secpath_cache 0 0 128 32 1 : tunables 0 0 0 : slabdata 0 0 0
flow_cache 0 0 160 25 1 : tunables 0 0 0 : slabdata 0 0 0
xfrm_dst_cache 0 0 352 11 1 : tunables 0 0 0 : slabdata 0 0 0
ip_fib_trie 8 46 88 46 1 : tunables 0 0 0 : slabdata 1 1 0
ip_fib_alias 9 46 88 46 1 : tunables 0 0 0 : slabdata 1 1 0
ip_dst_cache 6 18 224 18 1 : tunables 0 0 0 : slabdata 1 1 0
PING 0 0 704 11 2 : tunables 0 0 0 : slabdata 0 0 0
RAW 2 11 704 11 2 : tunables 0 0 0 : slabdata 1 1 0
UDP 8 22 736 11 2 : tunables 0 0 0 : slabdata 2 2 0
tw_sock_TCP 0 0 232 17 1 : tunables 0 0 0 : slabdata 0 0 0
request_sock_TCP 0 0 280 14 1 : tunables 0 0 0 : slabdata 0 0 0
TCP 5 10 1536 10 4 : tunables 0 0 0 : slabdata 1 1 0
eventpoll_pwq 44 92 88 46 1 : tunables 0 0 0 : slabdata 2 2 0
eventpoll_epi 44 50 160 25 1 : tunables 0 0 0 : slabdata 2 2 0
inotify_inode_mark 2 39 104 39 1 : tunables 0 0 0 : slabdata 1 1 0
request_queue 6 8 1016 8 2 : tunables 0 0 0 : slabdata 1 1 0
blkdev_requests 24 30 264 15 1 : tunables 0 0 0 : slabdata 2 2 0
blkdev_ioc 6 36 112 36 1 : tunables 0 0 0 : slabdata 1 1 0
bio-0 14 50 384 10 1 : tunables 0 0 0 : slabdata 5 5 0
biovec-max 14 20 3168 10 8 : tunables 0 0 0 : slabdata 2 2 0
biovec-128 0 0 1632 10 4 : tunables 0 0 0 : slabdata 0 0 0
biovec-64 0 0 864 9 2 : tunables 0 0 0 : slabdata 0 0 0
biovec-16 0 0 288 14 1 : tunables 0 0 0 : slabdata 0 0 0
uid_cache 1 25 160 25 1 : tunables 0 0 0 : slabdata 1 1 0
sock_inode_cache 127 152 416 19 2 : tunables 0 0 0 : slabdata 8 8 0
skbuff_fclone_cache 3 16 480 8 1 : tunables 0 0 0 : slabdata 2 2 0
skbuff_head_cache 13035 13048 288 14 1 : tunables 0 0 0 : slabdata 932 932 0
file_lock_cache 0 23 176 23 1 : tunables 0 0 0 : slabdata 1 1 0
file_lock_ctx 50 51 80 51 1 : tunables 0 0 0 : slabdata 1 1 0
shmem_inode_cache 216 220 392 10 1 : tunables 0 0 0 : slabdata 22 22 0
pool_workqueue 6 10 768 10 2 : tunables 0 0 0 : slabdata 1 1 0
proc_inode_cache 457 490 392 10 1 : tunables 0 0 0 : slabdata 49 49 0
sigqueue 0 20 200 20 1 : tunables 0 0 0 : slabdata 1 1 0
bdev_cache 4 8 512 8 1 : tunables 0 0 0 : slabdata 1 1 0
kernfs_node_cache 7331 7350 136 30 1 : tunables 0 0 0 : slabdata 245 245 0
mnt_cache 21 28 288 14 1 : tunables 0 0 0 : slabdata 2 2 0
filp 352 414 224 18 1 : tunables 0 0 0 : slabdata 23 23 0
inode_cache 2510 2541 352 11 1 : tunables 0 0 0 : slabdata 231 231 0
dentry 4844 4893 192 21 1 : tunables 0 0 0 : slabdata 233 233 0
names_cache 0 7 4192 7 8 : tunables 0 0 0 : slabdata 1 1 0
key_jar 2 21 192 21 1 : tunables 0 0 0 : slabdata 1 1 0
buffer_head 2176 2176 120 34 1 : tunables 0 0 0 : slabdata 64 64 0
nsproxy 0 0 80 51 1 : tunables 0 0 0 : slabdata 0 0 0
vm_area_struct 476 672 144 28 1 : tunables 0 0 0 : slabdata 24 24 0
mm_struct 25 56 512 8 1 : tunables 0 0 0 : slabdata 7 7 0
fs_cache 24 96 128 32 1 : tunables 0 0 0 : slabdata 3 3 0
files_cache 25 56 288 14 1 : tunables 0 0 0 : slabdata 4 4 0
signal_cache 54 78 608 13 2 : tunables 0 0 0 : slabdata 6 6 0
sighand_cache 54 80 3168 10 8 : tunables 0 0 0 : slabdata 8 8 0
task_struct 54 92 1392 23 8 : tunables 0 0 0 : slabdata 4 4 0
cred_jar 93 126 224 18 1 : tunables 0 0 0 : slabdata 7 7 0
anon_vma_chain 372 506 88 46 1 : tunables 0 0 0 : slabdata 11 11 0
anon_vma 292 414 88 46 1 : tunables 0 0 0 : slabdata 9 9 0
pid 62 96 128 32 1 : tunables 0 0 0 : slabdata 3 3 0
radix_tree_node 149 154 360 11 1 : tunables 0 0 0 : slabdata 14 14 0
idr_layer_cache 98 112 1120 14 4 : tunables 0 0 0 : slabdata 8 8 0
kmalloc-8192 63 69 8448 3 8 : tunables 0 0 0 : slabdata 23 23 0
kmalloc-4096 5532 5551 4352 7 8 : tunables 0 0 0 : slabdata 793 793 0
kmalloc-2048 2322 2338 2304 14 8 : tunables 0 0 0 : slabdata 167 167 0
kmalloc-1024 1141 1152 1280 12 4 : tunables 0 0 0 : slabdata 96 96 0
kmalloc-512 3847 3850 768 10 2 : tunables 0 0 0 : slabdata 385 385 0
kmalloc-256 535 544 512 8 1 : tunables 0 0 0 : slabdata 68 68 0
kmalloc-128 6592 8670 384 10 1 : tunables 0 0 0 : slabdata 867 867 0
kmem_cache_node 109 112 256 16 1 : tunables 0 0 0 : slabdata 7 7 0
kmem_cache 109 110 384 10 1 : tunables 0 0 0 : slabdata 11 11 0
^C
root@54309BesDigiLayxx2:~# wifi

Extract:

kmalloc-4096 110 133 4352 7 8 : tunables 0 0 0 : slabdata 19 19 0
kmalloc-4096 134 147 4352 7 8 : tunables 0 0 0 : slabdata 21 21 0
kmalloc-4096 134 245 4352 7 8 : tunables 0 0 0 : slabdata 35 35 0
kmalloc-4096 139 147 4352 7 8 : tunables 0 0 0 : slabdata 21 21 0
kmalloc-4096 1700 1764 4352 7 8 : tunables 0 0 0 : slabdata 252 252 0
kmalloc-4096 1967 1995 4352 7 8 : tunables 0 0 0 : slabdata 285 285 0
kmalloc-4096 4146 4151 4352 7 8 : tunables 0 0 0 : slabdata 593 593 0
kmalloc-4096 5118 5222 4352 7 8 : tunables 0 0 0 : slabdata 746 746 0
kmalloc-4096 5337 5355 4352 7 8 : tunables 0 0 0 : slabdata 765 765 0
kmalloc-4096 5532 5551 4352 7 8 : tunables 0 0 0 : slabdata 793 793 0
kmalloc-512 273 280 768 10 2 : tunables 0 0 0 : slabdata 28 28 0
kmalloc-512 295 320 768 10 2 : tunables 0 0 0 : slabdata 32 32 0
kmalloc-512 296 320 768 10 2 : tunables 0 0 0 : slabdata 32 32 0
kmalloc-512 335 400 768 10 2 : tunables 0 0 0 : slabdata 40 40 0
kmalloc-512 2138 2150 768 10 2 : tunables 0 0 0 : slabdata 215 215 0
kmalloc-512 2401 2420 768 10 2 : tunables 0 0 0 : slabdata 242 242 0
kmalloc-512 2496 2500 768 10 2 : tunables 0 0 0 : slabdata 250 250 0
kmalloc-512 3611 3620 768 10 2 : tunables 0 0 0 : slabdata 362 362 0
kmalloc-512 3687 3700 768 10 2 : tunables 0 0 0 : slabdata 370 370 0
kmalloc-512 3847 3850 768 10 2 : tunables 0 0 0 : slabdata 385 385 0

skbuff_head_cache 517 644 288 14 1 : tunables 0 0 0 : slabdata 46 46 0
skbuff_head_cache 519 644 288 14 1 : tunables 0 0 0 : slabdata 46 46 0
skbuff_head_cache 522 644 288 14 1 : tunables 0 0 0 : slabdata 46 46 0
skbuff_head_cache 522 798 288 14 1 : tunables 0 0 0 : slabdata 57 57 0
skbuff_head_cache 5887 6104 288 14 1 : tunables 0 0 0 : slabdata 436 436 0
skbuff_head_cache 6275 6356 288 14 1 : tunables 0 0 0 : slabdata 454 454 0
skbuff_head_cache 8442 8456 288 14 1 : tunables 0 0 0 : slabdata 604 604 0
skbuff_head_cache 11574 11676 288 14 1 : tunables 0 0 0 : slabdata 834 834 0
skbuff_head_cache 12080 12236 288 14 1 : tunables 0 0 0 : slabdata 874 874 0
skbuff_head_cache 13035 13048 288 14 1 : tunables 0 0 0 : slabdata 932 932 0

btw:
I can confirm LAN-Clients never cause this problem. I have an ArcherC7v2 on gluon-v2018.2.1+ with more than 70 LAN-Clients peaks each day, that never leaked.

@tackin Thank you very much. I need to see the whole output. If that is the whole output then the cache names are missing and it is quite useless. Nonetheless I can't see anything suspicious here.

BTW we need one output from the device when it's working fine and one when it is leaking.

@CodeFetch : I edited the previous comment.

@tackin Great! Now I understand why you posted the extract. So there seem to be skbs aggregating. I'll have a look what can cause this...

It could be possible that the ath9k driver can't send them out fast enough, because the driver's TX-queue is not triggered by mac80211 often enough. (retry packets aggregating)

There has been a fix for this by nbd168: https://github.com/torvalds/linux/commit/2b4a66980217332d91ab1785e1750857d6d52bc8#diff-ccb93195360636a322b558d121709687

But that's only a guess...

EDIT: BAD guess as we are running the stone-old 4.9 kernel on ar71xx...

I'm not in the details of the programming but my thinking is: 40+ clients may stuff the air down to a minimum throughput and make the driver buffer to increase more and more.

On my Fritzbox4040 I had only 1MB/s Traffic in the uplink but was not able to stream a mini video anymore when 40 other clients are around. They all did not produce much traffic (1 MB) or couldn't anymore because airtime is stuffed.
But on a Fritzbox4040 we don't see that leaking even with 70 Clients.

@tackin
I haven't found anything that was fixed that could lead to this behavior. Thus either the issue still persists with the kernel master or it was fixed incidentally. The only way to track this issue down is to use kmemleak.

Setting /sys/kernel/debug/ieee80211/phy0/ath9k/airtime_flags from 7 to 0 and restarting wifi does not affect the oom behaviour on an CPE210v2 with 2017.1.8

@TomSiener Turning off airtime fairness does not need a WiFi restart. Is it still set to 0 afterwards?

yes. the value 0 is remaining after "wifi"

We rolled out Gluon v2018.2.2 over the night let's see if this changes something.
Previously affected nodes:
https://map.ffmuc.net/#!/en/map/98ded0a79164

Please try Gluon master, merge this commit https://github.com/freifunk-gluon/gluon/commit/65d1f4ea8f82 and build with GLUON_DEBUG=true.

@awlx My RAM-leak-test-node is on gluon 2018.2.2 as well. ... so upgrading will not help.

According to @T-X kmemleak will do us no good:

17:06 hm, I don't expect CONFIG_DEBUG_KMEMLEAK to reveal anything here. the "wifi" command is able to free the memory, so there must be some references left
17:07 KMEMLEAK can only detect memory leaks which have no more references

Maybe it's something like what I fixed in OpenWrt commit 0c285bd081da55bd63da18f7596e7107a46bb798

@mweinelt Right, I overestimated what kmemleak can do :(... I expected it to detect stale sk_buffs, but it is just really stupid...

@nbd168 As this issue seems to only happen on ath9k devices I suspect the TX-retry-queue. I can't find a case in ath9k similar to your commit. It does not really explain why this happens with so many clients (maybe collisions). Why is there no limit for the TX queues' sizes in ath9k? ATH_MAX_SW_RETRIES is set to 30 by default. Is it possible that it just fills the memory?

Edit: Seems I don't understand the queues, yet, but I still suspect there's something wrong.

@tackin @awlx @TomSiener Can you please try an image built with CONFIG_SLUB_DEBUG=y?

Then execute echo 1 > /sys/kernel/slab/kmalloc-4096/trace and post the dmesg output before and when the memory increases.

sorry ...
Try to build 2018.2.2 and add in file /gloun/targets/generic

config 'CONFIG_SLUB_DEBUG=y'

Configuration failed:

  • unable to set 'CONFIG_SLUB_DEBUG=y'
    Makefile:126: die Regel für Ziel „config“ scheiterte

Maybe CONFIG_SLUB_DEBUG_ON ?

config 'CONFIG_KERNEL_SLABINFO=y'

(see openwrt/config/Config-kernel.in)

@NeoRaider @rubo77 @CodeFetch
I already build a firmware with config 'CONFIG_KERNEL_SLABINFO=y' and posted the output of /proc/slabinfo above.

Then execute echo 1 > /sys/kernel/slab/kmalloc-4096/trace and post the dmesg output before and when the memory increases.

So do I understand you right, that I can/should do this on that build without adding something else to the build?

Need to say, I build the slabinfo-build with:
diff --git a/targets/generic b/targets/generic
index 517603a6..a05296a3 100644
--- a/targets/generic
+++ b/targets/generic
@@ -51,6 +51,7 @@ if [ "$GLUON_DEBUG" = 1 ]; then
config '# CONFIG_USE_SSTRIP is not set'
fi

+config 'CONFIG_KERNEL_SLABINFO=y'

packages '-odhcpd-ipv6only' '-ppp' '-ppp-mod-pppoe' '-wpad-mini'
packages 'gluon-core' 'ip6tables'

but without the make OPTION GLUON_DEBUG=1
Is this important?

@NeoRaider I'd like to see a SLUB trace. For that CONFIG_SLUB_DEBUG needs to be set.

@tackin SLUB != SLAB
I don't know why CONFIG_SLUB_DEBUG fails to build. Maybe because the symbol is missing in the ar71xx-config. But as you can see here this entry exists. I'll try later, but I'm very busy atm.

@tackin @awlx @TomSiener

I've looked at the dependencies of CONFIG_SLUB_DEBUG and there should be no reason why it fails. Please build with V=s and this commit https://github.com/freifunk-gluon/gluon/commit/6c4f98944d38db42f2d851e50b3647489351dcb0 and GLUON_DEBUG=1.

@tackin Without GLUON_DEBUG the symbols might be stripped which gives unusable output.

@CodeFetch You are mixing up OpenWrt and kernel .config. CONFIG_SLUB_DEBUG is a kernel config symbol, you can not directly affect those in the target config.

Fortunately, OpenWrt provides CONFIG_KERNEL_SLABINFO, which also enables CONFIG_KERNEL_SLUB_DEBUG (which can't be set directly, as it is a hidden config symbol):

config KERNEL_SLABINFO
        select KERNEL_SLUB_DEBUG
        select KERNEL_SLUB_DEBUG_ON
        bool "Enable /proc slab debug info"

@NeoRaider Oh damned... You're so right. Thank you.

@tackin Can you please provide the dmesg output before and while the memory increases? echo 1 > /sys/kernel/slab/kmalloc-4096/trace
and GLUON_DEBUG needs to be set to 1 or the output will be useless.

@CodeFetch New build is in progress. I will probably have your requested data with the first oom-reboot tomorrow afternoon.

I'm so sorry people. In the night I bricked the WDR4300-Test-node with my flashing (wget an uncomplete Image and sysupgraded). This morning I replaced the WDR4300 with an older WR1043v1.
Each time I try to do on the WR1043

cat  /sys/kernel/slab/kmalloc-4096/trace
0
echo 1 > /sys/kernel/slab/kmalloc-4096/trace

This kills the router and need to powercyle.

After all I'm also not able to get the WDR4300 in recovery-mode to tftp a gluon-factory-image as in the howto https://openwrt.org/toh/tp-link/tl-wdr4300.
For the PC it looks like the node-swichtports are "dead" most times.

@tackin The need to power-cycle sounds like it is still working, but overwhelmed with the debugging. That's a good sign. If you could connect it using the serial console, we'd likely see some useful messages.

I set /sys/kernel/debug/ieee80211/phy0/ath9k/airtime_flags to 0 on a few testnodes with very frequent OOM events since updated to 2018.2.1.

It did a great impact on the uptime of two Ubiquiti NSM2 loco with xm hardware. After suffering OOMs about every hour, both nodes now have an uptime of 4 days. Right now both have >40 clients and are still healthy.

On a TP-Link Archer C7 v5 the setting on phy1 (phy0 does not have airtime_flags) also noticeably improved the stability. Only a single OOM event occurred afterwards, probably caused by the ath10k device. The prior uptime was about 50 minutes with 40-50 clients.

What does the "airtime fairness" do and how big is it's impact on throughput?

@CodeFetch
The device is about 20km away from my home and has no Serial-connector. I also have no serial-UART.
A friend is back form sailing next sunday. If he has time he can help with uart and soldering a serial.
The WR1043v1 is this one: https://draco.freifunk-trier.starletp9.de:3000/d/q8Ud1fJiz/single-nodes?orgId=1&var-node_name=54309_Test_1043_xx3
with very less free RAM atm. Not sure if you might see anything important. OOM occured with in seconds, not minutes like on the 4300.

@DL3DCF Airtime fairness is a great feature, but the kernel version 4.9 that we currently use does only support it, because nbd patched it. Thus it might be more stable with a newer kernel version. It measures the airtime it uses to send data to the clients and distributes the times equally among the clients. This is useful, because a client with a very low data rate/bad signal could take up most of the airtime, while transferring very few data. When you have many clients there likely is one that has a bad signal and all the other clients suffer from it. With airtime fairness it's like: Your signal sucks. That's your fault. I don't spend all my airtime on you, because I would let others suffer who can receive a 100 times more data in that same time frame.

Edit: The impact on total throughput is an increase of about 80% in worst-case scenarios. In average it's about 30-40%. If all clients have a similar signal strength there is no gain at all.

Edit2: I think that 80% number is wrong. I think it could even be more than 100%, but I'm not sure. I've got those figures from a talk.

@tackin Did it reboot on its own or did you need to power-cycle it?

@CodeFetch most times it reboots on it's own. One single time I needed to power-cycle (after 12:30Uhr) because of the "echo 1" command.

@tackin

One single time I needed to power-cycle (after 12:30Uhr) because of the "echo 1" command.

That's what I wanted to know. The command makes the router unresponsive, but does not cause a reboot -> it's still running -> the command works, but makes everything super-slow. Thus we should be able to still get an output via serial.

Actually that's why I wanted someone to reproduce this issue in a "lab environment" using fake clients as suggested in https://github.com/freifunk-gluon/gluon/issues/1768#issuecomment-506368366

It does not really make sense to try to hunt such a bug down on a device deployed in production.

4 days later: Does anybody currently work on the issue? what is required?

1) If it's just lacking a few Archers: let me know a shipping address.
2) if it's testbed: I can put 2-3 devices with RS232 attached on a vlan switch plus a vm with access to tagged IF, so you can do whatever you like with those units remote (just powercycling will be a manual process for the moment. But wait, i have so TP-Link Wifi things...
3) no, i can't compile with all suggested debug options and read logs.

@Adorfer: we will need about 45 WLAN-Clients simultaneously as well to trigger the problem.
My last Test-WDR4300 in the crowded public pool is dead, sorry. A friend might be able to fix it. So I'm out of the game as long as there is no replace for a device.
But after all, I'm only able to connect via ssh to that node. For a laboratory condition not the best idea.
Perhaps someone else is closer to a needed environment.
Let me know.

@tackin @Adorfer @DL3DCF @awlx @TomSiener It would be very helpful, if someone could recreate this issue in a controlled environment.

Thus I've written this script: https://github.com/freifunk-gluon/gluon/issues/1768#issuecomment-506368366
Take a bunch of WR841Ns with plain OpenWrt and use them to simulate clients with this script. This script creates STA VIFs. I don't know how many it can create. Theoretically about 2000.

Maybe this is not enough to trigger the bug. Maybe some "clients" need to have a bad signal, maybe it needs a specific traffic flow. Please experiment and reproduce this issue in a controlled environment.

@CodeFetch I don't have the skills nor the hardware to build that up. But I will ask my friend if we perhaps could create a labor in the pool-area next week for a day with a serial-connector connected to a device and see what the node tells us. And just hope it's crowded enough for the bug to occur. But please don't rely on that to happen.

@CodeFetch I'm not very familiar with debugging but I have several devices lying around and could do a little test setup next week. Could you please give a small summary of

  • what branch should I use (master, 2018.2.2 ?)
  • what compiler options should be used
  • what configs should I modify
  • what logs and outputs do you need

So we have a common ground for testing.

@DL3DCF

  • what branch should I use (master, 2018.2.2 ?)

This commit

  • what compiler options should be used

GLUON_DEBUG=1

  • what configs should I modify

none

  • what logs and outputs do you need

To start logging execute echo 1 > /sys/kernel/slab/kmalloc-4096/trace
and then I need the dmesg output while the router works good and when it starts filling its memory

echo 1 > /sys/kernel/slab/kmalloc-4096/trace
makes my TP-Link WDR4300 immediately unresponsive in the terminal of a ssh session. ssh-session get shut down/cut, but the node still runs and don't reboot. Is this normal/expected for this device?
So the only way is a serial-connector to get the dmesg-data requested?

So the only way is a serial-connector to get the dmesg-data requested?

this is what i understood from irc.

Building a firmware with GLUON_DEBUG=1 fails with the following warning in my environment:
WARNING: Image file /.../gluon/openwrt/build_dir/target-mips_24kc_musl/linux-ar71xx_generic/tmp/openwrt-ar71xx-generic-dap-1330-a1-squashfs-factory.img is too big
LastLinesOfBuild.log

"Image too big" is pretty clear. Maybe you can remove some superfluous packages?

Setting GLUON_DEBUG=1 requires a device or target with at least 16 MB of flash space. The build of devices with a smaller flash space might not work. https://gluon.readthedocs.io/en/latest/user/getting_started.html?highlight=GLUON_DEBUG#special-variables

@CodeFetch @mweinelt @Adorfer I need to come back to my last problem/question:

echo 1 > /sys/kernel/slab/kmalloc-4096/trace
makes my TP-Link WDR4300 immediately unresponsive in the terminal of a ssh session. ssh-session get shut down/cut, but the node still runs and don't reboot. Is this normal/expected for this device?
So the only way is a serial-connector to get the dmesg-data requested?

If so and the node (a WDR4300 with 100MB free RAM) is not responsive any more, I do not have any output/info of how many clients are on the node as I don't get any further respondd-messages to yanic, nor can I see if a client is still served / logged in. I simply cannot see if the node is working in "ok-mode" or in the "RAM-leak-mode". And if the node is not responsive, my guess is, clients are no longer served, get locked out and RAM-leak will never happen in this phase.
So what do you expect me to see on such a non responsive/blind node?

@tackin It does not make sense without a serial console.

When you see that the memory is increasing, you need to turn tracing on using echo 1 > /sys/kernel/slab/kmalloc-4096/trace, get the output of dmesg and you can turn it off again using echo 0 > /sys/kernel/slab/kmalloc-4096/trace afterwards.

And the same output I need when the node is running smoothly.

Is there a way for an automatic start of the trace if free memory is under xy and sending it to a remote logserver or something?

@A-Kasper Someone would need to write a script for that. What I actually want is someone reproducing this issue in a controlled environment. Everything else is Mist.

"Mist" :-)
All I can offer is a public pool. But how about debuging this on an upcoming congress or meeting event?

I've got some devices installed in streetslights with this behaviour. I can't install a serial console and wondered if I can help somehow with what I've got... ;)

echo 1 > /sys/kernel/slab/kmalloc-4096/trace results in not repsonding and rebooting after a few seconds on my WR1043Nv5 with master-build and GLUON_DEBUG enabled.

@TomSiener Do you use a serial console?

No, I thought there's no need for this effort, if the router stops working before I can connect with one or many wireless clients.

@TomSiener Okay, now I understand. The reason why it hangs is likely that it's a huge amount of debug messages, but you should see them on the serial console before the router reboots.

@CodeFetch Here's the trace captured by serial console from boot until reboot (about 20s after activating the trace).
There was no wireless client connected.
WR1043v5 with build from master branch and GLUON_DEBUG=1
WR1043v5_Debug_ TRACE_kmalloc-4096.txt

@TomSiener Thank you very much! It is likely rebooting, because it is dumping the memory content and the speed of the UART interface is very limited...

Please test a build with this commit.
It should turn off the object memory dumping.

@CodeFetch :
My TP-Link-WDR4300 reboots after echo 1 > /sys/kernel/slab/kmalloc-4096/trace
in the serial console.
I compiled the firmware with your patch "patches/openwrt/0024-kernel-turn-off-object-dumps-on-SLUB-debug-tracing.patch"
(I guess it's suffisant to throw the patch-file in the dir before compiling.)
My serial output:
https://gist.github.com/tackin/8943bbab19790abec7e1c54e0ca3e3df#file-tp-link_wdr4300_kmalloc-4096_trace-txt

Okay. Thank you very much for your help. As there are many, many allocations the serial output will always crash the router after some time. Thus the next step is that we need to rewrite the tracing function to check the calling function of the call stack and instead of dumping the information to the serial console, collect statistics about freed and allocated slabs. Then we can see which slabs aggregate respectively the call stack that does this allocation. This is a bit of work and I'm disappointed that such a feature does not exist in the linux kernel already, but bringing this upstream is out of scope. Thus consider it as a hack. This way you might use this feature on your production nodes without them crashing...

@T-X I just wonder... Where do the batadv_algo_dump get called all the time?

Yes, I can confirm. Same behaviour as before even with no object debug infos.
WR1043v5_Debug_ TRACE_kmalloc-4096_v2.txt

@CodeFetch Could you tell me how to fetch your commit with git ?
git checkout with the commitID is obviously not then right way:
fatal: reference is not a tree: b6b61d6999af3d853d1789e4eeac5d6689477e5b

Could you tell me how to fetch your commit with git

You have to switch to branch dbg/memleak https://github.com/freifunk-gluon/gluon/tree/dbg/memleak

It's not the memleak branch.

On Github you can always append the ".patch" suffix to get a commit as patch file.
Just do git am filename.patch.

Alternatively clone my branch with git clone https://github.com/codefetch/gluon -b patch-10.

Or even to add complexity add it as remote upstream and cherry pick the commit.

Too bad, that this needs serial access. This router is now under heavy stress:
https://map.ffmuc.net/#!/en/map/98ded0a79164

The tracing function needs to be rewritten to become a profiling function instead and only output statistics sporadically. Until then even the serial output will not help, as it will crash the router...

will an x86 (futro...) with ath9k-PCI card perhaps help, e.g. dumping to an attached HDD?

(some TPlink 2x2 2.4GH can be found for 20€ on ebay)

It's not nice to write to disk from such a kernel module. Ordinarily SLAB caches have a specific name and you could easily use the tracing function, because they wouldn't be allocated/freed so often, but the problem we have here is that these are packets. This is the most time-critical part in the Linux kernel and there is just no function to do profiling for that, yet. And writing a function for this is not easy as you get in locking hell if you do not work cleanly.

if it doesn't have other reasons, i can see this too on a ubiquiti rocket m2 with 64m ram:
https://stats.tecff.de/d/000000003/node?orgId=1&var-node_id=68725160becc&from=1563442025404&to=1563484908887
the threshold is around 30 i guess

@CodeFetch:
Today I could dump two serial-console-outputs.

  1. after starting the device (a WDR4300)
  2. after prox. 50MB RAM-loss
    https://pastebin.com/UqkQyruT
    https://pastebin.com/7FAhbZdv

@tackin Wow! Great... This could be very helpful.

@nbd168 It seems that there are being copied much more skbs than ever get freed. It seems to be a static function called by ieee80211_mark_rx_ba_filtered_frames somewhere in net/mac80211/rx.c doing a skb_copy. Do you have a clue how to find this out without doing it manually?

For reference in the dump provided by tackin the following functions allocated that number of skbs that were not freed afterwards and I see this in all dumps so far:

pskb_expand_head : 8
ieee80211_probereq_get : 0
batadv_add_bcast_packet_to_list : 4
ieee80211_mark_rx_ba_filtered_frames : 162

And here's the trace:

[ 1802.452167] TRACE kmalloc-4096 alloc 0x83796680 inuse=7 fp=0x  (null)
[ 1802.458715] CPU: 0 PID: 0 Comm: swapper Not tainted 4.9.182 #0
[ 1802.464627] Stack : 805476c2 00000032 00000000 00000000 804c8fc4 804c8c27 80472060 00000000
[ 1802.473171]         80543800 802afc34 00210d00 83b606e0 00000000 800a9a6c 00000000 00000006
[ 1802.481706]         00000006 84b24ec0 80475b0c 87c09bcc 00000000 800d88dc 805476c2 00000039
[ 1802.490242]         00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
[ 1802.498778]         00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
[ 1802.507314]         ...
[ 1802.509807] Call Trace:
[ 1802.512299] [<8006b210>] show_stack+0x70/0x8c
[ 1802.516727] [<80117260>] alloc_debug_processing+0x98/0x1d8
[ 1802.522294] [<80117c58>] ___slab_alloc.constprop.27+0x1e8/0x290
[ 1802.528303] [<80117d20>] __slab_alloc.constprop.26+0x20/0x44
[ 1802.534051] [<8011a58c>] __kmalloc_track_caller+0xf8/0x1b0
[ 1802.539628] [<802afb44>] __kmalloc_reserve.isra.3+0x40/0xa4
[ 1802.545289] [<802afc34>] __alloc_skb+0x8c/0x17c
[ 1802.549890] [<802b0110>] skb_copy+0x48/0xbc
[ 1802.554276] [<876a4678>] ieee80211_mark_rx_ba_filtered_frames+0xac4/0x141c [mac80211]
[ 1802.562263] [<802b363c>] __build_skb+0x40/0xc8

Hm... I don't see where this function should be called at all. As far as I can see it is called by an Intel WLAN driver only.

@CodeFetch +0xac4 is a huge offset, I assume it actually points to a different function that was stripped from the symbol table.

Backtraces in OpenWrt up to 18.06 are more ore less useless because of target/linux/generic/hack-4.9/202-reduce_module_size.patch, unless CONFIG_PROFILING is set (CONFIG_KERNEL_PROFILING in OpenWrt .config). Either set the config option, or remove that patch (as it was done in the OpenWrt master).

Hm. I guessed so and wondered about the strange call stack... I've thought the stripping was deactivated by GLUON_DEBUG. Maybe we should do this by default. Thank you for the hint!

@tackin Please build a firmware with this patch. And just do a trace when the router is up and running. I expect to recognize the problematic function in the idle state, too.

@CodeFetch Sorry but the build refused to finish. It ends with the errors below:
The only thing I added after the last successfull run with your patch "0024" was: config 'CONFIG_KERNEL_PROFILING=y'
My branch is v2018.2.x

Without "CONFIG_KERNEL_PROFILING=y" the build works.

In targets/generic:
...
if [ "$GLUON_DEBUG" = 1 ]; then
config 'CONFIG_DEBUG=y'
config 'CONFIG_NO_STRIP=y'
config 'CONFIG_KERNEL_SLABINFO=y'
config '# CONFIG_USE_STRIP is not set'
config '# CONFIG_USE_SSTRIP is not set'
config 'CONFIG_KERNEL_PROFILING=y'
fi
....


my make-gluon.sh:

!/bin/bash

VERSION=0.11.9.d02+tackin
OPTIONS="GLUON_RELEASE=$VERSION GLUON_BRANCH=stable GLUON_DEBUG=1"
cd gluon
make update
set -x

for ARCH in ar71xx-generic ; do
trap ": user abort; exit;" SIGINT SIGTERM # so CTRL+C will exit the loop
make GLUON_TARGET=$ARCH $OPTIONS V=s || exit 1
done


Build errored with:
...
[======================================================================================================================================|] 1258/1258 100%
Exportable Squashfs 4.0 filesystem, xz compressed, data block size 65536
compressed data, compressed metadata, compressed fragments, no xattrs
duplicates are removed
Filesystem size 6728.33 Kbytes (6.57 Mbytes)
30.97% of uncompressed filesystem size (21728.38 Kbytes)
Inode table size 10020 bytes (9.79 Kbytes)
22.06% of uncompressed inode table size (45418 bytes)
Directory table size 13816 bytes (13.49 Kbytes)
45.80% of uncompressed directory table size (30165 bytes)
Number of duplicate files found 166
Number of inodes 1335
Number of files 987
Number of fragments 34
Number of symbolic links 212
Number of device nodes 1
Number of fifo nodes 0
Number of socket nodes 0
Number of directories 135
Number of ids (unique uids + gids) 1
Number of uids 1
root (0)
Number of gids 1
root (0)
[ -f /home/tackin/freifunk/gluon/openwrt/build_dir/target-mips_24kc_musl/linux-ar71xx_generic/koala-kernel.bin -a -f /home/tackin/freifunk/gluon/openwrt/build_dir/target-mips_24kc_musl/linux-ar71xx_generic/root.squashfs+pkg=c902f989 ]
Makefile:68: recipe for target '/home/tackin/freifunk/gluon/openwrt/build_dir/target-mips_24kc_musl/linux-ar71xx_generic/tmp/openwrt-ar71xx-generic-koala-squashfs-sysupgrade.bin' failed
make[6]: * [/home/tackin/freifunk/gluon/openwrt/build_dir/target-mips_24kc_musl/linux-ar71xx_generic/tmp/openwrt-ar71xx-generic-koala-squashfs-sysupgrade.bin] Error 1
make[6]: Leaving directory '/home/tackin/freifunk/gluon/openwrt/target/linux/ar71xx/image'
Makefile:24: recipe for target 'install' failed
make[5]:
[install] Error 2
make[5]: Leaving directory '/home/tackin/freifunk/gluon/openwrt/target/linux/ar71xx'
Makefile:13: recipe for target 'install' failed
make[4]:
[install] Error 2
make[4]: Leaving directory '/home/tackin/freifunk/gluon/openwrt/target/linux'
time: target/linux/install#160.95#15.30#208.63
target/Makefile:23: recipe for target 'target/linux/install' failed
make[3]:
[target/linux/install] Error 2
make[3]: Leaving directory '/home/tackin/freifunk/gluon/openwrt'
target/Makefile:19: recipe for target '/home/tackin/freifunk/gluon/openwrt/staging_dir/target-mips_24kc_musl/stamp/.target_install' failed
make[2]:
[/home/tackin/freifunk/gluon/openwrt/staging_dir/target-mips_24kc_musl/stamp/.target_install] Error 2
make[2]: Leaving directory '/home/tackin/freifunk/gluon/openwrt'
/home/tackin/freifunk/gluon/openwrt/include/toplevel.mk:216: recipe for target 'world' failed
make[1]:
[world] Error 2
make[1]: Leaving directory '/home/tackin/freifunk/gluon/openwrt'
Makefile:150: die Regel für Ziel „all“ scheiterte
make: *
* [all] Fehler 2

Okay, I'll try a build tomorrow.

Hmm yes, the full CONFIG_KERNEL_PROFILING does a lot more than only fixing the missing symbols, this may push some images over their kernel size limit. Removing target/linux/generic/hack-4.9/202-reduce_module_size.patch should work better.

@NeoRaider
Removing target/linux/generic/hack-4.9/202-reduce_module_size.patch and build without "config 'CONFIG_KERNEL_PROFILING=y'" leads to the error:
...
make[3]: Leaving directory '/home/tackin/freifunk/gluon/openwrt'
export MAKEFLAGS= ;make -w -r checksum
make[3]: Entering directory '/home/tackin/freifunk/gluon/openwrt'
make[3]: Leaving directory '/home/tackin/freifunk/gluon/openwrt'
make[2]: Leaving directory '/home/tackin/freifunk/gluon/openwrt'
make[1]: Leaving directory '/home/tackin/freifunk/gluon/openwrt'
cp: cannot stat 'openwrt/bin/targets/ar71xx/generic/openwrt-ar71xx-generic-hornet-ub-squashfs-factory.bin': No such file or directory
Makefile:150: die Regel für Ziel „all“ scheiterte
make: * [all] Fehler 1

  • exit 1

cp: cannot stat 'openwrt/bin/targets/ar71xx/generic/openwrt-ar71xx-generic-hornet-ub-squashfs-factory.bin': No such file or directory

Generated image was too big for this device.

@tackin On what device are you testing this? If it has 16 MB flash then you should find an image for it in openwrt/bin/targets/ar71xx/generic/

Upps! Didn't expect that behaviour in ar71xx-generic build. But how to make only the "tp-link-wdr4300-v1"-image to avoid the build error?

@tackin Pass DEVICES='tp-link-tl-wdr4300-v1' (v2018.2.x and older) or GLUON_DEVICES='tp-link-tl-wdr4300-v1' (Gluon master).

@NeoRaider I did:
VERSION=0.11.9.d03+tackin
OPTIONS="GLUON_RELEASE=$VERSION GLUON_BRANCH=stable GLUON_DEBUG=1 DEVICES='tp-link-tl-wdr4300-v1'"
make update
for ARCH in ar71xx-generic ; do
make GLUON_TARGET=$ARCH $OPTIONS V=s || exit 1
done

But get an error:
Make[1]: Leaving directory '/home/tackin/freifunk/gluon/openwrt'
Error: unknown devices given: 'tp-link-tl-wdr4300-v1'
Makefile:126: die Regel für Ziel „config“ scheiterte
make: * [config] Fehler 1

  • exit 1

@tackin Because of the quoting you use, the single quotes around the device name seem to be passed as a part of the argument... just remove them. Passing multiple arguments through a single shell variable is always weird, and sometimes requires 'eval' to work at all.

ah ... needed to make it ... DEVICES=tp-link-tl-wdr4300-v1

yes, got it .... sorry for that.

@NeoRaider but still fails with ...

make[3]: Leaving directory '/home/tackin/freifunk/gluon/openwrt'
export MAKEFLAGS= ;make -w -r checksum
make[3]: Entering directory '/home/tackin/freifunk/gluon/openwrt'
make[3]: Leaving directory '/home/tackin/freifunk/gluon/openwrt'
make[2]: Leaving directory '/home/tackin/freifunk/gluon/openwrt'
make[1]: Leaving directory '/home/tackin/freifunk/gluon/openwrt'
cp: cannot stat 'openwrt/bin/targets/ar71xx/generic/openwrt-ar71xx-generic-tl-wdr4300-v1-squashfs-factory.bin': No such file or directory
Makefile:150: die Regel für Ziel „all“ scheiterte
make: * [all] Fehler 1

  • exit 1

But it worked before removing the patch? CONFIG_KERNEL_PROFILING is unset again? We'll need a more complete log to see what's going on here.

this bug isn't the place to discuss build issues, please use the IRC chat for this, i'll hide the comments as off-topic.

@rotanid Although I fully understand and support your aim, it is totally necessary to solve that build problem simply to deliver CodeFetch the damn dump from me to probably understand/solve the bug. And as far as I see, the build error is not that trivial.
So, any help in solving this particular buildproblem, helps us to solve the bug itself.
I would not like to only have this builderror in the irc-history, moving out of the focus the next hours.
For now I'd like to keep this bloody buildthing a part of the issue solution, but can be deleted/marked as off topic as soon as I'm able to serve you that single dump.

@NeoRaider
[…] "But it worked before removing the patch?" -> YES
"CONFIG_KERNEL_PROFILING is unset again?" -> YES
A build with all make parameters/options and openwrt/target/linux/generic/hack-4.9/202-reduce_module_size.patch in place runs perfect. My branch is 2018.2.x I try a new build now without the 202-reduce_module_size.patch an it failed again. Build-Log: http://tackin.de/debug_01

@tackin *** error: images are too big by 151131 bytes

On what device are you testing this? If it has 16 MB flash then you should find an image for it in openwrt/bin/targets/ar71xx/generic/

Unfortunately the WDR4300 only has 8 MB flash.

Is this a mesh-only router? Because then you could get rid of fastd, status page etc. and it might fit.

@CodeFetch Thank you for the kind help. I managed to make it fit, but needed to get rid of nearly all features.
Here is your dump: http://tackin.de/dump_01

@rotanid You may mark all my build-comments as off topic now. Thank you.

Okay... This seems to be the trace that is allocating all the frames that do not get processed in time.
This trace allocates 37 skbs.

alloc_debug_processing+0x98/0x1d8
___slab_alloc.constprop.27+0x1e8/0x290
__slab_alloc.constprop.26+0x20/0x44
__kmalloc_track_caller+0xf8/0x1b0
__kmalloc_reserve.isra.3+0x40/0xa4
__alloc_skb+0x8c/0x17c
skb_copy+0x48/0xbc
ieee80211_prepare_and_rx_handle+0x890/0x11e8
ieee80211_rx_napi+0x7d8/0x8c8
ath_rx_tasklet+0xc0c/0xdac
ath9k_tasklet+0x23c/0x2ac
tasklet_action+0x8c/0xc4
__do_softirq+0x164/0x294
plat_irq_dispatch+0xf8/0x110
handle_int+0x138/0x144
r4k_wait_irqoff+0x18/0x24
cpu_startup_entry+0xa8/0xe0
start_kernel+0x44c/0x470

and only frees 11 of these skbs using this trace:

show_stack+0x70/0x8c
free_debug_processing+0x2d8/0x300
__slab_free+0x5c/0x27c
kfree+0x150/0x160
__kfree_skb+0x14/0x98
ieee80211_prepare_and_rx_handle+0xd98/0x11e8
ieee80211_rx_napi+0x7d8/0x8c8
ath_rx_tasklet+0xc0c/0xdac
ath9k_tasklet+0x23c/0x2ac
tasklet_action+0x8c/0xc4
__do_softirq+0x164/0x294
plat_irq_dispatch+0xf8/0x110
handle_int+0x138/0x144
r4k_wait_irqoff+0x18/0x24
cpu_startup_entry+0xa8/0xe0
start_kernel+0x44c/0x470

So there are 26 skbs which seem to get lost somewhere in-between.

Here is a good example:

TRACE kmalloc-4096 alloc 0x86a5e680 inuse=7 fp=0x  (null)
CPU: 0 PID: 0 Comm: swapper Not tainted 4.9.182 #0
Stack : 805276c2 00000032 00000000 00000000 804b2ce4 804b2947 8045d4c0 00000000
        80523800 802a0864 00210d00 86876140 00000000 800a9a6c 00000000 00000006
        00000006 00000001 80460f70 87c09bcc 00000000 800d7e7c 805276c2 00000039
        00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
        00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
        ...
Call Trace:
[<8006b210>] show_stack+0x70/0x8c
[<80116800>] alloc_debug_processing+0x98/0x1d8
[<801171f8>] ___slab_alloc.constprop.27+0x1e8/0x290
[<801172c0>] __slab_alloc.constprop.26+0x20/0x44
[<80119b2c>] __kmalloc_track_caller+0xf8/0x1b0
[<802a0774>] __kmalloc_reserve.isra.3+0x40/0xa4
[<802a0864>] __alloc_skb+0x8c/0x17c
[<802a0d40>] skb_copy+0x48/0xbc
[<86924678>] ieee80211_prepare_and_rx_handle+0x890/0x11e8 [mac80211]
[<869257a8>] ieee80211_rx_napi+0x7d8/0x8c8 [mac80211]
[<87747c38>] ath_rx_tasklet+0xc0c/0xdac [ath9k]
[<87744f10>] ath9k_tasklet+0x23c/0x2ac [ath9k]
[<80083374>] tasklet_action+0x8c/0xc4
[<80082c4c>] __do_softirq+0x164/0x294
[<801ff628>] plat_irq_dispatch+0xf8/0x110
[<80066358>] handle_int+0x138/0x144
[<80067e4c>] r4k_wait_irqoff+0x18/0x24
[<800a4888>] cpu_startup_entry+0xa8/0xe0
[<804d9b98>] start_kernel+0x44c/0x470
TRACE kmalloc-4096 free 0x86a5e680 inuse=6 fp=0x86a5c480
CPU: 0 PID: 0 Comm: swapper Not tainted 4.9.182 #0
Stack : 805276c2 00000032 00000000 00000000 804b2ce4 804b2947 8045d4c0 00000000
        80523800 80467990 80467770 86a5e680 80550000 800a9a6c 00000000 00000006
        00000006 02090220 80460f70 87c09c2c 00000000 800d7e7c 805276c2 00000038
        00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
        00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
        ...
Call Trace:
[<8006b210>] show_stack+0x70/0x8c
[<80117cb0>] free_debug_processing+0x2d8/0x300
[<80117d34>] __slab_free+0x5c/0x27c
[<80118210>] kfree+0x150/0x160
[<8029ded0>] __kfree_skb+0x14/0x98
[<86924b80>] ieee80211_prepare_and_rx_handle+0xd98/0x11e8 [mac80211]
[<869257a8>] ieee80211_rx_napi+0x7d8/0x8c8 [mac80211]
[<87747c38>] ath_rx_tasklet+0xc0c/0xdac [ath9k]
[<87744f10>] ath9k_tasklet+0x23c/0x2ac [ath9k]
[<80083374>] tasklet_action+0x8c/0xc4
[<80082c4c>] __do_softirq+0x164/0x294
[<801ff628>] plat_irq_dispatch+0xf8/0x110
[<80066358>] handle_int+0x138/0x144
[<80067e4c>] r4k_wait_irqoff+0x18/0x24
[<800a4888>] cpu_startup_entry+0xa8/0xe0
[<804d9b98>] start_kernel+0x44c/0x470
TRACE kmalloc-4096 alloc 0x86a5e680 inuse=7 fp=0x  (null)
CPU: 0 PID: 0 Comm: swapper Not tainted 4.9.182 #0
Stack : 805276c2 00000032 00000000 00000000 804b2ce4 804b2947 8045d4c0 00000000
        80523800 802a0864 00210d00 8687f6e0 00000000 800a9a6c 00000000 00000006
        00000006 801ff628 80460f70 87c09bcc 00000000 800d7e7c 805276c2 00000039
        00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
        00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
        ...
Call Trace:
[<8006b210>] show_stack+0x70/0x8c
[<80116800>] alloc_debug_processing+0x98/0x1d8
[<801171f8>] ___slab_alloc.constprop.27+0x1e8/0x290
[<801172c0>] __slab_alloc.constprop.26+0x20/0x44
[<80119b2c>] __kmalloc_track_caller+0xf8/0x1b0
[<802a0774>] __kmalloc_reserve.isra.3+0x40/0xa4
[<802a0864>] __alloc_skb+0x8c/0x17c
[<802a0d40>] skb_copy+0x48/0xbc
[<86924678>] ieee80211_prepare_and_rx_handle+0x890/0x11e8 [mac80211]
[<869257a8>] ieee80211_rx_napi+0x7d8/0x8c8 [mac80211]
[<87747c38>] ath_rx_tasklet+0xc0c/0xdac [ath9k]
[<87744f10>] ath9k_tasklet+0x23c/0x2ac [ath9k]
[<80083374>] tasklet_action+0x8c/0xc4
[<80082c4c>] __do_softirq+0x164/0x294
[<801ff628>] plat_irq_dispatch+0xf8/0x110
[<80066358>] handle_int+0x138/0x144
[<80067e4c>] r4k_wait_irqoff+0x18/0x24
[<800a4888>] cpu_startup_entry+0xa8/0xe0
[<804d9b98>] start_kernel+0x44c/0x470

A skb page's refcount is being increased and directly decreased 0.2 seconds later and then it's recount is increased again and not decreased for at least 25 seconds/until the router reboots.
Maybe the buffer is just not filled at this point, but this is very strange. Where is the queue where these packets reside?

Does it have anything to do with our newer batman version that our CPE210 does not reboot that often anymore with 40 Clients and more? Or is it just pure luck?
https://map.ffmuc.net/#!/de/map/98ded0a79164
https://stats.ffmuc.net/d/hRIn3dRWk/mesh-nodes?orgId=1&var-nodeid=98ded0a79164

Yeah ... shortly after being happy that it doesn't reboot that often anymore ;).

Would it be helpful for debugging and reproduction, when we run a x86 Futro with a ath9k pci wifi card at a place where a high number of users? If yes, we can give developers access to the systems.

Would it be helpful for debugging and reproduction, when we run a x86 Futro with a ath9k pci wifi card at a place where a high number of users? If yes, we can give developers access to the systems.

maybe someone can correct me, but afaik the ath9k chips in routers are not the same as the ones on PCI or USB cards

AFAIK, ath9k chips in routers and on pci cards are the same. USB devices using other chipsets.

i am not sure about PCI cards, but router-wifi no embedded in the SoC are mostly connected via PCIe lines. So the only question imO would be: Is there a difference for the linux-drivers if a devices is on PCI or PCIe? (PCIe on a Futro is a bit of a challenge, but i have at least one set of "x1"-to-x2 slot for S900 which i could spare. plus an S900 off course.
BTW: on the location (larger pub) there is an Archer-C7v2 at the moment with frequent OOM+load+reboot when passing 50+ clients.

I have a IGEL H700C with ath9k WiFi card:

"lspci -tvv" output:
-+-[0000:80]---01.0 VIA Technologies, Inc. VT8237A/VT8251 HDA Controller \-[0000:00]-+-00.0 VIA Technologies, Inc. CN896/VN896/P4M900 Host Bridge +-00.1 VIA Technologies, Inc. CN896/VN896/P4M900 Host Bridge +-00.2 VIA Technologies, Inc. CN896/VN896/P4M900 Host Bridge +-00.3 VIA Technologies, Inc. CN896/VN896/P4M900 Host Bridge +-00.4 VIA Technologies, Inc. CN896/VN896/P4M900 Host Bridge +-00.5 VIA Technologies, Inc. CN896/VN896/P4M900 I/O APIC Interrupt Controller +-00.6 VIA Technologies, Inc. CN896/VN896/P4M900 Security Device +-00.7 VIA Technologies, Inc. CN896/VN896/P4M900 Host Bridge +-01.0-[01]----00.0 VIA Technologies, Inc. CN896/VN896/P4M900 [Chrome 9 HC] +-02.0-[02]-- +-03.0-[03]-- +-0f.0 VIA Technologies, Inc. VT82C586A/B/VT82C686/A/B/VT823x/A/C PIPC Bus Master IDE +-10.0 VIA Technologies, Inc. VT82xx/62xx UHCI USB 1.1 Controller +-10.1 VIA Technologies, Inc. VT82xx/62xx UHCI USB 1.1 Controller +-10.2 VIA Technologies, Inc. VT82xx/62xx UHCI USB 1.1 Controller +-10.3 VIA Technologies, Inc. VT82xx/62xx UHCI USB 1.1 Controller +-10.4 VIA Technologies, Inc. USB 2.0 +-11.0 VIA Technologies, Inc. VT8237S PCI to ISA Bridge +-11.7 VIA Technologies, Inc. VT8237/8251 Ultra VLINK Controller +-13.0 VIA Technologies, Inc. VT8237A Host Bridge \-13.1-[04]--+-03.0 Qualcomm Atheros AR9227 Wireless Network Adapter \-06.0 Realtek Semiconductor Co., Ltd. RTL-8110SC/8169SC Gigabit Ethernet
See dmesg output in attached file
wlanigel.txt

I once built an offloader for a friend, based on a Wyse R90LW with a PCIe to mPCIe adapter board and a dual band Atheros WiFi card on it. I can ask him to lend it back to me for tests.

ath9k WiFi card:

Which make?
(this is what we are discussing, vendors are mostly rather restrictive upon stating chipsets used)

TP-Link TL-WN851ND

Here's a picture: https://twitter.com/kpanic/status/1001420256596910081

Just to mention: one of my Archer C7 also has that same problem but is ath10k, or am I wrong?

Maybe we should close this one, too as I've summarized the observed problems in #1944 and this issue did not receive feedback for a while.

It seems the issue got resolved with Gluon 2020.2.x as we now see CPE210 with 60 clients and more without a reboot:
https://map.ffmuc.net/#!/en/map/98ded0a79164

@awlx From the statistics it does not look like this router had this problem at all.

It had last year. See comment above and next 3:
https://github.com/freifunk-gluon/gluon/issues/1768#issuecomment-525290243

But we clear the stats after (at latest) 180 days and earlier when we hit low disk space.

I agree that the issue seems to be resolved with v2020.2 and would like to close this issue if noone disagrees.

no objections, closing.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

lemoer picture lemoer  Â·  3Comments

2tata picture 2tata  Â·  3Comments

lephisto picture lephisto  Â·  5Comments

lcb01a picture lcb01a  Â·  3Comments

RalfJung picture RalfJung  Â·  5Comments