Gluon: Kernel error: unregister_netdevice: waiting for mesh-vpn to become free. Usage count = 1

Created on 12 Nov 2017  路  38Comments  路  Source: freifunk-gluon/gluon

I am seeing the following kernel error on some of my nodes:

unregister_netdevice: waiting for mesh-vpn to become free. Usage count = 1

In the past, I used to see an error very much like this on gateways; usually in the context of doing a reboot and then finding the machine stuck (and recoverable only via the server console). Batman had some bugs where it did not properly detach some interfaces. However, I haven't seen this error on gateways ever since we updated them to Batman 2017.x.

I am running two nodes, both with Gluon 2017.1.3. One, for testing, is an x86 node running in a VM via libvirt. There, I can pretty much reproduce the above error as follows:

  • Start the VM, make sure it is connected.
  • Put the host (i.e., my laptop) into stand-by.
  • Wait a little while (I usually get into the bus and drive home, or something like that).
  • Resume the host. Check the VM output, and I see the above error.

I then have to do a reboot of the VM (i.e., the node) to get it into a working state again.

Just now, I saw this error the first time on real hardware (TP-Link TL-WR841N/ND v10). I was trying to do /etc/init.d/tunneldigger restart, and it got stuck. logread shows that tunneldigger got the termination signal, but then failed to terminate. I sent kill -KILL to the result of pidof tunneldigger, to no avail. dmesg shows the above error, and it kept repeating. Ultimately I had to reboot the device.

Now I am a little worried that other devices might run into a similar issue when stopping services for an auto-update, and will just not work until someone notices and manually power cycles them.

bug upstream issue

Most helpful comment

All 38 comments

There were several such bugs found (and newly introduced?) up and down through the Linux network stack, for instance 8397ed36b7, 202f59afd44147 or f186ce61bb8235. And then there was a bigger cleanup from David Miller even: cf124db566e6b. Most of this seems to be present since 4.12 only, with no backport to stable kernels yet. So would be interesting to retry with some kernel >= 4.12.

i also thought, this error might be done.
that batman-adv behaviour is the most reason to recommend using "reboot -f " instead of "reboot", everytime batman-adv is is usage!

I have been seeing this on real hardware in the field (TP-Link TL-WR841N/ND v11) when the tunneldigger broker is restarted several times (ca. 3 times) within a period of about 30 minutes.

Just wanted to add some information: Yes, this this message points to a kernel bug, something is not decreasing a reference counter correctly there.

And yes, some years ago we had found similar issues in batman-adv and after that did a thorough review of this. People tried the according patches and reported their issue as fixed.

It seems that this issue reappears so far only for setups using tunneldigger instead of fastd? Could it be, that maybe therefore this time the l2tp kernel module might not handle its teardown properly? (as the issue is not reproduceable with fastd?)

Could it be, that maybe therefore this time the l2tp kernel module might not handle its teardown properly? (as the issue is not reproduceable with fastd?)

Sounds plausible.

I just noticed this in the firmware for Freifunk Hamburg (linux kernel 4.4.93) with two nodes. Both were affected after running "wifi".
It seems that batman was responsible for the issues there, as we do not use tunneldigger and "batctl o" and some other batman commands hung after that.
It's not (yet?) reliably reproducible though. After rebooting the node, running wifi did not trigger the issue again.

@tokudan: Which batman-adv and Gluon version were those two nodes running?

@T-X Freifunk Hamburg is at Gluon v2017.1.5 with batman-adv 2017.2 (from gluon-mesh-batman-adv-15).
The exact messages I saw on a TPLink 4900 while and after running "wifi":

Wed Feb 14 23:11:38 2018 daemon.notice hostapd: client0: interface state ENABLED->DISABLED
Wed Feb 14 23:11:38 2018 daemon.notice hostapd: client0: AP-DISABLED
Wed Feb 14 23:11:38 2018 daemon.notice hostapd: client0: CTRL-EVENT-TERMINATING
Wed Feb 14 23:11:38 2018 daemon.notice hostapd: nl80211: deinit ifname=client0 disabled_11b_rates=0
Wed Feb 14 23:11:38 2018 kern.info kernel: [174633.851473] device client0 left promiscuous mode
Wed Feb 14 23:11:38 2018 kern.info kernel: [174633.856335] br-client: port 4(client0) entered disabled state
Wed Feb 14 23:11:38 2018 daemon.notice hostapd: client1: interface state ENABLED->DISABLED
Wed Feb 14 23:11:38 2018 daemon.notice hostapd: client1: AP-STA-DISCONNECTED c0:bd:d1:d2:f9:be
Wed Feb 14 23:11:38 2018 daemon.notice hostapd: client1: AP-STA-DISCONNECTED d8:5b:2a:84:a6:61
Wed Feb 14 23:11:38 2018 daemon.notice hostapd: client1: AP-STA-DISCONNECTED d0:65:ca:74:fd:09
Wed Feb 14 23:11:38 2018 daemon.notice hostapd: client1: AP-STA-DISCONNECTED 00:71:cc:cd:4b:8f
Wed Feb 14 23:11:38 2018 daemon.notice hostapd: client1: AP-DISABLED
Wed Feb 14 23:11:38 2018 daemon.notice hostapd: client1: CTRL-EVENT-TERMINATING
Wed Feb 14 23:11:38 2018 daemon.notice hostapd: nl80211: deinit ifname=client1 disabled_11b_rates=0
Wed Feb 14 23:11:38 2018 kern.info kernel: [174633.871863] device client1 left promiscuous mode
Wed Feb 14 23:11:38 2018 kern.info kernel: [174633.876851] br-client: port 5(client1) entered disabled state
Wed Feb 14 23:11:38 2018 daemon.notice netifd: Network device 'client0' link is down
Wed Feb 14 23:11:38 2018 daemon.notice netifd: Network device 'client1' link is down
Wed Feb 14 23:11:38 2018 kern.info kernel: [174633.937270] batman_adv: bat0: Interface deactivated: mesh1
Wed Feb 14 23:11:38 2018 daemon.notice netifd: Network device 'mesh1' link is down
Wed Feb 14 23:11:38 2018 daemon.notice netifd: Interface 'mesh_radio1' has link connectivity loss
Wed Feb 14 23:11:38 2018 kern.info kernel: [174633.977777] batman_adv: bat0: Interface deactivated: mesh0
Wed Feb 14 23:11:39 2018 daemon.notice netifd: Network device 'mesh0' link is down
Wed Feb 14 23:11:39 2018 daemon.notice netifd: Interface 'mesh_radio0' has link connectivity loss
Wed Feb 14 23:11:39 2018 kern.info kernel: [174634.052896] batman_adv: bat0: Removing interface: mesh1
Wed Feb 14 23:11:39 2018 daemon.notice netifd: Interface 'mesh_radio1' is disabled
Wed Feb 14 23:11:39 2018 kern.info kernel: [174634.179498] batman_adv: bat0: Removing interface: mesh0
Wed Feb 14 23:11:39 2018 daemon.info respondd[27822]: unable to read providers from '/usr/lib/respondd', ignoring
Wed Feb 14 23:11:49 2018 kern.emerg kernel: [174644.354817] unregister_netdevice: waiting for mesh1 to become free. Usage count = 1
Wed Feb 14 23:11:59 2018 kern.emerg kernel: [174654.598617] unregister_netdevice: waiting for mesh1 to become free. Usage count = 1
Wed Feb 14 23:12:09 2018 kern.emerg kernel: [174664.846348] unregister_netdevice: waiting for mesh1 to become free. Usage count = 1

The last message kept repeating until a reboot and commands like "batctl o" hung.
Respondd seems to have failed as well, resulting in the node appearing offline on the map.

[Update: Gluon 2017.1.5 not ..2.5 of course...]

It seems a new refcounting issue was introduced a while ago, which would explain why reports have become more frequent recently: https://git.openwrt.org/?p=openwrt/openwrt.git;a=commitdiff;h=999bb66b20b03c753801ecebf1ec2a03c6a63c96

Please test if the issue is still reproducible with latest master or v2017.1.x.

I just tried with 2017.1.7 and could not reproduce this in my VM with a suspend-resume cycle. However, this is also no longer the same host machine that I used back in November.

when testing latest v2017.1.x ( 1304907106986343458ee245c7285c5156f2e5cc ) yesterday i ran into
Sat May 5 20:34:21 2018 kern.emerg kernel: [ 3944.265393] unregister_netdevice: waiting for primary0 to become free. Usage count = 1
during tests with the autoupdater-wifi-fallback and ssid-changer.
it didn't occur again after another reboot, though

Never mind, I am seeing this again in my VM. Probably the machine wasn't suspended long enough previously to trigger this, now it's been >90min and the error shows up again.

It would be great if someone who can reproduce this issue with v2017.1.x or master could check the next branch, so we know if kernel 4.9.y/4.14.y are affected as well. Also include the hardware target and the exact kernel version of your build with your feedback.

@RalfJung can reproduce it and he described in detail, how - so he or anyone else with time should be able to reproduce it

I don't have a tunneldigger setup at my disposal; fastd is not affected by the issue in the same way, as the mesh-vpn interface is not torn down on reconnects with fastd. I might be able to get the same behaviour by changing the fastd config a bit...

it happened to me in a fastd setup on 2018-05-05 while debugging another thing, but i'm not sure how to reproduce it as i was focussed on the other thing...

I might have time to try this later. @NeoRaider if you like, you could use our site file from https://git.hacksaar.de/FreifunkSaar/gluon-site and test with with our tunneldigger servers.

I compiled the current next branch (eef493d06e53e595bde1fbe30338a9b3ed90040c) and ran it in a VM. The issue occurred, as usual, after leaving the VM host suspended for the night.

i also have the error message now with my home router, but not with tunneldigger:
[336077.155375] unregister_netdevice: waiting for mesh1 to become free. Usage count = 1
running Gluon v2017.1.7 , Kernel 4.4.129, batman-adv compat15, fastd, Router WDR4900 (dual-band ath9k), some wifi-mesh neighbors

EDIT: happened again today

on 2018/05/25 almost all of our devices running v2017.1.x were affected by this issue with mesh0/mesh1 not becoming free.
maybe commit f4e278e6c1616476d46ee732d808d182c9c76485 made things worse, not better? it's not the issue, tried without.

i have invested much more time now hoping to figure this out.
i can reproduce the issue now reliably, as i will describe below.
i could reproduce it with several Gluon versions:

  • several versions of the v2017.1.x branch, starting with the release v2017.1 up to the lastest update in branch (kernel 4.4.x, batman-adv 2017.1 and 2017.2)
  • Gluon master branch (upcoming v2018.1, kernel 4.4.x, batman-adv 2018.0)
    i couldn't reproduce it with Gluon v2016.2.7 (kernel 3.18.x, batman-adv 2016.2), although this doesn't mean the bug isn't there.

the reproduction is possibly unrelated to hardware, as i could reproduce with the following devices:

  • TP-Link WDR4900 (mpc85xx-generic dual-band)
  • TP-Link WDR3600 (ar71xx-generic ath9k dual-band)
  • TP-Link TL-WR1043ND v4 (ar71xx-generic ath9k single-band)
  • TP-Link Archer C59 (ar71xx-generic ath10k dual-band)

also, i could reproduce it with the following different settings:

  • Wireless Mesh on/off
  • LAN Mesh on/off
  • fastd VPN on/off (though it wasn't easy to test without fastd VPN enabled)

my reproduction works as follows:

  1. start (or reboot) the node
  2. wait until the gateway connection is pretty good ( batctl gwl )
  3. DROP all UDP packets coming from the Gluon node (either at your home router or on the gateway by using the public WAN IP address of the node)
  4. wait at least 8 minutes (more to be safe, wasn't always reproducible when waiting time was too short)
  5. do execute /etc/init.d/network restart on the node
  6. see dmesg | grep unregister

i used the community network of Freifunk Altdorf.
site.conf: https://github.com/tecff/site-ffa
examplary affected build: http://fw1.tecff.de/issue-1258-affected/

I just noticed the same after more and more nodes were detected as down by respondd / yanic (meshviewer) while bat-mesh was working.

Devices so far:

  • TP-LINK TL-WA801ND v2
  • Ubiquiti UniFi

Error:
unregister_netdevice: waiting for mesh0 to become free. Usage count = 1

I am using tunneldigger on latest gluon master (no fw customizing, no custom packages).
If this is an issue with reconnect to the broker because of packet loss on UDP (like test case above), there seems to be a second bug. The uplink were this node is located is stable and reliable and was not disconnected for the last two days (also there were are no errors visible in our logs and monitoring).

@rotanid You wanted to test batman-adv from gluon 2016.2.x. I've added it as branch batadv/2016.2.x in https://github.com/FreifunkVogtland/gluon/tree/batadv/2016.2.x with it.

I think it is reasonable to just use kvm for the tests as wifi mesh neighbors don't seem to be relevant here ("$" is on the host system, "#" is in OpenWrt/LEDE).

$ sudo iptables -D INPUT -p udp -j DROP
$ qemu-system-x86_64 -enable-kvm -snapshot -nographic -nic user,mac=02:11:22:33:44:01 gluon-*-x86-64.vmdk
wait 8 minutes for the node to be "stable" (check `batctl gwl`)
$ sudo iptables -A INPUT -p udp -j DROP
wait 8 minutes for the node to loose its neighbors in batman-adv and get into the bad state(?)
# /etc/init.d/network restart
wait for 60 seconds to see whether "unregister_netdevice: waiting for mesh0 to become free" appears in the kernel log

You can find more batman-adv versions between https://github.com/FreifunkVogtland/gluon/tree/batadv/2016.2.x and https://github.com/FreifunkVogtland/gluon/tree/batadv/2018.1.

  • DROP all UDP packets coming from the Gluon node (either at your home router or on the gateway by using the public WAN IP address of the node)
  • wait at least 8 minutes (more to be safe, wasn't always reproducible when waiting time was too short)

Interesting! I guess this is what I "achieved" by putting the laptop running the node (in a VM) into suspend -- effectively no packets can go in or out any more. Just putting it into suspend for a couple seconds didn't do anything, but leaving it in suspend for half an hour reliably triggers the problem.

@rotanid Can you test whether it:

  • 1e7b160c9ffc ("Switch to batman-adv 2016.4-0"): problem happens
  • 91a1dc08e3f7 ("Switch to batman-adv 2016.3-0"): problem doesn't happen

I have now also added the branches with all the patches from 2016.4 as separate commits:

@rotanid: I found a big problem in the gateway netlink code from Andrew. Not sure whether this is your problem but it is at least a problem which can cause such a problem in gluon 2017.1.x. I will post a patch later today

@ecsv thanks for your work!
i tested this by using your branches:
https://github.com/FreifunkVogtland/gluon/tree/batadv/gw_refcnt_fix
https://github.com/FreifunkVogtland/gluon/tree/batadv/gw_refcnt_fix-2017.1.x
i couldn't reproduce my issue anymore with these branches, while another device with unpatched v2017.1.x running the test at the same time still shows the issues.

@RalfJung please also test your case with at least one of those two branches

Merged for both master and v2017.1.x, thanks to everyone who helped getting to the root of this and fixing it!

and thanks to the UPS vendor used by Hetzner for the failing UPS on 2018/05/24 which lead to the exact length of downtime which triggered the bug for us ;)

I tried the 2017.1.x branch and was unable to reproduce.

I get it reproducably on a regular Linux machine with kernel

root@gw1:/home/moeller# systemctl stop fastd@$(systemd-escape ffoh-mesh-vpn)
Message from syslogd@gw1 at Jun 20 15:42:02 ...
kernel:[140542.561333] unregister_netdevice: waiting for ffoh-mesh-vpn to become free. Usage count = 1

with Debian buster, using only packages from the distribution on kernel

Linux version 4.16.0-2-amd64 ([email protected]) (gcc version 7.3.0 (Debian 7.3.0-19)) #1 SMP Debian 4.16.12-1 (2018-05-27)

Cheers,
Steffen

@smoe Kernel 4.16 doesn't have the most recent batman-adv fixes and it might take a while until it gets them, if ever.
if you want to fix this, you would have to build latest batman-adv maint branch by yourself for Debian Buster, i have only built packages for Jessie so far.

@rotanid, I was not completely sure how to interpret your reply.

Is there something to be done to shorten the time between batman-adv developments to an installable package for Debian unstable/testing? It should be possible for me to help with something along those lines, aiming at Debian experimental with it. Would like to consult you and the Debian kernel folks about best workflows, first, though.

Concerning the exact issue at hand I should not attempt to fix this myself. Would not know where to start. Was only meaning to help with an easy way to get it reproduced.

@smoe for official debian repositories, you will have to wait until there is a new official kernel including a new batman-adv release. there's little we can do to accelerate this.

the alternative is to build a debian package with the latest batman-adv kernel module by yourself, which i did for Debian Jessie with Kernel 4.9 from backports.
to do this after every batman-bugfix is enough work, i really don't want to also have to do this for more than one debian version and especially affecting testing: after each debian kernel update...
ideally, this would be automated and the packages be put into a repository, but this hasn't been done yet.

@rotanid Well, yes, if that could be automated then I offer to upload to Debian Experimental (or sign and upload elsewhere but since we have all other bits of the Freifunk gateway infrastructure in the distribution, I would like to stay close). I don't have any interaction with the Debian Kernel Team, they may have additional ideas, who I would feel like consulting with whatever plan we/you/I come up with. For instance if Debian would get the batman-adv module back as a separate package, then there could also be a batman-adv package distributed via the experimental section. Feel free contacting me via [email protected] to discuss things further.

@ecsv probably knows best, but since batadv was part of the upstream kernel the dkms package for debian was discontinued in ~2011.

fwiw: we would appreciate having a recent version packaged through dkms.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

HACKER-3000 picture HACKER-3000  路  5Comments

mweinelt picture mweinelt  路  3Comments

Nurtic-Vibe picture Nurtic-Vibe  路  5Comments

rubo77 picture rubo77  路  5Comments

Nurtic-Vibe picture Nurtic-Vibe  路  5Comments