Gluon: batman-adv crash when removing interface in certain configurations

Created on 9 Mar 2016  路  22Comments  路  Source: freifunk-gluon/gluon

When I configure more then 2 VLANs via UCI for mesh-connections and add them to bat0 the network stack isn't working correctly any longer. Any invocation of ip, ifconfig, brctl, batctl etc will result in a stuck system.
If I configure just two mesh-VLANs the system is working fine. It is also possible to add more VLANs to bat0 via "vconfig eth0.x" and "batctl if add eth0.x".

The strange part is before the crash happens the kernel always prints "tried to remove device eth0 from br-client" but eth0 itself shouldn't be part of any bridge or whatever. I also don't see this message when I configure just two mesh-VLANs.

See attached files for backtrace and /etc/config/network.
If you need further information please let me know.

EDIT:

  • Tested with v2016.1.1 and master.
  • Problem occurs on two pieces of hardware.

dmesg.txt
network_config.txt

bug upstream issue

Most helpful comment

As there is an upstream fix now, I'll try backporting it to the Gluon kernels.

All 22 comments

Interesting, I'll try to reproduce this with other hardware.

Which batman-adv version are you using, compat 14 or 15?

root@0216088335f8:/# batctl -v
batctl 2016.0 [batman-adv: 2016.0]

As far as I understand this refers to compat 15.

EDIT: I'll also try to reproduce it on a Raspberry Pi so we know if it is hardware related or not.

It looks like this problem is some how related to br-client. If I remove br-client completely from /etc/config/network I'm able to configure 4 mesh-VLANs (didn't try more).

However this isn't really an option as this device is intended to be a backbone node with connects several Ubiquiti devices running stock firmware. We have a management VLAN (3) on each of them which is then bridged with bat0 via br-client. Without this we wouldn't be able to access them from remote.

I've forwarded the issue to the batman-adv developers, they'll have a look.

Thanks. Is there a ticket I can track and contribute to?

I just found some more details that may be helpful:
I can leave br-client enabled but need to remove bat0 from it. And again it wont harm the system if I add it manually once everything is up.

I think there is no ticket yet, I'll open one.

I wasn't able to reproduce this issue on x86-vmware with the current Gluon master.

It might be worth noting that the crash happens when the VLAN interfaces are _removed_ from bat0. netifd first sees eth0 as up, adding the 4 VLAN interfaces to bat0, one second later it is seen as down again; when eth0.102 is removed from bat0, something fails and the kernel instead tries to remove eth0 from br-client, which causes the crash as eth0 is not part of br-client.

Can confirm that. The problem happens when the interface eth0 is going down. When I set everything up manually and then power cycle the switch it crashes with the same error message.

Could this be the same problem as we have here?
http://patchwork.ozlabs.org/patch/587118/

Ok tried the attached patch against master and it no longer crashes because of trying to remove eth0 from br-client. Now if fails because it tries to remove eth0 from local-node. Looks like I'm digging in the right direction.

dmesg.txt
patch.txt

BTW: I wasn't able to reproduce this behavior on other hardware like the Raspberry Pi.

EDIT:
It crashes only half the time with the message that it tries to remove eth0 from local-node. When it boots without a crash I wasn't able to force a crash (till now). For the moment the combination of this patch and removing the package next-node looks like a working setup.

@belzebub40k, can you test the updated patch from http://lists.openwall.net/netdev/2016/03/25/103 ?

Ah, it can also be found in patchwork: http://patchwork.ozlabs.org/patch/602124/

Perfect until now it seems to work without problems. I will give it some more time for testing and keep you posted.

Sadly it does work properly. It crashes with the same error as reported initially when the counterpart is power cycled. However disconnecting it physically doesn't cause a crash.

Today I got my BananaPro and I tried unpatched (except those to profiles.mk to build the image) v2016.1.3 on it and it doesn't crash which is very strange as it is mostly a BananaPi M1 with a WiFi chip. <-- NOT TRUE

I'm at a point where I consider faulty hardware. I try to get my hands on an other hardware revision of the M1. The one I have is a clone from sinovoip and not the original from Lemaker.

I think the issue is that the kernel code is completely b0rked and may fail in various spectacular ways.

As there is an upstream fix now, I'll try backporting it to the Gluon kernels.

As we might see this on our gateways, do you have a link to the upstream ticket?

Two upstream fixes related to interface removal:

Still pending is this ticket with an according patchset to fix this which still needs a review:

  • https://www.open-mesh.org/issues/168
  • https://patchwork.open-mesh.org/patch/16718/
  • https://patchwork.open-mesh.org/patch/16719/

(no claim that this list is complete - @neoraider, if you noticed any other patches feel free to add :-) )

EDIT: Ah, and just saw the patch by Andrew Collins which @NeoRaider refered to above. @poldy79 and @plumpudding, this might be the patch you are most urgently looking for as it matches the kernel traces in #910 and here exactly. Though @NeoRaider seems to have an updated, newer version of this in mind?

EDIT2: This looks like the accepted patch @NeoRaider seems to be refering to:

So far it seems to be part of Linux 4.9-rc1 only but is queued for stable kernels. The patchwork ticket states that it is not completely fixed yet though and that another patch will follow.

@NeoRaider I did a dirty backport of Andrew Collins kernel patch and it looks much better. However now it seems like there is a race condition caused by eth0 getting up and down quickly. This sounds exactly like the problem addressed with the batman-adv patches mentioned by @T-X . I'll try with batman-adv master instead of 2016.2.

Did not figure out how to change to batman-adv master thus just applied batman-adv: Modify mesh_iface outside sysfs context as a patch. That seams to fix the race condition mostly. At least it does not crash the kernel when uplugging and re-plugging eth0. But I still see messages which indicate that the race condition isn't fixed fully.

batman_adv: (null): no_rebroadcast: Changing from: disabled to: enabled
NOHZ: local_softirq_pending 08

The batman_adv (null) messages appear during every restart of eth0. local_softirq_pending happend just one time and I wasn't able to reproduce it a second time by now.

console log boot and restart of the interface triggered by unplugging or rebooting the switch

applied patches

Both mentioned patches are included in Gluon now (and I've just backported the batadv patch to -legacy), so I guess we can consider this fixed.

BTW: Ganz bl枚d, wenn das w盲hrend eines Autoupdates auftritt.
https://paste.debian.net/947029/
(Gluon v2016.2.x gebaut von https://github.com/freifunk-gluon/gluon/commit/97f44c208b4dd23a63a0069963ca04fad899bf05 )

Was this page helpful?
0 / 5 - 0 ratings

Related issues

RalfJung picture RalfJung  路  5Comments

edeso picture edeso  路  3Comments

mweinelt picture mweinelt  路  3Comments

Nurtic-Vibe picture Nurtic-Vibe  路  5Comments

sargon picture sargon  路  4Comments