Core: Multi-WAN Failover fails on unplugging cable of WAN1 / WAN2

Created on 10 Jun 2020  路  66Comments  路  Source: opnsense/core

Important notices
Before you add a new report, we ask you kindly to acknowledge the following:

[X] I have read the contributing guide lines at https://github.com/opnsense/core/blob/master/CONTRIBUTING.md

[X] I have searched the existing issues and I'm convinced that mine is new.

Describe the bug
Multi-WAN Failover fails on unplugging cable (very short period) of WAN1 / WAN2.
It will do strange things and not go back to Tier1.
Hitting save on any interface will fix the issue till next time.

To Reproduce

  1. followed this guide 100%: https://docs.opnsense.org/manual/how-tos/multiwan.html
  2. Creating Failover with WAN1 being Tier1 and WAN2 being Tier2
  3. Unplugging Cable of Tier1 for only 0.1 seconds and plugging it back in
  4. cut internet connection on WAN1 without pulling the cable (link needs to stay up). You will see the trouble happening after that. Do some testing by cutting internet on Tier1 and Tier2 but do not pull the cable, try to cut the internet somehow else. Check the behavior.

Expected behavior
Failover should work as set up (Tier 1, Tier 2) even on ports flapping or some hardware rebooting which is connected to WAN1 or WAN2.

OPNsense 20.1.7
APU4 PC-Engines

I am willing to show the problems via anydesk or teamviewer.

THANK YOU!

bug

All 66 comments

This actually happened to me as well, you can read the post on it @ https://forum.opnsense.org/index.php?topic=17198.msg78209#msg78209

i tested again and can sum up:

any physically un- und replugging of a ethernet-cable on the appliance (link going down and then back up) on any Tier (EDIT: if on DHCP) will mess up Failover afterwards.
(fyi: i just saw unbound dns not running for a few seconds in the dashboard.)

If i use another switch to cut the internet (so no direct un- and replugging of cable on port) the Failover will work as intended.
But keep in mind: failover will fail in the future if for example a device connected to the tier1 or tier2 WAN-interface will reboot (link down, link up).

AFAIK this only happens on dual DHCP oder dual PPPoE WANs. If you use static this will not occur.

Yeah but this is very rare that you can ever get static IP's, neither of my wan providers have static IP's available OR the cost of getting one is not very enticing.

I also followed that same doc and I didn't see anywhere that static IP's were mandatory for this to function properly. It shouldn't matter though.

AFAIK this only happens on dual DHCP oder dual PPPoE WANs. If you use static this will not occur.

I have dual DHCP. But why would this not being considered a bug?
Same would be with dual PPPOE.

I will test with one of them being static.

Update:
Made Tier2 DHCP to Static and pulled the Tier1 with still having DHCP. Same Problem.
I will test both beeing static right now.
Update2:
Seems like un- and replugging on either Tier with DHCP enabled will cause this. With static it seems to be fine. But its not about dual DHCP it is about DHCP in general. Using DHCP on WAN will make failover fail if that interface goes physically down and up again.

Can you let the modem do the dialup so you can use static IP behind on OPNsense (for testing), same to DHCP.

Don't forget here are all volunteers, not every developer has dual DHCP or dual PPPoE for testing all the time.

Can you let the modem do the dialup so you can use static IP behind on OPNsense (for testing), same to DHCP.

Don't forget here are all volunteers, not every developer has dual DHCP or dual PPPoE for testing all the time.

You would only need one DHCP for reproducing. See my update above.
I will try to help as good as possible.

I am not able to test PPPOE at the moment.

Final result:

WAN interface with DHCP will mess up Failover if it gets physically un- and replugged.

Easy to reproduce:

  • Do everything like said here: https://docs.opnsense.org/manual/how-tos/multiwan.html
  • Setup WAN1 (Tier1) with DHCP. Tier2 as you like, does not matter.
  • Put a simple network-switch between WAN1 and the Router you are connected to.
  • Unplug the cable on WAN1 (physically) on Opnsense-Appliance.
  • Replug the cable to WAN1 (physically) Opnsense-Appliance.
  • Wait a little bit till everything is back "normal" (gateways online).
  • now do not physically unplug the cable on the WAN1 of opensense but unplug the cable on the switch connected to your router. So there will be no link down on WAN1 of opnsense
  • let this cable be out for about 30 seconds
  • replug this cable
  • you will see that WAN1 (Tier1) will not be used anymore even it is back online after some seconds.

If i use STATIC there is no such problem.

FYI: with DHCP i can see Gateways _disappearing_ on Dashboard. This does not happen with static.

If you are interested, this Windows Batch (.bat) will tell you your public ip again and again so you will see on which gateway you really are.

@Echo off
:loop
SET /A XCOUNT+=1
echo %XCOUNT%
For /f %%A in (
  'powershell -command "(Invoke-Webrequest "http://api.ipify.org" -TimeoutSec 1).content"'
) Do Set ExtIP=%%A
Echo External IP is : %ExtIP%
goto loop 

FYI: with DHCP i can see Gateways going down on Dashboard. This does not happen with static.

Isn't that the point though, detect a gateway down then switch over? If the gateways do not report down then this isn't a true failover, just a re-route.

FYI: with DHCP i can see Gateways going down on Dashboard. This does not happen with static.

Isn't that the point though, detect a gateway down then switch over? If the gateways do not report down then this isn't a true failover, just a re-route.

The whole list is not seen anymore, as if there are no gateways. I changed "going down" to "disappearing".

Any news on this?
Thank you!

No, it's vacation time ..

No, it's vacation time ..

Do my testings help if someone has spare time?
Thanks

I have the same / similar issue.

Infrastructure

  • Cable Modem (Kabel-Vodafone, Germany) in Bridge mode. Must use DHCP due to provider needs.
  • PPoE (V-DSL, Deutsche Telekom, Germany), must also use DHCP due to provider needs.

Both Internet connections do have fixed IPv4/IPv6, however, DHCP has to be used mandatory.

Tests done

  • With PPoE I tried to have the modem doing the PPoE stuff instead of OPNSense; it didn't make a difference.
  • I tried also all other suggestions made here; again no difference.

Problem in addition
I found in addition, that if I am not unplugging the connection to one oft the modems manually but if, like since the beginning of this year, with Kabel-Vodafone a 'standard issue', the Internet provider has great latency issues or simply on his side of the modem, the connection breaks, the result is the same as with unplugging the connection between OPNsense and modem.

Maybe this behavior is another bug, however, it may help to track down this issue since it isn't only annoying, it is simply rendering OPNsense in multi-wan environments for small-midsize-businesses worthless.

Side-Notes

  • I did switch from pfSense to OPNsense in December 2019, since Kabel-Vodafone. Up until this point I can report that the issues from OPNsense never occurred to me with pfSense. I mention this since to my understanding there is some kind of code-sharing, potentially here is a another chance to track this down.
  • To make sure that network cards (drivers) aren't the issue, I did test this issue with a physical OPNsense machine and in addition, on same hardware, with a Hyper-V and OPNsense in a virtual machine. Same situation.

  • To test whether or not network cards are the issue, I did also use different, FreeBSD hardware list based, network cards in standard configuration and found no difference.

All cards were Intel cards.

  • I also played around with 1GBit and 10GBit NICs since at least with Kabel-Deutschland cable modems I found in the past the one or other issue whilst the DrayTek modem(s) we use are rock-solid and don't care about NICs. No difference.

The issue persists.

I can, btw., confirm that Multi-Wan on fixed IP between Modems and OPNsense as well as here described, Multi-Wan on fixed IP between Router (DC-Infrastructure) and OPNsense work like a charme. In the ladder environment, there is also no difference between copper or fibre, nor speed (tested up to 40 GBit with Mellanox-cards).

No, it's vacation time ..

I was just wondering: where is "vacation time"? In Germany it starts at the end of June.
And in times of Covid-19 everything is different, right?
Thanks

This and last week was homeoffice-only, no chance to test, sorry.

This and last week was homeoffice-only, no chance to test, sorry.

But you are still at it so everything is fine.
It's a big issue.
Hope to hear from you soon.
Thanks

There also needs to be a solution for those that do not have access to static IP's.

Looks as if because there is no 'interface down/up' when the cable is not removed, then dhcp is not re-triggered. I'll TRY and re-create this locally, although I don't have multwan per say.

Looks as if because there is no 'interface down/up' when the cable is not removed, then dhcp is not re-triggered. I'll TRY and re-create this locally, although I don't have multwan per say.

Great, if you need help or inspiration: let us know.
Just one of the two WANs needs to be DHCP and you can reproduce the issue easily.

OK.. my test consists of this, and remember I'm running on 20.7b.

I have a test router setup as failover getting it's two WAN addresses from my main router, two LAN networks so independent addresses. I have one LAN out of the test router and this is v4 only - v6 is disabled. In between the WAN input on my test router and the output from the switch port carrying the primary router LAN I have added another switch, we'll call it 'switch B',, this allows me to unplug that network without taking down the primary WAN interface on the test router. I think that pretty much matches what you are saying.

I've added some extra logging to the rc.syshook.d\monitor\10-dpinger script so I can see it being called and that echoes some junk to a log file for me.

It's working perfectly, I can unplug the Input side of 'Switch B', thus leaving the WAN port of the test router connected and I can watch the gateway loss increasing. I have waited until its showing 100% loss and then checked my temp log file to see if the 10dpinger script is called and also I have done a tracert from my pc that confirms the gateways have switched.
I have then left it for around 5 minutes - I went and made a cup of tea.:) I then reconnected the input to 'Switch B' and watched as the loss started to decline, I waited until that went to around 30% at which point the indicator amber, but I waited until it went green. I then checked to see if the 10-dpinger script had run, it had. I also did a tracert from my PC and that confirmed the gateway had switched back to the primary gateway. So it appears fine on v4 only.

Might add that on one occasion it took around 30 seconds for the route to switch back to the primary, but it did switch,

Are you using v6 too? Perhaps there might be an issue when dual stack is used

No ipv6 here either.

Please keep this in mind:
First unplug and replug the WAN with DHCP _on your Opnsense_-Appliance.
Then try your test again (Switch B) _without_ unplugging and repluggin on WAN of Opnsense.

Please see here (step by step):
https://github.com/opnsense/core/issues/4160#issuecomment-641789848

Thanks!

I have TWO opnsense appliances, primary router and test router. Which one do you mean?

I have TWO opnsense appliances, primary router and test router. Which one do you mean?

The one that is handling the Multi-WAN.
The one you set up like this:
https://docs.opnsense.org/manual/how-tos/multiwan.html

I am just using one Opnsense with two WAN and a dumb switch for testing as explained above.

Did that... works fine.

Did anything change regarding this in the beta you are using?
Could you please test with 20.7. (non Beta)?
And are you using DHCP on the WAN-Interfaces?

No, 20,7 is not out yet. Do you mean try with 20.1.7? :) Yes, can do that. It'll take me a few minutes to back up the configs and install that version. BBS

Yup, DHCP on both interfaces.

Works fine on 20.1.1, now I'll update.

20.1.7 appears to have a problem.. I'll see if I can try and find it.

If it worked with 20.1.1 and breaks wit 20.1.7 this patch can only be the reason (which should fix it rather than break it):
https://github.com/opnsense/core/issues/3961

That's in rc-linkup, that doesn't get called as far as I can see in this scenario, It's where the interface is still up, but the other side of the switch in the middle is down. I'm going to re-test 20.1.1 to confirm first that it was working.

This is a strange one. I haven't been able to make it fail in 20.1.1, but 20.1.7 is a bit weird. as @pete1019 says, if you flip it a couple of times then you get the secondary WAN monitor works fine, but there's no route to host from the PC,.. It gets as far as OPNsense but that's it. Really doesn't matter which way you do it either, first changeover and back appears to work every time, after that it doesn't, even if the interface goes down and back up.

@pete1019 can you re-open this please.

@pete1019 can you re-open this please.

This one here is still open.

i am not able to open this one because i am not the owner: https://github.com/opnsense/core/issues/3961

The default route is missing..

@pete1019 can you re-open this please.
i am not able to open this one because i am not the owner: #3961

Sorry my bad... just saw closed. :)

Yup, add the default route and it's working again. @pete1019 could you check that as well please. Do a netstat -4rW, and see if the default route vanishes.

Yup, add the default route and it's working again. @pete1019 could you check that as well please. Do a netstat -4rW, and see if the default route vanishes.

i am not able to. Don't have an appliance here. Could set up a VM again these days.
But i remember: just hitting save on any interface did the trick as well.

Maybe somebody else could try?
@ischilling
@arch1mede

Thanks

Yes save interface would restore the gateway. So that appears to be the issue.

@marjohn56 the system log probably contains more details if there's a race between dhcpc and the link-up event somehow.

The "default gateway" switching calls inside the filter code:
https://github.com/opnsense/core/blob/e2f6272957d8f3e60b107d3eca450929415de4cb/src/etc/inc/system.inc#L416

Don't believe it to be a dhcp issue, as when disconnect the primary interface the secondary interface is still up and running, with an address on the interface. It;s just that the default gateway doesn't get added although the old one is removed.

usually this should leave some content in the logs, but since dhcpc is responsible for providing the gateway and it doesn't exist with a static address, it sounds quite related to me.

Maybe I misunderstand then, I've never looked at failover before. My assumption was that both interfaces would have a dhcp assigned address from the ISP, in the event of the primary interface going down the secondary interface would already have the information and it would just be a case of setting the default route/filters - or am I misreading this?

I don't think you are, but at a first glance there are only a couple of things that can go wrong, either the gateways are not known (see the files in /tmp/) or the process responsible for detecting a failure doesn't provide the correct signal (which would be dpinger logically). I haven't looked into this issue, but to me it doesn't look like https://github.com/opnsense/core/issues/3961 killed a feature, more likely it worked by accident (not switching where it was supposed to).

Personally, I would start looking at the events triggered in the (system) log, currently I don't have time to test this locally.

I've checked that the 10-dpinger script is being called, at least I did on 20.7, so dpinger etc is doing its thing. I'm just trying to work my way through the gateways group stuff to work out what SHOULD happen when an interface goes down. Think I'll do a compare as this seems to be a regression but I could be wrong.

checked, dhclient is still running on the secondary interface. More mysterious still is whilst checking through the code I saw this:
if (isset($config['system']['gw_switch_default'])) {
// When gateway switching is enabled, we might consider a different default gateway.
// although this isn't really the right spot for the feature (it's a monitoring/routing decision),

I set that in general config - Allow gateway switching and it's worked every time now. Could that be the missing link?

euh, yes, if gateway switching isn't enabled it won't try to update the standard gateway. I would expect it to stay stale though (so in this case the question is which event lead to gateway removal).

OK... well I've also made sure that the DNS servers in general are NOT the same as those specced in the gateways, so just leaving the gateways at the default address. If this all tests out, and I'll do some more testing tomorrow then I think we need to change the how to, as it shows the google DNS addresses. So it looks like you and @Franco don't need to worry about this, you can take a look at the IPv6 link-local monitoring issues. ;馃榿

Tested this some more, and after setting the gateway switching to On it appears to behave itself. Need the op's and others to confirm this.

Tested this some more, and after setting the gateway switching to On it appears to behave itself. Need the op's and others to confirm this.

You mean this?
When using Unbound for DNS resolution you should also enable Default Gateway Switching via System->Settings->General, as local generated traffic will only use the current default gateway which will not change without this option.
From here: https://docs.opnsense.org/manual/how-tos/multiwan.html

Nope, this was always active in my tests on 20.1.7 and i can still reproduce the problem.

Please always check which gateway it really goes by checking something like www.ipcheck.com for example

How long are you waiting for recovery, on mine it takes around 60 seconds.

you mean 60 seconds when packetloss is back under 10%?

With STATIC IP on WAN it is back instantly (after packetloss being under the threshold).
I was not checking so long (60 seconds) on my tests.

Interesting.. OK, a bit of deeper delving has maybe got me somewhere...
It would appear that the call to configctl filter reload in 10-dpinger doesn't actually do anything, changing the line to /usr/local/sbin/configctl filter reload does.

@pete1019 -Try editing the file, you'll find it in /usr/local/etc/rc.syshook.d

Ok, I finally found the time to debug.

My test machine has 20.7b (ISO, not only UI), WAN1 is DHCP, gateway has prio 251, marked as upstream, monitoring enabled. WAN2 is static, 192.168.12.X, prio 255, marked as upstream, monitoring enabled. In System : Settings : General, default gateway switching is enabled. I do NOT use gateway groups or similar, just gateway switching. I shut the switchport where WAN1 sits (like unplugging the cable or a defect of modem) and it fails over to static. I reenable the port and it fails back to DHCP gateway. I did this 3 times .. always set the correct gateway.

Cant reproduce ..

Ok, I finally found the time to debug.

My test machine has 20.7b (ISO, not only UI), WAN1 is DHCP, gateway has prio 251, marked as upstream, monitoring enabled. WAN2 is static, 192.168.12.X, prio 255, marked as upstream, monitoring enabled. In System : Settings : General, default gateway switching is enabled. I do NOT use gateway groups or similar, just gateway switching. I shut the switchport where WAN1 sits (like unplugging the cable or a defect of modem) and it fails over to static. I reenable the port and it fails back to DHCP gateway. I did this 3 times .. always set the correct gateway.

Cant reproduce ..

Please do exactly as i stated here:
https://github.com/opnsense/core/issues/4160#issuecomment-641789848

Looks like you always physically unplug and replug.
Please only do this ones with the DHCP-Port. 2nd time please use a dumb switch and cut the connection there so you don't unplug the cable to WAN of opnsense. It is important to not physically detatch the cable again.

Also: i use Gateway-Group. Just like everything was explained in official Multi-WAN tutorial:
https://docs.opnsense.org/manual/how-tos/multiwan.html

The reason why this is so important to work:
imagine your Modem would reboot for some reason. It will get a link down in opnsense on your WAN.
Later, only the Internet will fail (no link down and link up again) because your provider is down.
It will switch to WAN2 but it will never (or not as soon as intended) switch back to WAN1.

I cant test this from home .. maybe next week when I get back to work ...

I cant test this from home .. maybe next week when I get back to work ...

No dumb little switch at home?
But again: THANKS everyone for your time!

The machine is at work and needs cabling.
But I'm happy gateway selection code is fine. I never use gateway groups, but we will see next week

The machine is at work and needs cabling.
But I'm happy gateway selection code is fine. I never use gateway groups, but we will see next week

If i use physically unplug and replug on WAN with DHCP everything works fine for me as well.
Thats why it is important to do it like this:
https://github.com/opnsense/core/issues/4160#issuecomment-641789848

Excited on how your tests go next week.

Yes it DOES work fine if you physically unplug, that's because a WAN down/up event is triggered. If however there is an upstream failure and dpinger should do the detection THAT is where the issue is. As I said, the reason it fails is due to what I pointed out in an earlier message, the problem is in 10-dpinger, it doesn;t run 'configctl filter reload', nor does it write anything to the log to say it hasn't. If you give the full path to configctl then it does work.

Yes it DOES work fine if you physically unplug, that's because a WAN down/up event is triggered. If however there is an upstream failure and dpinger should do the detection THAT is where the issue is. As I said, the reason it fails is due to what I pointed out in an earlier message, the problem is in 10-dpinger, it doesn;t run 'configctl filter reload', nor does it write anything to the log to say it hasn't. If you give the full path to configctl then it does work.

Thanks, so who is able to fix that and release it?
I think pfsense does not have this issue as someone stated that here before.

I think i should set up another test-vm here. But i need to think about how to get a second WAN since i don't have the LTE-device here anymore.
Can you please give more instructions what i should exactly do to test your fix? Log into opnsense via ssh... nano into " /usr/local/etc/rc.syshook.d", change what (line)? Will this survive an update? Thanks

You are, until an update is released. You can fix it yourself, I've posted how. I don't really see the relevance of pfsense in the conversation,

Now fixed and will be in the next release or you can patch it yourself.

So is this commit fixing the issue? Anyone can confim?
Thank you.

Was fixed in 20.1.8 most likely. :)

Was this page helpful?
0 / 5 - 0 ratings