Important notices
Before you add a new report, we ask you kindly to acknowledge the following:
[X] I have read the contributing guide lines at https://github.com/opnsense/core/blob/master/CONTRIBUTING.md
[X] I have searched the existing issues and I'm convinced that mine is new.
Describe the bug
Multi WAN does not seem to work properly when two WAN interfaces (each of them configured as DHCP client) are used. During a test, unplugging igb1 (WAN) leads to the expected behaviour that traffic is then routed via igb2 (WAN2). Plugging the the cable to igb1 (WAN) in again, is still working as expected (traffic again via WAN). But after a second unplugging of igb1 (WAN), the routing of new connections is not going via WAN2. Only after reloading pf (using the web interface) leads to traffic from LAN reaching the Internet again.
To Reproduce
Install and configure OPNsense:
Steps to reproduce the behavior:
Expected behavior
When Multi WAN is configured, and a working WAN is available, traffic routing should be always possible.
Relevant log files
OPNsense-Multi-WAN-Test-Logs.txt
Additional context
Using static IP does not seem to cause problems.
Another user has reported the same issue here: https://forum.opnsense.org/index.php?topic=15943.0
I can fully reproduce the behaviour that the user mentions there.
Environment
OPNsense 20.1.1-amd64, FreeBSD 11.2-RELEASE-p16-HBSD, OpenSSL 1.1.1d 10 Sep 2019
Intel(R) Celeron(R) CPU J3160 @ 1.60GHz (4 cores)
Network 4x Intel庐 I210-AT
In case you need further information, here is the configuration file:
I'm planning further testing tomorrow using static IP addresses on the two WAN interfaces. I will then compare the log outputs to get an idea what the root cause of the issue could be.
If anybody has an idea or tips where I should especially look, just let me know. Otherwise I will update this issue tomorrow.
@tk-wfischer Is it possible to work this out with me on private channel? I'd like to test this with gateway monitoring to rules out if it's failover logic in general or something related to link loss (which is very very rare case in dual setup)
In my testings, I always unplugged the cable to test the failover setup. But as we're connecting to our ISP with a direct fiber channel it can happen that the whole link goes down - for instance due to powerloss in some segments of the network.
@mimugmail and I did some further testing.
We have changed the configuration beforehand:
Our results:
Details of the first unplugging of the WAN interface (igb1):
Mar 10 10:47:15 multiwantest kernel: igb1: link state changed to DOWN
Mar 10 10:47:15 multiwantest opnsense: /usr/local/etc/rc.linkup: DEVD Ethernet detached event for wan
Mar 10 10:47:15 multiwantest dhclient[8302]: connection closed
Mar 10 10:47:15 multiwantest dhclient[8302]: exiting.
Mar 10 10:47:15 multiwantest opnsense: /usr/local/etc/rc.linkup: Clearing states for stale wan route on igb1
Mar 10 10:47:30 multiwantest opnsense: /usr/local/etc/rc.filter_configure: Ignore down inet gateways : WAN_DHCP
Mar 10 10:47:30 multiwantest opnsense: /usr/local/etc/rc.filter_configure: ROUTING: removing /tmp/igb1_defaultgw
Mar 10 10:47:30 multiwantest opnsense: /usr/local/etc/rc.filter_configure: ROUTING: creating /tmp/igb2_defaultgw using '10.1.102.1'
Mar 10 10:47:30 multiwantest opnsense: /usr/local/etc/rc.filter_configure: Ignore down inet6 gateways : WAN_DHCP
Mar 10 10:47:30 multiwantest kernel: pflog0: promiscuous mode disabled
Mar 10 10:47:31 multiwantest kernel: pflog0: promiscuous mode enabled
Mar 10 10:48:05 multiwantest kernel: igb1: link state changed to UP
After plugging the cable of igb1 (WAN) in again, and unplugging the cable again brings the following log content:
Mar 10 10:49:11 multiwantest kernel: igb1: link state changed to DOWN
Mar 10 10:49:12 multiwantest opnsense: /usr/local/etc/rc.linkup: DEVD Ethernet detached event for wan
Mar 10 10:49:12 multiwantest dhclient[76783]: connection closed
Mar 10 10:49:12 multiwantest dhclient[76783]: exiting.
Mar 10 10:49:12 multiwantest opnsense: /usr/local/etc/rc.linkup: Clearing states for stale wan route on igb1
<--- nothing happens here --->
Mar 10 10:50:30 multiwantest kernel: igb1: link state changed to UP
So when doing the unplugging the second time, the following lines are missing:
Mar 10 10:47:30 multiwantest opnsense: /usr/local/etc/rc.filter_configure: Ignore down inet gateways : WAN_DHCP
Mar 10 10:47:30 multiwantest opnsense: /usr/local/etc/rc.filter_configure: ROUTING: removing /tmp/igb1_defaultgw
Mar 10 10:47:30 multiwantest opnsense: /usr/local/etc/rc.filter_configure: ROUTING: creating /tmp/igb2_defaultgw using '10.1.102.1'
Mar 10 10:47:30 multiwantest opnsense: /usr/local/etc/rc.filter_configure: Ignore down inet6 gateways : WAN_DHCP
Mar 10 10:47:30 multiwantest kernel: pflog0: promiscuous mode disabled
Mar 10 10:47:31 multiwantest kernel: pflog0: promiscuous mode enabled
I can fully confirm the behavior decribed by @tk-wfischer.
For us this behavior is especially critical as we have a direct fiber-connection to our OPNsense box using a SFP-module.
Obviously it can happen that the entire link goes (physically) down. This is also the reason we use two links in the first place: They are connected from different sides of the building to eliminate the risk of one link being damaged resulting in our whole location going down.
We highly appreciate any workaround or bugfix to get this working properly.
@tk-wfischer can you check if dpinger reported loss around 10:47:15? (/var/log/gateways.log)
Sure - here is the log snipped from this time frame:
[...]
Mar 10 10:46:17 multiwantest dpinger: GATEWAY ALARM: WAN_DHCP (Addr: 81.24.64.2 Alarm: 0 RTT: 11240ms RTTd: 776ms Loss: 6%)
Mar 10 10:47:16 multiwantest dpinger: WAN_DHCP 81.24.64.2: sendto error: 65
Mar 10 10:47:17 multiwantest dpinger: WAN_DHCP 81.24.64.2: sendto error: 65
Mar 10 10:47:18 multiwantest dpinger: WAN_DHCP 81.24.64.2: sendto error: 65
Mar 10 10:47:19 multiwantest dpinger: WAN_DHCP 81.24.64.2: sendto error: 65
Mar 10 10:47:20 multiwantest dpinger: WAN_DHCP 81.24.64.2: sendto error: 65
Mar 10 10:47:21 multiwantest dpinger: WAN_DHCP 81.24.64.2: sendto error: 65
Mar 10 10:47:22 multiwantest dpinger: WAN_DHCP 81.24.64.2: sendto error: 65
Mar 10 10:47:23 multiwantest dpinger: WAN_DHCP 81.24.64.2: sendto error: 65
Mar 10 10:47:24 multiwantest dpinger: WAN_DHCP 81.24.64.2: sendto error: 65
Mar 10 10:47:25 multiwantest dpinger: WAN_DHCP 81.24.64.2: sendto error: 65
Mar 10 10:47:26 multiwantest dpinger: WAN_DHCP 81.24.64.2: sendto error: 65
Mar 10 10:47:27 multiwantest dpinger: WAN_DHCP 81.24.64.2: sendto error: 65
Mar 10 10:47:28 multiwantest dpinger: WAN_DHCP 81.24.64.2: sendto error: 65
Mar 10 10:47:29 multiwantest dpinger: WAN_DHCP 81.24.64.2: sendto error: 65
Mar 10 10:47:29 multiwantest dpinger: WAN_DHCP 81.24.64.2: Alarm latency 11252us stddev 658us loss 22%
[...]
@tk-wfischer ok, can you try https://github.com/opnsense/core/commit/0e2751d2d15da34389f6c2923fb1215aeffd6f38 ?
opnsense-patch 0e2751d
If that solves it, I'm also interested in the gateway monitor status afterwards (is it reporting ok after a while).
@AdSchellevis and @qdrop17 - good news - the patch solves the issue :-)
Under "System: Gateways: Single" and on the dashboard, the Gateway is in Status "Online" again then, too.
Here are the screens after unplugging igb1:


And here after plugging the cable of igb1 in again:


I think now everything is working fine as expected.
@qdrop17 can you test, too, if the patch solves the issue for you, too?
Thank you @AdSchellevis for your great support - we appreciate it very much.
alright, I can confirm the fix to be effective. Everything seems to be working as expected now.
Thank you very much @AdSchellevis and @tk-wfischer for your efforts to fix this!
This is good news. You are welcome.
@fichtner Should I leave this issue "Open" until the fix is included in OPNsense 20.1.2 or 20.7?
20.1.3 ;)
Please try again but only unplugging and replugging the frist time. Second time: cut the internet without unplugging the cable on WAN.
See step by step here: https://github.com/opnsense/core/issues/4160#issuecomment-641789848
This seems to bring another problem to light.
Most helpful comment
@AdSchellevis and @qdrop17 - good news - the patch solves the issue :-)
Under "System: Gateways: Single" and on the dashboard, the Gateway is in Status "Online" again then, too.
Here are the screens after unplugging igb1:


And here after plugging the cable of igb1 in again:


I think now everything is working fine as expected.
@qdrop17 can you test, too, if the patch solves the issue for you, too?
Thank you @AdSchellevis for your great support - we appreciate it very much.