Important notices
Before you add a new report, we ask you kindly to acknowledge the following:
[x] I have read the contributing guide lines at https://github.com/opnsense/core/blob/master/CONTRIBUTING.md
[x] I have searched the existing issues and I'm convinced that mine is new.
Describe the bug
A clear and concise description of what the bug is, including last known working version (if any).
Tip: to validate your setup was working with the previous version, use opnsense-revert (https://docs.opnsense.org/manual/opnsense_tools.html#opnsense-revert)
To Reproduce
Steps to reproduce the behavior:
Upstream Gateway being enabled(active) and netstat -rn showing it as default.Expected behavior
This gateway should not being set as a default gateway.
Environment
Software version used and hardware type if relevant.
e.g.:
OPNsense 20.1.2-amd64
FreeBSD 11.2-RELEASE-p17-HBSD
OpenSSL 1.1.1d 10 Sep 2019
Background information:
We're running BGP on two paths to our upstream provider to anounce certain subnets, and they originate the default route to us.
Because we didn't have a default gateway configured in the OPNsense gateways section, FRR set the default gateway being received from our ISP (this is how we wanted it by allowing the import of 0.0.0.0/0). This worked flawlessly, also the gateway failover via BGP worked, because we're getting the default originate on both links.
Also, we have multiple GRE tunnels configured to the backend, for which gateway entries are getting created automatically.
The initial setup was done on OPNsense 19.1.
A few days ago, we've upgraded to 20.1 (which almost worked flawlessly - some minor issues apart from this one).
After the upgrade because of this behavior and not having any gateway configured, OPNsense chooses one of the automatic GRE tunnel gateways as default gateway (none of them is configured as default candidate and they're getting all configured with weight 255).
This is an issue, because the import of the BGP originated default gateway from our ISP is not working anymore due to OPNsense setting one of the GRE tunnel gateways as default gateway which has a higher weight according to frr than the one we're getting anounced because it's a static route.
To workaround this, we had to configure both gateways in OPNsense aswell and activate dpinger on them for automatic failover with a lower weight than the GRE tunnel gateways, but we'd like to receive the gateway through BGP again.
Testing the steps above on OPNsense 19.1 works as expected.
20.1.2:


19.1.4:


duplicate https://github.com/opnsense/core/issues/3597 ?
I've stumbled across this ticket, but marking the GRE tunnel gateways as down isn't an option for us.
Also we are not able to change the GRE tunnel gateway attributes (like forcing it to down) because they're getting created with a name longer than 32 chars (which is being prevented on edit) - and changing gateway names isn't allowed.
We also have another gateway on a dedicated management link configured which is separate from the BGP links, and this would get elected as default gateway then. Forcing this down is no idea either, because that'd prevent the management interface from function.
just to be sure, the screenshot notes a gateway "test", which will become default when there's nothing else, but there's a reason why you can't mark it as down? Just trying to figure out if you have a question about something in core or the plugin (frr), it the later is the case I can move your ticket to plugins.
Yes, the gateway is named test and it is indeed becoming default although its being not set as a default gateway candidate.
This isn't a frr / plugin issue, because this happens aswell on a base install. I've created 2 more dummy interface and created a gateway on each interface, with none being set as a default gateway candidate. Yet one is being selected as a default gateway:

Setting one to a higher weight but checking the default upstream candidate tickbox results in a correct behavior selecting this one as the new default gateway:

so, you force it down? (as the other issue suggests)
Forcing the one gateway with the candidate checkbox ticked to down state causes OPNsense to select again one of the non candidates:

In my opinion the only issue here is, that OPNsense should stop selecting default gateways without the Upstream Gateway checkbox ticked.
we've discussed that thoroughly in the other ticket, we're not going to add new toggles for similar behaviour as marking the gateways that are available as being down (not available for default gateway selection).
I don't think that there are any new toggles needed, only a small logic change in the gateway selection to skip gateways being elected as default without being a default candidate:
public function getDefaultGW($skip = null, $ipproto = 'inet')
{
foreach ($this->getGateways() as $gateway) {
if ($gateway['ipprotocol'] == $ipproto) {
if (is_array($skip) && in_array($gateway['name'], $skip)) {
continue;
} elseif (!empty($gateway['disabled']) || !empty($gateway['is_loopback']) || !empty($gateway['force_down']) || empty($gateway['defaultgw'])) {
continue;
} else {
return $gateway;
}
}
}
// not found
return null;
}
That would be a breaking change for a lot of other setups, hence the choice as discussed in the other ticket.
I discussed this very intensively with @AdSchellevis .. there's no downside forcing a gateway as down (only the red icon makes afraid). You can set manual routes with it so this should be a clean approach.
Indeed, the red scary offline button made me wonder if there should be another way. Apart from that, everything works as intended. I just have to explain coworkers now why I am forcing gateways to down state.
Is there a chance considering this logic change for a new major in the future?
I would not consider a change with 20.7 since this will already bring 12.1 major update. We shouldn't add too much changes in auto-logic (touching defaults) at the same time.
We might consider changing the presentation / legends if that helps, the logic itself is unlikely to change, since it either adds more difficult to understand toggles or has a high risk of breaking connectivity on certain setups.
Interesting discussion. So,
Did I get this correctly? If so, changing the presentation / explanation would be appreciated. The current text suggests that unchecking _Upstream Gateway_ deselects the gateway as a default gateway candidate. At least that was my interpretation.
You're right, that explains it quite well. But honestly, who reads the documentation for "basic" settings where the UI is (apparently) unambiguous? I admit I didn't.
For _Upstream Gateway_, the UI says "will select [...] as a default gateway candidate", while the doc says "marks the gateway as favourable for default gateway selection". _Selecting as a candidate_ and _marking as favourable_ are two different things. Bringing the UI more in line with the doc would be appreciated. :-)
I don't mind if one wants to extend the help text, but the standard behaviour (select a default if we have none configured) has always been there..... It's always a good idea to read the docs, we do write them for a good reason :-)
While trying to come up with a better help text I realized that I still don't fully understand the logic behind this. If unchecking _Upstream Gateway_ just moves the gateway to the bottom of the list of default gateway candidates, then why do we need this option? I mean, that's what the _Priority_ value is for. Doesn't that make this checkbox completely redundant?
@maurice-w It's all about priorities, upstream gateways are favourable first, if none is available it moves on to non upstream variants. You can easily try with the toggles available to simulate the changes, since the gateway overview is always sorted the same as the underlaying logic in gateway selection.
Dhcp type interfaces for example will be prioritised a bit higher by default (gateway provided by interface), which is why all usually "just" works (and if it doesn't, you can change it).
Gateway selection is a bit complex due to the different (existing) scenarios and dynamic interface types. We just made sure you can influence all aspects of the choices manually, while maintaining backwards compatibility in most cases (theoretically it should be possible to only use priorities, but dynamic types very easily ruin your setup).
The internals can be found here https://github.com/opnsense/core/blob/e690ff6fec3de90949cae42c7620ae6ec6b8beb0/src/opnsense/mvc/app/library/OPNsense/Routing/Gateways.php#L129
Originally, upstream was called WAN, which we deemed less logical https://github.com/opnsense/core/commit/7a8b12f030a851a801f55c27ee2a90635805020c
As mentioned earlier, I don't mind extending the help text, although to me, it doesn't strike as very useful given the complex nature of the task (the docs should explain all, the help texts are just a summary, which I believe aren't wrong at the moment.).
To avoid having to explain more about this subject and how we got here, please read (all of) https://github.com/opnsense/core/issues/2279 first, it should contain most of the choices.
From a single static interface configuration, gateway selection seems very simple, but as soon as your adding interfaces, pulling plugs, etc, things can get complex very rapidly.
Even though the issue I just discovered isn't exactly this very same one, it appears to be very closely related.
I recently had an episode where an offline gateway - detected to be offline due to the monitoring I had configured for it - was still selected as a default gateway even though it was correctly being reported as offline by the monitoring framework.
The gateway's interface was up and fully configured with an IP address, but traffic upstream was interrupted due to an ISP issue. The monitoring IP was a well-known public IP (i.e. Google's DNS @ 8.8.8.8), such that if that IP wasn't reachable, then by it would be safe to assume that the internet wasn't reachable through that ISP's link, and thus that gateway should not be used for routing.
And still OPNSense selected it as the default gateway even when there was another gateway available that was clearly online. This resulted in an interruption of service.
Thankfully the ISP's outage didn't last long and service was restored, but this highlights a problem with the gateway selection code: it appears that not all factors that should be taken into account
during the selection process are being considered.
Notice the first gateway on the list:

Notice the default gateway that's active:

I wrote a gateway monitoring script that would print out to STDOUT whenever there was a change in the default gateway configuration (polling via netstat -rn4 every 0.01s), and the system does activate the available gateway (BKUP_GW), but almost immediately falls back to MAIN_GW. And it does this repeatedly:
[diego@firewall ~]$ sudo ./monitor-gateway
2020/03/13 15:14:03: GATEWAY=[<MAIN_GW_IP>]
2020/03/13 15:14:29: GATEWAY=[empty]
2020/03/13 15:14:29: GATEWAY=[192.168.200.1]
2020/03/13 15:15:01: GATEWAY=[empty]
2020/03/13 15:15:01: GATEWAY=[192.168.100.1]
2020/03/13 15:16:30: GATEWAY=[192.168.200.1]
2020/03/13 15:16:37: GATEWAY=[192.168.100.1]
2020/03/13 15:18:06: GATEWAY=[empty]
2020/03/13 15:18:06: GATEWAY=[192.168.200.1]
2020/03/13 15:18:13: GATEWAY=[192.168.100.1]
2020/03/13 15:18:17: GATEWAY=[192.168.200.1]
2020/03/13 15:18:21: GATEWAY=[192.168.100.1]
2020/03/13 15:19:20: GATEWAY=[192.168.200.1]
2020/03/13 15:19:26: GATEWAY=[empty]
2020/03/13 15:19:26: GATEWAY=[192.168.100.1]
2020/03/13 15:20:33: GATEWAY=[192.168.200.1]
2020/03/13 15:20:41: GATEWAY=[192.168.100.1]
So something within the system detects the issue, but it keeps tripping over itself when trying to fix it.
I'm looking through the gateway selection code, and I don't see anywhere where it would take into account the gateway's current online/offline status by itself, as determined by the gateway monitoring infrastructure. Then again I may just be blind/dumb :)
All that said, let me know if you'd like me to start a new ticket with this specific issue.
Cheers!
This issue has been automatically timed-out (after 180 days of inactivity).
For more information about the policies for this repository,
please read https://github.com/opnsense/core/blob/master/CONTRIBUTING.md for further details.
If someone wants to step up and work on this issue,
just let us know, so we can reopen the issue and assign an owner to it.
I think that this one was fixed, I reported a duplicate issue which was fixed
I have tried to read and understand this ticket and the one it refers to, but I still cannot wrap my head around what the intended meaning or functionality is. At the very best, the wording in the UI is extremley confusing, I would claim it's outright misleading.
With the current wording and the UI elements used (big red flags when marked as down, etc.), I remain adamant that the only possible interpretation is:
"If a gateway is not marked as an upstream candidate it stands to reason it should never be used as default gateway."
This is not the observed behaviour, and the combination of FRR (for getting routes for "the internet") and static routes for internal networks is currently nowhere near intuitive, and the "solution" feels like a hack. It hit us hard in the face during our attempt at migrating one of our datacentre firewall clusters last night..
This has been addressed by allowing one to select which gateways are "upstream" and which aren't. This allows you to un-mark VPN gateways (for instance) as non-upstream gateways so they won't get picked for default GW. The problem had to do with interface and route reconfiguration not happening under the appropriate events (i.e. interface goes up or comes down).
The solution is to use just use "Upstream Gateway" in a single gateway's definition. I can confirm that version 20.7.7_1 is not affected by this problem.
Cheers.
--
Diego Rivera │ ARMEDIA
Application Architect
e. diego
a. 8221 Old Courthouse Rd Suite 300, Vienna, VA 22181, USA
Webhttps://www.armedia.com/ │ LinkedInhttps://www.linkedin.com/in/diego-rivera-23b3211/
On Fri, 2021-01-15 at 02:51 -0800, Eirik Øverby wrote:
CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you recognize the sender and know the content is safe.
I have tried to read and understand this ticket and the one it refers to, but I still cannot wrap my head around what the intended meaning or functionality is. At the very best, the wording in the UI is extremley confusing, I would claim it's outright misleading.
With the current wording and the UI elements used (big red flags when marked as down, etc.), I remain adamant that the only possible interpretation is:
"If a gateway is not marked as an upstream candidate it stands to reason it should never be used as default gateway."
This is not the observed behaviour, and the combination of FRR (for getting routes for "the internet") and static routes for internal networks is currently nowhere near intuitive, and the "solution" feels like a hack. It hit us hard in the face during our attempt at migrating one of our datacentre firewall clusters last night..
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHubhttps://github.com/opnsense/core/issues/3966#issuecomment-760849711, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ABOQ4RYJ3BKPJBITETAFRNDS2AMZ7ANCNFSM4LDAQ7XQ.
If you use dynamic routing you don't necessarily have a default gateway at all in the routing tab, only gateways for static routes.
Then you should be able to turn off the setting of the default gateway, and this issue could be moot, right?
This issue was a side-effect of (as I said) a defect where reconfiguration wasn't being done at the appropriate moments, coupled with multi-WAN setups (primary-backup) where failover was configured. The system would fail over, but either not fail back, or the default gateway could be erroneously be set to a VPN gateway.
Cheers.
--
Diego Rivera │ ARMEDIA
Application Architect
e. diego
a. 8221 Old Courthouse Rd Suite 300, Vienna, VA 22181, USA
Webhttps://www.armedia.com/ │ LinkedInhttps://www.linkedin.com/in/diego-rivera-23b3211/
On Fri, 2021-01-15 at 07:09 -0800, Erik Inge Bolsø wrote:
CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you recognize the sender and know the content is safe.
If you use dynamic routing you don't necessarily have a default gateway at all in the routing tab, only gateways for static routes.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHubhttps://github.com/opnsense/core/issues/3966#issuecomment-760996588, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ABOQ4RYDBMHGBB7JQNP5DCTS2BLDJANCNFSM4LDAQ7XQ.
What we saw yesterday - with 20.7.7_1 - is not what you describe:
Despite the above, the single static gateway we have defined is assigned as default gateway, preventing FRR from overriding this. It has to be explicitly marked as "down" in order to prevent it being used as default.
The problem @ltning describes is exactly what I was dealing with aswell. Except for me, those (non) upstream gateways were added by OPNsense automatically because of some GRE tunnels we have configured. When we migrated to the new version introducing this behavior, this hit us hard aswell.