Core: Gateway-group bound state flushing on failback

Created on 5 Jun 2019  Â·  5Comments  Â·  Source: opnsense/core

When using a failover gateway group (Tier1/2) for VoiP,
there is a situation where SIP and RTP can be split across tiers, resulting in
VoiP becoming unavailable.

Scenario:

Tier1 UP; Tier2 UP

SIP and RTP are over Tier1, everything works well

Tier1 DOWN (fails); Tier2 UP

SIP and RTP are over Tier2, everything works well

Tier1 UP; Tier2 UP

SIP remains over Tier2, (new) RTP goes out via Tier1.
VoiP remains broken until manual state flush or forcing manual re-registration

Tier1 UP; Tier2 DOWN (fails)

SIP and RTP are over Tier1, everything works well
Registration now is forced to reoccur via Tier1

Basically, when the registration occurs via tier 2, and tier 1 comes back online, the registration stays on tier 2.
This results in the RTP data going out over tier 1 and hence being in a split state, ruining the system.

The ideal solution would be if the gateway group contained a feature
"Flush states for this GW group on failback"

This would allow for:

  • GW groups that leave connections on the generally favorable soft-failback where old connections remain
    on the backup
  • Allow for failover groups that require hard-failback (VoIP, metered connections)

I'd be willing to poke around for implementation given some pointers.

help wanted

Most helpful comment

With the new rule logic planned for 19.7 it should be possible to kill on a per rule bases, since we use the label field as a unique rule hash (previously the description was put there).

All 5 comments

The problem is that OPN/pf keeps track of SIP and since T2 is still up when moving to T1 again, SIP packets are still sent over T2 while connection-less RTP runs over T1.
We maybe need something similar like Kill states and/or Dynamic state reset for ANY gateway failover with a clear warning in desciption and disabled by default.

Yes, that’s the “problem” exactly.

Is there a way to match the state to the rule that passed it?
Because then the dynamic state reset could be matched to the gateway group.

It would be great if this can somehow be made dependent on a rule, because it’s very intrusive and only appropriate in a select few cases.

I will tinker a bit later with pfctl because I’m not really familiar with it.

As a side note, do you think that a stateless gateway rule would provide an adequate workaround or will there be other unforeseen issues?

With the new rule logic planned for 19.7 it should be possible to kill on a per rule bases, since we use the label field as a unique rule hash (previously the description was put there).

This issue has been automatically timed-out (after 180 days of inactivity).

For more information about the policies for this repository,
please read https://github.com/opnsense/core/blob/master/CONTRIBUTING.md for further details.

If someone wants to step up and work on this issue,
just let us know, so we can reopen the issue and assign an owner to it.

Just adding this here, for anyone else that stumbles across it, as the opnsense forum post appears to have been archived - I hacked a script from something I found on the pfsense forum, that does the same.

My WAN interface is lagg0_vlan18, the script checks the default route interface, then if it's set to WAN it reset the mobile data states. My Opnsense connects to a 4G/Mobile router over 192.168.54.0/24.

Uses 'pfctl -k' to only kill states for 192.168.54.102 (the NAT IP on Opnsense that goes over 4G). Runs from cron every minute, causes my L2TP (UDP) tunnels to then fail back/reconnect on the active connection..... rather than remaining on 4G.

As I have gateway monitoring enabled, which causes 1 ICMP session, mobile states have to be greater than 1 (to allow for this) before it does any state flushing.

"$MOBILE_NSTATES" -gt 1

#!/bin/sh
# *** kills firewall states on failover Mobile Data  when WAN is up ***

WAN_IF="lagg0_vlan18"

CURRENT_TIME="$(date +"%c")"
WAN_STATUS=`route -n show default | grep interface | awk '{print $2}'`

if [ "$WAN_STATUS" = "$WAN_IF" ]; then
    # the following line may need to be tweaked depending on your needs
    MOBILE_NSTATES=`pfctl -s state | grep "192\.168\.54" | wc -l`
    if [ "$MOBILE_NSTATES" -gt 1 ]; then
        echo "$CURRENT_TIME: WAN1 is online, but connections remain on Mobile Data. Killing states."
        pfctl -k 192.168.54.102
    fi
fi

EDIT: This also requires gateway switching to be enabled, so that the default route/interface changes.

Was this page helpful?
0 / 5 - 0 ratings