Mosh: Mosh connection stalls over VPN

Created on 9 Dec 2017  Â·  5Comments  Â·  Source: mobile-shell/mosh

Observed behaviour:

  • Mosh successfully connects
  • After some use, client will appear to lose server connection but server is still running
  • Client can disconnect using C-^.

I discovered I could trigger the behaviour using cat bigfile, and after digging around some more, found this to be an MTU issue.

Using tcpdump on the client side at the VPN tunnel I see:

14:01:58.018503 IP xxxxxxx.60001 > vpn-10-50-36-1.xxxxx.57929: UDP, bad length 1272 > 1168
14:01:58.018843 IP xxxxxxx.60001 > vpn-10-50-36-1.xxxxx.57929: UDP, length 779

I suspect this is a consequence of buggy behaviour on the part of the VPN ... failing to advertise a lower MTU and simply truncating packets in flight.

This ticket might be useful in providing another data point if there's work to make mosh more tolerant of MTU in general.

Most helpful comment

Overall it seems to me that for mosh to be robust, it cannot rely on UDP fragmentation ever working properly across the diverse environments in which it will be used.

    /*
     * IPv4 MTU. Don't use full Ethernet-derived MTU,
     * mobile networks have high tunneling overhead.
     *
     * As of July 2016, VPN traffic over Amtrak Acela wifi seems to be
     * dropped if tunnelled packets are 1320 bytes or larger.  Use a
     * 1280-byte IPv4 MTU for now.
     *
     * We may have to implement ICMP-less PMTUD (RFC 4821) eventually.
     */
    static const int DEFAULT_IPV4_MTU = 1280;

Looking in RFC 4821, I read:

It is RECOMMENDED that search_low be initially set to an MTU size that is likely to work over a very wide range of environments. Given today's technologies, a value of 1024 bytes is probably safe enough. The initial value for search_low SHOULD be configurable.

Given that, perhaps DEFAULT_IPV4_MTU = 1024; would be a good compromise until mosh is capable of performing its own MTU probes.

By the book, RFC 791 says that hosts are permitted to drop fragmented packets larger than 576 octets:

All hosts must be prepared to accept datagrams of up to 576 octets (whether they arrive whole or in fragments). It is recommended that hosts only send datagrams larger than 576 octets if they have assurance that the destination is prepared to accept the larger datagrams.

If mosh is doing its own MTU assessments, maybe consider the following:

  • Perform the probes on a separate (logical) control channel to allow the user data to use the current (usable) MTU unhindered
  • When constructing the probe, chunk the probe content into separate UDP packets so that the last UDP packet is 512 octets or less. The last UDP packet will not be fragmented and it will likely be received intact. The receiver can return a NAK if the last UDP packet was received, but the intervening ones were not.
  • After connection loss, rewind the MTU to 512, and use the MTU probes on the control channel to gradually, and quickly, find a good MTU.
  • It might be nice to periodically probe to see if the MTU can be increased even after a good MTU is established. Since these probes are sent on the control channel, user data flow will not be interrupted.

All 5 comments

That’s really odd, if the VPN is truncating rather than dropping oversized packets.

What version of Mosh do you have on the server?

What is the MTU on the VPN? What type of VPN is it?

And can you get a tcpdump capture with -v so we can see if the DF bit is set?

Also, the easy way to diagnose this is to start tmux/screen, do whatever hangs your session. Then type the keystrokes to switch to a new/empty screen. If your screen comes alive, then it’s almost certainly MTU.

The VPN is Cisco AnyConnect on Mac, and the observed behaviour is very odd.

I observe this behaviour on 1.3.2. (I also reverted to my stable version of 1.2.4 to confirm.)

Recompiling the server to force a small MTU makes the problem go away.

The tcpdump expression I used was incorrect. Applying a correct pattern I see:

4   0.019986    10.184.151.232  10.50.36.1  IPv4    1200    Fragmented IP protocol (proto=UDP 17, off=0, ID=a501) [Reassembled in #5]
5   0.020046    10.184.151.232  10.50.36.1  UDP 108 60001 → 52963 Len=1252
6   0.020290    10.184.151.232  10.50.36.1  UDP 847 60001 → 52963 Len=815

The VPN interface has MTU 1200, so this looks good.

My client is running on OSX, apply dtruss I see the following sequence repeated:

pselect(0x10, 0x10CFF6AEC, 0x0)      = 1 0
recvmsg(0xB, 0x7FFF52C56960, 0x80)       = -1 Err#35
recvmsg(0xC, 0x7FFF52C56960, 0x80)       = -1 Err#35
recvmsg(0xD, 0x7FFF52C56960, 0x80)       = 821 0
.. repeats ..

So the client receives the short packets (around 800 octets), but the reassembled UDP packet (longer than 1200 octets) is not delivered to the client even though it seems to be correctly fragmented as it comes out of the VPN.

The UDP fragments are being delivered (otherwise Wireshark wouldn't see them).

Possibilities I can think of:

  • All UDP fragments are simply discarded
  • UDP reassembly failed
  • The reassembled packet is discarded, and no attempt made to deliver it to the client

Although at this point the problem does not seem to lie with mosh directly, a capability to (gradually) reduce the size of the UDP packet transmitted my mosh would be a way to work around scenarios such as this.

And can you get a tcpdump capture with -v so we can see if the DF bit is set?

Here are the capture summaries of the Ethernet frames named above (4, 5 and 6) that show that DF is not set.

Frame 4: 1200 bytes on wire (9600 bits), 1200 bytes captured (9600 bits) on interface 0
Null/Loopback
Internet Protocol Version 4, Src: 10.184.151.232, Dst: 10.50.36.1
    0100 .... = Version: 4
    .... 0101 = Header Length: 20 bytes (5)
    Differentiated Services Field: 0x02 (DSCP: CS0, ECN: ECT(0))
    Total Length: 1196
    Identification: 0xc482 (50306)
    Flags: 0x01 (More Fragments)
    Fragment offset: 0
    Time to live: 245
    Protocol: UDP (17)
    Header checksum: 0x0be9 [validation disabled]
    [Header checksum status: Unverified]
    Source: 10.184.151.232
    Destination: 10.50.36.1
    [Source GeoIP: Unknown]
    [Destination GeoIP: Unknown]
    Reassembled IPv4 in frame: 5
Data (1176 bytes)

Frame 5: 108 bytes on wire (864 bits), 108 bytes captured (864 bits) on interface 0
Null/Loopback
Internet Protocol Version 4, Src: 10.184.151.232, Dst: 10.50.36.1
    0100 .... = Version: 4
    .... 0101 = Header Length: 20 bytes (5)
    Differentiated Services Field: 0x02 (DSCP: CS0, ECN: ECT(0))
    Total Length: 104
    Identification: 0xc482 (50306)
    Flags: 0x00
    Fragment offset: 1176
    Time to live: 245
    Protocol: UDP (17)
    Header checksum: 0x2f9a [validation disabled]
    [Header checksum status: Unverified]
    Source: 10.184.151.232
    Destination: 10.50.36.1
    [Source GeoIP: Unknown]
    [Destination GeoIP: Unknown]
    [2 IPv4 Fragments (1260 bytes): #4(1176), #5(84)]
        [Frame: 4, payload: 0-1175 (1176 bytes)]
        [Frame: 5, payload: 1176-1259 (84 bytes)]
        [Fragment count: 2]
        [Reassembled IPv4 length: 1260]
        [Reassembled IPv4 data: ea61fe2904eced8d80000000000012536457a225cc7ce3a7...]
User Datagram Protocol, Src Port: 60001, Dst Port: 65065
Data (1252 bytes)

Frame 6: 827 bytes on wire (6616 bits), 827 bytes captured (6616 bits) on interface 0
Null/Loopback
Internet Protocol Version 4, Src: 10.184.151.232, Dst: 10.50.36.1
    0100 .... = Version: 4
    .... 0101 = Header Length: 20 bytes (5)
    Differentiated Services Field: 0x02 (DSCP: CS0, ECN: ECT(0))
    Total Length: 823
    Identification: 0xc483 (50307)
    Flags: 0x00
    Fragment offset: 0
    Time to live: 245
    Protocol: UDP (17)
    Header checksum: 0x2d5d [validation disabled]
    [Header checksum status: Unverified]
    Source: 10.184.151.232
    Destination: 10.50.36.1
    [Source GeoIP: Unknown]
    [Destination GeoIP: Unknown]
User Datagram Protocol, Src Port: 60001, Dst Port: 65065
Data (795 bytes)

Overall it seems to me that for mosh to be robust, it cannot rely on UDP fragmentation ever working properly across the diverse environments in which it will be used.

    /*
     * IPv4 MTU. Don't use full Ethernet-derived MTU,
     * mobile networks have high tunneling overhead.
     *
     * As of July 2016, VPN traffic over Amtrak Acela wifi seems to be
     * dropped if tunnelled packets are 1320 bytes or larger.  Use a
     * 1280-byte IPv4 MTU for now.
     *
     * We may have to implement ICMP-less PMTUD (RFC 4821) eventually.
     */
    static const int DEFAULT_IPV4_MTU = 1280;

Looking in RFC 4821, I read:

It is RECOMMENDED that search_low be initially set to an MTU size that is likely to work over a very wide range of environments. Given today's technologies, a value of 1024 bytes is probably safe enough. The initial value for search_low SHOULD be configurable.

Given that, perhaps DEFAULT_IPV4_MTU = 1024; would be a good compromise until mosh is capable of performing its own MTU probes.

By the book, RFC 791 says that hosts are permitted to drop fragmented packets larger than 576 octets:

All hosts must be prepared to accept datagrams of up to 576 octets (whether they arrive whole or in fragments). It is recommended that hosts only send datagrams larger than 576 octets if they have assurance that the destination is prepared to accept the larger datagrams.

If mosh is doing its own MTU assessments, maybe consider the following:

  • Perform the probes on a separate (logical) control channel to allow the user data to use the current (usable) MTU unhindered
  • When constructing the probe, chunk the probe content into separate UDP packets so that the last UDP packet is 512 octets or less. The last UDP packet will not be fragmented and it will likely be received intact. The receiver can return a NAK if the last UDP packet was received, but the intervening ones were not.
  • After connection loss, rewind the MTU to 512, and use the MTU probes on the control channel to gradually, and quickly, find a good MTU.
  • It might be nice to periodically probe to see if the MTU can be increased even after a good MTU is established. Since these probes are sent on the control channel, user data flow will not be interrupted.

I had this same issue with openvpn where I had set the MTU to 1200. I remove the setting in advanced settings and just used the default (which was 1500), and now all is well with the world.

The source where this gets defined is here for others who may need to recompile mosh to get it to work with their network setup.

It would definitely be nice for mosh to auto detect MTU and adjust accordingly.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

xapple picture xapple  Â·  7Comments

a-b picture a-b  Â·  6Comments

brandonkal picture brandonkal  Â·  7Comments

Intensity picture Intensity  Â·  7Comments

ghost picture ghost  Â·  5Comments