Algo: Intermittent routing failures with GCE servers

Created on 8 Apr 2017  路  17Comments  路  Source: trailofbits/algo

Algo (e.g., strongSwan) servers on Google Cloud Engine intermittently fail to route packets from VPN clients. In general, this manifests as clients being able to connect to the VPN but then unable to browse any or some websites through it. For example, the entire internet works but google.com won't.

We have been under the impression that MSS/MTU issues were to blame but adjusting these values does not appear to resolve the problem for most users. It's been reported more frequently than any other issue. It is likely that issues #358, #345, #310, #210, #185 are all related.

References we've found that may be helpful:

Please add any specific troubleshooting steps you've tried that did or did not work to this issue so we can keep track of them. Thanks!

bug cloud_provider

Most helpful comment

(Disclosure - Google employee, posting in my own capacity)

Edit - MSS can be 1316 instead of 1280
1316 = MTU (1460) - IPv6 (60) - IPsec /wAES (60) - PPPoE (8) - NAT-T (16)

More info:

Something in the Ubuntu image is not not interacting with the user lying infrastructure. I managed to get the tunnel fully working by

  1. Launch a VM using the Debian image and make sure IP forwarding is enabled
  2. SSH in and install git.
  3. Install the other algo python dependancies
  4. Update Pip - sudo python -m pip install --upgrade pip
  5. Skip the virtualenv setup and install the python dependancies directly (sudo pip install -r requirements.txt). You'll have to fight your way through some stuff here - rerunning the install gets you working but it will complain about Azure dependency mismatches.
  6. Copy keys to client and connect. The tunnel should work but the MTU will cause website load failures.
  7. sudo iptables -t mangle -A FORWARD -o eth0 -p tcp -m tcp --tcp-flags SYN,RST SYN -j TCPMSS --set-mss 1316
  8. Edit /etc/iptables/rules.v4 to persist the above rule:
*mangle
:PREROUTING ACCEPT [0:0]
:INPUT ACCEPT [0:0]
:FORWARD ACCEPT [0:0]
-A FORWARD -o eth0  -p tcp -m tcp --tcp-flags SYN,RST SYN   -j TCPMSS --set-mss 1316
:OUTPUT ACCEPT [0:0]
:POSTROUTING ACCEPT [0:0]

Tunnel works fully - using one as I type this.

However, if I perform a local install (with virtualenv) on the ubuntu image, the tunnel does not work. The client can connect and pings to 8.8.8.8 seem to work but DNS requests are black-holed - dig google.com just hangs.

I looked at tcpdumps from the servers. On Ubuntu, I can see that the server forwards the DNS request, and I can see the response ESP packet being sent but it never arrives at the client.

Next steps are to try and figure out how the response packets are different between Ubuntu and Debian. Its going to be difficult given version differences

Output from ipsec version
Debian - Linux strongSwan U5.2.1/K3.16.0-4-amd64
Ubuntu - Linux strongSwan U5.3.5/K4.8.0-45-generic

Strongswan versions:
Debian - 5.2.1-6+deb8u2
Ubuntu - 5.3.2-1ubuntu3

All 17 comments

(Disclosure - Google employee, posting in my own capacity)

Edit - MSS can be 1316 instead of 1280
1316 = MTU (1460) - IPv6 (60) - IPsec /wAES (60) - PPPoE (8) - NAT-T (16)

More info:

Something in the Ubuntu image is not not interacting with the user lying infrastructure. I managed to get the tunnel fully working by

  1. Launch a VM using the Debian image and make sure IP forwarding is enabled
  2. SSH in and install git.
  3. Install the other algo python dependancies
  4. Update Pip - sudo python -m pip install --upgrade pip
  5. Skip the virtualenv setup and install the python dependancies directly (sudo pip install -r requirements.txt). You'll have to fight your way through some stuff here - rerunning the install gets you working but it will complain about Azure dependency mismatches.
  6. Copy keys to client and connect. The tunnel should work but the MTU will cause website load failures.
  7. sudo iptables -t mangle -A FORWARD -o eth0 -p tcp -m tcp --tcp-flags SYN,RST SYN -j TCPMSS --set-mss 1316
  8. Edit /etc/iptables/rules.v4 to persist the above rule:
*mangle
:PREROUTING ACCEPT [0:0]
:INPUT ACCEPT [0:0]
:FORWARD ACCEPT [0:0]
-A FORWARD -o eth0  -p tcp -m tcp --tcp-flags SYN,RST SYN   -j TCPMSS --set-mss 1316
:OUTPUT ACCEPT [0:0]
:POSTROUTING ACCEPT [0:0]

Tunnel works fully - using one as I type this.

However, if I perform a local install (with virtualenv) on the ubuntu image, the tunnel does not work. The client can connect and pings to 8.8.8.8 seem to work but DNS requests are black-holed - dig google.com just hangs.

I looked at tcpdumps from the servers. On Ubuntu, I can see that the server forwards the DNS request, and I can see the response ESP packet being sent but it never arrives at the client.

Next steps are to try and figure out how the response packets are different between Ubuntu and Debian. Its going to be difficult given version differences

Output from ipsec version
Debian - Linux strongSwan U5.2.1/K3.16.0-4-amd64
Ubuntu - Linux strongSwan U5.3.5/K4.8.0-45-generic

Strongswan versions:
Debian - 5.2.1-6+deb8u2
Ubuntu - 5.3.2-1ubuntu3

Note that we decreased the MSS in https://github.com/trailofbits/algo/commit/2ec6f41e0ff8fb304e300a872a6f5ffed92e236d. People can try testing Algo on GCE again to see if IP Forwarding and the reduced MSS helps fix the issue.

@dguido The new solution fails at loading a website at all.
I followed @kiratp instructions and managed to get a fully operational VPN, but by using Algo with https://github.com/trailofbits/algo/commit/2ec6f41e0ff8fb304e300a872a6f5ffed92e236d included and with no install errors, the VPN doesn't work at all.

EDIT : I tried the first version of @kiratp 's workaround. Didn't bothered to update it since it worked.

@dguido same here; with the latest commit there's no way to create a functional VPN, whether I follow @kiratp's instructions or the official one.

actually the forwarding is not working at all with google ip forwarding enabled. rolled back

ping 208.67.220.220 is working
dig ya.ru @208.67.220.220 is not working, packets are not being returned back to the client's PC, however I can see them on the interfaces:

16:10:09.670744 IP 10.19.48.1.64473 > 208.67.220.220.53: 59402+ A? ya.ru. (23)
16:10:09.670790 IP 10.132.0.2.64473 > 208.67.220.220.53: 59402+ A? ya.ru. (23)
16:10:09.676513 IP 208.67.220.220.53 > 10.132.0.2.64473: 59402 3/0/0 A 213.180.204.3, A 213.180.193.3, A 93.158.134.3 (71)

More info:

This is on Ubuntu, with latest Algo build, with IP Forwarding enabled on the VM.

```(env) kiratp@kiratp:algo-corp $ ping 8.8.8.8
PING 8.8.8.8 (8.8.8.8): 56 data bytes
64 bytes from 8.8.8.8: icmp_seq=0 ttl=52 time=145.245 ms
64 bytes from 8.8.8.8: icmp_seq=1 ttl=52 time=192.259 ms
64 bytes from 8.8.8.8: icmp_seq=2 ttl=52 time=228.272 ms
64 bytes from 8.8.8.8: icmp_seq=3 ttl=52 time=50.967 ms
64 bytes from 8.8.8.8: icmp_seq=4 ttl=52 time=15.980 ms
^C
--- 8.8.8.8 ping statistics ---
5 packets transmitted, 5 packets received, 0.0% packet loss
round-trip min/avg/max/stddev = 15.980/126.545/228.272/81.183 ms
(env) kiratp@kiratp:algo-corp $ dig google.com

; <<>> DiG 9.8.3-P1 <<>> google.com
;; global options: +cmd
;; connection timed out; no servers could be reached

(env) kiratp@kiratp-:algo-corp $ ifconfig ipsec0 ipsec0: flags=8051<UP,POINTOPOINT,RUNNING,MULTICAST> mtu 1400 inet 10.19.48.1 --> 10.19.48.1 netmask 0xff000000 inet6 fe80::c6b3:1ff:febe:9493%ipsec0 prefixlen 64 scopeid 0xd inet6 fd9d:bc11:4020::1 prefixlen 64 nd6 options=201<PERFORMNUD,DAD>

Another item - my GCE Debian instance works (using @kiratp instructions...) - but disconnects hourly! It is predictable... Right around the hour mark. Tried upping ipv4 keepalives:
sudo /sbin/sysctl -w net.ipv4.tcp_keepalive_time=60 net.ipv4.tcp_keepalive_intvl=60 net.ipv4.tcp_keepalive_probes=5
From - https://cloud.google.com/compute/docs/troubleshooting

Ok, I'm calling it. The issues here are not going to resolve themselves quickly. We should disable deploys to GCE. @gunph1ld How do you want to handle this? Should we remove all the code or just hide the UI?

Put it on a branch?

@dguido I will move it from the master branch

I did not delete it.
Just tested, works on 17.04

I can also confirm strongswan works for me on 17.04 - fails with previous release.

+! 17.04 working here too

So is this issue closed now? Or whats the verdict?

Ubuntu 16.04 working also. Closing for now

Was this page helpful?
0 / 5 - 0 ratings

Related issues

sebasmurphy picture sebasmurphy  路  5Comments

dsecareanu picture dsecareanu  路  5Comments

dmwyatt picture dmwyatt  路  3Comments

FiloSottile picture FiloSottile  路  5Comments

xhdix picture xhdix  路  3Comments