Calico: TCP offloading on vxlan.calico adaptor causing 63 second delays in VXLAN communications node->nodeport or node->clusterip:port.

Created on 20 Jan 2020 · 26Comments · Source: projectcalico/calico

I am experiencing a 63 second delay in VXLAN communications node->node:nodeport or node->clusterip:port. After inspecting pcaps on both sending and receiving nodes it appears related to TCP offloading on the vxlan.calico interface. Disabling this through ethtool appears to 'resolve' the issue, but I'm entirely unsure whether this is a good idea or not, or if there's a better fix?

Expected Behavior

From a node, the following should work in all cases:

curl localhost: [nodeport]
curl [nodeip]:[nodeport]
curl [clusterip]:[port]

Current Behavior

consider the following:

A: node running pod [web], a simple web service (containous/whoami)
B: node running pod [alpine], a base container; exec sh
C: node not doing anything in particular.

With service defined:

$ kubectl describe svc whoami-cluster-nodeport
  ...
    Type:                     NodePort
    NodePort:                 web  32081/TCP
    External Traffic Policy:  Cluster

I get the following results when trying to access the web service:

| from/to | [web_ip]:80 | [cluster_ip]:80 | localhost:32081 | [a_ip]:32081 | [b_ip]:32081 | [c_ip]:32081 |
----------------|:-----------:|:---------------:|:---------------:|:------------:|:------------:|:------------:|
| node A | ok | ok | ok | ok | ok | ok |
| node B | ok | 63 seconds | 63 seconds | ok | 63 seconds | ok |
| node C | ok | 63 seconds | 63 seconds | ok | ok | 63 seconds |
| pod alpine | ok | ok | - | ok | 63 seconds | ok |
| external host | - | - | - | ok | ok | ok |

Further, if I change replicas for the pods so that pods are on A & C (rr load balancing b/t 2 nodes), the 63 second delay will occur half of the time from those hosting nodes:

from A: curl localhost:32081
from A: curl [node_a_ip]:32081

The problem seems to stem from trying to route sourced from a node to another node. I did a trace and tcp dump from C -> A (via localhost:32081 on C... see below). On both nodes, the TCPDUMP shows repeated SYN packets attempting to establish the connection. They all show "bad udp cksum 0xffff -> 0x76dc!" in the results. After 63 seconds, a SYN packet is set with 'no cksum' and the connection is established.

I had to disable TCP Offloading... after issuing this command, curl localhost:32081 worked consistently on all nodes.

# ethtool --offload vxlan.calico rx off tx off
Actual changes:
rx-checksumming: off
tx-checksumming: off
        tx-checksum-ip-generic: off
tcp-segmentation-offload: off
        tx-tcp-segmentation: off [requested on]
        tx-tcp-ecn-segmentation: off [requested on]
        tx-tcp6-segmentation: off [requested on]
        tx-tcp-mangleid-segmentation: off [requested on]
udp-fragmentation-offload: off [requested on]

So.... I'm entirely unsure whether this is a good idea or not? Or whether there's a way to fix this through iptables? Or whether this needs fixed in the OS/hosting (VSphere)?

TRACE from Node C (client)

# sudo perf trace --no-syscalls --event 'net:*' wget -q -O /dev/null localhost:32081
     0.000 net:net_dev_queue:dev=vxlan.calico skbaddr=0xffff8c6ed93d00f8 len=66
     0.033 net:net_dev_queue:dev=eth0 skbaddr=0xffff8c6ed93d00f8 len=116
     0.084 net:net_dev_xmit:dev=eth0 skbaddr=0xffff8c6ed93d00f8 len=116 rc=0
     0.088 net:net_dev_xmit:dev=vxlan.calico skbaddr=0xffff8c6ed93d00f8 len=66     rc=0
 63122.053 net:net_dev_queue:dev=vxlan.calico skbaddr=0xffff8c607b6be0f8     len=165
 63122.070 net:net_dev_queue:dev=eth0 skbaddr=0xffff8c607b6be0f8 len=215
 63122.078 net:net_dev_xmit:dev=eth0 skbaddr=0xffff8c607b6be0f8 len=215 rc=0
 63122.080 net:net_dev_xmit:dev=vxlan.calico skbaddr=0xffff8c607b6be0f8     len=165 rc=0
 63123.135 net:net_dev_queue:dev=vxlan.calico skbaddr=0xffff8c607b6b90f8 len=54
 63123.154 net:net_dev_queue:dev=eth0 skbaddr=0xffff8c607b6b90f8 len=104
 63123.162 net:net_dev_xmit:dev=eth0 skbaddr=0xffff8c607b6b90f8 len=104 rc=0
 63123.165 net:net_dev_xmit:dev=vxlan.calico skbaddr=0xffff8c607b6b90f8 len=54 rc=0

TCPDUMP from Node C (client)

# tcpdump -vv host [node_a_ip]
tcpdump: listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes
11:51:31.656624 IP (tos 0x0, ttl 64, id 59791, offset 0, flags [none], proto UDP (17), length 102)
    nodeC.local.26898 > nodeA.local.4789: [bad udp cksum 0xffff -> 0x76dc!] VXLAN, flags [I] (0x08), vni 4096
IP (tos 0x0, ttl 64, id 8153, offset 0, flags [DF], proto TCP (6), length 52)
    nodeC.39274 > 10.244.143.193.http: Flags [S], cksum 0xe849 (correct), seq 2295512834, win 43690, options [mss 65495,nop,nop,sackOK,nop,wscale 7], length 0
11:51:32.657185 IP (tos 0x0, ttl 64, id 60020, offset 0, flags [none], proto UDP (17), length 102)
    nodeC.local.26898 > nodeA.local.4789: [bad udp cksum 0xffff -> 0x76dc!] VXLAN, flags [I] (0x08), vni 4096
IP (tos 0x0, ttl 64, id 8154, offset 0, flags [DF], proto TCP (6), length 52)
    nodeC.39274 > 10.244.143.193.http: Flags [S], cksum 0xe849 (correct), seq 2295512834, win 43690, options [mss 65495,nop,nop,sackOK,nop,wscale 7], length 0
11:51:34.661166 IP (tos 0x0, ttl 64, id 61933, offset 0, flags [none], proto UDP (17), length 102)
    nodeC.local.26898 > nodeA.local.4789: [bad udp cksum 0xffff -> 0x76dc!] VXLAN, flags [I] (0x08), vni 4096
IP (tos 0x0, ttl 64, id 8155, offset 0, flags [DF], proto TCP (6), length 52)
    nodeC.39274 > 10.244.143.193.http: Flags [S], cksum 0xe849 (correct), seq 2295512834, win 43690, options [mss 65495,nop,nop,sackOK,nop,wscale 7], length 0
11:51:36.669116 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has nodeA.local tell nodeC.local, length 28
11:51:36.669385 ARP, Ethernet (len 6), IPv4 (len 4), Reply nodeA.local is-at 00:50:56:a3:d4:91 (oui Unknown), length 46
11:51:38.669155 IP (tos 0x0, ttl 64, id 65370, offset 0, flags [none], proto UDP (17), length 102)
    nodeC.local.26898 > nodeA.local.4789: [bad udp cksum 0xffff -> 0x76dc!] VXLAN, flags [I] (0x08), vni 4096
IP (tos 0x0, ttl 64, id 8156, offset 0, flags [DF], proto TCP (6), length 52)
    nodeC.39274 > 10.244.143.193.http: Flags [S], cksum 0xe849 (correct), seq 2295512834, win 43690, options [mss 65495,nop,nop,sackOK,nop,wscale 7], length 0
11:51:46.685204 IP (tos 0x0, ttl 64, id 3142, offset 0, flags [none], proto UDP (17), length 102)
    nodeC.local.26898 > nodeA.local.4789: [bad udp cksum 0xffff -> 0x76dc!] VXLAN, flags [I] (0x08), vni 4096
IP (tos 0x0, ttl 64, id 8157, offset 0, flags [DF], proto TCP (6), length 52)
    nodeC.39274 > 10.244.143.193.http: Flags [S], cksum 0xe849 (correct), seq 2295512834, win 43690, options [mss 65495,nop,nop,sackOK,nop,wscale 7], length 0
11:52:02.733178 IP (tos 0x0, ttl 64, id 5028, offset 0, flags [none], proto UDP (17), length 102)
    nodeC.local.26898 > nodeA.local.4789: [bad udp cksum 0xffff -> 0x76dc!] VXLAN, flags [I] (0x08), vni 4096
IP (tos 0x0, ttl 64, id 8158, offset 0, flags [DF], proto TCP (6), length 52)
    nodeC.39274 > 10.244.143.193.http: Flags [S], cksum 0xe849 (correct), seq 2295512834, win 43690, options [mss 65495,nop,nop,sackOK,nop,wscale 7], length 0
11:52:34.797203 IP (tos 0x0, ttl 64, id 30608, offset 0, flags [none], proto UDP (17), length 102)
    nodeC.local.60563 > nodeA.local.4789: [no cksum] VXLAN, flags [I] (0x08), vni 4096
IP (tos 0x0, ttl 64, id 8159, offset 0, flags [DF], proto TCP (6), length 52)
    nodeC.39274 > 10.244.143.193.http: Flags [S], cksum 0xe849 (correct), seq 2295512834, win 43690, options [mss 65495,nop,nop,sackOK,nop,wscale 7], length 0
11:52:34.797731 IP (tos 0x0, ttl 64, id 33394, offset 0, flags [none], proto UDP (17), length 102)
    nodeA.local.36276 > nodeC.local.4789: [no cksum] VXLAN, flags [I] (0x08), vni 4096
IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto TCP (6), length 52)
    10.244.143.193.http > nodeC.39274: Flags [S.], cksum 0x4348 (correct), seq 4013223269, ack 2295512835, win 28000, options [mss 1400,nop,nop,sackOK,nop,wscale 7], length 0
11:52:34.797854 IP (tos 0x0, ttl 64, id 30609, offset 0, flags [none], proto UDP (17), length 90)
    nodeC.local.60563 > nodeA.local.4789: [no cksum] VXLAN, flags [I] (0x08), vni 4096
IP (tos 0x0, ttl 64, id 8160, offset 0, flags [DF], proto TCP (6), length 40)
    nodeC.39274 > 10.244.143.193.http: Flags [.], cksum 0xefe8 (correct), seq 1, ack 1, win 342, length 0
11:52:34.797936 IP (tos 0x0, ttl 64, id 30610, offset 0, flags [none], proto UDP (17), length 203)
    nodeC.local.60563 > nodeA.local.4789: [no cksum] VXLAN, flags [I] (0x08), vni 4096
IP (tos 0x0, ttl 64, id 8161, offset 0, flags [DF], proto TCP (6), length 153)
    nodeC.39274 > 10.244.143.193.http: Flags [P.], cksum 0x7ffc (correct), seq 1:114, ack 1, win 342, length 113: HTTP, length: 113
        GET / HTTP/1.1
        User-Agent: Wget/1.14 (linux-gnu)
        Accept: */*
        Host: localhost:32081
        Connection: Keep-Alive

11:52:34.798117 IP (tos 0x0, ttl 64, id 33395, offset 0, flags [none], proto UDP (17), length 90)
    nodeA.local.37789 > nodeC.local.4789: [no cksum] VXLAN, flags [I] (0x08), vni 4096
IP (tos 0x0, ttl 63, id 26251, offset 0, flags [DF], proto TCP (6), length 40)
    10.244.143.193.http > nodeC.39274: Flags [.], cksum 0xeff2 (correct), seq 1, ack 114, win 219, length 0
11:52:34.798547 IP (tos 0x0, ttl 64, id 33396, offset 0, flags [none], proto UDP (17), length 458)
    nodeA.local.37789 > nodeC.local.4789: [no cksum] VXLAN, flags [I] (0x08), vni 4096
IP (tos 0x0, ttl 63, id 26252, offset 0, flags [DF], proto TCP (6), length 408)
    10.244.143.193.http > nodeC.39274: Flags [P.], cksum 0xe168 (correct), seq 1:369, ack 114, win 219, length 368: HTTP, length: 368
        HTTP/1.1 200 OK
        Date: Mon, 20 Jan 2020 16:52:34 GMT
        Content-Length: 250
        Content-Type: text/plain; charset=utf-8

        Hostname: whoami-66686d967d-mzk8p
        IP: 127.0.0.1
        IP: ::1
        IP: 10.244.143.193
        IP: fe80::d830:89ff:fe7f:2703
        RemoteAddr: 10.244.90.192:39274
        GET / HTTP/1.1
        Host: localhost:32081
        User-Agent: Wget/1.14 (linux-gnu)
        Accept: */*
        Connection: Keep-Alive

11:52:34.798602 IP (tos 0x0, ttl 64, id 30611, offset 0, flags [none], proto UDP (17), length 90)
    nodeC.local.60563 > nodeA.local.4789: [no cksum] VXLAN, flags [I] (0x08), vni 4096
IP (tos 0x0, ttl 64, id 8162, offset 0, flags [DF], proto TCP (6), length 40)
    nodeC.39274 > 10.244.143.193.http: Flags [.], cksum 0xedff (correct), seq 114, ack 369, win 350, length 0
11:52:34.799551 IP (tos 0x0, ttl 64, id 30612, offset 0, flags [none], proto UDP (17), length 90)
    nodeC.local.60563 > nodeA.local.4789: [no cksum] VXLAN, flags [I] (0x08), vni 4096
IP (tos 0x0, ttl 64, id 8163, offset 0, flags [DF], proto TCP (6), length 40)
    nodeC.39274 > 10.244.143.193.http: Flags [F.], cksum 0xedfe (correct), seq 114, ack 369, win 350, length 0
11:52:34.799731 IP (tos 0x0, ttl 64, id 33397, offset 0, flags [none], proto UDP (17), length 90)
    nodeA.local.37789 > nodeC.local.4789: [no cksum] VXLAN, flags [I] (0x08), vni 4096
IP (tos 0x0, ttl 63, id 26253, offset 0, flags [DF], proto TCP (6), length 40)
    10.244.143.193.http > nodeC.39274: Flags [F.], cksum 0xee80 (correct), seq 369, ack 115, win 219, length 0
11:52:34.799779 IP (tos 0x0, ttl 64, id 30613, offset 0, flags [none], proto UDP (17), length 90)
    nodeC.local.60563 > nodeA.local.4789: [no cksum] VXLAN, flags [I] (0x08), vni 4096
IP (tos 0x0, ttl 64, id 8164, offset 0, flags [DF], proto TCP (6), length 40)
    nodeC.39274 > 10.244.143.193.http: Flags [.], cksum 0xedfd (correct), seq 115, ack 370, win 350, length 0
11:52:39.805144 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has nodeA.local tell nodeC.local, length 28
11:52:39.805399 ARP, Ethernet (len 6), IPv4 (len 4), Reply nodeA.local is-at 00:50:56:a3:d4:91 (oui Unknown), length 46

TCPDUMP from Node A (hosting service pod)

# tcpdump -vv host [node_c_ip]
tcpdump: listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes
11:51:31.656705 IP (tos 0x0, ttl 64, id 59791, offset 0, flags [none], proto UDP (17), length 102)
    nodeC.local.26898 > nodeA.local.4789: [bad udp cksum 0xffff -> 0x76dc!] VXLAN, flags [I] (0x08), vni 4096
IP (tos 0x0, ttl 64, id 8153, offset 0, flags [DF], proto TCP (6), length 52)
    10.244.90.192.39274 > 10.244.143.193.http: Flags [S], cksum 0xe849 (correct), seq 2295512834, win 43690, options [mss 65495,nop,nop,sackOK,nop,wscale 7], length 0
11:51:32.657297 IP (tos 0x0, ttl 64, id 60020, offset 0, flags [none], proto UDP (17), length 102)
    nodeC.local.26898 > nodeA.local.4789: [bad udp cksum 0xffff -> 0x76dc!] VXLAN, flags [I] (0x08), vni 4096
IP (tos 0x0, ttl 64, id 8154, offset 0, flags [DF], proto TCP (6), length 52)
    10.244.90.192.39274 > 10.244.143.193.http: Flags [S], cksum 0xe849 (correct), seq 2295512834, win 43690, options [mss 65495,nop,nop,sackOK,nop,wscale 7], length 0
11:51:34.661301 IP (tos 0x0, ttl 64, id 61933, offset 0, flags [none], proto UDP (17), length 102)
    nodeC.local.26898 > nodeA.local.4789: [bad udp cksum 0xffff -> 0x76dc!] VXLAN, flags [I] (0x08), vni 4096
IP (tos 0x0, ttl 64, id 8155, offset 0, flags [DF], proto TCP (6), length 52)
    10.244.90.192.39274 > 10.244.143.193.http: Flags [S], cksum 0xe849 (correct), seq 2295512834, win 43690, options [mss 65495,nop,nop,sackOK,nop,wscale 7], length 0
11:51:36.669200 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has nodeA.local tell nodeC.local, length 46
11:51:36.669224 ARP, Ethernet (len 6), IPv4 (len 4), Reply nodeA.local is-at 00:50:56:a3:d4:91 (oui Unknown), length 28
11:51:38.669297 IP (tos 0x0, ttl 64, id 65370, offset 0, flags [none], proto UDP (17), length 102)
    nodeC.local.26898 > nodeA.local.4789: [bad udp cksum 0xffff -> 0x76dc!] VXLAN, flags [I] (0x08), vni 4096
IP (tos 0x0, ttl 64, id 8156, offset 0, flags [DF], proto TCP (6), length 52)
    10.244.90.192.39274 > 10.244.143.193.http: Flags [S], cksum 0xe849 (correct), seq 2295512834, win 43690, options [mss 65495,nop,nop,sackOK,nop,wscale 7], length 0
11:51:46.685305 IP (tos 0x0, ttl 64, id 3142, offset 0, flags [none], proto UDP (17), length 102)
    nodeC.local.26898 > nodeA.local.4789: [bad udp cksum 0xffff -> 0x76dc!] VXLAN, flags [I] (0x08), vni 4096
IP (tos 0x0, ttl 64, id 8157, offset 0, flags [DF], proto TCP (6), length 52)
    10.244.90.192.39274 > 10.244.143.193.http: Flags [S], cksum 0xe849 (correct), seq 2295512834, win 43690, options [mss 65495,nop,nop,sackOK,nop,wscale 7], length 0
11:52:02.733268 IP (tos 0x0, ttl 64, id 5028, offset 0, flags [none], proto UDP (17), length 102)
    nodeC.local.26898 > nodeA.local.4789: [bad udp cksum 0xffff -> 0x76dc!] VXLAN, flags [I] (0x08), vni 4096
IP (tos 0x0, ttl 64, id 8158, offset 0, flags [DF], proto TCP (6), length 52)
    10.244.90.192.39274 > 10.244.143.193.http: Flags [S], cksum 0xe849 (correct), seq 2295512834, win 43690, options [mss 65495,nop,nop,sackOK,nop,wscale 7], length 0
11:52:34.797359 IP (tos 0x0, ttl 64, id 30608, offset 0, flags [none], proto UDP (17), length 102)
    nodeC.local.60563 > nodeA.local.4789: [no cksum] VXLAN, flags [I] (0x08), vni 4096
IP (tos 0x0, ttl 64, id 8159, offset 0, flags [DF], proto TCP (6), length 52)
    10.244.90.192.39274 > 10.244.143.193.http: Flags [S], cksum 0xe849 (correct), seq 2295512834, win 43690, options [mss 65495,nop,nop,sackOK,nop,wscale 7], length 0
11:52:34.797533 IP (tos 0x0, ttl 64, id 33394, offset 0, flags [none], proto UDP (17), length 102)
    nodeA.local.36276 > nodeC.local.4789: [no cksum] VXLAN, flags [I] (0x08), vni 4096
IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto TCP (6), length 52)
    10.244.143.193.http > 10.244.90.192.39274: Flags [S.], cksum 0x4348 (correct), seq 4013223269, ack 2295512835, win 28000, options [mss 1400,nop,nop,sackOK,nop,wscale 7], length 0
11:52:34.797895 IP (tos 0x0, ttl 64, id 30609, offset 0, flags [none], proto UDP (17), length 90)
    nodeC.local.60563 > nodeA.local.4789: [no cksum] VXLAN, flags [I] (0x08), vni 4096
IP (tos 0x0, ttl 64, id 8160, offset 0, flags [DF], proto TCP (6), length 40)
    10.244.90.192.39274 > 10.244.143.193.http: Flags [.], cksum 0xefe8 (correct), seq 1, ack 1, win 342, length 0
11:52:34.797960 IP (tos 0x0, ttl 64, id 30610, offset 0, flags [none], proto UDP (17), length 203)
    nodeC.local.60563 > nodeA.local.4789: [no cksum] VXLAN, flags [I] (0x08), vni 4096
IP (tos 0x0, ttl 64, id 8161, offset 0, flags [DF], proto TCP (6), length 153)
    10.244.90.192.39274 > 10.244.143.193.http: Flags [P.], cksum 0x7ffc (correct), seq 1:114, ack 1, win 342, length 113: HTTP, length: 113
        GET / HTTP/1.1
        User-Agent: Wget/1.14 (linux-gnu)
        Accept: */*
        Host: localhost:32081
        Connection: Keep-Alive

11:52:34.797995 IP (tos 0x0, ttl 64, id 33395, offset 0, flags [none], proto UDP (17), length 90)
    nodeA.local.37789 > nodeC.local.4789: [no cksum] VXLAN, flags [I] (0x08), vni 4096
IP (tos 0x0, ttl 63, id 26251, offset 0, flags [DF], proto TCP (6), length 40)
    10.244.143.193.http > 10.244.90.192.39274: Flags [.], cksum 0xeff2 (correct), seq 1, ack 114, win 219, length 0
11:52:34.798460 IP (tos 0x0, ttl 64, id 33396, offset 0, flags [none], proto UDP (17), length 458)
    nodeA.local.37789 > nodeC.local.4789: [no cksum] VXLAN, flags [I] (0x08), vni 4096
IP (tos 0x0, ttl 63, id 26252, offset 0, flags [DF], proto TCP (6), length 408)
    10.244.143.193.http > 10.244.90.192.39274: Flags [P.], cksum 0xe168 (correct), seq 1:369, ack 114, win 219, length 368: HTTP, length: 368
        HTTP/1.1 200 OK
        Date: Mon, 20 Jan 2020 16:52:34 GMT
        Content-Length: 250
        Content-Type: text/plain; charset=utf-8

        Hostname: whoami-66686d967d-mzk8p
        IP: 127.0.0.1
        IP: ::1
        IP: 10.244.143.193
        IP: fe80::d830:89ff:fe7f:2703
        RemoteAddr: 10.244.90.192:39274
        GET / HTTP/1.1
        Host: localhost:32081
        User-Agent: Wget/1.14 (linux-gnu)
        Accept: */*
        Connection: Keep-Alive

11:52:34.798635 IP (tos 0x0, ttl 64, id 30611, offset 0, flags [none], proto UDP (17), length 90)
    nodeC.local.60563 > nodeA.local.4789: [no cksum] VXLAN, flags [I] (0x08), vni 4096
IP (tos 0x0, ttl 64, id 8162, offset 0, flags [DF], proto TCP (6), length 40)
    10.244.90.192.39274 > 10.244.143.193.http: Flags [.], cksum 0xedff (correct), seq 114, ack 369, win 350, length 0
11:52:34.799589 IP (tos 0x0, ttl 64, id 30612, offset 0, flags [none], proto UDP (17), length 90)
    nodeC.local.60563 > nodeA.local.4789: [no cksum] VXLAN, flags [I] (0x08), vni 4096
IP (tos 0x0, ttl 64, id 8163, offset 0, flags [DF], proto TCP (6), length 40)
    10.244.90.192.39274 > 10.244.143.193.http: Flags [F.], cksum 0xedfe (correct), seq 114, ack 369, win 350, length 0
11:52:34.799655 IP (tos 0x0, ttl 64, id 33397, offset 0, flags [none], proto UDP (17), length 90)
    nodeA.local.37789 > nodeC.local.4789: [no cksum] VXLAN, flags [I] (0x08), vni 4096
IP (tos 0x0, ttl 63, id 26253, offset 0, flags [DF], proto TCP (6), length 40)
    10.244.143.193.http > 10.244.90.192.39274: Flags [F.], cksum 0xee80 (correct), seq 369, ack 115, win 219, length 0
11:52:34.799814 IP (tos 0x0, ttl 64, id 30613, offset 0, flags [none], proto UDP (17), length 90)
    nodeC.local.60563 > nodeA.local.4789: [no cksum] VXLAN, flags [I] (0x08), vni 4096
IP (tos 0x0, ttl 64, id 8164, offset 0, flags [DF], proto TCP (6), length 40)
    10.244.90.192.39274 > 10.244.143.193.http: Flags [.], cksum 0xedfd (correct), seq 115, ack 370, win 350, length 0
11:52:39.805231 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has nodeA.local tell nodeC.local, length 46
11:52:39.805256 ARP, Ethernet (len 6), IPv4 (len 4), Reply nodeA.local is-at 00:50:56:a3:d4:91 (oui Unknown), length 28

Possible Solution

# ethtool --offload vxlan.calico rx off tx off

Steps to Reproduce (for bugs)

kubeadm init
install calico configured for vxlan
install a simple web pod and a nodeport service
attempt to access nodeport from cluster

Context

There are times when we need to be able to access a service from a node (eg, log shipping from the node to a hosted service, hosted app api access, k8s hosted registry) where this defect will interfere with normal communications

Your Environment

Calico version: v3.11
Orchestrator version (e.g. kubernetes, mesos, rkt): k8s 1.17
Operating System and version: RHEL 7 (version 3.10.0-1062.9.1.el7.x86_64)
Link to your project (optional):

impachigh kinbug likelihoolow

Source

jelaryma

Most helpful comment

Hello,

We have the exact same issue on CentOS 7 (3.10.0-1062.9.1.el7.x86_64) running Calico/Canal and flannel. After spending the better part of a week trying to figure out why two thirds of our cluster was unable to reliably to talk to the other third I stumbled upon this issue.

I can report that disabling offloading completely works around this issue. In our case, the command is: sudo ethtool --offload flannel.1 rx off tx off (because we are running flannel).

Your Environment

Calico version: v3.11
Orchestrator version (e.g. kubernetes, mesos, rkt): k8s 1.17.1
Operating System and version: CentOS 7 (3.10.0-1062.9.1.el7.x86_64)

KarlHerler on 26 Jan 2020

👍3

All 26 comments

Hello,

I can report that disabling offloading completely works around this issue. In our case, the command is: sudo ethtool --offload flannel.1 rx off tx off (because we are running flannel).

Your Environment

Calico version: v3.11
Orchestrator version (e.g. kubernetes, mesos, rkt): k8s 1.17.1
Operating System and version: CentOS 7 (3.10.0-1062.9.1.el7.x86_64)

KarlHerler on 26 Jan 2020

👍3

Same issue here, we have 63 second delay while connecting to ClusterIP services in CentOS 7 / k8s 1.7.2 / Calico 3.12.0 cluster running on Hetzner Cloud. Disabling ethernet offloading resolves mentioned connection issues.

3cky on 6 Feb 2020

ethtool --offload interface rx off tx off

thx it works. flannel+k8s1.17.2 centos7.
but how do you know disable tcp offload works？

phantooom on 6 Feb 2020

@jelaryma I am having a similar problem. https://github.com/kubernetes/kubernetes/issues/88986
I also measured the 63 second delay and I am using flannel.

I came to think it was a flannel issue, but you are also seeing this with calico..
https://github.com/coreos/flannel/issues/1268

Is it your thought that this is a k8s bug?

Also would i run the following command on every node of the cluster?
sudo ethtool --offload flannel.1 rx off tx off

davesargrad on 20 Mar 2020

There's a thread in SIG network about this: https://groups.google.com/forum/#!topic/kubernetes-sig-network/JxkTLd4M8WM

Summary so far seems to be that this is a kernel bug related to VXLAN offload where the checksum calculation is not properly offloaded.

caseydavenport on 20 Mar 2020

👍1

Same issue with k8s 1.18 + ubuntu 16.04 /w calico 3.11, it causes 3s delays only.

ethtool --offload interface rx off tx off can workaround perfectly!

hien on 30 Mar 2020

👍1

Flannel now has an open PR addressing this. https://github.com/coreos/flannel/pull/1282

gamer22026 on 16 Apr 2020

see this https://github.com/coreos/flannel/pull/1282#issuecomment-635145841

zhangguanzhang on 28 May 2020

I have been hit by this bug as well. Does anyone know where to add the ethtool command to make it persistent after a reboot on Centos7? I tried adding it to rc.local but it looks like the device is being created after the script runs because I am getting a Cannot get device feature names: No such device error..

giordyb on 12 Jun 2020

@jelaryma
After 63 seconds, a SYN packet is set with 'no cksum' and the connection is established.

63 seconds maybe refers to 5 times retransmission. But i'm still confused about the cause of this issue. So do you kown some articles or blog posts about the 'no cksum' flag.
Thanks.

Bowser1704 on 1 Jul 2020

@jelaryma
After 63 seconds, a SYN packet is set with 'no cksum' and the connection is established.

63 seconds maybe refers to 5 times retransmission. But i'm still confused about the cause of this issue. So do you kown some articles or blog posts about the 'no cksum' flag.
Thanks.

https://zhangguanzhang.github.io/2020/05/23/k8s-vxlan-63-timeout/

zhangguanzhang on 1 Jul 2020

I have been hit by this bug as well. Does anyone know where to add the ethtool command to make it persistent after a reboot on Centos7? I tried adding it to rc.local but it looks like the device is being created after the script runs because I am getting a Cannot get device feature names: No such device error..

Did you found a solution for persistent fix after reboot?

balleon on 4 Jul 2020

Did you found a solution for persistent fix after reboot?

@balleon Nope, thankfully the servers don't get rebooted very often...

giordyb on 4 Jul 2020

any better solution?

xiaods on 9 Jul 2020

any better solution?

On my Kubernetes 1.18.5 and CentOS 7 cluster i use a kube-proxy custom image.

FROM k8s.gcr.io/kube-proxy:v1.18.5
RUN rm -f /usr/sbin/iptables && clean-install iptables

balleon on 9 Jul 2020

👍1

Hello,

We have the exact same issue on CentOS 7 (3.10.0-1062.9.1.el7.x86_64) running Calico/Canal and flannel. After spending the better part of a week trying to figure out why two thirds of our cluster was unable to reliably to talk to the other third I stumbled upon this issue.

I can report that disabling offloading completely works around this issue. In our case, the command is: sudo ethtool --offload flannel.1 rx off tx off (because we are running flannel).

Your Environment

Calico version: v3.11

Orchestrator version (e.g. kubernetes, mesos, rkt): k8s 1.17.1

Operating System and version: CentOS 7 (3.10.0-1062.9.1.el7.x86_64)

THIS WORKS!!!

ctopher7 on 12 Jul 2020

Last Kubernetes release shoud fix this issue.
https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG/CHANGELOG-1.18.md#changes-by-kind

Fixes a problem with 63-second or 1-second connection delays with some VXLAN-based network plugins which was first widely noticed in 1.16 (though some users saw it earlier than that, possibly only with specific network plugins). If you were previously using ethtool to disable checksum offload on your primary network interface, you should now be able to stop doing that. (#92035, @danwinship) [SIG Network and Node]

I tried it in the following environment:

CentOS 7.8 (3.10.0-1127)

Kubernetes 1.18.6

Calico 1.15.1 (VXLAN)

Problem still there, i have to sudo ethtool --offload vxlan.calico rx off tx off on all hosts to have a working cluster.
Did you found a fix that doesn't require to disable vxlan.calico interface offload?
This workaround isn't persistent after reboot so it can't be applied in a production environment.

balleon on 9 Aug 2020

😕1 👍1

Last Kubernetes release shoud fix this issue.
https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG/CHANGELOG-1.18.md#changes-by-kind

Fixes a problem with 63-second or 1-second connection delays with some VXLAN-based network plugins which was first widely noticed in 1.16 (though some users saw it earlier than that, possibly only with specific network plugins). If you were previously using ethtool to disable checksum offload on your primary network interface, you should now be able to stop doing that. (#92035, @danwinship) [SIG Network and Node]

I tried it in the following environment:

CentOS 7.8 (3.10.0-1127)

Kubernetes 1.18.6

Calico 1.15.1 (VXLAN)

Problem still there, i have to sudo ethtool --offload vxlan.calico rx off tx off on all hosts to have a working cluster.
Did you found a fix that doesn't require to disable vxlan.calico interface offload?
This workaround isn't persistent after reboot so it can't be applied in a production environment.

@danwinship PTAL

zhangguanzhang on 9 Aug 2020

It is possible that Calico creates iptables rules of its own which run into the same problems Kubernetes was running into? (The initial comment on the kubernetes PR and this later comment give a pretty detailed explanation of the problem.)

danwinship on 9 Aug 2020

To my knowledge Calico doesn't NAT packets that would then be sent over the VXLAN device (which IIUC is the scenario that triggers this checksum bug). We should only be performing NAT on packets which are destined outside of the cluster (but I could be misremembering something).

caseydavenport on 11 Aug 2020

I don't know _exactly_ what cases were triggering that upstream. There are weird edge cases, like the rule that says if a pod connects to a service that it is an endpoint of, it has to be SNAT'ed. So given pods A and B in Service X on different nodes, a connection from pod A to Service X's Cluster IP to pod B would end up both SNAT'ed and tunneled over the VXLAN.

But in general, if you have any rules that do "if marked then masquerade", it is best to rewrite them to do "if marked then unmark and then masquerade".

danwinship on 12 Aug 2020

👍1

@balleon hm, did you update from a bad kubernetes release to a fixed kubernetes release without rebooting? I just realized that the new code doesn't make any effort to clean up the old broken iptables rules, so if you just installed the new release and restart kubelet, you probably still have the bad rules in KUBE-POSTROUTING. (The new rules would get appended after the old ones, and thus would have no effect.)

danwinship on 18 Aug 2020

@balleon hm, did you update from a bad kubernetes release to a fixed kubernetes release without rebooting? I just realized that the new code doesn't make any effort to clean up the old broken iptables rules, so if you just installed the new release and restart kubelet, you probably still have the bad rules in KUBE-POSTROUTING. (The new rules would get appended after the old ones, and thus would have no effect.)

Still using a custom kube-proxy Dockerfile.
Kubernetes 1.18.6 fresh install doesn't fix the problem.
Maybe Calico 1.16 with the following option can fix the problem
https://github.com/projectcalico/libcalico-go/pull/1264

balleon on 14 Sep 2020

👍1

I have the same issue here, tested on Kubernetes v1.18.12 with Calico v3.17.

I tried the new option FELIX_FEATUREDETECTOVERRIDE="MASQFullyRandom=false", but there's still a random-fully generated by kube-proxy on a MASQUERADE rule.

With FELIX_FEATUREDETECTOVERRIDE="MASQFullyRandom=false"

# iptables -t nat -L -n | grep fully
MASQUERADE  all  --  0.0.0.0/0            0.0.0.0/0            /* kubernetes service traffic requiring SNAT */ random-fully

Without FELIX_FEATUREDETECTOVERRIDE="MASQFullyRandom=false":

# iptables -t nat -L -n | grep fully
MASQUERADE  all  --  0.0.0.0/0            0.0.0.0/0            /* kubernetes service traffic requiring SNAT */ random-fully
MASQUERADE  all  --  0.0.0.0/0            0.0.0.0/0            /* cali:e9dnSgSVNmIcpVhP */ ADDRTYPE match src-type !LOCAL limit-out ADDRTYPE match src-type LOCAL random-fully
MASQUERADE  all  --  0.0.0.0/0            0.0.0.0/0            /* cali:flqWnvo8yq4ULQLa */ match-set cali40masq-ipam-pools src ! match-set cali40all-ipam-pools dst random-fully

I was able to solve it by manually adding a MASQUERADE rule without random-fully option:

# dig +timeout=2 @10.96.0.10 google.com
; <<>> DiG 9.16.1-Ubuntu <<>> +timeout @10.96.0.10 google.com
; (1 server found)
;; global options: +cmd
;; connection timed out; no servers could be reached

# iptables -t nat -L KUBE-POSTROUTING -n -v
Chain KUBE-POSTROUTING (1 references)
 pkts bytes target     prot opt in     out     source               destination
 1049 59949 RETURN     all  --  *      *       0.0.0.0/0            0.0.0.0/0            mark match ! 0x4000/0x4000
   10   807 MARK       all  --  *      *       0.0.0.0/0            0.0.0.0/0            MARK xor 0x4000
   10   807 MASQUERADE  all  --  *      *       0.0.0.0/0            0.0.0.0/0            /* kubernetes service traffic requiring SNAT */ random-fully

# iptables -t nat -I KUBE-POSTROUTING 3 -j MASQUERADE

# iptables -t nat -L KUBE-POSTROUTING -n -v
Chain KUBE-POSTROUTING (1 references)
 pkts bytes target     prot opt in     out     source               destination
 1064 60774 RETURN     all  --  *      *       0.0.0.0/0            0.0.0.0/0            mark match ! 0x4000/0x4000
   10   807 MARK       all  --  *      *       0.0.0.0/0            0.0.0.0/0            MARK xor 0x4000
    0     0 MASQUERADE  all  --  *      *       0.0.0.0/0            0.0.0.0/0
   10   807 MASQUERADE  all  --  *      *       0.0.0.0/0            0.0.0.0/0            /* kubernetes service traffic requiring SNAT */ random-fully

# dig +timeout=2 @10.96.0.10 google.com
; <<>> DiG 9.16.1-Ubuntu <<>> +timeout @10.96.0.10 google.com
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 7434
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1
;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;google.com.                    IN      A
;; ANSWER SECTION:
google.com.             30      IN      A       142.250.74.238
;; Query time: 0 msec
;; SERVER: 10.96.0.10#53(10.96.0.10)
;; WHEN: Thu Nov 26 11:05:49 UTC 2020
;; MSG SIZE  rcvd: 65

But that's not reboot proof :( We should have an option in kube-proxy to disable the random-fully option.

couloum on 26 Nov 2020

If anyone is interested by a reboot-proof solution to apply the workarround (disable offloading on vxlan.calico), you can use these 2 files:

/etc/systemd/system/disable-offloading-on-vxlan.service

[Unit]
Description=Disable offloading on vxlan.calico network interface
Requires=kubelet.service
Documentation=https://github.com/projectcalico/calico/issues/3145

[Service]
Type=oneshot
ExecStart=/usr/local/bin/disable-offloading-on-vxlan
RemainAfterExit=yes
Restart=no
TimeoutStartSec=660

[Install]
WantedBy=multi-user.target

/usr/local/bin/disable-offloading-on-vxlan

#!/bin/bash

# This script apply a workarround for a bug encountered on Kubernetes with vxlan
# and iptables >= 1.6.2.
# You can find more details on this bug here:
# https://github.com/kubernetes/kubernetes/issues/96868
# https://github.com/projectcalico/calico/issues/3145
#
# The workarround is to disable offloading on vxlan interface

# Wait until an interface named vxlan.calico appear
# Wait a maximum of 10 minutes (= 60 checks every 10 seconds)

sleep_interval=10
max_retries=60
nb_tries=0
nic_name="vxlan.calico"

is_nic_available() {
  ip a show dev $nic_name > /dev/null 2>&1
}

deactivate_offloading() {
  ethtool --offload $nic_name rx off tx off
}

check_offloading() {
  # Return an error if at least one offload is enabled (rx or tx)
  if ethtool --show-offload $nic_name | grep -E '^.x-checksumming:' | grep -q  ': on'; then
    return 1
  else
    return 0
  fi
}

echo "Starting $(basename $0)"
echo "This will disable RX and TX offloading on network interface $nic_name"

while [[ $nb_tries -lt $max_retries ]]; do
  if is_nic_available; then
    echo "Network interface $nic_name found! Disabling offloading on it..."
    deactivate_offloading
    sleep 2
    if check_offloading; then
      echo "Offloading successfully disabled on interface $nic_name"
      exit 0
    else
      echo "Offloading has not been disabled correctly on interface $nic_name. Please check what happened"
      exit 2
    fi
  fi

  nb_tries=$((nb_tries + 1))

  echo "Network interface $nic_name does not exist yet. Waiting ${sleep_interval}s for it to appear (attempt $nb_tries/$max_retries)"

  sleep $sleep_interval
done

# If we are here, then we have timed out
echo "Exiting after $nb_tries attempts to detect interface $nic_name"
exit 1

couloum on 26 Nov 2020

FWIW, the upstream kernel patch that fixed the issues we had root-caused for OpenShift and some Kubernetes use-cases is https://github.com/torvalds/linux/commit/ea64d8d6c675c0bb712689b13810301de9d8f77a and is present in the 5.7 and later kernels, and the RHEL 8.2's kernel-4.18.0-193.13.2.el8_2 and later as of 2020-Jul-21. I presume CentOS 8.2 has this fix already.

Other distros (Ubuntu 20.04.1) may not yet have it, if they haven't updated their kernel or backported the patch.