I am experiencing a 63 second delay in VXLAN communications node->node:nodeport or node->clusterip:port. After inspecting pcaps on both sending and receiving nodes it appears related to TCP offloading on the vxlan.calico interface. Disabling this through ethtool appears to 'resolve' the issue, but I'm entirely unsure whether this is a good idea or not, or if there's a better fix?
From a node, the following should work in all cases:
curl localhost: [nodeport]
curl [nodeip]:[nodeport]
curl [clusterip]:[port]
consider the following:
A: node running pod [web], a simple web service (containous/whoami)
B: node running pod [alpine], a base container; exec sh
C: node not doing anything in particular.
With service defined:
$ kubectl describe svc whoami-cluster-nodeport
...
Type: NodePort
NodePort: web 32081/TCP
External Traffic Policy: Cluster
I get the following results when trying to access the web service:
| from/to | [web_ip]:80 | [cluster_ip]:80 | localhost:32081 | [a_ip]:32081 | [b_ip]:32081 | [c_ip]:32081 |
----------------|:-----------:|:---------------:|:---------------:|:------------:|:------------:|:------------:|
| node A | ok | ok | ok | ok | ok | ok |
| node B | ok | 63 seconds | 63 seconds | ok | 63 seconds | ok |
| node C | ok | 63 seconds | 63 seconds | ok | ok | 63 seconds |
| pod alpine | ok | ok | - | ok | 63 seconds | ok |
| external host | - | - | - | ok | ok | ok |
Further, if I change replicas for the pods so that pods are on A & C (rr load balancing b/t 2 nodes), the 63 second delay will occur half of the time from those hosting nodes:
The problem seems to stem from trying to route sourced from a node to another node. I did a trace and tcp dump from C -> A (via localhost:32081 on C... see below). On both nodes, the TCPDUMP shows repeated SYN packets attempting to establish the connection. They all show "bad udp cksum 0xffff -> 0x76dc!" in the results. After 63 seconds, a SYN packet is set with 'no cksum' and the connection is established.
I had to disable TCP Offloading... after issuing this command, curl localhost:32081 worked consistently on all nodes.
# ethtool --offload vxlan.calico rx off tx off
Actual changes:
rx-checksumming: off
tx-checksumming: off
tx-checksum-ip-generic: off
tcp-segmentation-offload: off
tx-tcp-segmentation: off [requested on]
tx-tcp-ecn-segmentation: off [requested on]
tx-tcp6-segmentation: off [requested on]
tx-tcp-mangleid-segmentation: off [requested on]
udp-fragmentation-offload: off [requested on]
So.... I'm entirely unsure whether this is a good idea or not? Or whether there's a way to fix this through iptables? Or whether this needs fixed in the OS/hosting (VSphere)?
# sudo perf trace --no-syscalls --event 'net:*' wget -q -O /dev/null localhost:32081
0.000 net:net_dev_queue:dev=vxlan.calico skbaddr=0xffff8c6ed93d00f8 len=66
0.033 net:net_dev_queue:dev=eth0 skbaddr=0xffff8c6ed93d00f8 len=116
0.084 net:net_dev_xmit:dev=eth0 skbaddr=0xffff8c6ed93d00f8 len=116 rc=0
0.088 net:net_dev_xmit:dev=vxlan.calico skbaddr=0xffff8c6ed93d00f8 len=66 rc=0
63122.053 net:net_dev_queue:dev=vxlan.calico skbaddr=0xffff8c607b6be0f8 len=165
63122.070 net:net_dev_queue:dev=eth0 skbaddr=0xffff8c607b6be0f8 len=215
63122.078 net:net_dev_xmit:dev=eth0 skbaddr=0xffff8c607b6be0f8 len=215 rc=0
63122.080 net:net_dev_xmit:dev=vxlan.calico skbaddr=0xffff8c607b6be0f8 len=165 rc=0
63123.135 net:net_dev_queue:dev=vxlan.calico skbaddr=0xffff8c607b6b90f8 len=54
63123.154 net:net_dev_queue:dev=eth0 skbaddr=0xffff8c607b6b90f8 len=104
63123.162 net:net_dev_xmit:dev=eth0 skbaddr=0xffff8c607b6b90f8 len=104 rc=0
63123.165 net:net_dev_xmit:dev=vxlan.calico skbaddr=0xffff8c607b6b90f8 len=54 rc=0
# tcpdump -vv host [node_a_ip]
tcpdump: listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes
11:51:31.656624 IP (tos 0x0, ttl 64, id 59791, offset 0, flags [none], proto UDP (17), length 102)
nodeC.local.26898 > nodeA.local.4789: [bad udp cksum 0xffff -> 0x76dc!] VXLAN, flags [I] (0x08), vni 4096
IP (tos 0x0, ttl 64, id 8153, offset 0, flags [DF], proto TCP (6), length 52)
nodeC.39274 > 10.244.143.193.http: Flags [S], cksum 0xe849 (correct), seq 2295512834, win 43690, options [mss 65495,nop,nop,sackOK,nop,wscale 7], length 0
11:51:32.657185 IP (tos 0x0, ttl 64, id 60020, offset 0, flags [none], proto UDP (17), length 102)
nodeC.local.26898 > nodeA.local.4789: [bad udp cksum 0xffff -> 0x76dc!] VXLAN, flags [I] (0x08), vni 4096
IP (tos 0x0, ttl 64, id 8154, offset 0, flags [DF], proto TCP (6), length 52)
nodeC.39274 > 10.244.143.193.http: Flags [S], cksum 0xe849 (correct), seq 2295512834, win 43690, options [mss 65495,nop,nop,sackOK,nop,wscale 7], length 0
11:51:34.661166 IP (tos 0x0, ttl 64, id 61933, offset 0, flags [none], proto UDP (17), length 102)
nodeC.local.26898 > nodeA.local.4789: [bad udp cksum 0xffff -> 0x76dc!] VXLAN, flags [I] (0x08), vni 4096
IP (tos 0x0, ttl 64, id 8155, offset 0, flags [DF], proto TCP (6), length 52)
nodeC.39274 > 10.244.143.193.http: Flags [S], cksum 0xe849 (correct), seq 2295512834, win 43690, options [mss 65495,nop,nop,sackOK,nop,wscale 7], length 0
11:51:36.669116 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has nodeA.local tell nodeC.local, length 28
11:51:36.669385 ARP, Ethernet (len 6), IPv4 (len 4), Reply nodeA.local is-at 00:50:56:a3:d4:91 (oui Unknown), length 46
11:51:38.669155 IP (tos 0x0, ttl 64, id 65370, offset 0, flags [none], proto UDP (17), length 102)
nodeC.local.26898 > nodeA.local.4789: [bad udp cksum 0xffff -> 0x76dc!] VXLAN, flags [I] (0x08), vni 4096
IP (tos 0x0, ttl 64, id 8156, offset 0, flags [DF], proto TCP (6), length 52)
nodeC.39274 > 10.244.143.193.http: Flags [S], cksum 0xe849 (correct), seq 2295512834, win 43690, options [mss 65495,nop,nop,sackOK,nop,wscale 7], length 0
11:51:46.685204 IP (tos 0x0, ttl 64, id 3142, offset 0, flags [none], proto UDP (17), length 102)
nodeC.local.26898 > nodeA.local.4789: [bad udp cksum 0xffff -> 0x76dc!] VXLAN, flags [I] (0x08), vni 4096
IP (tos 0x0, ttl 64, id 8157, offset 0, flags [DF], proto TCP (6), length 52)
nodeC.39274 > 10.244.143.193.http: Flags [S], cksum 0xe849 (correct), seq 2295512834, win 43690, options [mss 65495,nop,nop,sackOK,nop,wscale 7], length 0
11:52:02.733178 IP (tos 0x0, ttl 64, id 5028, offset 0, flags [none], proto UDP (17), length 102)
nodeC.local.26898 > nodeA.local.4789: [bad udp cksum 0xffff -> 0x76dc!] VXLAN, flags [I] (0x08), vni 4096
IP (tos 0x0, ttl 64, id 8158, offset 0, flags [DF], proto TCP (6), length 52)
nodeC.39274 > 10.244.143.193.http: Flags [S], cksum 0xe849 (correct), seq 2295512834, win 43690, options [mss 65495,nop,nop,sackOK,nop,wscale 7], length 0
11:52:34.797203 IP (tos 0x0, ttl 64, id 30608, offset 0, flags [none], proto UDP (17), length 102)
nodeC.local.60563 > nodeA.local.4789: [no cksum] VXLAN, flags [I] (0x08), vni 4096
IP (tos 0x0, ttl 64, id 8159, offset 0, flags [DF], proto TCP (6), length 52)
nodeC.39274 > 10.244.143.193.http: Flags [S], cksum 0xe849 (correct), seq 2295512834, win 43690, options [mss 65495,nop,nop,sackOK,nop,wscale 7], length 0
11:52:34.797731 IP (tos 0x0, ttl 64, id 33394, offset 0, flags [none], proto UDP (17), length 102)
nodeA.local.36276 > nodeC.local.4789: [no cksum] VXLAN, flags [I] (0x08), vni 4096
IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto TCP (6), length 52)
10.244.143.193.http > nodeC.39274: Flags [S.], cksum 0x4348 (correct), seq 4013223269, ack 2295512835, win 28000, options [mss 1400,nop,nop,sackOK,nop,wscale 7], length 0
11:52:34.797854 IP (tos 0x0, ttl 64, id 30609, offset 0, flags [none], proto UDP (17), length 90)
nodeC.local.60563 > nodeA.local.4789: [no cksum] VXLAN, flags [I] (0x08), vni 4096
IP (tos 0x0, ttl 64, id 8160, offset 0, flags [DF], proto TCP (6), length 40)
nodeC.39274 > 10.244.143.193.http: Flags [.], cksum 0xefe8 (correct), seq 1, ack 1, win 342, length 0
11:52:34.797936 IP (tos 0x0, ttl 64, id 30610, offset 0, flags [none], proto UDP (17), length 203)
nodeC.local.60563 > nodeA.local.4789: [no cksum] VXLAN, flags [I] (0x08), vni 4096
IP (tos 0x0, ttl 64, id 8161, offset 0, flags [DF], proto TCP (6), length 153)
nodeC.39274 > 10.244.143.193.http: Flags [P.], cksum 0x7ffc (correct), seq 1:114, ack 1, win 342, length 113: HTTP, length: 113
GET / HTTP/1.1
User-Agent: Wget/1.14 (linux-gnu)
Accept: */*
Host: localhost:32081
Connection: Keep-Alive
11:52:34.798117 IP (tos 0x0, ttl 64, id 33395, offset 0, flags [none], proto UDP (17), length 90)
nodeA.local.37789 > nodeC.local.4789: [no cksum] VXLAN, flags [I] (0x08), vni 4096
IP (tos 0x0, ttl 63, id 26251, offset 0, flags [DF], proto TCP (6), length 40)
10.244.143.193.http > nodeC.39274: Flags [.], cksum 0xeff2 (correct), seq 1, ack 114, win 219, length 0
11:52:34.798547 IP (tos 0x0, ttl 64, id 33396, offset 0, flags [none], proto UDP (17), length 458)
nodeA.local.37789 > nodeC.local.4789: [no cksum] VXLAN, flags [I] (0x08), vni 4096
IP (tos 0x0, ttl 63, id 26252, offset 0, flags [DF], proto TCP (6), length 408)
10.244.143.193.http > nodeC.39274: Flags [P.], cksum 0xe168 (correct), seq 1:369, ack 114, win 219, length 368: HTTP, length: 368
HTTP/1.1 200 OK
Date: Mon, 20 Jan 2020 16:52:34 GMT
Content-Length: 250
Content-Type: text/plain; charset=utf-8
Hostname: whoami-66686d967d-mzk8p
IP: 127.0.0.1
IP: ::1
IP: 10.244.143.193
IP: fe80::d830:89ff:fe7f:2703
RemoteAddr: 10.244.90.192:39274
GET / HTTP/1.1
Host: localhost:32081
User-Agent: Wget/1.14 (linux-gnu)
Accept: */*
Connection: Keep-Alive
11:52:34.798602 IP (tos 0x0, ttl 64, id 30611, offset 0, flags [none], proto UDP (17), length 90)
nodeC.local.60563 > nodeA.local.4789: [no cksum] VXLAN, flags [I] (0x08), vni 4096
IP (tos 0x0, ttl 64, id 8162, offset 0, flags [DF], proto TCP (6), length 40)
nodeC.39274 > 10.244.143.193.http: Flags [.], cksum 0xedff (correct), seq 114, ack 369, win 350, length 0
11:52:34.799551 IP (tos 0x0, ttl 64, id 30612, offset 0, flags [none], proto UDP (17), length 90)
nodeC.local.60563 > nodeA.local.4789: [no cksum] VXLAN, flags [I] (0x08), vni 4096
IP (tos 0x0, ttl 64, id 8163, offset 0, flags [DF], proto TCP (6), length 40)
nodeC.39274 > 10.244.143.193.http: Flags [F.], cksum 0xedfe (correct), seq 114, ack 369, win 350, length 0
11:52:34.799731 IP (tos 0x0, ttl 64, id 33397, offset 0, flags [none], proto UDP (17), length 90)
nodeA.local.37789 > nodeC.local.4789: [no cksum] VXLAN, flags [I] (0x08), vni 4096
IP (tos 0x0, ttl 63, id 26253, offset 0, flags [DF], proto TCP (6), length 40)
10.244.143.193.http > nodeC.39274: Flags [F.], cksum 0xee80 (correct), seq 369, ack 115, win 219, length 0
11:52:34.799779 IP (tos 0x0, ttl 64, id 30613, offset 0, flags [none], proto UDP (17), length 90)
nodeC.local.60563 > nodeA.local.4789: [no cksum] VXLAN, flags [I] (0x08), vni 4096
IP (tos 0x0, ttl 64, id 8164, offset 0, flags [DF], proto TCP (6), length 40)
nodeC.39274 > 10.244.143.193.http: Flags [.], cksum 0xedfd (correct), seq 115, ack 370, win 350, length 0
11:52:39.805144 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has nodeA.local tell nodeC.local, length 28
11:52:39.805399 ARP, Ethernet (len 6), IPv4 (len 4), Reply nodeA.local is-at 00:50:56:a3:d4:91 (oui Unknown), length 46
# tcpdump -vv host [node_c_ip]
tcpdump: listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes
11:51:31.656705 IP (tos 0x0, ttl 64, id 59791, offset 0, flags [none], proto UDP (17), length 102)
nodeC.local.26898 > nodeA.local.4789: [bad udp cksum 0xffff -> 0x76dc!] VXLAN, flags [I] (0x08), vni 4096
IP (tos 0x0, ttl 64, id 8153, offset 0, flags [DF], proto TCP (6), length 52)
10.244.90.192.39274 > 10.244.143.193.http: Flags [S], cksum 0xe849 (correct), seq 2295512834, win 43690, options [mss 65495,nop,nop,sackOK,nop,wscale 7], length 0
11:51:32.657297 IP (tos 0x0, ttl 64, id 60020, offset 0, flags [none], proto UDP (17), length 102)
nodeC.local.26898 > nodeA.local.4789: [bad udp cksum 0xffff -> 0x76dc!] VXLAN, flags [I] (0x08), vni 4096
IP (tos 0x0, ttl 64, id 8154, offset 0, flags [DF], proto TCP (6), length 52)
10.244.90.192.39274 > 10.244.143.193.http: Flags [S], cksum 0xe849 (correct), seq 2295512834, win 43690, options [mss 65495,nop,nop,sackOK,nop,wscale 7], length 0
11:51:34.661301 IP (tos 0x0, ttl 64, id 61933, offset 0, flags [none], proto UDP (17), length 102)
nodeC.local.26898 > nodeA.local.4789: [bad udp cksum 0xffff -> 0x76dc!] VXLAN, flags [I] (0x08), vni 4096
IP (tos 0x0, ttl 64, id 8155, offset 0, flags [DF], proto TCP (6), length 52)
10.244.90.192.39274 > 10.244.143.193.http: Flags [S], cksum 0xe849 (correct), seq 2295512834, win 43690, options [mss 65495,nop,nop,sackOK,nop,wscale 7], length 0
11:51:36.669200 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has nodeA.local tell nodeC.local, length 46
11:51:36.669224 ARP, Ethernet (len 6), IPv4 (len 4), Reply nodeA.local is-at 00:50:56:a3:d4:91 (oui Unknown), length 28
11:51:38.669297 IP (tos 0x0, ttl 64, id 65370, offset 0, flags [none], proto UDP (17), length 102)
nodeC.local.26898 > nodeA.local.4789: [bad udp cksum 0xffff -> 0x76dc!] VXLAN, flags [I] (0x08), vni 4096
IP (tos 0x0, ttl 64, id 8156, offset 0, flags [DF], proto TCP (6), length 52)
10.244.90.192.39274 > 10.244.143.193.http: Flags [S], cksum 0xe849 (correct), seq 2295512834, win 43690, options [mss 65495,nop,nop,sackOK,nop,wscale 7], length 0
11:51:46.685305 IP (tos 0x0, ttl 64, id 3142, offset 0, flags [none], proto UDP (17), length 102)
nodeC.local.26898 > nodeA.local.4789: [bad udp cksum 0xffff -> 0x76dc!] VXLAN, flags [I] (0x08), vni 4096
IP (tos 0x0, ttl 64, id 8157, offset 0, flags [DF], proto TCP (6), length 52)
10.244.90.192.39274 > 10.244.143.193.http: Flags [S], cksum 0xe849 (correct), seq 2295512834, win 43690, options [mss 65495,nop,nop,sackOK,nop,wscale 7], length 0
11:52:02.733268 IP (tos 0x0, ttl 64, id 5028, offset 0, flags [none], proto UDP (17), length 102)
nodeC.local.26898 > nodeA.local.4789: [bad udp cksum 0xffff -> 0x76dc!] VXLAN, flags [I] (0x08), vni 4096
IP (tos 0x0, ttl 64, id 8158, offset 0, flags [DF], proto TCP (6), length 52)
10.244.90.192.39274 > 10.244.143.193.http: Flags [S], cksum 0xe849 (correct), seq 2295512834, win 43690, options [mss 65495,nop,nop,sackOK,nop,wscale 7], length 0
11:52:34.797359 IP (tos 0x0, ttl 64, id 30608, offset 0, flags [none], proto UDP (17), length 102)
nodeC.local.60563 > nodeA.local.4789: [no cksum] VXLAN, flags [I] (0x08), vni 4096
IP (tos 0x0, ttl 64, id 8159, offset 0, flags [DF], proto TCP (6), length 52)
10.244.90.192.39274 > 10.244.143.193.http: Flags [S], cksum 0xe849 (correct), seq 2295512834, win 43690, options [mss 65495,nop,nop,sackOK,nop,wscale 7], length 0
11:52:34.797533 IP (tos 0x0, ttl 64, id 33394, offset 0, flags [none], proto UDP (17), length 102)
nodeA.local.36276 > nodeC.local.4789: [no cksum] VXLAN, flags [I] (0x08), vni 4096
IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto TCP (6), length 52)
10.244.143.193.http > 10.244.90.192.39274: Flags [S.], cksum 0x4348 (correct), seq 4013223269, ack 2295512835, win 28000, options [mss 1400,nop,nop,sackOK,nop,wscale 7], length 0
11:52:34.797895 IP (tos 0x0, ttl 64, id 30609, offset 0, flags [none], proto UDP (17), length 90)
nodeC.local.60563 > nodeA.local.4789: [no cksum] VXLAN, flags [I] (0x08), vni 4096
IP (tos 0x0, ttl 64, id 8160, offset 0, flags [DF], proto TCP (6), length 40)
10.244.90.192.39274 > 10.244.143.193.http: Flags [.], cksum 0xefe8 (correct), seq 1, ack 1, win 342, length 0
11:52:34.797960 IP (tos 0x0, ttl 64, id 30610, offset 0, flags [none], proto UDP (17), length 203)
nodeC.local.60563 > nodeA.local.4789: [no cksum] VXLAN, flags [I] (0x08), vni 4096
IP (tos 0x0, ttl 64, id 8161, offset 0, flags [DF], proto TCP (6), length 153)
10.244.90.192.39274 > 10.244.143.193.http: Flags [P.], cksum 0x7ffc (correct), seq 1:114, ack 1, win 342, length 113: HTTP, length: 113
GET / HTTP/1.1
User-Agent: Wget/1.14 (linux-gnu)
Accept: */*
Host: localhost:32081
Connection: Keep-Alive
11:52:34.797995 IP (tos 0x0, ttl 64, id 33395, offset 0, flags [none], proto UDP (17), length 90)
nodeA.local.37789 > nodeC.local.4789: [no cksum] VXLAN, flags [I] (0x08), vni 4096
IP (tos 0x0, ttl 63, id 26251, offset 0, flags [DF], proto TCP (6), length 40)
10.244.143.193.http > 10.244.90.192.39274: Flags [.], cksum 0xeff2 (correct), seq 1, ack 114, win 219, length 0
11:52:34.798460 IP (tos 0x0, ttl 64, id 33396, offset 0, flags [none], proto UDP (17), length 458)
nodeA.local.37789 > nodeC.local.4789: [no cksum] VXLAN, flags [I] (0x08), vni 4096
IP (tos 0x0, ttl 63, id 26252, offset 0, flags [DF], proto TCP (6), length 408)
10.244.143.193.http > 10.244.90.192.39274: Flags [P.], cksum 0xe168 (correct), seq 1:369, ack 114, win 219, length 368: HTTP, length: 368
HTTP/1.1 200 OK
Date: Mon, 20 Jan 2020 16:52:34 GMT
Content-Length: 250
Content-Type: text/plain; charset=utf-8
Hostname: whoami-66686d967d-mzk8p
IP: 127.0.0.1
IP: ::1
IP: 10.244.143.193
IP: fe80::d830:89ff:fe7f:2703
RemoteAddr: 10.244.90.192:39274
GET / HTTP/1.1
Host: localhost:32081
User-Agent: Wget/1.14 (linux-gnu)
Accept: */*
Connection: Keep-Alive
11:52:34.798635 IP (tos 0x0, ttl 64, id 30611, offset 0, flags [none], proto UDP (17), length 90)
nodeC.local.60563 > nodeA.local.4789: [no cksum] VXLAN, flags [I] (0x08), vni 4096
IP (tos 0x0, ttl 64, id 8162, offset 0, flags [DF], proto TCP (6), length 40)
10.244.90.192.39274 > 10.244.143.193.http: Flags [.], cksum 0xedff (correct), seq 114, ack 369, win 350, length 0
11:52:34.799589 IP (tos 0x0, ttl 64, id 30612, offset 0, flags [none], proto UDP (17), length 90)
nodeC.local.60563 > nodeA.local.4789: [no cksum] VXLAN, flags [I] (0x08), vni 4096
IP (tos 0x0, ttl 64, id 8163, offset 0, flags [DF], proto TCP (6), length 40)
10.244.90.192.39274 > 10.244.143.193.http: Flags [F.], cksum 0xedfe (correct), seq 114, ack 369, win 350, length 0
11:52:34.799655 IP (tos 0x0, ttl 64, id 33397, offset 0, flags [none], proto UDP (17), length 90)
nodeA.local.37789 > nodeC.local.4789: [no cksum] VXLAN, flags [I] (0x08), vni 4096
IP (tos 0x0, ttl 63, id 26253, offset 0, flags [DF], proto TCP (6), length 40)
10.244.143.193.http > 10.244.90.192.39274: Flags [F.], cksum 0xee80 (correct), seq 369, ack 115, win 219, length 0
11:52:34.799814 IP (tos 0x0, ttl 64, id 30613, offset 0, flags [none], proto UDP (17), length 90)
nodeC.local.60563 > nodeA.local.4789: [no cksum] VXLAN, flags [I] (0x08), vni 4096
IP (tos 0x0, ttl 64, id 8164, offset 0, flags [DF], proto TCP (6), length 40)
10.244.90.192.39274 > 10.244.143.193.http: Flags [.], cksum 0xedfd (correct), seq 115, ack 370, win 350, length 0
11:52:39.805231 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has nodeA.local tell nodeC.local, length 46
11:52:39.805256 ARP, Ethernet (len 6), IPv4 (len 4), Reply nodeA.local is-at 00:50:56:a3:d4:91 (oui Unknown), length 28
# ethtool --offload vxlan.calico rx off tx off
There are times when we need to be able to access a service from a node (eg, log shipping from the node to a hosted service, hosted app api access, k8s hosted registry) where this defect will interfere with normal communications
Hello,
We have the exact same issue on CentOS 7 (3.10.0-1062.9.1.el7.x86_64) running Calico/Canal and flannel. After spending the better part of a week trying to figure out why two thirds of our cluster was unable to reliably to talk to the other third I stumbled upon this issue.
I can report that disabling offloading completely works around this issue. In our case, the command is: sudo ethtool --offload flannel.1 rx off tx off (because we are running flannel).
Same issue here, we have 63 second delay while connecting to ClusterIP services in CentOS 7 / k8s 1.7.2 / Calico 3.12.0 cluster running on Hetzner Cloud. Disabling ethernet offloading resolves mentioned connection issues.
ethtool --offload interface rx off tx off
thx it works. flannel+k8s1.17.2 centos7.
but how do you know disable tcp offload works?
@jelaryma I am having a similar problem. https://github.com/kubernetes/kubernetes/issues/88986
I also measured the 63 second delay and I am using flannel.
I came to think it was a flannel issue, but you are also seeing this with calico..
https://github.com/coreos/flannel/issues/1268
Is it your thought that this is a k8s bug?
Also would i run the following command on every node of the cluster?
sudo ethtool --offload flannel.1 rx off tx off
There's a thread in SIG network about this: https://groups.google.com/forum/#!topic/kubernetes-sig-network/JxkTLd4M8WM
Summary so far seems to be that this is a kernel bug related to VXLAN offload where the checksum calculation is not properly offloaded.
Same issue with k8s 1.18 + ubuntu 16.04 /w calico 3.11, it causes 3s delays only.
ethtool --offload interface rx off tx off can workaround perfectly!
Flannel now has an open PR addressing this. https://github.com/coreos/flannel/pull/1282
see this https://github.com/coreos/flannel/pull/1282#issuecomment-635145841
I have been hit by this bug as well. Does anyone know where to add the ethtool command to make it persistent after a reboot on Centos7? I tried adding it to rc.local but it looks like the device is being created after the script runs because I am getting a Cannot get device feature names: No such device error..
@jelaryma
After 63 seconds, a SYN packet is set with 'no cksum' and the connection is established.
63 seconds maybe refers to 5 times retransmission. But i'm still confused about the cause of this issue. So do you kown some articles or blog posts about the 'no cksum' flag.
Thanks.
@jelaryma
After 63 seconds, a SYN packet is set with 'no cksum' and the connection is established.63 seconds maybe refers to 5 times retransmission. But i'm still confused about the cause of this issue. So do you kown some articles or blog posts about the 'no cksum' flag.
Thanks.
https://zhangguanzhang.github.io/2020/05/23/k8s-vxlan-63-timeout/
I have been hit by this bug as well. Does anyone know where to add the ethtool command to make it persistent after a reboot on Centos7? I tried adding it to rc.local but it looks like the device is being created after the script runs because I am getting a
Cannot get device feature names: No such deviceerror..
Did you found a solution for persistent fix after reboot?
Did you found a solution for persistent fix after reboot?
@balleon Nope, thankfully the servers don't get rebooted very often...
any better solution?
any better solution?
On my Kubernetes 1.18.5 and CentOS 7 cluster i use a kube-proxy custom image.
FROM k8s.gcr.io/kube-proxy:v1.18.5
RUN rm -f /usr/sbin/iptables && clean-install iptables
Hello,
We have the exact same issue on CentOS 7 (3.10.0-1062.9.1.el7.x86_64) running Calico/Canal and flannel. After spending the better part of a week trying to figure out why two thirds of our cluster was unable to reliably to talk to the other third I stumbled upon this issue.
I can report that disabling offloading completely works around this issue. In our case, the command is:
sudo ethtool --offload flannel.1 rx off tx off(because we are running flannel).Your Environment
- Calico version: v3.11
- Orchestrator version (e.g. kubernetes, mesos, rkt): k8s 1.17.1
- Operating System and version: CentOS 7 (3.10.0-1062.9.1.el7.x86_64)
THIS WORKS!!!
Last Kubernetes release shoud fix this issue.
https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG/CHANGELOG-1.18.md#changes-by-kind
Fixes a problem with 63-second or 1-second connection delays with some VXLAN-based network plugins which was first widely noticed in 1.16 (though some users saw it earlier than that, possibly only with specific network plugins). If you were previously using ethtool to disable checksum offload on your primary network interface, you should now be able to stop doing that. (#92035, @danwinship) [SIG Network and Node]
I tried it in the following environment:
- CentOS 7.8 (3.10.0-1127)
- Kubernetes 1.18.6
- Calico 1.15.1 (VXLAN)
Problem still there, i have to sudo ethtool --offload vxlan.calico rx off tx off on all hosts to have a working cluster.
Did you found a fix that doesn't require to disable vxlan.calico interface offload?
This workaround isn't persistent after reboot so it can't be applied in a production environment.
Last Kubernetes release shoud fix this issue.
https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG/CHANGELOG-1.18.md#changes-by-kindFixes a problem with 63-second or 1-second connection delays with some VXLAN-based network plugins which was first widely noticed in 1.16 (though some users saw it earlier than that, possibly only with specific network plugins). If you were previously using ethtool to disable checksum offload on your primary network interface, you should now be able to stop doing that. (#92035, @danwinship) [SIG Network and Node]
I tried it in the following environment:
- CentOS 7.8 (3.10.0-1127)
- Kubernetes 1.18.6
- Calico 1.15.1 (VXLAN)
Problem still there, i have to
sudo ethtool --offload vxlan.calico rx off tx offon all hosts to have a working cluster.
Did you found a fix that doesn't require to disable vxlan.calico interface offload?
This workaround isn't persistent after reboot so it can't be applied in a production environment.
@danwinship PTAL
It is possible that Calico creates iptables rules of its own which run into the same problems Kubernetes was running into? (The initial comment on the kubernetes PR and this later comment give a pretty detailed explanation of the problem.)
To my knowledge Calico doesn't NAT packets that would then be sent over the VXLAN device (which IIUC is the scenario that triggers this checksum bug). We should only be performing NAT on packets which are destined outside of the cluster (but I could be misremembering something).
I don't know _exactly_ what cases were triggering that upstream. There are weird edge cases, like the rule that says if a pod connects to a service that it is an endpoint of, it has to be SNAT'ed. So given pods A and B in Service X on different nodes, a connection from pod A to Service X's Cluster IP to pod B would end up both SNAT'ed and tunneled over the VXLAN.
But in general, if you have any rules that do "if marked then masquerade", it is best to rewrite them to do "if marked then unmark and then masquerade".
@balleon hm, did you update from a bad kubernetes release to a fixed kubernetes release without rebooting? I just realized that the new code doesn't make any effort to clean up the old broken iptables rules, so if you just installed the new release and restart kubelet, you probably still have the bad rules in KUBE-POSTROUTING. (The new rules would get appended after the old ones, and thus would have no effect.)
@balleon hm, did you update from a bad kubernetes release to a fixed kubernetes release without rebooting? I just realized that the new code doesn't make any effort to clean up the old broken iptables rules, so if you just installed the new release and restart kubelet, you probably still have the bad rules in
KUBE-POSTROUTING. (The new rules would get appended after the old ones, and thus would have no effect.)
Still using a custom kube-proxy Dockerfile.
Kubernetes 1.18.6 fresh install doesn't fix the problem.
Maybe Calico 1.16 with the following option can fix the problem
https://github.com/projectcalico/libcalico-go/pull/1264
I have the same issue here, tested on Kubernetes v1.18.12 with Calico v3.17.
I tried the new option FELIX_FEATUREDETECTOVERRIDE="MASQFullyRandom=false", but there's still a random-fully generated by kube-proxy on a MASQUERADE rule.
With FELIX_FEATUREDETECTOVERRIDE="MASQFullyRandom=false"
# iptables -t nat -L -n | grep fully
MASQUERADE all -- 0.0.0.0/0 0.0.0.0/0 /* kubernetes service traffic requiring SNAT */ random-fully
Without FELIX_FEATUREDETECTOVERRIDE="MASQFullyRandom=false":
# iptables -t nat -L -n | grep fully
MASQUERADE all -- 0.0.0.0/0 0.0.0.0/0 /* kubernetes service traffic requiring SNAT */ random-fully
MASQUERADE all -- 0.0.0.0/0 0.0.0.0/0 /* cali:e9dnSgSVNmIcpVhP */ ADDRTYPE match src-type !LOCAL limit-out ADDRTYPE match src-type LOCAL random-fully
MASQUERADE all -- 0.0.0.0/0 0.0.0.0/0 /* cali:flqWnvo8yq4ULQLa */ match-set cali40masq-ipam-pools src ! match-set cali40all-ipam-pools dst random-fully
I was able to solve it by manually adding a MASQUERADE rule without random-fully option:
# dig +timeout=2 @10.96.0.10 google.com
; <<>> DiG 9.16.1-Ubuntu <<>> +timeout @10.96.0.10 google.com
; (1 server found)
;; global options: +cmd
;; connection timed out; no servers could be reached
# iptables -t nat -L KUBE-POSTROUTING -n -v
Chain KUBE-POSTROUTING (1 references)
pkts bytes target prot opt in out source destination
1049 59949 RETURN all -- * * 0.0.0.0/0 0.0.0.0/0 mark match ! 0x4000/0x4000
10 807 MARK all -- * * 0.0.0.0/0 0.0.0.0/0 MARK xor 0x4000
10 807 MASQUERADE all -- * * 0.0.0.0/0 0.0.0.0/0 /* kubernetes service traffic requiring SNAT */ random-fully
# iptables -t nat -I KUBE-POSTROUTING 3 -j MASQUERADE
# iptables -t nat -L KUBE-POSTROUTING -n -v
Chain KUBE-POSTROUTING (1 references)
pkts bytes target prot opt in out source destination
1064 60774 RETURN all -- * * 0.0.0.0/0 0.0.0.0/0 mark match ! 0x4000/0x4000
10 807 MARK all -- * * 0.0.0.0/0 0.0.0.0/0 MARK xor 0x4000
0 0 MASQUERADE all -- * * 0.0.0.0/0 0.0.0.0/0
10 807 MASQUERADE all -- * * 0.0.0.0/0 0.0.0.0/0 /* kubernetes service traffic requiring SNAT */ random-fully
# dig +timeout=2 @10.96.0.10 google.com
; <<>> DiG 9.16.1-Ubuntu <<>> +timeout @10.96.0.10 google.com
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 7434
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1
;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;google.com. IN A
;; ANSWER SECTION:
google.com. 30 IN A 142.250.74.238
;; Query time: 0 msec
;; SERVER: 10.96.0.10#53(10.96.0.10)
;; WHEN: Thu Nov 26 11:05:49 UTC 2020
;; MSG SIZE rcvd: 65
But that's not reboot proof :( We should have an option in kube-proxy to disable the random-fully option.
If anyone is interested by a reboot-proof solution to apply the workarround (disable offloading on vxlan.calico), you can use these 2 files:
/etc/systemd/system/disable-offloading-on-vxlan.service
[Unit]
Description=Disable offloading on vxlan.calico network interface
Requires=kubelet.service
Documentation=https://github.com/projectcalico/calico/issues/3145
[Service]
Type=oneshot
ExecStart=/usr/local/bin/disable-offloading-on-vxlan
RemainAfterExit=yes
Restart=no
TimeoutStartSec=660
[Install]
WantedBy=multi-user.target
/usr/local/bin/disable-offloading-on-vxlan
#!/bin/bash
# This script apply a workarround for a bug encountered on Kubernetes with vxlan
# and iptables >= 1.6.2.
# You can find more details on this bug here:
# https://github.com/kubernetes/kubernetes/issues/96868
# https://github.com/projectcalico/calico/issues/3145
#
# The workarround is to disable offloading on vxlan interface
# Wait until an interface named vxlan.calico appear
# Wait a maximum of 10 minutes (= 60 checks every 10 seconds)
sleep_interval=10
max_retries=60
nb_tries=0
nic_name="vxlan.calico"
is_nic_available() {
ip a show dev $nic_name > /dev/null 2>&1
}
deactivate_offloading() {
ethtool --offload $nic_name rx off tx off
}
check_offloading() {
# Return an error if at least one offload is enabled (rx or tx)
if ethtool --show-offload $nic_name | grep -E '^.x-checksumming:' | grep -q ': on'; then
return 1
else
return 0
fi
}
echo "Starting $(basename $0)"
echo "This will disable RX and TX offloading on network interface $nic_name"
while [[ $nb_tries -lt $max_retries ]]; do
if is_nic_available; then
echo "Network interface $nic_name found! Disabling offloading on it..."
deactivate_offloading
sleep 2
if check_offloading; then
echo "Offloading successfully disabled on interface $nic_name"
exit 0
else
echo "Offloading has not been disabled correctly on interface $nic_name. Please check what happened"
exit 2
fi
fi
nb_tries=$((nb_tries + 1))
echo "Network interface $nic_name does not exist yet. Waiting ${sleep_interval}s for it to appear (attempt $nb_tries/$max_retries)"
sleep $sleep_interval
done
# If we are here, then we have timed out
echo "Exiting after $nb_tries attempts to detect interface $nic_name"
exit 1
FWIW, the upstream kernel patch that fixed the issues we had root-caused for OpenShift and some Kubernetes use-cases is https://github.com/torvalds/linux/commit/ea64d8d6c675c0bb712689b13810301de9d8f77a and is present in the 5.7 and later kernels, and the RHEL 8.2's kernel-4.18.0-193.13.2.el8_2 and later as of 2020-Jul-21. I presume CentOS 8.2 has this fix already.
Other distros (Ubuntu 20.04.1) may not yet have it, if they haven't updated their kernel or backported the patch.
Most helpful comment
Hello,
We have the exact same issue on CentOS 7 (3.10.0-1062.9.1.el7.x86_64) running Calico/Canal and flannel. After spending the better part of a week trying to figure out why two thirds of our cluster was unable to reliably to talk to the other third I stumbled upon this issue.
I can report that disabling offloading completely works around this issue. In our case, the command is:
sudo ethtool --offload flannel.1 rx off tx off(because we are running flannel).Your Environment