When following the EKS GSG instructions to validate Cilium 1.8.0-rc3 for #11903 (including the fix for #12078), the connectivity check for pod-to-b-multi-node-nodeport is failing:
% kubectl get po
NAME READY STATUS RESTARTS AGE
echo-a-58dd59998d-fcfsc 1/1 Running 0 133m
echo-b-865969889d-7qgfg 1/1 Running 0 133m
echo-b-host-659c674bb6-tvzxm 1/1 Running 0 133m
host-to-b-multi-node-clusterip-6fb94d9df6-v25v4 1/1 Running 0 133m
host-to-b-multi-node-headless-7c4ff79cd-2dgct 1/1 Running 0 133m
pod-to-a-5c8dcf69f7-zq2zj 1/1 Running 0 133m
pod-to-a-allowed-cnp-75684d58cc-tb5jm 1/1 Running 0 133m
pod-to-a-external-1111-669ccfb85f-8l2p2 1/1 Running 0 133m
pod-to-a-l3-denied-cnp-7b8bfcb66c-qg2wc 1/1 Running 0 133m
pod-to-b-intra-node-74997967f8-c88x9 1/1 Running 0 133m
pod-to-b-intra-node-nodeport-775f967f47-t426f 1/1 Running 0 133m
pod-to-b-multi-node-clusterip-587678cbc4-xskt6 1/1 Running 0 133m
pod-to-b-multi-node-headless-574d9f5894-xd2jq 1/1 Running 0 133m
pod-to-b-multi-node-nodeport-7944d9f9fc-qpv5r 0/1 Running 0 133m
pod-to-external-fqdn-allow-google-cnp-6dd57bc859-bqhhq 1/1 Running 0 133m
% kubectl get svc
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
echo-a ClusterIP 10.100.133.146 <none> 80/TCP 136m
echo-b NodePort 10.100.21.112 <none> 80:31313/TCP 136m
echo-b-headless ClusterIP None <none> 80/TCP 136m
echo-b-host-headless ClusterIP None <none> <none> 136m
kubernetes ClusterIP 10.100.0.1 <none> 443/TCP 147m
% kubectl get ep
NAME ENDPOINTS AGE
echo-a 192.168.108.20:80 136m
echo-b 192.168.98.169:80 136m
echo-b-headless 192.168.98.169:80 136m
echo-b-host-headless 192.168.16.155 136m
kubernetes 192.168.148.188:443,192.168.97.61:443 147m
Follow-up for #12078
/cc @brb
@tklauser does it pass with Cilium 1.7? I can see this particular check was added after 1.7 (https://github.com/cilium/cilium/commit/3cc04e57aee6ea86b676cbef3e419556cae20999), but I'm guessing it's still meant to work with 1.7, right?
I'm seeing an RST:
-> stack flow 0x1ced2b76 identity 38656->1 state new ifindex 0 orig-ip 0.0.0.0: 192.168.13.11:50938 -> 192.168.21.197:31313 tcp SYN
-> endpoint 436 flow 0x0 identity 1->38656 state reply ifindex 0 orig-ip 192.168.21.197: 192.168.21.197:31313 -> 192.168.13.11:50938 tcp ACK, RST
For my cluster, the reason is that the following:
apiVersion: v1
kind: Service
metadata:
creationTimestamp: "2020-04-21T12:36:08Z"
name: echo-b
namespace: default
resourceVersion: "1915"
selfLink: /api/v1/namespaces/default/services/echo-b
uid: acb29a3b-83cc-11ea-80a1-025016633d8a
spec:
clusterIP: 10.100.173.105
ports:
- port: 80
protocol: TCP
targetPort: 80
selector:
name: echo-b
sessionAffinity: None
type: ClusterIP
status:
loadBalancer: {}
Should be of type NodePort but it is not.
I'm able to reproduce this. It happens with both kube-proxy and BPF NodePort.
The packet gets into the remote pod and the reply makes it out of the pod:
20:39:49.193719 IP 192.168.21.197.44966 > 192.168.126.96.80: Flags [S], seq 1973508466, win 26883, options [mss 8961,sackOK,TS val 509552408 ecr 0,nop,wscale 7], length 0
20:39:49.193750 IP 192.168.126.96.80 > 192.168.21.197.44966: Flags [S.], seq 4134654191, ack 1973508467, win 26847, options [mss 8961,sackOK,TS val 193777408 ecr 509552408,nop,wscale 7], length 0
20:39:50.196403 IP 192.168.126.96.80 > 192.168.21.197.44966: Flags [S.], seq 4134654191, ack 1973508467, win 26847, options [mss 8961,sackOK,TS val 193778410 ecr 509552408,nop,wscale 7], length 0
20:39:50.197370 IP 192.168.21.197.44966 > 192.168.126.96.80: Flags [S], seq 1973508466, win 26883, options [mss 8961,sackOK,TS val 509553411 ecr 0,nop,wscale 7], length 0
20:39:50.197391 IP 192.168.126.96.80 > 192.168.21.197.44966: Flags [S.], seq 4134654191, ack 1973508467, win 26847, options [mss 8961,sackOK,TS val 193778411 ecr 509552408,nop,wscale 7], length 0
20:39:52.212402 IP 192.168.126.96.80 > 192.168.21.197.44966: Flags [S.], seq 4134654191, ack 1973508467, win 26847, options [mss 8961,sackOK,TS val 193780426 ecr 509552408,nop,wscale 7], length 0
20:39:52.213415 IP 192.168.21.197.44966 > 192.168.126.96.80: Flags [S], seq 1973508466, win 26883, options [mss 8961,sackOK,TS val 509555427 ecr 0,nop,wscale 7], length 0
20:39:52.213435 IP 192.168.126.96.80 > 192.168.21.197.44966: Flags [S.], seq 4134654191, ack 1973508467, win 26847, options [mss 8961,sackOK,TS val 193780427 ecr 509552408,nop,wscale 7], length 0
The packet doesn't make it back out of eth0 so it is dropped in the stack of the remote node.
20:41:27.919172 IP 192.168.34.149.45600 > 192.168.21.197.31313: Flags [S], seq 2095917542, win 26883, options [mss 8961,sackOK,TS val 509651132 ecr 0,nop,wscale 7], length 0
20:41:28.949256 IP 192.168.34.149.45600 > 192.168.21.197.31313: Flags [S], seq 2095917542, win 26883, options [mss 8961,sackOK,TS val 509652162 ecr 0,nop,wscale 7], length 0
20:41:30.965266 IP 192.168.34.149.45600 > 192.168.21.197.31313: Flags [S], seq 2095917542, win 26883, options [mss 8961,sackOK,TS val 509654178 ecr 0,nop,wscale 7], length 0
20:41:35.221258 IP 192.168.34.149.45600 > 192.168.21.197.31313: Flags [S], seq 2095917542, win 26883, options [mss 8961,sackOK,TS val 509658434 ecr 0,nop,wscale 7], length 0
20:41:43.413259 IP 192.168.34.149.45600 > 192.168.21.197.31313: Flags [S], seq 2095917542, win 26883, options [mss 8961,sackOK,TS val 509666626 ecr 0,nop,wscale 7], length 0
@errordeveloper checked with v1.7.5:
NAME READY STATUS RESTARTS AGE
echo-a-58dd59998d-nd6x4 1/1 Running 0 91s
echo-b-865969889d-dvkcz 1/1 Running 0 90s
echo-b-host-659c674bb6-9dj58 1/1 Running 0 89s
host-to-b-multi-node-clusterip-6fb94d9df6-9fxfx 1/1 Running 0 88s
host-to-b-multi-node-headless-7c4ff79cd-xrrw8 1/1 Running 0 88s
pod-to-a-5c8dcf69f7-2xn2w 1/1 Running 0 85s
pod-to-a-allowed-cnp-75684d58cc-bfjc8 1/1 Running 0 88s
pod-to-a-external-1111-669ccfb85f-jlwsw 1/1 Running 0 83s
pod-to-a-l3-denied-cnp-7b8bfcb66c-nfcd2 1/1 Running 0 86s
pod-to-b-intra-node-74997967f8-4r4r8 1/1 Running 0 84s
pod-to-b-intra-node-nodeport-775f967f47-gcznq 0/1 Running 1 85s
pod-to-b-multi-node-clusterip-587678cbc4-c5xdn 1/1 Running 0 84s
pod-to-b-multi-node-headless-574d9f5894-5h8zd 1/1 Running 0 84s
pod-to-b-multi-node-nodeport-7944d9f9fc-dcrs5 0/1 Running 1 83s
pod-to-external-fqdn-allow-google-cnp-6dd57bc859-d7t47 1/1 Running 0 82s
Looks like this also occurs and in addition pod-to-b-intra-node-nodeport also fails.
% kubectl get svc
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
echo-a ClusterIP 10.100.171.245 <none> 80/TCP 2m34s
echo-b ClusterIP 10.100.59.26 <none> 80/TCP 2m33s
echo-b-headless ClusterIP None <none> 80/TCP 2m32s
echo-b-host-headless ClusterIP None <none> <none> 2m31s
kubernetes ClusterIP 10.100.0.1 <none> 443/TCP 3h17m
% kubectl get ep
NAME ENDPOINTS AGE
echo-a 192.168.34.45:80 3m7s
echo-b 192.168.39.74:80 3m6s
echo-b-headless 192.168.39.74:80 3m5s
echo-b-host-headless 192.168.49.116 3m4s
kubernetes 192.168.148.197:443,192.168.175.208:443 3h18m
% kubectl describe pod pod-to-b-multi-node-nodeport-7944d9f9fc-dcrs5
[...]
Warning Unhealthy 113s (x6 over 3m13s) kubelet, ip-192-168-14-188.us-west-2.compute.internal Liveness probe failed: curl: (7) Failed to connect to echo-b-host-headless port 31313: Connection refused
Normal Killing 113s (x2 over 2m53s) kubelet, ip-192-168-14-188.us-west-2.compute.internal Container pod-to-b-multi-node-nodeport failed liveness probe, will be restarted
Warning Unhealthy 95s (x11 over 3m15s) kubelet, ip-192-168-14-188.us-west-2.compute.internal Readiness probe failed: curl: (7) Failed to connect to echo-b-host-headless port 31313: Connection refused
% kubectl describe pod pod-to-b-intra-node-nodeport-775f967f47-gcznq
[...]
Warning Unhealthy 2m38s (x6 over 3m58s) kubelet, ip-192-168-49-116.us-west-2.compute.internal Liveness probe failed: curl: (7) Failed to connect to echo-b-host-headless port 31313: Connection refused
Normal Killing 2m38s (x2 over 3m38s) kubelet, ip-192-168-49-116.us-west-2.compute.internal Container pod-to-b-intra-node-hostport failed liveness probe, will be restarted
Warning Unhealthy 2m35s (x10 over 4m5s) kubelet, ip-192-168-49-116.us-west-2.compute.internal Readiness probe failed: curl: (7) Failed to connect to echo-b-host-headless port 31313: Connection refused
I found the cause, the routing is not symmetric.
The SYN is received on eth0:
21:44:37.544583 IP 192.168.34.149.41456 > 192.168.21.197.31313: Flags [S], seq 1836691167, win 26883, options [mss 8961,sackOK,TS val 513440720 ecr 0,nop,wscale 7], length 0
The SYN-ACK is sent out over eth2:
21:44:37.544706 IP 192.168.21.197.31313 > 192.168.34.149.41456: Flags [S.], seq 520985781, ack 1836691168, win 26847, options [mss 8961,sackOK,TS val 197665741 ecr 513440720,nop,wscale 7], length 0
@tklauser I think it would makes sense to double check without Cilium, i.e. with just the built-in CNI. It would be good eliminate a possibility of issues in the VPC configuration that eksctl implements.
The conntrack entry looks like this:
ipv4 2 tcp 6 59 SYN_RECV src=192.168.34.149 dst=192.168.21.197 sport=47458 dport=31313 src=192.168.126.96 dst=192.168.21.197 sport=80 dport=47458 mark=128 zone=0 use=2
The mark indicates that the 0x80 has been set via this rule:
-A PREROUTING -i eth0 -m comment --comment "AWS, primary ENI" -m addrtype --dst-type LOCAL --limit-iface-in -j CONNMARK --set-xmark 0x80/0x80
For a yet unknown reason, the following ip rule matches:
110: from 192.168.126.96 to 192.168.0.0/16 lookup 27
It is meant to to lead all packets from the pod which uses an ENI IP of eth2 out of eth2 based on the source address.
It's not clear why this rule matches despite the reverse SNAT leading to source IP of 192.168.21.197
Removing the above rule (matching the IP of the pod behind the NodePort), fixes the problem.
This does not seem to be a regression. The logic was already the same for 1.7. I'm removing the release blocker flag.
Fixing this will require to mark NodePort traffic, mark it, and route it out the same interface as it came in on.
I'm able to reproduce this. It happens with both kube-proxy and BPF NodePort.
@tgraf Have you checked whether BPF NodePort is affected when running without kube-proxy?
@tgraf Have you checked whether BPF NodePort is affected when running without kube-proxy?
Both are affected. The issue is in the routing of the reply. It doesn't matter who translates it back and forth.
Other ENI CNIs are using a rule like this:
-A PREROUTING -i eth0 -m comment --comment "AWS, primary ENI" -m addrtype --dst-type LOCAL --limit-iface-in -j CONNMARK --set-xmark 0x80/0x80
and then use an ip rule to force replies out that interface
OK, I see. The problem happens on a _remote_ node in which BPF NodePort cannot bypass the stack.
I could reproduce the issue by following the GSG for AWS-EKS. I could also apply the manual fix from Thomas, with an additional step for the return-path filter on eth0.
The ENI CNI in the aws-node daemonset sets the following rules and configuration on the node:
$ ip rule | grep -w 0x80
1024: from all fwmark 0x80/0x80 lookup main
$ iptables-save | grep -w 0x80
-A PREROUTING -i eth0 -m comment --comment "AWS, primary ENI" -m addrtype --dst-type LOCAL --limit-iface-in -j CONNMARK --set-xmark 0x80/0x80
-A PREROUTING -i eni+ -m comment --comment "AWS, primary ENI" -j CONNMARK --restore-mark --nfmask 0x80 --ctmask 0x80
$ sysctl net.ipv4.conf.eth0.rp_filter
2
Since the GSG instructs to remove the daemonset before deploying Cilium and creating the node, this configuration is not used, and we have instead:
$ ip rule | grep -w 0x80
[null]
$ iptables-save | grep -w 0x80
[null]
$ ip rule | grep 'from 192.168.126.96'
110: from 192.168.126.96 to 192.168.0.0/16 lookup 3
$ ip route show table 3
default via 192.168.64.1 dev eth1
192.168.64.1 dev eth1 scope link
$ sysctl net.ipv4.conf.eth0.rp_filter
1
Cilium sets net.ipv4.conf.all.rp_filter at 0, but the maximum value in conf/{all,interface}/rp_filter is used when doing source validation on an {interface}, so in our case rp_filter is in strict mode on eth0. This prevents the packets received from the first node on eth0 to be (SNAT-ed and) forwarded to the pod. Instead they are dropped by the host and no SYN/ACK is emitted back. Disabling rp_filter or setting it to _loose mode_ fixes it, but the SYN/ACKs are not sent to the correct destination.
This is due to the ip rule that is matched for those packets, it tells the host to do a FIB lookup in table 3 (associated to interface at index 3, eth1 in my case) and not in the main table as should be the case. This is where we need marking the packets and looking at the main table when the mark is found.
I used the following commands to restore the rules and have pod-to-b-multi-node-nodeport getting ready:
# sysctl -w net.ipv4.conf.eth0.rp_filter=2
# iptables -t mangle -A PREROUTING -i eth0 -m comment --comment "AWS, primary ENI" -m addrtype --dst-type LOCAL --limit-iface-in -j CONNMARK --set-xmark 0x80/0x80
# iptables -t mangle -A PREROUTING -i lxc+ -m comment --comment "AWS, primary ENI" -j CONNMARK --restore-mark --nfmask 0x80 --ctmask 0x80
# ip rule add fwmark 0x80/0x80 lookup main
I'm working on a fix to have Cilium reproduce this configuration on AWS.
We likely missed this when validating the GSG for 1.7 because pod-to-b-multi-node-nodeport (or pod-to-b-intra-node-nodeport failing with v1.7.5) did not exist at the time.
This is now fixed, but longer term some follow-up would be desirable. Please also refer to this comment on the PR.
Most helpful comment
I could reproduce the issue by following the GSG for AWS-EKS. I could also apply the manual fix from Thomas, with an additional step for the return-path filter on
eth0.The ENI CNI in the
aws-nodedaemonset sets the following rules and configuration on the node:Since the GSG instructs to remove the daemonset before deploying Cilium and creating the node, this configuration is not used, and we have instead:
Cilium sets
net.ipv4.conf.all.rp_filterat0, but the maximum value inconf/{all,interface}/rp_filteris used when doing source validation on an{interface}, so in our caserp_filteris in strict mode oneth0. This prevents the packets received from the first node oneth0to be (SNAT-ed and) forwarded to the pod. Instead they are dropped by the host and noSYN/ACKis emitted back. Disablingrp_filteror setting it to _loose mode_ fixes it, but theSYN/ACKs are not sent to the correct destination.This is due to the
ip rulethat is matched for those packets, it tells the host to do a FIB lookup in table 3 (associated to interface at index3,eth1in my case) and not in themaintable as should be the case. This is where we need marking the packets and looking at themaintable when the mark is found.I used the following commands to restore the rules and have
pod-to-b-multi-node-nodeportgetting ready:I'm working on a fix to have Cilium reproduce this configuration on AWS.
We likely missed this when validating the GSG for 1.7 because
pod-to-b-multi-node-nodeport(orpod-to-b-intra-node-nodeportfailing with v1.7.5) did not exist at the time.