Cilium: pod-to-b-multi-node-nodeport connectivity test failing on EKS with 1.8.0-rc3

Created on 16 Jun 2020  路  19Comments  路  Source: cilium/cilium

When following the EKS GSG instructions to validate Cilium 1.8.0-rc3 for #11903 (including the fix for #12078), the connectivity check for pod-to-b-multi-node-nodeport is failing:

% kubectl get po
NAME                                                     READY   STATUS    RESTARTS   AGE
echo-a-58dd59998d-fcfsc                                  1/1     Running   0          133m
echo-b-865969889d-7qgfg                                  1/1     Running   0          133m
echo-b-host-659c674bb6-tvzxm                             1/1     Running   0          133m
host-to-b-multi-node-clusterip-6fb94d9df6-v25v4          1/1     Running   0          133m
host-to-b-multi-node-headless-7c4ff79cd-2dgct            1/1     Running   0          133m
pod-to-a-5c8dcf69f7-zq2zj                                1/1     Running   0          133m
pod-to-a-allowed-cnp-75684d58cc-tb5jm                    1/1     Running   0          133m
pod-to-a-external-1111-669ccfb85f-8l2p2                  1/1     Running   0          133m
pod-to-a-l3-denied-cnp-7b8bfcb66c-qg2wc                  1/1     Running   0          133m
pod-to-b-intra-node-74997967f8-c88x9                     1/1     Running   0          133m
pod-to-b-intra-node-nodeport-775f967f47-t426f            1/1     Running   0          133m
pod-to-b-multi-node-clusterip-587678cbc4-xskt6           1/1     Running   0          133m
pod-to-b-multi-node-headless-574d9f5894-xd2jq            1/1     Running   0          133m
pod-to-b-multi-node-nodeport-7944d9f9fc-qpv5r            0/1     Running   0          133m
pod-to-external-fqdn-allow-google-cnp-6dd57bc859-bqhhq   1/1     Running   0          133m

% kubectl get svc
NAME                   TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)        AGE
echo-a                 ClusterIP   10.100.133.146   <none>        80/TCP         136m
echo-b                 NodePort    10.100.21.112    <none>        80:31313/TCP   136m
echo-b-headless        ClusterIP   None             <none>        80/TCP         136m
echo-b-host-headless   ClusterIP   None             <none>        <none>         136m
kubernetes             ClusterIP   10.100.0.1       <none>        443/TCP        147m

% kubectl get ep 
NAME                   ENDPOINTS                               AGE
echo-a                 192.168.108.20:80                       136m
echo-b                 192.168.98.169:80                       136m
echo-b-headless        192.168.98.169:80                       136m
echo-b-host-headless   192.168.16.155                          136m
kubernetes             192.168.148.188:443,192.168.97.61:443   147m

Follow-up for #12078

/cc @brb

kinbug priorithigh

Most helpful comment

I could reproduce the issue by following the GSG for AWS-EKS. I could also apply the manual fix from Thomas, with an additional step for the return-path filter on eth0.

The ENI CNI in the aws-node daemonset sets the following rules and configuration on the node:

$ ip rule | grep -w 0x80
1024:   from all fwmark 0x80/0x80 lookup main 

$ iptables-save | grep -w 0x80
-A PREROUTING -i eth0 -m comment --comment "AWS, primary ENI" -m addrtype --dst-type LOCAL --limit-iface-in -j CONNMARK --set-xmark 0x80/0x80
-A PREROUTING -i eni+ -m comment --comment "AWS, primary ENI" -j CONNMARK --restore-mark --nfmask 0x80 --ctmask 0x80

$ sysctl net.ipv4.conf.eth0.rp_filter
2

Since the GSG instructs to remove the daemonset before deploying Cilium and creating the node, this configuration is not used, and we have instead:

$ ip rule | grep -w 0x80
[null]

$ iptables-save | grep -w 0x80
[null]

$ ip rule | grep 'from 192.168.126.96'
110:    from 192.168.126.96 to 192.168.0.0/16 lookup 3
$ ip route show table 3
default via 192.168.64.1 dev eth1 
192.168.64.1 dev eth1 scope link

$ sysctl net.ipv4.conf.eth0.rp_filter
1

Cilium sets net.ipv4.conf.all.rp_filter at 0, but the maximum value in conf/{all,interface}/rp_filter is used when doing source validation on an {interface}, so in our case rp_filter is in strict mode on eth0. This prevents the packets received from the first node on eth0 to be (SNAT-ed and) forwarded to the pod. Instead they are dropped by the host and no SYN/ACK is emitted back. Disabling rp_filter or setting it to _loose mode_ fixes it, but the SYN/ACKs are not sent to the correct destination.

This is due to the ip rule that is matched for those packets, it tells the host to do a FIB lookup in table 3 (associated to interface at index 3, eth1 in my case) and not in the main table as should be the case. This is where we need marking the packets and looking at the main table when the mark is found.

I used the following commands to restore the rules and have pod-to-b-multi-node-nodeport getting ready:

# sysctl -w net.ipv4.conf.eth0.rp_filter=2
# iptables -t mangle -A PREROUTING -i eth0 -m comment --comment "AWS, primary ENI" -m addrtype --dst-type LOCAL --limit-iface-in -j CONNMARK --set-xmark 0x80/0x80
# iptables -t mangle -A PREROUTING -i lxc+ -m comment --comment "AWS, primary ENI" -j CONNMARK --restore-mark --nfmask 0x80 --ctmask 0x80
# ip rule add fwmark 0x80/0x80 lookup main

I'm working on a fix to have Cilium reproduce this configuration on AWS.

We likely missed this when validating the GSG for 1.7 because pod-to-b-multi-node-nodeport (or pod-to-b-intra-node-nodeport failing with v1.7.5) did not exist at the time.

All 19 comments

@tklauser does it pass with Cilium 1.7? I can see this particular check was added after 1.7 (https://github.com/cilium/cilium/commit/3cc04e57aee6ea86b676cbef3e419556cae20999), but I'm guessing it's still meant to work with 1.7, right?

I'm seeing an RST:

-> stack flow 0x1ced2b76 identity 38656->1 state new ifindex 0 orig-ip 0.0.0.0: 192.168.13.11:50938 -> 192.168.21.197:31313 tcp SYN
-> endpoint 436 flow 0x0 identity 1->38656 state reply ifindex 0 orig-ip 192.168.21.197: 192.168.21.197:31313 -> 192.168.13.11:50938 tcp ACK, RST

For my cluster, the reason is that the following:

apiVersion: v1
kind: Service
metadata:
  creationTimestamp: "2020-04-21T12:36:08Z"
  name: echo-b
  namespace: default
  resourceVersion: "1915"
  selfLink: /api/v1/namespaces/default/services/echo-b
  uid: acb29a3b-83cc-11ea-80a1-025016633d8a
spec:
  clusterIP: 10.100.173.105
  ports:
  - port: 80
    protocol: TCP
    targetPort: 80
  selector:
    name: echo-b
  sessionAffinity: None
  type: ClusterIP
status:
  loadBalancer: {}

Should be of type NodePort but it is not.

I'm able to reproduce this. It happens with both kube-proxy and BPF NodePort.

The packet gets into the remote pod and the reply makes it out of the pod:

20:39:49.193719 IP 192.168.21.197.44966 > 192.168.126.96.80: Flags [S], seq 1973508466, win 26883, options [mss 8961,sackOK,TS val 509552408 ecr 0,nop,wscale 7], length 0
20:39:49.193750 IP 192.168.126.96.80 > 192.168.21.197.44966: Flags [S.], seq 4134654191, ack 1973508467, win 26847, options [mss 8961,sackOK,TS val 193777408 ecr 509552408,nop,wscale 7], length 0
20:39:50.196403 IP 192.168.126.96.80 > 192.168.21.197.44966: Flags [S.], seq 4134654191, ack 1973508467, win 26847, options [mss 8961,sackOK,TS val 193778410 ecr 509552408,nop,wscale 7], length 0
20:39:50.197370 IP 192.168.21.197.44966 > 192.168.126.96.80: Flags [S], seq 1973508466, win 26883, options [mss 8961,sackOK,TS val 509553411 ecr 0,nop,wscale 7], length 0
20:39:50.197391 IP 192.168.126.96.80 > 192.168.21.197.44966: Flags [S.], seq 4134654191, ack 1973508467, win 26847, options [mss 8961,sackOK,TS val 193778411 ecr 509552408,nop,wscale 7], length 0
20:39:52.212402 IP 192.168.126.96.80 > 192.168.21.197.44966: Flags [S.], seq 4134654191, ack 1973508467, win 26847, options [mss 8961,sackOK,TS val 193780426 ecr 509552408,nop,wscale 7], length 0
20:39:52.213415 IP 192.168.21.197.44966 > 192.168.126.96.80: Flags [S], seq 1973508466, win 26883, options [mss 8961,sackOK,TS val 509555427 ecr 0,nop,wscale 7], length 0
20:39:52.213435 IP 192.168.126.96.80 > 192.168.21.197.44966: Flags [S.], seq 4134654191, ack 1973508467, win 26847, options [mss 8961,sackOK,TS val 193780427 ecr 509552408,nop,wscale 7], length 0

The packet doesn't make it back out of eth0 so it is dropped in the stack of the remote node.

20:41:27.919172 IP 192.168.34.149.45600 > 192.168.21.197.31313: Flags [S], seq 2095917542, win 26883, options [mss 8961,sackOK,TS val 509651132 ecr 0,nop,wscale 7], length 0
20:41:28.949256 IP 192.168.34.149.45600 > 192.168.21.197.31313: Flags [S], seq 2095917542, win 26883, options [mss 8961,sackOK,TS val 509652162 ecr 0,nop,wscale 7], length 0
20:41:30.965266 IP 192.168.34.149.45600 > 192.168.21.197.31313: Flags [S], seq 2095917542, win 26883, options [mss 8961,sackOK,TS val 509654178 ecr 0,nop,wscale 7], length 0
20:41:35.221258 IP 192.168.34.149.45600 > 192.168.21.197.31313: Flags [S], seq 2095917542, win 26883, options [mss 8961,sackOK,TS val 509658434 ecr 0,nop,wscale 7], length 0
20:41:43.413259 IP 192.168.34.149.45600 > 192.168.21.197.31313: Flags [S], seq 2095917542, win 26883, options [mss 8961,sackOK,TS val 509666626 ecr 0,nop,wscale 7], length 0

@errordeveloper checked with v1.7.5:

NAME                                                     READY   STATUS    RESTARTS   AGE
echo-a-58dd59998d-nd6x4                                  1/1     Running   0          91s
echo-b-865969889d-dvkcz                                  1/1     Running   0          90s
echo-b-host-659c674bb6-9dj58                             1/1     Running   0          89s
host-to-b-multi-node-clusterip-6fb94d9df6-9fxfx          1/1     Running   0          88s
host-to-b-multi-node-headless-7c4ff79cd-xrrw8            1/1     Running   0          88s
pod-to-a-5c8dcf69f7-2xn2w                                1/1     Running   0          85s
pod-to-a-allowed-cnp-75684d58cc-bfjc8                    1/1     Running   0          88s
pod-to-a-external-1111-669ccfb85f-jlwsw                  1/1     Running   0          83s
pod-to-a-l3-denied-cnp-7b8bfcb66c-nfcd2                  1/1     Running   0          86s
pod-to-b-intra-node-74997967f8-4r4r8                     1/1     Running   0          84s
pod-to-b-intra-node-nodeport-775f967f47-gcznq            0/1     Running   1          85s
pod-to-b-multi-node-clusterip-587678cbc4-c5xdn           1/1     Running   0          84s
pod-to-b-multi-node-headless-574d9f5894-5h8zd            1/1     Running   0          84s
pod-to-b-multi-node-nodeport-7944d9f9fc-dcrs5            0/1     Running   1          83s
pod-to-external-fqdn-allow-google-cnp-6dd57bc859-d7t47   1/1     Running   0          82s

Looks like this also occurs and in addition pod-to-b-intra-node-nodeport also fails.

% kubectl get svc
NAME                   TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)   AGE
echo-a                 ClusterIP   10.100.171.245   <none>        80/TCP    2m34s
echo-b                 ClusterIP   10.100.59.26     <none>        80/TCP    2m33s
echo-b-headless        ClusterIP   None             <none>        80/TCP    2m32s
echo-b-host-headless   ClusterIP   None             <none>        <none>    2m31s
kubernetes             ClusterIP   10.100.0.1       <none>        443/TCP   3h17m

% kubectl get ep 
NAME                   ENDPOINTS                                 AGE
echo-a                 192.168.34.45:80                          3m7s
echo-b                 192.168.39.74:80                          3m6s
echo-b-headless        192.168.39.74:80                          3m5s
echo-b-host-headless   192.168.49.116                            3m4s
kubernetes             192.168.148.197:443,192.168.175.208:443   3h18m

% kubectl describe pod pod-to-b-multi-node-nodeport-7944d9f9fc-dcrs5
[...]
  Warning  Unhealthy  113s (x6 over 3m13s)   kubelet, ip-192-168-14-188.us-west-2.compute.internal  Liveness probe failed: curl: (7) Failed to connect to echo-b-host-headless port 31313: Connection refused
  Normal   Killing    113s (x2 over 2m53s)   kubelet, ip-192-168-14-188.us-west-2.compute.internal  Container pod-to-b-multi-node-nodeport failed liveness probe, will be restarted
  Warning  Unhealthy  95s (x11 over 3m15s)   kubelet, ip-192-168-14-188.us-west-2.compute.internal  Readiness probe failed: curl: (7) Failed to connect to echo-b-host-headless port 31313: Connection refused

% kubectl describe pod pod-to-b-intra-node-nodeport-775f967f47-gcznq
[...]
  Warning  Unhealthy  2m38s (x6 over 3m58s)  kubelet, ip-192-168-49-116.us-west-2.compute.internal  Liveness probe failed: curl: (7) Failed to connect to echo-b-host-headless port 31313: Connection refused
  Normal   Killing    2m38s (x2 over 3m38s)  kubelet, ip-192-168-49-116.us-west-2.compute.internal  Container pod-to-b-intra-node-hostport failed liveness probe, will be restarted
  Warning  Unhealthy  2m35s (x10 over 4m5s)  kubelet, ip-192-168-49-116.us-west-2.compute.internal  Readiness probe failed: curl: (7) Failed to connect to echo-b-host-headless port 31313: Connection refused

I found the cause, the routing is not symmetric.
The SYN is received on eth0:

21:44:37.544583 IP 192.168.34.149.41456 > 192.168.21.197.31313: Flags [S], seq 1836691167, win 26883, options [mss 8961,sackOK,TS val 513440720 ecr 0,nop,wscale 7], length 0

The SYN-ACK is sent out over eth2:

21:44:37.544706 IP 192.168.21.197.31313 > 192.168.34.149.41456: Flags [S.], seq 520985781, ack 1836691168, win 26847, options [mss 8961,sackOK,TS val 197665741 ecr 513440720,nop,wscale 7], length 0

@tklauser I think it would makes sense to double check without Cilium, i.e. with just the built-in CNI. It would be good eliminate a possibility of issues in the VPC configuration that eksctl implements.

The conntrack entry looks like this:

ipv4     2 tcp      6 59 SYN_RECV src=192.168.34.149 dst=192.168.21.197 sport=47458 dport=31313 src=192.168.126.96 dst=192.168.21.197 sport=80 dport=47458 mark=128 zone=0 use=2

The mark indicates that the 0x80 has been set via this rule:

-A PREROUTING -i eth0 -m comment --comment "AWS, primary ENI" -m addrtype --dst-type LOCAL --limit-iface-in -j CONNMARK --set-xmark 0x80/0x80

For a yet unknown reason, the following ip rule matches:

110:    from 192.168.126.96 to 192.168.0.0/16 lookup 27

It is meant to to lead all packets from the pod which uses an ENI IP of eth2 out of eth2 based on the source address.

It's not clear why this rule matches despite the reverse SNAT leading to source IP of 192.168.21.197

Removing the above rule (matching the IP of the pod behind the NodePort), fixes the problem.

This does not seem to be a regression. The logic was already the same for 1.7. I'm removing the release blocker flag.

Fixing this will require to mark NodePort traffic, mark it, and route it out the same interface as it came in on.

I'm able to reproduce this. It happens with both kube-proxy and BPF NodePort.

@tgraf Have you checked whether BPF NodePort is affected when running without kube-proxy?

@tgraf Have you checked whether BPF NodePort is affected when running without kube-proxy?

Both are affected. The issue is in the routing of the reply. It doesn't matter who translates it back and forth.

Other ENI CNIs are using a rule like this:

-A PREROUTING -i eth0 -m comment --comment "AWS, primary ENI" -m addrtype --dst-type LOCAL --limit-iface-in -j CONNMARK --set-xmark 0x80/0x80

and then use an ip rule to force replies out that interface

OK, I see. The problem happens on a _remote_ node in which BPF NodePort cannot bypass the stack.

I could reproduce the issue by following the GSG for AWS-EKS. I could also apply the manual fix from Thomas, with an additional step for the return-path filter on eth0.

The ENI CNI in the aws-node daemonset sets the following rules and configuration on the node:

$ ip rule | grep -w 0x80
1024:   from all fwmark 0x80/0x80 lookup main 

$ iptables-save | grep -w 0x80
-A PREROUTING -i eth0 -m comment --comment "AWS, primary ENI" -m addrtype --dst-type LOCAL --limit-iface-in -j CONNMARK --set-xmark 0x80/0x80
-A PREROUTING -i eni+ -m comment --comment "AWS, primary ENI" -j CONNMARK --restore-mark --nfmask 0x80 --ctmask 0x80

$ sysctl net.ipv4.conf.eth0.rp_filter
2

Since the GSG instructs to remove the daemonset before deploying Cilium and creating the node, this configuration is not used, and we have instead:

$ ip rule | grep -w 0x80
[null]

$ iptables-save | grep -w 0x80
[null]

$ ip rule | grep 'from 192.168.126.96'
110:    from 192.168.126.96 to 192.168.0.0/16 lookup 3
$ ip route show table 3
default via 192.168.64.1 dev eth1 
192.168.64.1 dev eth1 scope link

$ sysctl net.ipv4.conf.eth0.rp_filter
1

Cilium sets net.ipv4.conf.all.rp_filter at 0, but the maximum value in conf/{all,interface}/rp_filter is used when doing source validation on an {interface}, so in our case rp_filter is in strict mode on eth0. This prevents the packets received from the first node on eth0 to be (SNAT-ed and) forwarded to the pod. Instead they are dropped by the host and no SYN/ACK is emitted back. Disabling rp_filter or setting it to _loose mode_ fixes it, but the SYN/ACKs are not sent to the correct destination.

This is due to the ip rule that is matched for those packets, it tells the host to do a FIB lookup in table 3 (associated to interface at index 3, eth1 in my case) and not in the main table as should be the case. This is where we need marking the packets and looking at the main table when the mark is found.

I used the following commands to restore the rules and have pod-to-b-multi-node-nodeport getting ready:

# sysctl -w net.ipv4.conf.eth0.rp_filter=2
# iptables -t mangle -A PREROUTING -i eth0 -m comment --comment "AWS, primary ENI" -m addrtype --dst-type LOCAL --limit-iface-in -j CONNMARK --set-xmark 0x80/0x80
# iptables -t mangle -A PREROUTING -i lxc+ -m comment --comment "AWS, primary ENI" -j CONNMARK --restore-mark --nfmask 0x80 --ctmask 0x80
# ip rule add fwmark 0x80/0x80 lookup main

I'm working on a fix to have Cilium reproduce this configuration on AWS.

We likely missed this when validating the GSG for 1.7 because pod-to-b-multi-node-nodeport (or pod-to-b-intra-node-nodeport failing with v1.7.5) did not exist at the time.

This is now fixed, but longer term some follow-up would be desirable. Please also refer to this comment on the PR.

Was this page helpful?
0 / 5 - 0 ratings