Cilium: kube-proxy replacement: LoadBalancer traffic fails from host back to same host

Created on 27 May 2020  Â·  38Comments  Â·  Source: cilium/cilium

Bug report

General Information

  • Cilium version (run cilium version)
Client: 1.7.4 c7ee6d62b 2020-05-15T16:07:35+02:00 go version go1.13.10 linux/amd64
Daemon: 1.7.4 c7ee6d62b 2020-05-15T16:07:35+02:00 go version go1.13.10 linux/amd64
  • Kernel version (run uname -a)
Linux test03.lan 5.6.2-1.el7.elrepo.x86_64 #1 SMP Thu Apr 2 10:55:54 EDT 2020 x86_64 x86_64 x86_64 GNU/Linux
  • Orchestration system version in use (e.g. kubectl version, Mesos, ...)
Client Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.3", GitCommit:"2e7996e3e2712684bc73f0dec0200d64eec7fe40", GitTreeState:"clean", BuildDate:"2020-05-20T12:52:00Z", GoVersion:"go1.13.9", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"17", GitVersion:"v1.17.4", GitCommit:"8d8aa39598534325ad77120c120a22b3a990b5ea", GitTreeState:"clean", BuildDate:"2020-03-12T20:55:23Z", GoVersion:"go1.13.8", Compiler:"gc", Platform:"linux/amd64"}
  • Link to relevant artifacts (policies, deployments scripts, ...)
# automatically restarts pods to ensure controlled by new CNI driver
operator:
  enabled: true

global:
  k8sServiceHost: "127.0.0.1"
  k8sServicePort: "6443"
  enableXTSocketFallback: false

  prometheus:
    enabled: true

  bpf:
    preallocateMaps: true

  # disabling not ready for primetime yet
  # https://github.com/cilium/cilium/projects/93#column-7748410
  installIptablesRules: true

  # https://docs.cilium.io/en/latest/architecture/#arch-guide
  # https://cilium.io/blog/2019/02/12/cilium-14/#sockmap-bpf-based-sidecar-acceleration-alpha
  # https://www.youtube.com/watch?v=ER9eIXL2_14
  sockops:
    enabled: true

  k8s:
    # cilium pods will not start on node until pod CIDR has been assigned
    requireIPv4PodCIDR: true

  # eliminates need for any kind of BGP stuff
  # automatically addes routes to each node
  autoDirectNodeRoutes: true

  tunnel: disabled
  kubeProxyReplacement: strict
  hostServices:
    enabled: true
  nodePort:
    enabled: true
    # dsr or snat
    #mode: dsr
    mode: snat
  externalIPs:
    enabled: true

  # dev purposes only
  cleanState: false
  cleanBpfState: true
  • Upload a system dump (run curl -sLO https://github.com/cilium/cilium-sysdump/releases/latest/download/cilium-sysdump.zip && python cilium-sysdump.zip and then attach the generated zip file)

How to reproduce the issue

I'm using cilium with metallb with the kube-proxy replacement. I've got a pretty big matrix of scenarios I'm testing and most of them work, but we've found a situation where certain traffic fails to be handled. I believe I can distill the issue down to: when traffic leaves N1 (Node 1) and comes back to N1P (Pod running on Node 1) without an snat involved it fails

I've tried this with both dsr mode (my intended target) and snat mode (less interested, but wanted to try it out). Both fail under the above circumstances. Her's a pretty crude representation of what I think the traffic flows are and what works and what fails:

GW = gateway
R = router
NX = node X
NXP = pod running on node X

dsr mode

# service with Cluster externalTrafficPolicy
N1 -> GW -> R -> N2 -> N1P: fail
N1 -> GW -> R -> N1 -> N1P: fail
N2 -> GW -> R -> N1 -> N1P: success

# service with Local externalTrafficPolicy
N2 -> GW -> R -> N1 -> N1P: success
N1 -> GW -> R -> N1 -> N1P: fail

snat mode

# service with Cluster externalTrafficPolicy
N1 -> GW -> R -> N2 -> N1P: success
N1 -> GW -> R -> N1 -> N1P: fail

# service with Local externalTrafficPolicy
N2 -> GW -> R -> N1 -> N1P: success
N1 -> GW -> R -> N1 -> N1P: fail
kinbug kincommunity-report

All 38 comments

  • What IP + port do you use to access a LB svc?
  • Is that IP + port in cilium bpf lb list on N1?
  • Can you paste cilium status output?

@brb this is literally the exact same cluster I previously gave you access to with the exact same mysql service. I think you still have access..if not I can reopen the port forward.

kubectl -n kube-system exec cilium-qqstk -- cilium bpf lb list
...
192.168.58.0:3306    10.42.0.215:3306 (2)           
                     0.0.0.0:0 (2) [ExternalIPs]    
...
kubectl -n kube-system exec cilium-qqstk -- cilium status
KVStore:                Ok   Disabled
Kubernetes:             Ok   1.17 (v1.17.4) [linux/amd64]
Kubernetes APIs:        ["CustomResourceDefinition", "cilium/v2::CiliumClusterwideNetworkPolicy", "cilium/v2::CiliumEndpoint", "cilium/v2::CiliumNetworkPolicy", "cilium/v2::CiliumNode", "core/v1::Endpoint", "core/v1::Namespace", "core/v1::Pods", "core/v1::Service", "networking.k8s.io/v1::NetworkPolicy"]
KubeProxyReplacement:   Strict   [NodePort (DSR, 30000-32767), ExternalIPs, HostReachableServices (TCP, UDP)]
Cilium:                 Ok       OK
NodeMonitor:            Listening for events on 2 CPUs with 64x4096 of shared memory
Cilium health daemon:   Ok   
IPAM:                   IPv4: 8/255 allocated from 10.42.0.0/24, 
Controller Status:      31/31 healthy
Proxy Status:           OK, ip 10.42.0.40, 0 redirects active on ports 10000-20000
Cluster health:   5/5 reachable   (2020-05-28T07:22:37Z)
kubectl get pods -A -o wide | grep test01
default          mysql-768bcb9b84-x6znx                   1/1     Running     0          15h     10.42.0.215   test01.lan   <none>           <none>
kube-system      cilium-qqstk                             1/1     Running     0          14h     172.29.2.21   test01.lan   <none>           <none>
kubectl get svc | grep mysql
mysql-lb-bgp   LoadBalancer   10.43.252.65    192.168.58.0   3306:32005/TCP   49d
# on test01.lan
mysql -uroot -proot -h 192.168.58.0

# on test01.lan: tcpdump -i any host 192.168.58.0
07:25:53.893685 IP test01.lan.58998 > 192.168.58.0.mysql: Flags [S], seq 627751695, win 64240, options [mss 1460,sackOK,TS val 2341430069 ecr 0,nop,wscale 7], length 0
07:25:53.893844 IP test01.lan.58998 > 192.168.58.0.mysql: Flags [S], seq 627751695, win 64240, options [mss 1460,sackOK,TS val 2341430069 ecr 0,nop,wscale 7], length 0
07:25:54.947566 IP test01.lan.58998 > 192.168.58.0.mysql: Flags [S], seq 627751695, win 64240, options [mss 1460,sackOK,TS val 2341431123 ecr 0,nop,wscale 7], length 0
07:25:54.947758 IP test01.lan.58998 > 192.168.58.0.mysql: Flags [S], seq 627751695, win 64240, options [mss 1460,sackOK,TS val 2341431123 ecr 0,nop,wscale 7], length 0
07:25:56.995544 IP test01.lan.58998 > 192.168.58.0.mysql: Flags [S], seq 627751695, win 64240, options [mss 1460,sackOK,TS val 2341433171 ecr 0,nop,wscale 7], length 0
07:25:56.995718 IP test01.lan.58998 > 192.168.58.0.mysql: Flags [S], seq 627751695, win 64240, options [mss 1460,sackOK,TS val 2341433171 ecr 0,nop,wscale 7], length 0
07:26:01.027551 IP test01.lan.58998 > 192.168.58.0.mysql: Flags [S], seq 627751695, win 64240, options [mss 1460,sackOK,TS val 2341437203 ecr 0,nop,wscale 7], length 0
07:26:01.027819 IP test01.lan.58998 > 192.168.58.0.mysql: Flags [S], seq 627751695, win 64240, options [mss 1460,sackOK,TS val 2341437203 ecr 0,nop,wscale 7], length 0
07:26:09.347586 IP test01.lan.58998 > 192.168.58.0.mysql: Flags [S], seq 627751695, win 64240, options [mss 1460,sackOK,TS val 2341445523 ecr 0,nop,wscale 7], length 0
07:26:09.347784 IP test01.lan.58998 > 192.168.58.0.mysql: Flags [S], seq 627751695, win 64240, options [mss 1460,sackOK,TS val 2341445523 ecr 0,nop,wscale 7], length 0
07:26:25.731621 IP test01.lan.58998 > 192.168.58.0.mysql: Flags [S], seq 627751695, win 64240, options [mss 1460,sackOK,TS val 2341461907 ecr 0,nop,wscale 7], length 0
07:26:25.732171 IP test01.lan.58998 > 192.168.58.0.mysql: Flags [S], seq 627751695, win 64240, options [mss 1460,sackOK,TS val 2341461907 ecr 0,nop,wscale 7], length 0
07:26:57.987563 IP test01.lan.58998 > 192.168.58.0.mysql: Flags [S], seq 627751695, win 64240, options [mss 1460,sackOK,TS val 2341494163 ecr 0,nop,wscale 7], length 0
07:26:57.987731 IP test01.lan.58998 > 192.168.58.0.mysql: Flags [S], seq 627751695, win 64240, options [mss 1460,sackOK,TS val 2341494163 ecr 0,nop,wscale 7], length 0

# on my GW/R: tcpdump -i bridge0 host 192.168.58.0
01:25:53.891936 IP test01.lan.58998 > 192.168.58.0.mysql: Flags [S], seq 627751695, win 64240, options [mss 1460,sackOK,TS val 2341430069 ecr 0,nop,wscale 7], length 0
01:25:53.891985 IP test01.lan.58998 > 192.168.58.0.mysql: Flags [S], seq 627751695, win 64240, options [mss 1460,sackOK,TS val 2341430069 ecr 0,nop,wscale 7], length 0
01:25:54.945847 IP test01.lan.58998 > 192.168.58.0.mysql: Flags [S], seq 627751695, win 64240, options [mss 1460,sackOK,TS val 2341431123 ecr 0,nop,wscale 7], length 0
01:25:54.945883 IP test01.lan.58998 > 192.168.58.0.mysql: Flags [S], seq 627751695, win 64240, options [mss 1460,sackOK,TS val 2341431123 ecr 0,nop,wscale 7], length 0
01:25:56.993787 IP test01.lan.58998 > 192.168.58.0.mysql: Flags [S], seq 627751695, win 64240, options [mss 1460,sackOK,TS val 2341433171 ecr 0,nop,wscale 7], length 0
01:25:56.993824 IP test01.lan.58998 > 192.168.58.0.mysql: Flags [S], seq 627751695, win 64240, options [mss 1460,sackOK,TS val 2341433171 ecr 0,nop,wscale 7], length 0
01:26:01.025864 IP test01.lan.58998 > 192.168.58.0.mysql: Flags [S], seq 627751695, win 64240, options [mss 1460,sackOK,TS val 2341437203 ecr 0,nop,wscale 7], length 0
01:26:01.025902 IP test01.lan.58998 > 192.168.58.0.mysql: Flags [S], seq 627751695, win 64240, options [mss 1460,sackOK,TS val 2341437203 ecr 0,nop,wscale 7], length 0
01:26:09.345843 IP test01.lan.58998 > 192.168.58.0.mysql: Flags [S], seq 627751695, win 64240, options [mss 1460,sackOK,TS val 2341445523 ecr 0,nop,wscale 7], length 0
01:26:09.345880 IP test01.lan.58998 > 192.168.58.0.mysql: Flags [S], seq 627751695, win 64240, options [mss 1460,sackOK,TS val 2341445523 ecr 0,nop,wscale 7], length 0
01:26:25.730259 IP test01.lan.58998 > 192.168.58.0.mysql: Flags [S], seq 627751695, win 64240, options [mss 1460,sackOK,TS val 2341461907 ecr 0,nop,wscale 7], length 0
01:26:25.730296 IP test01.lan.58998 > 192.168.58.0.mysql: Flags [S], seq 627751695, win 64240, options [mss 1460,sackOK,TS val 2341461907 ecr 0,nop,wscale 7], length 0
01:26:57.985817 IP test01.lan.58998 > 192.168.58.0.mysql: Flags [S], seq 627751695, win 64240, options [mss 1460,sackOK,TS val 2341494163 ecr 0,nop,wscale 7], length 0
01:26:57.985855 IP test01.lan.58998 > 192.168.58.0.mysql: Flags [S], seq 627751695, win 64240, options [mss 1460,sackOK,TS val 2341494163 ecr 0,nop,wscale 7], length 0

It's entirely possible it's something to do with the environment. I do have 2 very different environments however that are exhibiting the same behavior.

Thanks. Is the cilium bpf lb list output from test01.lan? If not, does test01.lan run cilium-agent?

It is from that node. It is running the agent.

Can you paste bpftool cgroup tree output from that node, and also cilium bpf lb list?

kubectl -n kube-system exec cilium-qqstk -- bpftool cgroup tree
CgroupPath
ID       AttachType      AttachFlags     Name           
/run/cilium/cgroupv2
11786    connect4                                       
11778    connect6                                       
11788    post_bind4                                     
11780    post_bind6                                     
11790    sendmsg4                                       
11782    sendmsg6                                       
11792    recvmsg4                                       
11784    recvmsg6                 
kubectl -n kube-system exec cilium-qqstk -- cilium bpf lb list
SERVICE ADDRESS      BACKEND ADDRESS                
10.43.13.198:44134   0.0.0.0:0 (26) [ClusterIP]     
                     10.42.2.234:44134 (26)         
10.43.212.93:8443    10.42.3.158:8443 (7)           
                     0.0.0.0:0 (7) [ClusterIP]      
172.29.4.1:8000      10.42.1.23:80 (19)             
                     10.42.3.51:80 (19)             
                     0.0.0.0:0 (19) [ExternalIPs]   
                     10.42.2.227:80 (19)            
10.43.252.65:3306    10.42.0.215:3306 (1)           
                     0.0.0.0:0 (1) [ClusterIP]      
10.42.0.40:32005     10.42.0.215:3306 (5)           
                     0.0.0.0:0 (5) [NodePort]       
10.43.0.10:53        0.0.0.0:0 (24) [ClusterIP]     
                     10.42.1.38:53 (24)             
                     10.42.0.96:53 (24)             
10.42.0.40:32101     10.42.3.51:80 (22)             
                     0.0.0.0:0 (22) [NodePort]      
                     10.42.2.227:80 (22)            
                     10.42.1.23:80 (22)             
0.0.0.0:31298        0.0.0.0:0 (9) [NodePort]       
                     10.42.3.158:8443 (9)           
0.0.0.0:32005        10.42.0.215:3306 (3)           
                     0.0.0.0:0 (3) [NodePort]       
172.29.2.21:32005    0.0.0.0:0 (4) [NodePort]       
                     10.42.0.215:3306 (4)           
10.43.113.163:3300   10.42.1.145:3300 (29)          
                     0.0.0.0:0 (29) [ClusterIP]     
172.29.2.21:31298    0.0.0.0:0 (10) [NodePort]      
                     10.42.3.158:8443 (10)          
10.43.0.10:9153      0.0.0.0:0 (23) [ClusterIP]     
                     10.42.0.96:9153 (23)           
                     10.42.1.38:9153 (23)           
10.43.104.150:80     0.0.0.0:0 (12) [ClusterIP]     
10.43.239.70:6789    0.0.0.0:0 (13) [ClusterIP]     
                     10.42.0.21:6789 (13)           
10.43.121.94:9283    10.42.3.158:9283 (17)          
                     0.0.0.0:0 (17) [ClusterIP]     
10.43.0.1:443        0.0.0.0:0 (27) [ClusterIP]     
                     172.29.2.21:6443 (27)          
10.43.123.162:443    0.0.0.0:0 (25) [ClusterIP]     
                     10.42.2.54:443 (25)            
10.43.215.34:80      10.42.2.247:80 (6)             
                     0.0.0.0:0 (6) [ClusterIP]      
10.43.246.6:3300     0.0.0.0:0 (16) [ClusterIP]     
                     10.42.4.101:3300 (16)          
10.43.246.6:6789     0.0.0.0:0 (15) [ClusterIP]     
                     10.42.4.101:6789 (15)          
10.43.8.169:8443     0.0.0.0:0 (8) [ClusterIP]      
                     10.42.3.158:8443 (8)           
10.42.0.40:31298     10.42.3.158:8443 (11)          
                     0.0.0.0:0 (11) [NodePort]      
192.168.58.0:3306    10.42.0.215:3306 (2)           
                     0.0.0.0:0 (2) [ExternalIPs]    
10.43.239.70:3300    10.42.0.21:3300 (14)           
                     0.0.0.0:0 (14) [ClusterIP]     
10.43.113.163:6789   10.42.1.145:6789 (28)          
                     0.0.0.0:0 (28) [ClusterIP]     
10.43.120.219:8000   0.0.0.0:0 (18) [ClusterIP]     
                     10.42.2.227:80 (18)            
                     10.42.3.51:80 (18)             
                     10.42.1.23:80 (18)             
172.29.2.21:32101    0.0.0.0:0 (21) [NodePort]      
                     10.42.1.23:80 (21)             
                     10.42.2.227:80 (21)            
                     10.42.3.51:80 (21)             
0.0.0.0:32101        10.42.1.23:80 (20)             
                     10.42.3.51:80 (20)             
                     10.42.2.227:80 (20)            
                     0.0.0.0:0 (20) [NodePort]      

This explains why loadbalancing is not performed by bpf_sock: 192.168.58.0:3306 is an externalIP svc. Because it does not belong to any of the nodes IP addr, bpf_sock does not perform the translation to prevent from man-in-the-middle attacks.

Check with kube-proxy, but you are not supposed to access an externalIP svc from inside a cluster.

@brb I'm not sure I follow the logic there while reconciling with the actual behavior. I'll test on my canal setup but observe the following:

  • not allowing the nodes (or better said the host networking namespace) to access LB services on the surface seems a quite limiting and bad approach
  • if the request originates at N1 and is bound to terminate on a pod on N2 access is fine (so the overall idea of not allowing nodes to access LB services is not seen in this behavior)
  • given the above, this results in completely sporadic behavior and entirely independent on A. the router's ecmp decision making and B. cilium's load balancing decision(s) for any given request
  • pods can access those services even from the same node (ie: if I exec into the mysql pod and connect to mysql using the LB IP it works). Watching tcpdump in this case seems to indicate that cilium is intercepting this traffic and redirecting without ever going to the gw/r etc.

I'm not sure if the ideal here is that the host networking namespace behaves the same as the pod namespaces (ie: traffic is intercepted and never leaves the cluster...or at least doesn't go to the default route etc). I would assume so for consistency's sake.

I may have missed a subtle comment there about externalIP..to be clear here is the service definition as it's type is LoadBalancer

apiVersion: v1
kind: Service
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"v1","kind":"Service","metadata":{"annotations":{"metallb.universe.tf/address-pool":"dedicated-bgp"},"name":"mysql-lb-bgp","namespace":"default"},"spec":{"ports":[{"name":"mysql","port":3306,"protocol":"TCP"}],"selector":{"app":"mysql"},"type":"LoadBalancer"}}
    metallb.universe.tf/address-pool: dedicated-bgp
  creationTimestamp: "2020-04-08T16:28:16Z"
  name: mysql-lb-bgp
  namespace: default
  resourceVersion: "88927656"
  selfLink: /api/v1/namespaces/default/services/mysql-lb-bgp
  uid: 65e2c239-9e2d-4c07-9271-956d82ccc41a
spec:
  clusterIP: 10.43.252.65
  externalTrafficPolicy: Cluster
  ports:
  - name: mysql
    nodePort: 32005
    port: 3306
    protocol: TCP
    targetPort: 3306
  selector:
    app: mysql
  sessionAffinity: None
  type: LoadBalancer
status:
  loadBalancer:
    ingress:
    - ip: 192.168.58.0

So a quick/crude test on a canal based cluster shows the service (type LoadBalancer, policy Cluster) available to all nodes in the cluster. In addition it appears to be behaving exactly as I described the hypothetical behavior should be...traffic never goes to the gw/router and stays local to the cluster from what I can tell.

However, the same setup with policy Local appears to result in failure from nodes which don't have a running pod :( I'm not entirely sure of the culprit of this but I'm guessing kube-proxy as it appears no traffic goes to the fw nor to the destination node. The only thing I see on the node sending the request is:

14:04:11.353705 IP 192.168.57.0 > 192.168.57.0: ICMP 192.168.57.0 tcp port mysql unreachable, length 68
14:04:12.374192 IP 192.168.57.0 > 192.168.57.0: ICMP 192.168.57.0 tcp port mysql unreachable, length 68
14:08:10.864621 IP 192.168.57.0 > 192.168.57.0: ICMP 192.168.57.0 tcp port mysql unreachable, length 68

Not sure why ICMP is getting invoked here, I'm trying to telnet 192.168.57.0 3306.

My expectation for kube-proxy or cilium is that the request would be successful for any combination but there may be things I'm not fully grasping for sure :)

OK, then we need to distinguish in the datapath between LoadBalancer and externalIP service types, and allow the translation of the former.

@brb there is no such thing as externalIP type.

https://kubernetes.io/docs/concepts/services-networking/service/#external-ips

What is the difference jn logic you intend to implement regardless? IMO if traffic is originating from a node toward a known external IP (using that term very loosely) it should just work no?

I think I may understand the security implications/concerns. The idea being that IPs provisioned via LoadBalancer can be more trusted as cluster administrators would generally have more control over that vs users could put any arbitrary IP as an externalIP?

I understand now how that could be potentially malicious as I hypothetically could hijack traffic destined to any arbitrary endpoint in any public/private network and get it redirected to the service. Perhaps a config option(s) might be necessary to let cluster admins determine desired behavior?

there is no such thing as externalIP type

In the datapath we treat it as a separate service type, as it has some security implications.

The idea being that IPs provisioned via LoadBalancer can be more trusted as cluster administrators would generally have more control over that vs users could put any arbitrary IP as an externalIP?

Yes, see: https://github.com/kubernetes/kubernetes/blob/master/pkg/proxy/iptables/proxier.go#L1092

Perhaps a config option(s) might be necessary to let cluster admins determine desired behavior?

We want to conform to k8s' kube-proxy which doesn't have such flag to reduce cluster's security. Also, there are more ways to access the same service from a cluster.

Not sure I follow the implications of the last comment. Does that mean this won’t be implemented?

As it stands in both kube-proxy and cilium the ‘security’ of the situation is useless as traffic blocks in some situations but not all.

As for following kube-proxy there are already lot’s of things cilium does that kube-proxy doesn’t (or I wouldn’t bother using the replacement). The advanced features are precisely what make cilium appealing.

Could you explain why you want to access a svc via externalIP instead of LoadBalancer IP, ClusterIP or NodeIP + NodePort from inside a cluster?

Maybe I wasn’t clear. I’m only concerned about LB...and that is the original issue. I understand the security implications of external IP for sure.

I want services to be accessible from their globally recognizable (ie external) ip/dns entries by the cluster itself including nodes and pods running in the host networking namespace.

There are many examples. Right now I’m running an ingress controller using LB. Behind that ingress one of the things I’m running is the rancher platform. Usually the cluster running rancher is imported into rancher (inception?) and in so doing agent pods in the host networking namespace are spun up and attempt to connect to rancher on the globally recognized dns (ie dns entry pointed to the LB IP). Depending on circumstances (sporadic) the agent may connect or may not as it stands). That’s just 1 example.

In general when the clusters start getting used more heavily I don’t want people to care what cluster a service is running in or even know. I also don’t want to explain to them why it doesn’t work if they happen to try accessing a service that should be recognizable anywhere in the environment but isn’t from X cluster because...reasons.

I appreciate the patience in understanding the scenario!

To be clear, the IP I’m using is the LB IP and the service does not define any externalIPs.

To be clear, the IP I’m using is the LB IP and the service does not define any externalIPs.

This is going to be fixed once this issue is resolved.

Re accessing externalIPs from inside a cluster, please create a separate issue. Thanks.

Awesome! I have no desire for externalIP traffic personally.

I’ve got a test cluster ready to go when a test image is ready. I have a petty extensive set of tests to run through to ensure a bunch of different scenarios work as expected so I can run through all of those if it’s helpful.

Cool, we will ping you once we have a fix.

We've hit the same issue. Currently Cilium collapses LB type into ExternalIP here: https://github.com/cilium/cilium/blob/master/pkg/service/service.go#L597

As discussed, ExternalIP is not supposed to be accessible from rootns. LB does not have this restriction, yet it's applied due to collapse above.

I’ve got a test cluster ready to go when a test image is ready. I have a petty extensive set of tests to run through to ensure a bunch of different scenarios work as expected so I can run through all of those if it’s helpful.

@travisghansen Issue is fixed now in docker.io/cilium/cilium:latest and will be part of v1.8.1 tag soon; tested with MetalLB. If you have a chance to run also through your set of tests as mentioned above that would be great. Thanks for reporting!

This makes me so happy! I’m tied up most of the day today but I’ll definitely have some feedback in the next day or so. This is super high priority for me so I really appreciate the help.

@borkmann either I've got something misconfigured or it's still off.

kubectl exec -it -n kube-system cilium-cq8rs -- cilium version
Client: 1.8.90 c00b4b70a 2020-06-24T13:50:41-07:00 go version go1.14.4 linux/amd64
Daemon: 1.8.90 c00b4b70a 2020-06-24T13:50:41-07:00 go version go1.14.4 linux/amd64
kubectl get pods -o wide | grep mysql
mysql-6d45dbbb4-gq4f4          1/1     Running   0          49m   10.42.0.234   test01.lan   <none>           <none>
kubectl get svc | grep mysql
mysql-lb-bgp   LoadBalancer   10.43.252.65    192.168.58.0   3306:32005/TCP   78d

(from cilium on test01.lan)

kubectl exec -it -n kube-system cilium-vm84s -- cilium service list
ID   Frontend             Service Type   Backend                  
1    10.43.0.10:53        ClusterIP      1 => 10.42.0.222:53      
                                         2 => 10.42.4.45:53       
2    10.43.0.10:9153      ClusterIP      1 => 10.42.0.222:9153    
                                         2 => 10.42.4.45:9153     
3    10.43.121.94:9283    ClusterIP      1 => 10.42.3.158:9283    
4    10.43.239.70:6789    ClusterIP      1 => 10.42.0.208:6789    
5    10.43.239.70:3300    ClusterIP      1 => 10.42.0.208:3300    
6    10.43.0.1:443        ClusterIP      1 => 172.29.2.21:6443    
7    10.43.215.34:80      ClusterIP      1 => 10.42.2.247:80      
8    10.43.100.168:6789   ClusterIP      1 => 10.42.3.177:6789    
9    10.43.100.168:3300   ClusterIP      1 => 10.42.3.177:3300    
10   10.43.123.162:443    ClusterIP      1 => 10.42.2.54:443      
11   10.43.13.198:44134   ClusterIP      1 => 10.42.2.234:44134   
12   10.43.8.169:8443     ClusterIP      1 => 10.42.3.158:8443    
13   0.0.0.0:31298        NodePort       1 => 10.42.3.158:8443    
14   172.29.2.21:31298    NodePort       1 => 10.42.3.158:8443    
15   10.43.113.163:3300   ClusterIP      1 => 10.42.1.145:3300    
16   10.43.113.163:6789   ClusterIP      1 => 10.42.1.145:6789    
17   10.43.104.150:80     ClusterIP                               
18   10.43.212.93:8443    ClusterIP      1 => 10.42.3.158:8443    
19   10.43.120.219:8000   ClusterIP      1 => 10.42.1.23:80       
                                         2 => 10.42.2.227:80      
                                         3 => 10.42.3.51:80       
20   172.29.4.1:8000      LoadBalancer   1 => 10.42.1.23:80       
                                         2 => 10.42.2.227:80      
                                         3 => 10.42.3.51:80       
21   172.29.2.21:32101    NodePort       1 => 10.42.1.23:80       
                                         2 => 10.42.2.227:80      
                                         3 => 10.42.3.51:80       
22   0.0.0.0:32101        NodePort       1 => 10.42.1.23:80       
                                         2 => 10.42.2.227:80      
                                         3 => 10.42.3.51:80       
23   10.43.252.65:3306    ClusterIP      1 => 10.42.0.234:3306    
24   192.168.58.0:3306    LoadBalancer   1 => 10.42.0.234:3306    
25   172.29.2.21:32005    NodePort       1 => 10.42.0.234:3306    
26   0.0.0.0:32005        NodePort       1 => 10.42.0.234:3306    
# fails from test01.lan
# works from other nodes in/out of the cluster
mysql -uroot -proot -h 192.168.58.0

# works from test01.lan
mysql -uroot -proot -h 127.0.0.1 -P 32005

I've tried with both dsr and snat.

Actually, I haven't ran through the full matrix, but the issue appears to be with Local traffic policy on the svc. When I changed it to Cluster I was able to access from the same node running the Pod. I'll keep digging now that I got that to work..

Actually, I haven't ran through the full matrix, but the issue appears to be with Local traffic policy on the svc. When I changed it to Cluster I was able to access from the same node running the Pod. I'll keep digging now that I got that to work..

That is correct, I haven't changed the semantics of Local property for any of the types on in-cluster access, let me revisit that (the current semantics is that for any of the types we skip translation if the svc address is not as HOST_ID in ipcache aka it must be a local address -- let me run some comparisons with the expected behavior for in-cluster next). (For the time being, please run with Cluster then.)

Yeah, I have a bigger matrix to go though still (non-related Pods running on N1 and N2 etc hitting LB IP, Cluster IP, etc) when Local is fully up...but here's what I've tested so far (sorry for the crude syntax, let me know if something doesn't make sense).

dsr + cluster (1 Pod backing service running on N1)
(all work)

N1 -> N1P
N1 -> 127.0.0.1 nodeport
N1 -> LB IP -> N1P
N1 -> SVC Cluster IP -> N1P

N2 -> N1P
N2 -> 127.0.0.1 nodeport -> N1P (should fail in Local policy but not cluster)
N2 -> N1 nodeport -> N1P
N2 -> LB IP -> N1P
N2 -> SVC Cluster IP -> N1P

EXT -> LB IP -> N1P
EXT -> N1 nodeport -> N1P
EXT -> N2 nodeport -> N1P (should fail in Local policy but not cluster)





dsr + local (1 Pod backing service running on N1)

N1 -> N1P
N1 -> 127.0.0.1 nodeport
N1 -> LB IP -> N1P (fail)
N1 -> SVC Cluster IP -> N1P

N2 -> N1P
N2 -> 127.0.0.1 nodeport -> N1P (should fail in Local policy but not cluster)
N2 -> N1 nodeport -> N1P
N2 -> LB IP -> N1P
N1 -> SVC Cluster IP -> N1P

EXT -> LB IP -> N1P
EXT -> N1 nodeport -> N1P
EXT -> N2 nodeport -> N1P (should fail in Local policy but not cluster)










snat + cluster (1 Pod backing service running on N1)
(all work)

N1 -> N1P
N1 -> 127.0.0.1 nodeport
N1 -> LB IP -> N1P
N1 -> SVC Cluster IP -> N1P

N2 -> N1P
N2 -> 127.0.0.1 nodeport -> N1P (should fail in Local policy but not cluster)
N2 -> N1 nodeport -> N1P
N2 -> LB IP -> N1P
N2 -> SVC Cluster IP -> N1P

EXT -> LB IP -> N1P
EXT -> N1 nodeport -> N1P
EXT -> N2 nodeport -> N1P (should fail in Local policy but not cluster)





snat + local (1 Pod backing service running on N1)

N1 -> N1P
N1 -> 127.0.0.1 nodeport
N1 -> LB IP -> N1P (fail)
N1 -> SVC Cluster IP -> N1P

N2 -> N1P
N2 -> 127.0.0.1 nodeport -> N1P (should fail in Local policy but not cluster)
N2 -> N1 nodeport -> N1P
N2 -> LB IP -> N1P
N1 -> SVC Cluster IP -> N1P

EXT -> LB IP -> N1P
EXT -> N1 nodeport -> N1P
EXT -> N2 nodeport -> N1P (should fail in Local policy but not cluster)

Everything behaves as expected (at least as I expect) with the exception of:
dsr or snat + Local + N1 -> LB IP -> N1P

Nice work! Looking forward to the final Local piece!

I'm currently running in dsr mode and mostly bring this up as a talking point. When the connection happens to a node-local pod the source IP is the ip from cilium_host device. When connection occurs to non-node-local pod it's the IP of the nic with the default route (for me anyway). This is using native routing with Cluster policy ATM.

I'm not entirely sure what the behavior should be, mostly just making note in case folks run into policy issues that could stem from the behavior.

Just testing now and looks really good so far. Make note however that (I doubt this is intended behavior but perhaps) the service NodePort works on ALL nodes (even those not running pods) with externalTrafficPolicy=Local.

Note however that external traffic appears to only work to nodes that do have pods running.

Not a big deal to me, but seems a little odd.

Just testing now and looks really good so far. Make note however that (I doubt this is intended behavior but perhaps) the service NodePort works on _ALL_ nodes (even those not running pods) with externalTrafficPolicy=Local.

Note however that external traffic appears to only work to nodes that do have pods running.

Not a big deal to me, but seems a little odd.

Thanks a lot for testing. Intended behavior, meaning:

For requests from outside the cluster, services with externalTrafficPolicy=Local will only select backends that are local to the node in order to allow for client source IP preservation. Meaning, if no backend is on that node, the request is dropped. If there are local backends, then it's load balanced among the local backends.

For requests from inside the cluster, services with externalTrafficPolicy=Local will be able to select all service backends, also those that are remote to the node. Given for in-cluster we always select backends directly from the socket layer, there is no intermediate hop which would have to do SNAT for nodeport. The in-cluster communication for all endpoints also works for kube-proxy (#11746).

@borkmann OK sounds good. I've got another issue (not sure if it's always been doing this or not) but with Cluster policy + dsr I've got a situation where client IP is not being preserved. Going to dig a little bit more but current scenario is: EXT -> N2 NodePort -> N1P

IP is preserved with:

EXT -> N1 NodePort -> N1P

Issue may be broader than NodePort however, I'll do some testing with LB IP and forcing traffic in different paths...

Sample output ran from external node:

for ip in 192.168.58.0 172.29.2.21:32005 172.29.2.22:32005;do   echo $ip;   port=${ip#*:};   if [ "${ip}" == "${port}" ];then     port="3306";   fi;   ip=$(echo "${ip}" | cut -d ":" -f1);   mysql -uroot -proot -h "${ip}" -P "${port}" -e "SHOW PROCESSLIST"; done
192.168.58.0
mysql: [Warning] Using a password on the command line interface can be insecure.
+-------+------+--------------------+------+---------+------+-------+------------------+
| Id    | User | Host               | db   | Command | Time | State | Info             |
+-------+------+--------------------+------+---------+------+-------+------------------+
| 25867 | root | 172.29.10.12:40874 | NULL | Query   |    0 | init  | SHOW PROCESSLIST |
+-------+------+--------------------+------+---------+------+-------+------------------+
172.29.2.21:32005
mysql: [Warning] Using a password on the command line interface can be insecure.
+-------+------+--------------------+------+---------+------+-------+------------------+
| Id    | User | Host               | db   | Command | Time | State | Info             |
+-------+------+--------------------+------+---------+------+-------+------------------+
| 25868 | root | 172.29.10.12:58400 | NULL | Query   |    0 | init  | SHOW PROCESSLIST |
+-------+------+--------------------+------+---------+------+-------+------------------+
172.29.2.22:32005
mysql: [Warning] Using a password on the command line interface can be insecure.
+-------+------+-------------------+------+---------+------+-------+------------------+
| Id    | User | Host              | db   | Command | Time | State | Info             |
+-------+------+-------------------+------+---------+------+-------+------------------+
| 25869 | root | 172.29.2.22:41960 | NULL | Query   |    0 | init  | SHOW PROCESSLIST |
+-------+------+-------------------+------+---------+------+-------+------------------+

LB IP is 192.168.58.0

Client IP is 172.29.10.12

172.29.2.21:32005 172.29.2.22:32005 is svc NodePort on 2 different nodes, single backing pod is running only on the former node.

I have my suspicions that if the LB/BGP traffic landed on a node without a backing pod I'd have the same behavior...will test after sleepy sleepy though.

Hmm, I currently cannot reproduce this. I have 3 nodes:

  • node A (apoc): 192.168.178.29
  • node B (tank): 192.168.178.28
  • node C: 192.168.178.30

The node C is an external client, meaning not managed by Cilium.

# kubectl get pods -n default -o wide
NAME                      READY   STATUS    RESTARTS   AGE   IP             NODE   NOMINATED NODE   READINESS GATES
nginx6-7d4b5d6bdf-sgqpm   1/1     Running   0          90m   10.217.1.138   tank   <none>           <none>

And I have:

# kubectl get svc nginx6
NAME     TYPE       CLUSTER-IP      EXTERNAL-IP   PORT(S)        AGE
nginx6   NodePort   10.97.128.143   <none>        80:31190/TCP   89m

# kubectl get ep nginx6
NAME     ENDPOINTS         AGE
nginx6   10.217.1.138:80   89m

So backend is on node tank. Service has externalTrafficPolicy: Cluster.

I'm running the agent on apoc/tank with DSR. Full options I've used:

# ./daemon/cilium-agent --identity-allocation-mode=crd --enable-ipv6=true --enable-ipv4=true --disable-envoy-version-check=true --tunnel=disabled --k8s-kubeconfig-path=$HOME/.kube/config --kube-proxy-replacement=strict --node-port-mode=dsr --enable-l7-proxy=false --auto-direct-node-routes=true --native-routing-cidr=10.217.0.0/16

The node C is doing curl to apoc where it finds that the backend is remote, thus needs DSR:

# curl 192.168.178.29:31190
<!DOCTYPE html>
<html>
[...]

Running tcpdump on node tank I get:

# tcpdump -i any port 80 or port 31190 -n
[...]
13:38:49.643764 IP 192.168.178.30.51300 > 10.217.1.138.80: Flags [S], seq 1174432341, win 64240, options [mss 1460,sackOK,TS val 2665013319 ecr 0,nop,wscale 7], length 0
13:38:49.643800 IP 192.168.178.30.51300 > 10.217.1.138.80: Flags [S], seq 1174432341, win 64240, options [mss 1460,sackOK,TS val 2665013319 ecr 0,nop,wscale 7], length 0
13:38:49.643826 IP 10.217.1.138.80 > 192.168.178.30.51300: Flags [S.], seq 4260698455, ack 1174432342, win 65160, options [mss 1460,sackOK,TS val 3340810749 ecr 2665013319,nop,wscale 7], length 0
13:38:49.643836 IP 192.168.178.29.31190 > 192.168.178.30.51300: Flags [S.], seq 4260698455, ack 1174432342, win 65160, options [mss 1460,sackOK,TS val 3340810749 ecr 2665013319,nop,wscale 7], length 0
13:38:49.646042 IP 192.168.178.30.51300 > 10.217.1.138.80: Flags [.], ack 1, win 502, options [nop,nop,TS val 2665013322 ecr 3340810749], length 0
13:38:49.646050 IP 192.168.178.30.51300 > 10.217.1.138.80: Flags [.], ack 1, win 502, options [nop,nop,TS val 2665013322 ecr 3340810749], length 0
[...]

So client source IP is preserved (192.168.178.30.51300) and tank replies on behalf of apoc (192.168.178.29.31190) in the SYN/ACK.

Which agent options do you have?

172.29.2.22:32005
mysql: [Warning] Using a password on the command line interface can be insecure.
+-------+------+-------------------+------+---------+------+-------+------------------+
| Id | User | Host | db | Command | Time | State | Info |
+-------+------+-------------------+------+---------+------+-------+------------------+
| 25869 | root | 172.29.2.22:41960 | NULL | Query | 0 | init | SHOW PROCESSLIST |
+-------+------+-------------------+------+---------+------+-------+------------------+

Hmm, you're saying that host should have been 172.29.10.12 on the last entry. That node is running the agent with DSR for sure, right?

Do you have some steps to reproduce?

Yeah that’s correct. It’s deployed as daemon set so unless something is off it should be with dsr yes.

I’ll kick the agents pods in a bit just to confirm it still is in a bad state.

Sorry above comments about source IP are complete noise...I thought I was running in dsr but was not :(

I'll go through a deeper round of testing with snat + Local now and report observations.

Was this page helpful?
0 / 5 - 0 ratings