General Information
cilium version)Client: 1.7.4 c7ee6d62b 2020-05-15T16:07:35+02:00 go version go1.13.10 linux/amd64
Daemon: 1.7.4 c7ee6d62b 2020-05-15T16:07:35+02:00 go version go1.13.10 linux/amd64
uname -a)Linux test03.lan 5.6.2-1.el7.elrepo.x86_64 #1 SMP Thu Apr 2 10:55:54 EDT 2020 x86_64 x86_64 x86_64 GNU/Linux
kubectl version, Mesos, ...)Client Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.3", GitCommit:"2e7996e3e2712684bc73f0dec0200d64eec7fe40", GitTreeState:"clean", BuildDate:"2020-05-20T12:52:00Z", GoVersion:"go1.13.9", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"17", GitVersion:"v1.17.4", GitCommit:"8d8aa39598534325ad77120c120a22b3a990b5ea", GitTreeState:"clean", BuildDate:"2020-03-12T20:55:23Z", GoVersion:"go1.13.8", Compiler:"gc", Platform:"linux/amd64"}
# automatically restarts pods to ensure controlled by new CNI driver
operator:
enabled: true
global:
k8sServiceHost: "127.0.0.1"
k8sServicePort: "6443"
enableXTSocketFallback: false
prometheus:
enabled: true
bpf:
preallocateMaps: true
# disabling not ready for primetime yet
# https://github.com/cilium/cilium/projects/93#column-7748410
installIptablesRules: true
# https://docs.cilium.io/en/latest/architecture/#arch-guide
# https://cilium.io/blog/2019/02/12/cilium-14/#sockmap-bpf-based-sidecar-acceleration-alpha
# https://www.youtube.com/watch?v=ER9eIXL2_14
sockops:
enabled: true
k8s:
# cilium pods will not start on node until pod CIDR has been assigned
requireIPv4PodCIDR: true
# eliminates need for any kind of BGP stuff
# automatically addes routes to each node
autoDirectNodeRoutes: true
tunnel: disabled
kubeProxyReplacement: strict
hostServices:
enabled: true
nodePort:
enabled: true
# dsr or snat
#mode: dsr
mode: snat
externalIPs:
enabled: true
# dev purposes only
cleanState: false
cleanBpfState: true
curl -sLO
https://github.com/cilium/cilium-sysdump/releases/latest/download/cilium-sysdump.zip &&
python cilium-sysdump.zip and then attach the generated zip file)How to reproduce the issue
I'm using cilium with metallb with the kube-proxy replacement. I've got a pretty big matrix of scenarios I'm testing and most of them work, but we've found a situation where certain traffic fails to be handled. I believe I can distill the issue down to: when traffic leaves N1 (Node 1) and comes back to N1P (Pod running on Node 1) without an snat involved it fails
I've tried this with both dsr mode (my intended target) and snat mode (less interested, but wanted to try it out). Both fail under the above circumstances. Her's a pretty crude representation of what I think the traffic flows are and what works and what fails:
GW = gateway
R = router
NX = node X
NXP = pod running on node X
# service with Cluster externalTrafficPolicy
N1 -> GW -> R -> N2 -> N1P: fail
N1 -> GW -> R -> N1 -> N1P: fail
N2 -> GW -> R -> N1 -> N1P: success
# service with Local externalTrafficPolicy
N2 -> GW -> R -> N1 -> N1P: success
N1 -> GW -> R -> N1 -> N1P: fail
# service with Cluster externalTrafficPolicy
N1 -> GW -> R -> N2 -> N1P: success
N1 -> GW -> R -> N1 -> N1P: fail
# service with Local externalTrafficPolicy
N2 -> GW -> R -> N1 -> N1P: success
N1 -> GW -> R -> N1 -> N1P: fail
cilium bpf lb list on N1?cilium status output?@brb this is literally the exact same cluster I previously gave you access to with the exact same mysql service. I think you still have access..if not I can reopen the port forward.
kubectl -n kube-system exec cilium-qqstk -- cilium bpf lb list
...
192.168.58.0:3306 10.42.0.215:3306 (2)
0.0.0.0:0 (2) [ExternalIPs]
...
kubectl -n kube-system exec cilium-qqstk -- cilium status
KVStore: Ok Disabled
Kubernetes: Ok 1.17 (v1.17.4) [linux/amd64]
Kubernetes APIs: ["CustomResourceDefinition", "cilium/v2::CiliumClusterwideNetworkPolicy", "cilium/v2::CiliumEndpoint", "cilium/v2::CiliumNetworkPolicy", "cilium/v2::CiliumNode", "core/v1::Endpoint", "core/v1::Namespace", "core/v1::Pods", "core/v1::Service", "networking.k8s.io/v1::NetworkPolicy"]
KubeProxyReplacement: Strict [NodePort (DSR, 30000-32767), ExternalIPs, HostReachableServices (TCP, UDP)]
Cilium: Ok OK
NodeMonitor: Listening for events on 2 CPUs with 64x4096 of shared memory
Cilium health daemon: Ok
IPAM: IPv4: 8/255 allocated from 10.42.0.0/24,
Controller Status: 31/31 healthy
Proxy Status: OK, ip 10.42.0.40, 0 redirects active on ports 10000-20000
Cluster health: 5/5 reachable (2020-05-28T07:22:37Z)
kubectl get pods -A -o wide | grep test01
default mysql-768bcb9b84-x6znx 1/1 Running 0 15h 10.42.0.215 test01.lan <none> <none>
kube-system cilium-qqstk 1/1 Running 0 14h 172.29.2.21 test01.lan <none> <none>
kubectl get svc | grep mysql
mysql-lb-bgp LoadBalancer 10.43.252.65 192.168.58.0 3306:32005/TCP 49d
# on test01.lan
mysql -uroot -proot -h 192.168.58.0
# on test01.lan: tcpdump -i any host 192.168.58.0
07:25:53.893685 IP test01.lan.58998 > 192.168.58.0.mysql: Flags [S], seq 627751695, win 64240, options [mss 1460,sackOK,TS val 2341430069 ecr 0,nop,wscale 7], length 0
07:25:53.893844 IP test01.lan.58998 > 192.168.58.0.mysql: Flags [S], seq 627751695, win 64240, options [mss 1460,sackOK,TS val 2341430069 ecr 0,nop,wscale 7], length 0
07:25:54.947566 IP test01.lan.58998 > 192.168.58.0.mysql: Flags [S], seq 627751695, win 64240, options [mss 1460,sackOK,TS val 2341431123 ecr 0,nop,wscale 7], length 0
07:25:54.947758 IP test01.lan.58998 > 192.168.58.0.mysql: Flags [S], seq 627751695, win 64240, options [mss 1460,sackOK,TS val 2341431123 ecr 0,nop,wscale 7], length 0
07:25:56.995544 IP test01.lan.58998 > 192.168.58.0.mysql: Flags [S], seq 627751695, win 64240, options [mss 1460,sackOK,TS val 2341433171 ecr 0,nop,wscale 7], length 0
07:25:56.995718 IP test01.lan.58998 > 192.168.58.0.mysql: Flags [S], seq 627751695, win 64240, options [mss 1460,sackOK,TS val 2341433171 ecr 0,nop,wscale 7], length 0
07:26:01.027551 IP test01.lan.58998 > 192.168.58.0.mysql: Flags [S], seq 627751695, win 64240, options [mss 1460,sackOK,TS val 2341437203 ecr 0,nop,wscale 7], length 0
07:26:01.027819 IP test01.lan.58998 > 192.168.58.0.mysql: Flags [S], seq 627751695, win 64240, options [mss 1460,sackOK,TS val 2341437203 ecr 0,nop,wscale 7], length 0
07:26:09.347586 IP test01.lan.58998 > 192.168.58.0.mysql: Flags [S], seq 627751695, win 64240, options [mss 1460,sackOK,TS val 2341445523 ecr 0,nop,wscale 7], length 0
07:26:09.347784 IP test01.lan.58998 > 192.168.58.0.mysql: Flags [S], seq 627751695, win 64240, options [mss 1460,sackOK,TS val 2341445523 ecr 0,nop,wscale 7], length 0
07:26:25.731621 IP test01.lan.58998 > 192.168.58.0.mysql: Flags [S], seq 627751695, win 64240, options [mss 1460,sackOK,TS val 2341461907 ecr 0,nop,wscale 7], length 0
07:26:25.732171 IP test01.lan.58998 > 192.168.58.0.mysql: Flags [S], seq 627751695, win 64240, options [mss 1460,sackOK,TS val 2341461907 ecr 0,nop,wscale 7], length 0
07:26:57.987563 IP test01.lan.58998 > 192.168.58.0.mysql: Flags [S], seq 627751695, win 64240, options [mss 1460,sackOK,TS val 2341494163 ecr 0,nop,wscale 7], length 0
07:26:57.987731 IP test01.lan.58998 > 192.168.58.0.mysql: Flags [S], seq 627751695, win 64240, options [mss 1460,sackOK,TS val 2341494163 ecr 0,nop,wscale 7], length 0
# on my GW/R: tcpdump -i bridge0 host 192.168.58.0
01:25:53.891936 IP test01.lan.58998 > 192.168.58.0.mysql: Flags [S], seq 627751695, win 64240, options [mss 1460,sackOK,TS val 2341430069 ecr 0,nop,wscale 7], length 0
01:25:53.891985 IP test01.lan.58998 > 192.168.58.0.mysql: Flags [S], seq 627751695, win 64240, options [mss 1460,sackOK,TS val 2341430069 ecr 0,nop,wscale 7], length 0
01:25:54.945847 IP test01.lan.58998 > 192.168.58.0.mysql: Flags [S], seq 627751695, win 64240, options [mss 1460,sackOK,TS val 2341431123 ecr 0,nop,wscale 7], length 0
01:25:54.945883 IP test01.lan.58998 > 192.168.58.0.mysql: Flags [S], seq 627751695, win 64240, options [mss 1460,sackOK,TS val 2341431123 ecr 0,nop,wscale 7], length 0
01:25:56.993787 IP test01.lan.58998 > 192.168.58.0.mysql: Flags [S], seq 627751695, win 64240, options [mss 1460,sackOK,TS val 2341433171 ecr 0,nop,wscale 7], length 0
01:25:56.993824 IP test01.lan.58998 > 192.168.58.0.mysql: Flags [S], seq 627751695, win 64240, options [mss 1460,sackOK,TS val 2341433171 ecr 0,nop,wscale 7], length 0
01:26:01.025864 IP test01.lan.58998 > 192.168.58.0.mysql: Flags [S], seq 627751695, win 64240, options [mss 1460,sackOK,TS val 2341437203 ecr 0,nop,wscale 7], length 0
01:26:01.025902 IP test01.lan.58998 > 192.168.58.0.mysql: Flags [S], seq 627751695, win 64240, options [mss 1460,sackOK,TS val 2341437203 ecr 0,nop,wscale 7], length 0
01:26:09.345843 IP test01.lan.58998 > 192.168.58.0.mysql: Flags [S], seq 627751695, win 64240, options [mss 1460,sackOK,TS val 2341445523 ecr 0,nop,wscale 7], length 0
01:26:09.345880 IP test01.lan.58998 > 192.168.58.0.mysql: Flags [S], seq 627751695, win 64240, options [mss 1460,sackOK,TS val 2341445523 ecr 0,nop,wscale 7], length 0
01:26:25.730259 IP test01.lan.58998 > 192.168.58.0.mysql: Flags [S], seq 627751695, win 64240, options [mss 1460,sackOK,TS val 2341461907 ecr 0,nop,wscale 7], length 0
01:26:25.730296 IP test01.lan.58998 > 192.168.58.0.mysql: Flags [S], seq 627751695, win 64240, options [mss 1460,sackOK,TS val 2341461907 ecr 0,nop,wscale 7], length 0
01:26:57.985817 IP test01.lan.58998 > 192.168.58.0.mysql: Flags [S], seq 627751695, win 64240, options [mss 1460,sackOK,TS val 2341494163 ecr 0,nop,wscale 7], length 0
01:26:57.985855 IP test01.lan.58998 > 192.168.58.0.mysql: Flags [S], seq 627751695, win 64240, options [mss 1460,sackOK,TS val 2341494163 ecr 0,nop,wscale 7], length 0
It's entirely possible it's something to do with the environment. I do have 2 very different environments however that are exhibiting the same behavior.
Thanks. Is the cilium bpf lb list output from test01.lan? If not, does test01.lan run cilium-agent?
It is from that node. It is running the agent.
Can you paste bpftool cgroup tree output from that node, and also cilium bpf lb list?
kubectl -n kube-system exec cilium-qqstk -- bpftool cgroup tree
CgroupPath
ID AttachType AttachFlags Name
/run/cilium/cgroupv2
11786 connect4
11778 connect6
11788 post_bind4
11780 post_bind6
11790 sendmsg4
11782 sendmsg6
11792 recvmsg4
11784 recvmsg6
kubectl -n kube-system exec cilium-qqstk -- cilium bpf lb list
SERVICE ADDRESS BACKEND ADDRESS
10.43.13.198:44134 0.0.0.0:0 (26) [ClusterIP]
10.42.2.234:44134 (26)
10.43.212.93:8443 10.42.3.158:8443 (7)
0.0.0.0:0 (7) [ClusterIP]
172.29.4.1:8000 10.42.1.23:80 (19)
10.42.3.51:80 (19)
0.0.0.0:0 (19) [ExternalIPs]
10.42.2.227:80 (19)
10.43.252.65:3306 10.42.0.215:3306 (1)
0.0.0.0:0 (1) [ClusterIP]
10.42.0.40:32005 10.42.0.215:3306 (5)
0.0.0.0:0 (5) [NodePort]
10.43.0.10:53 0.0.0.0:0 (24) [ClusterIP]
10.42.1.38:53 (24)
10.42.0.96:53 (24)
10.42.0.40:32101 10.42.3.51:80 (22)
0.0.0.0:0 (22) [NodePort]
10.42.2.227:80 (22)
10.42.1.23:80 (22)
0.0.0.0:31298 0.0.0.0:0 (9) [NodePort]
10.42.3.158:8443 (9)
0.0.0.0:32005 10.42.0.215:3306 (3)
0.0.0.0:0 (3) [NodePort]
172.29.2.21:32005 0.0.0.0:0 (4) [NodePort]
10.42.0.215:3306 (4)
10.43.113.163:3300 10.42.1.145:3300 (29)
0.0.0.0:0 (29) [ClusterIP]
172.29.2.21:31298 0.0.0.0:0 (10) [NodePort]
10.42.3.158:8443 (10)
10.43.0.10:9153 0.0.0.0:0 (23) [ClusterIP]
10.42.0.96:9153 (23)
10.42.1.38:9153 (23)
10.43.104.150:80 0.0.0.0:0 (12) [ClusterIP]
10.43.239.70:6789 0.0.0.0:0 (13) [ClusterIP]
10.42.0.21:6789 (13)
10.43.121.94:9283 10.42.3.158:9283 (17)
0.0.0.0:0 (17) [ClusterIP]
10.43.0.1:443 0.0.0.0:0 (27) [ClusterIP]
172.29.2.21:6443 (27)
10.43.123.162:443 0.0.0.0:0 (25) [ClusterIP]
10.42.2.54:443 (25)
10.43.215.34:80 10.42.2.247:80 (6)
0.0.0.0:0 (6) [ClusterIP]
10.43.246.6:3300 0.0.0.0:0 (16) [ClusterIP]
10.42.4.101:3300 (16)
10.43.246.6:6789 0.0.0.0:0 (15) [ClusterIP]
10.42.4.101:6789 (15)
10.43.8.169:8443 0.0.0.0:0 (8) [ClusterIP]
10.42.3.158:8443 (8)
10.42.0.40:31298 10.42.3.158:8443 (11)
0.0.0.0:0 (11) [NodePort]
192.168.58.0:3306 10.42.0.215:3306 (2)
0.0.0.0:0 (2) [ExternalIPs]
10.43.239.70:3300 10.42.0.21:3300 (14)
0.0.0.0:0 (14) [ClusterIP]
10.43.113.163:6789 10.42.1.145:6789 (28)
0.0.0.0:0 (28) [ClusterIP]
10.43.120.219:8000 0.0.0.0:0 (18) [ClusterIP]
10.42.2.227:80 (18)
10.42.3.51:80 (18)
10.42.1.23:80 (18)
172.29.2.21:32101 0.0.0.0:0 (21) [NodePort]
10.42.1.23:80 (21)
10.42.2.227:80 (21)
10.42.3.51:80 (21)
0.0.0.0:32101 10.42.1.23:80 (20)
10.42.3.51:80 (20)
10.42.2.227:80 (20)
0.0.0.0:0 (20) [NodePort]
This explains why loadbalancing is not performed by bpf_sock: 192.168.58.0:3306 is an externalIP svc. Because it does not belong to any of the nodes IP addr, bpf_sock does not perform the translation to prevent from man-in-the-middle attacks.
Check with kube-proxy, but you are not supposed to access an externalIP svc from inside a cluster.
@brb I'm not sure I follow the logic there while reconciling with the actual behavior. I'll test on my canal setup but observe the following:
I'm not sure if the ideal here is that the host networking namespace behaves the same as the pod namespaces (ie: traffic is intercepted and never leaves the cluster...or at least doesn't go to the default route etc). I would assume so for consistency's sake.
I may have missed a subtle comment there about externalIP..to be clear here is the service definition as it's type is LoadBalancer
apiVersion: v1
kind: Service
metadata:
annotations:
kubectl.kubernetes.io/last-applied-configuration: |
{"apiVersion":"v1","kind":"Service","metadata":{"annotations":{"metallb.universe.tf/address-pool":"dedicated-bgp"},"name":"mysql-lb-bgp","namespace":"default"},"spec":{"ports":[{"name":"mysql","port":3306,"protocol":"TCP"}],"selector":{"app":"mysql"},"type":"LoadBalancer"}}
metallb.universe.tf/address-pool: dedicated-bgp
creationTimestamp: "2020-04-08T16:28:16Z"
name: mysql-lb-bgp
namespace: default
resourceVersion: "88927656"
selfLink: /api/v1/namespaces/default/services/mysql-lb-bgp
uid: 65e2c239-9e2d-4c07-9271-956d82ccc41a
spec:
clusterIP: 10.43.252.65
externalTrafficPolicy: Cluster
ports:
- name: mysql
nodePort: 32005
port: 3306
protocol: TCP
targetPort: 3306
selector:
app: mysql
sessionAffinity: None
type: LoadBalancer
status:
loadBalancer:
ingress:
- ip: 192.168.58.0
So a quick/crude test on a canal based cluster shows the service (type LoadBalancer, policy Cluster) available to all nodes in the cluster. In addition it appears to be behaving exactly as I described the hypothetical behavior should be...traffic never goes to the gw/router and stays local to the cluster from what I can tell.
However, the same setup with policy Local appears to result in failure from nodes which don't have a running pod :( I'm not entirely sure of the culprit of this but I'm guessing kube-proxy as it appears no traffic goes to the fw nor to the destination node. The only thing I see on the node sending the request is:
14:04:11.353705 IP 192.168.57.0 > 192.168.57.0: ICMP 192.168.57.0 tcp port mysql unreachable, length 68
14:04:12.374192 IP 192.168.57.0 > 192.168.57.0: ICMP 192.168.57.0 tcp port mysql unreachable, length 68
14:08:10.864621 IP 192.168.57.0 > 192.168.57.0: ICMP 192.168.57.0 tcp port mysql unreachable, length 68
Not sure why ICMP is getting invoked here, I'm trying to telnet 192.168.57.0 3306.
My expectation for kube-proxy or cilium is that the request would be successful for any combination but there may be things I'm not fully grasping for sure :)
OK, then we need to distinguish in the datapath between LoadBalancer and externalIP service types, and allow the translation of the former.
@brb there is no such thing as externalIP type.
https://kubernetes.io/docs/concepts/services-networking/service/#external-ips
What is the difference jn logic you intend to implement regardless? IMO if traffic is originating from a node toward a known external IP (using that term very loosely) it should just work no?
I think I may understand the security implications/concerns. The idea being that IPs provisioned via LoadBalancer can be more trusted as cluster administrators would generally have more control over that vs users could put any arbitrary IP as an externalIP?
I understand now how that could be potentially malicious as I hypothetically could hijack traffic destined to any arbitrary endpoint in any public/private network and get it redirected to the service. Perhaps a config option(s) might be necessary to let cluster admins determine desired behavior?
there is no such thing as externalIP type
In the datapath we treat it as a separate service type, as it has some security implications.
The idea being that IPs provisioned via LoadBalancer can be more trusted as cluster administrators would generally have more control over that vs users could put any arbitrary IP as an externalIP?
Yes, see: https://github.com/kubernetes/kubernetes/blob/master/pkg/proxy/iptables/proxier.go#L1092
Perhaps a config option(s) might be necessary to let cluster admins determine desired behavior?
We want to conform to k8s' kube-proxy which doesn't have such flag to reduce cluster's security. Also, there are more ways to access the same service from a cluster.
Not sure I follow the implications of the last comment. Does that mean this won’t be implemented?
As it stands in both kube-proxy and cilium the ‘security’ of the situation is useless as traffic blocks in some situations but not all.
As for following kube-proxy there are already lot’s of things cilium does that kube-proxy doesn’t (or I wouldn’t bother using the replacement). The advanced features are precisely what make cilium appealing.
Could you explain why you want to access a svc via externalIP instead of LoadBalancer IP, ClusterIP or NodeIP + NodePort from inside a cluster?
Maybe I wasn’t clear. I’m only concerned about LB...and that is the original issue. I understand the security implications of external IP for sure.
I want services to be accessible from their globally recognizable (ie external) ip/dns entries by the cluster itself including nodes and pods running in the host networking namespace.
There are many examples. Right now I’m running an ingress controller using LB. Behind that ingress one of the things I’m running is the rancher platform. Usually the cluster running rancher is imported into rancher (inception?) and in so doing agent pods in the host networking namespace are spun up and attempt to connect to rancher on the globally recognized dns (ie dns entry pointed to the LB IP). Depending on circumstances (sporadic) the agent may connect or may not as it stands). That’s just 1 example.
In general when the clusters start getting used more heavily I don’t want people to care what cluster a service is running in or even know. I also don’t want to explain to them why it doesn’t work if they happen to try accessing a service that should be recognizable anywhere in the environment but isn’t from X cluster because...reasons.
I appreciate the patience in understanding the scenario!
To be clear, the IP I’m using is the LB IP and the service does not define any externalIPs.
To be clear, the IP I’m using is the LB IP and the service does not define any externalIPs.
This is going to be fixed once this issue is resolved.
Re accessing externalIPs from inside a cluster, please create a separate issue. Thanks.
Awesome! I have no desire for externalIP traffic personally.
I’ve got a test cluster ready to go when a test image is ready. I have a petty extensive set of tests to run through to ensure a bunch of different scenarios work as expected so I can run through all of those if it’s helpful.
Cool, we will ping you once we have a fix.
We've hit the same issue. Currently Cilium collapses LB type into ExternalIP here: https://github.com/cilium/cilium/blob/master/pkg/service/service.go#L597
As discussed, ExternalIP is not supposed to be accessible from rootns. LB does not have this restriction, yet it's applied due to collapse above.
I’ve got a test cluster ready to go when a test image is ready. I have a petty extensive set of tests to run through to ensure a bunch of different scenarios work as expected so I can run through all of those if it’s helpful.
@travisghansen Issue is fixed now in docker.io/cilium/cilium:latest and will be part of v1.8.1 tag soon; tested with MetalLB. If you have a chance to run also through your set of tests as mentioned above that would be great. Thanks for reporting!
This makes me so happy! I’m tied up most of the day today but I’ll definitely have some feedback in the next day or so. This is super high priority for me so I really appreciate the help.
@borkmann either I've got something misconfigured or it's still off.
kubectl exec -it -n kube-system cilium-cq8rs -- cilium version
Client: 1.8.90 c00b4b70a 2020-06-24T13:50:41-07:00 go version go1.14.4 linux/amd64
Daemon: 1.8.90 c00b4b70a 2020-06-24T13:50:41-07:00 go version go1.14.4 linux/amd64
kubectl get pods -o wide | grep mysql
mysql-6d45dbbb4-gq4f4 1/1 Running 0 49m 10.42.0.234 test01.lan <none> <none>
kubectl get svc | grep mysql
mysql-lb-bgp LoadBalancer 10.43.252.65 192.168.58.0 3306:32005/TCP 78d
(from cilium on test01.lan)
kubectl exec -it -n kube-system cilium-vm84s -- cilium service list
ID Frontend Service Type Backend
1 10.43.0.10:53 ClusterIP 1 => 10.42.0.222:53
2 => 10.42.4.45:53
2 10.43.0.10:9153 ClusterIP 1 => 10.42.0.222:9153
2 => 10.42.4.45:9153
3 10.43.121.94:9283 ClusterIP 1 => 10.42.3.158:9283
4 10.43.239.70:6789 ClusterIP 1 => 10.42.0.208:6789
5 10.43.239.70:3300 ClusterIP 1 => 10.42.0.208:3300
6 10.43.0.1:443 ClusterIP 1 => 172.29.2.21:6443
7 10.43.215.34:80 ClusterIP 1 => 10.42.2.247:80
8 10.43.100.168:6789 ClusterIP 1 => 10.42.3.177:6789
9 10.43.100.168:3300 ClusterIP 1 => 10.42.3.177:3300
10 10.43.123.162:443 ClusterIP 1 => 10.42.2.54:443
11 10.43.13.198:44134 ClusterIP 1 => 10.42.2.234:44134
12 10.43.8.169:8443 ClusterIP 1 => 10.42.3.158:8443
13 0.0.0.0:31298 NodePort 1 => 10.42.3.158:8443
14 172.29.2.21:31298 NodePort 1 => 10.42.3.158:8443
15 10.43.113.163:3300 ClusterIP 1 => 10.42.1.145:3300
16 10.43.113.163:6789 ClusterIP 1 => 10.42.1.145:6789
17 10.43.104.150:80 ClusterIP
18 10.43.212.93:8443 ClusterIP 1 => 10.42.3.158:8443
19 10.43.120.219:8000 ClusterIP 1 => 10.42.1.23:80
2 => 10.42.2.227:80
3 => 10.42.3.51:80
20 172.29.4.1:8000 LoadBalancer 1 => 10.42.1.23:80
2 => 10.42.2.227:80
3 => 10.42.3.51:80
21 172.29.2.21:32101 NodePort 1 => 10.42.1.23:80
2 => 10.42.2.227:80
3 => 10.42.3.51:80
22 0.0.0.0:32101 NodePort 1 => 10.42.1.23:80
2 => 10.42.2.227:80
3 => 10.42.3.51:80
23 10.43.252.65:3306 ClusterIP 1 => 10.42.0.234:3306
24 192.168.58.0:3306 LoadBalancer 1 => 10.42.0.234:3306
25 172.29.2.21:32005 NodePort 1 => 10.42.0.234:3306
26 0.0.0.0:32005 NodePort 1 => 10.42.0.234:3306
# fails from test01.lan
# works from other nodes in/out of the cluster
mysql -uroot -proot -h 192.168.58.0
# works from test01.lan
mysql -uroot -proot -h 127.0.0.1 -P 32005
I've tried with both dsr and snat.
Actually, I haven't ran through the full matrix, but the issue appears to be with Local traffic policy on the svc. When I changed it to Cluster I was able to access from the same node running the Pod. I'll keep digging now that I got that to work..
Actually, I haven't ran through the full matrix, but the issue appears to be with
Localtraffic policy on the svc. When I changed it toClusterI was able to access from the same node running the Pod. I'll keep digging now that I got that to work..
That is correct, I haven't changed the semantics of Local property for any of the types on in-cluster access, let me revisit that (the current semantics is that for any of the types we skip translation if the svc address is not as HOST_ID in ipcache aka it must be a local address -- let me run some comparisons with the expected behavior for in-cluster next). (For the time being, please run with Cluster then.)
Yeah, I have a bigger matrix to go though still (non-related Pods running on N1 and N2 etc hitting LB IP, Cluster IP, etc) when Local is fully up...but here's what I've tested so far (sorry for the crude syntax, let me know if something doesn't make sense).
dsr + cluster (1 Pod backing service running on N1)
(all work)
N1 -> N1P
N1 -> 127.0.0.1 nodeport
N1 -> LB IP -> N1P
N1 -> SVC Cluster IP -> N1P
N2 -> N1P
N2 -> 127.0.0.1 nodeport -> N1P (should fail in Local policy but not cluster)
N2 -> N1 nodeport -> N1P
N2 -> LB IP -> N1P
N2 -> SVC Cluster IP -> N1P
EXT -> LB IP -> N1P
EXT -> N1 nodeport -> N1P
EXT -> N2 nodeport -> N1P (should fail in Local policy but not cluster)
dsr + local (1 Pod backing service running on N1)
N1 -> N1P
N1 -> 127.0.0.1 nodeport
N1 -> LB IP -> N1P (fail)
N1 -> SVC Cluster IP -> N1P
N2 -> N1P
N2 -> 127.0.0.1 nodeport -> N1P (should fail in Local policy but not cluster)
N2 -> N1 nodeport -> N1P
N2 -> LB IP -> N1P
N1 -> SVC Cluster IP -> N1P
EXT -> LB IP -> N1P
EXT -> N1 nodeport -> N1P
EXT -> N2 nodeport -> N1P (should fail in Local policy but not cluster)
snat + cluster (1 Pod backing service running on N1)
(all work)
N1 -> N1P
N1 -> 127.0.0.1 nodeport
N1 -> LB IP -> N1P
N1 -> SVC Cluster IP -> N1P
N2 -> N1P
N2 -> 127.0.0.1 nodeport -> N1P (should fail in Local policy but not cluster)
N2 -> N1 nodeport -> N1P
N2 -> LB IP -> N1P
N2 -> SVC Cluster IP -> N1P
EXT -> LB IP -> N1P
EXT -> N1 nodeport -> N1P
EXT -> N2 nodeport -> N1P (should fail in Local policy but not cluster)
snat + local (1 Pod backing service running on N1)
N1 -> N1P
N1 -> 127.0.0.1 nodeport
N1 -> LB IP -> N1P (fail)
N1 -> SVC Cluster IP -> N1P
N2 -> N1P
N2 -> 127.0.0.1 nodeport -> N1P (should fail in Local policy but not cluster)
N2 -> N1 nodeport -> N1P
N2 -> LB IP -> N1P
N1 -> SVC Cluster IP -> N1P
EXT -> LB IP -> N1P
EXT -> N1 nodeport -> N1P
EXT -> N2 nodeport -> N1P (should fail in Local policy but not cluster)
Everything behaves as expected (at least as I expect) with the exception of:
dsr or snat + Local + N1 -> LB IP -> N1P
Nice work! Looking forward to the final Local piece!
I'm currently running in dsr mode and mostly bring this up as a talking point. When the connection happens to a node-local pod the source IP is the ip from cilium_host device. When connection occurs to non-node-local pod it's the IP of the nic with the default route (for me anyway). This is using native routing with Cluster policy ATM.
I'm not entirely sure what the behavior should be, mostly just making note in case folks run into policy issues that could stem from the behavior.
Just testing now and looks really good so far. Make note however that (I doubt this is intended behavior but perhaps) the service NodePort works on ALL nodes (even those not running pods) with externalTrafficPolicy=Local.
Note however that external traffic appears to only work to nodes that do have pods running.
Not a big deal to me, but seems a little odd.
Just testing now and looks really good so far. Make note however that (I doubt this is intended behavior but perhaps) the service NodePort works on _ALL_ nodes (even those not running pods) with
externalTrafficPolicy=Local.Note however that external traffic appears to only work to nodes that do have pods running.
Not a big deal to me, but seems a little odd.
Thanks a lot for testing. Intended behavior, meaning:
For requests from outside the cluster, services with externalTrafficPolicy=Local will only select backends that are local to the node in order to allow for client source IP preservation. Meaning, if no backend is on that node, the request is dropped. If there are local backends, then it's load balanced among the local backends.
For requests from inside the cluster, services with externalTrafficPolicy=Local will be able to select all service backends, also those that are remote to the node. Given for in-cluster we always select backends directly from the socket layer, there is no intermediate hop which would have to do SNAT for nodeport. The in-cluster communication for all endpoints also works for kube-proxy (#11746).
@borkmann OK sounds good. I've got another issue (not sure if it's always been doing this or not) but with Cluster policy + dsr I've got a situation where client IP is not being preserved. Going to dig a little bit more but current scenario is: EXT -> N2 NodePort -> N1P
IP is preserved with:
EXT -> N1 NodePort -> N1P
Issue may be broader than NodePort however, I'll do some testing with LB IP and forcing traffic in different paths...
Sample output ran from external node:
for ip in 192.168.58.0 172.29.2.21:32005 172.29.2.22:32005;do echo $ip; port=${ip#*:}; if [ "${ip}" == "${port}" ];then port="3306"; fi; ip=$(echo "${ip}" | cut -d ":" -f1); mysql -uroot -proot -h "${ip}" -P "${port}" -e "SHOW PROCESSLIST"; done
192.168.58.0
mysql: [Warning] Using a password on the command line interface can be insecure.
+-------+------+--------------------+------+---------+------+-------+------------------+
| Id | User | Host | db | Command | Time | State | Info |
+-------+------+--------------------+------+---------+------+-------+------------------+
| 25867 | root | 172.29.10.12:40874 | NULL | Query | 0 | init | SHOW PROCESSLIST |
+-------+------+--------------------+------+---------+------+-------+------------------+
172.29.2.21:32005
mysql: [Warning] Using a password on the command line interface can be insecure.
+-------+------+--------------------+------+---------+------+-------+------------------+
| Id | User | Host | db | Command | Time | State | Info |
+-------+------+--------------------+------+---------+------+-------+------------------+
| 25868 | root | 172.29.10.12:58400 | NULL | Query | 0 | init | SHOW PROCESSLIST |
+-------+------+--------------------+------+---------+------+-------+------------------+
172.29.2.22:32005
mysql: [Warning] Using a password on the command line interface can be insecure.
+-------+------+-------------------+------+---------+------+-------+------------------+
| Id | User | Host | db | Command | Time | State | Info |
+-------+------+-------------------+------+---------+------+-------+------------------+
| 25869 | root | 172.29.2.22:41960 | NULL | Query | 0 | init | SHOW PROCESSLIST |
+-------+------+-------------------+------+---------+------+-------+------------------+
LB IP is 192.168.58.0
Client IP is 172.29.10.12
172.29.2.21:32005 172.29.2.22:32005 is svc NodePort on 2 different nodes, single backing pod is running only on the former node.
I have my suspicions that if the LB/BGP traffic landed on a node without a backing pod I'd have the same behavior...will test after sleepy sleepy though.
Hmm, I currently cannot reproduce this. I have 3 nodes:
192.168.178.29192.168.178.28192.168.178.30The node C is an external client, meaning not managed by Cilium.
# kubectl get pods -n default -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
nginx6-7d4b5d6bdf-sgqpm 1/1 Running 0 90m 10.217.1.138 tank <none> <none>
And I have:
# kubectl get svc nginx6
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
nginx6 NodePort 10.97.128.143 <none> 80:31190/TCP 89m
# kubectl get ep nginx6
NAME ENDPOINTS AGE
nginx6 10.217.1.138:80 89m
So backend is on node tank. Service has externalTrafficPolicy: Cluster.
I'm running the agent on apoc/tank with DSR. Full options I've used:
# ./daemon/cilium-agent --identity-allocation-mode=crd --enable-ipv6=true --enable-ipv4=true --disable-envoy-version-check=true --tunnel=disabled --k8s-kubeconfig-path=$HOME/.kube/config --kube-proxy-replacement=strict --node-port-mode=dsr --enable-l7-proxy=false --auto-direct-node-routes=true --native-routing-cidr=10.217.0.0/16
The node C is doing curl to apoc where it finds that the backend is remote, thus needs DSR:
# curl 192.168.178.29:31190
<!DOCTYPE html>
<html>
[...]
Running tcpdump on node tank I get:
# tcpdump -i any port 80 or port 31190 -n
[...]
13:38:49.643764 IP 192.168.178.30.51300 > 10.217.1.138.80: Flags [S], seq 1174432341, win 64240, options [mss 1460,sackOK,TS val 2665013319 ecr 0,nop,wscale 7], length 0
13:38:49.643800 IP 192.168.178.30.51300 > 10.217.1.138.80: Flags [S], seq 1174432341, win 64240, options [mss 1460,sackOK,TS val 2665013319 ecr 0,nop,wscale 7], length 0
13:38:49.643826 IP 10.217.1.138.80 > 192.168.178.30.51300: Flags [S.], seq 4260698455, ack 1174432342, win 65160, options [mss 1460,sackOK,TS val 3340810749 ecr 2665013319,nop,wscale 7], length 0
13:38:49.643836 IP 192.168.178.29.31190 > 192.168.178.30.51300: Flags [S.], seq 4260698455, ack 1174432342, win 65160, options [mss 1460,sackOK,TS val 3340810749 ecr 2665013319,nop,wscale 7], length 0
13:38:49.646042 IP 192.168.178.30.51300 > 10.217.1.138.80: Flags [.], ack 1, win 502, options [nop,nop,TS val 2665013322 ecr 3340810749], length 0
13:38:49.646050 IP 192.168.178.30.51300 > 10.217.1.138.80: Flags [.], ack 1, win 502, options [nop,nop,TS val 2665013322 ecr 3340810749], length 0
[...]
So client source IP is preserved (192.168.178.30.51300) and tank replies on behalf of apoc (192.168.178.29.31190) in the SYN/ACK.
Which agent options do you have?
172.29.2.22:32005
mysql: [Warning] Using a password on the command line interface can be insecure.
+-------+------+-------------------+------+---------+------+-------+------------------+
| Id | User | Host | db | Command | Time | State | Info |
+-------+------+-------------------+------+---------+------+-------+------------------+
| 25869 | root | 172.29.2.22:41960 | NULL | Query | 0 | init | SHOW PROCESSLIST |
+-------+------+-------------------+------+---------+------+-------+------------------+
Hmm, you're saying that host should have been 172.29.10.12 on the last entry. That node is running the agent with DSR for sure, right?
Do you have some steps to reproduce?
Yeah that’s correct. It’s deployed as daemon set so unless something is off it should be with dsr yes.
I’ll kick the agents pods in a bit just to confirm it still is in a bad state.
Sorry above comments about source IP are complete noise...I thought I was running in dsr but was not :(
I'll go through a deeper round of testing with snat + Local now and report observations.