What happened:
When deploying calico against kind I used the following kind config:
kind: Cluster
apiVersion: kind.sigs.k8s.io/v1alpha3
networking:
disableDefaultCNI: True
nodes:
- role: control-plane
- role: worker
- role: worker
kubeadmConfigPatches:
- |
apiVersion: kubeadm.k8s.io/v1beta2
kind: ClusterConfiguration
metadata:
name: config
networking:
serviceSubnet: "10.96.0.1/12"
podSubnet: "192.168.0.0/16"
I then apply the latest calico manifest from: https://docs.projectcalico.org/latest/getting-started/kubernetes/installation/calico
This results in a crashlooping calico-node pod on each host with the following presented in the log:
2019-09-30 18:38:28.452 [FATAL][42] int_dataplane.go 1037: Kernel's RPF check is set to 'loose'. This would allow endpoints to spoof their IP address. Calico requires net.ipv4.conf.all.rp_filter to be set to 0 or 1. If you require loose RPF and you are not concerned about spoofing, this check can be disabled by setting the IgnoreLooseRPF configuration parameter to 'true'.
This can be worked around by running the following:
kind get nodes --name=kind | xargs -n1 -I {} docker exec {} sysctl -w net.ipv4.conf.all.rp_filter=0
adjust the --name argument to the name of your cluster or leave it off for the "default" kind cluster.
I then looked into when this value was being set.
In the standard bring up this is the configured value:
docker exec -ti kind-control-plane sysctl -a | grep all.rp_filter
net.ipv4.conf.all.rp_filter = 2
which it appears is being set by
/etc/sysctl.d/10-network-security.conf:net.ipv4.conf.default.rp_filter=2
/etc/sysctl.d/10-network-security.conf:net.ipv4.conf.all.rp_filter=2
This was in turn changed to 2 with this issue:
https://bugs.launchpad.net/ubuntu/+source/procps/+bug/1814262
in my other findings I found that:
the base image that we use ubuntu:19.04 has it set:
11:27 $ docker run -it ubuntu:19.04 sysctl -a | grep all.rp_filter
net.ipv4.conf.all.rp_filter = 1
and the base-image freshly built:
11:32 $ docker run -it --tmpfs /tmp --tmpfs /run --privileged --entrypoint /bin/bash mauilion/base
root@914dc973cf59:/# sysctl -a | grep rp_filter
net.ipv4.conf.all.rp_filter = 1
root@914dc973cf59:/# exit
altho! the security file is present at this time!.
root@2f587629b72c:/# cat /etc/sysctl.d/10-network-security.conf
# Turn on Source Address Verification in all interfaces to
# prevent some spoofing attacks.
net.ipv4.conf.default.rp_filter=2
net.ipv4.conf.all.rp_filter=2
I think what's happening is that the sysctl is being honored when we start up the "real" node-image and that is what's causing the problem for calico.
What you expected to happen:
that rp_filter would be set to 0 or 1 as it is set by default in 19.03
How to reproduce it (as minimally and precisely as possible):
This is true of most of the recent base images.
Anything else we need to know?:
In my opinion it's safe to set all.rp_filter to a value of 1 explicitly.
Environment:
kind version): 0.5.1kubectl version):docker info):/etc/os-release):nice catch
write up on this by Alex as well I like his solution as well!
https://twitter.com/alexbrand/status/1178768251024760833?s=20
kubectl -n kube-system set env daemonset/calico-node FELIX_IGNORELOOSERPF=true
/assign
thanks @mauilion !
@mauilion
kubectl -n kube-system set env daemonset/calico-node FELIX_IGNORELOOSERPF=true
It seems to be messing up DNS (I was following you TGIK 075)
And while everything is working fine with kind default cni.
Calico one is giving issues, when I used "kubectl -n kube-system set env daemonset/calico-node FELIX_IGNORELOOSERPF=true"
$ k exec -it nginxd-667bdf4c99-qsbrv -- bash
root@nginxd-667bdf4c99-qsbrv:/# curl google.com
curl: (6) Could not resolve host: google.com
root@nginxd-667bdf4c99-qsbrv:/# nslookup google.com
Server: 10.96.0.10
Address: 10.96.0.10#53
** server can't find google.com: SERVFAIL
root@nginxd-667bdf4c99-qsbrv:/# exit
new images set rp_filter so you won't have to do this anymore
@BenTheElder New images created since this bug was fixed include v1.16.1 and v1.16.2. Is it worth patching the v1.15.3 (or older) images, or is that out of scope for kind?
I'll push new images with https://github.com/kubernetes-sigs/kind/milestone/8 which is primarily blocked on rounding out some stability fixes. I'm back on that now.
@BenTheElder Thank you! I wasn't sure if older images would get fixes like this.
write up on this by Alex as well I like his solution as well!
https://twitter.com/alexbrand/status/1178768251024760833?s=20
kubectl -n kube-system set env daemonset/calico-node FELIX_IGNORELOOSERPF=true
Really thanks Man!!
This solved my issue, been getting crazy for quite some hours.
After I made the:
kubectl -n kube-system set env daemonset/calico-node FELIX_IGNORELOOSERPF=true
All my calico-node pods started working.
Again thanks a lot!
if you update to kind v0.6.1 and use one of the images in the v0.6 release notes the rp_filter settings should be correct already
Most helpful comment
write up on this by Alex as well I like his solution as well!
https://twitter.com/alexbrand/status/1178768251024760833?s=20
kubectl -n kube-system set env daemonset/calico-node FELIX_IGNORELOOSERPF=true