We currently set the aws-node pod to be privileged, but that might not be needed.
Check if CAP_NET_ADMIN and CAP_DAC_OVERRIDE is enough to set the RDF check to loose and copy the binary and config file:
Unfortunately, having the capabilities limited to ["NET_ADMIN", "DAC_OVERRIDE"] is not enough to write to /proc/sys/net/*. From the logs:
[DEBUG] Setting RPF for primary interface: /proc/sys/net/ipv4/conf/eth0/rp_filter
[ERROR] Failed to set up host networkfailed to configure eth0 RPF check: open /proc/sys/net/ipv4/conf/eth0/rp_filter: read-only file system
[ERROR] Initialization failure: ipamd init: failed to set up host network: failed to configure eth0 RPF check: open /proc/sys/net/ipv4/conf/eth0/rp_filter: read-only file system
Looks like only for privileged pod, /proc gets mounted with write access.
There are few ways to remove privileged pod permission for aws-node,
1) Set the rp_filter through init container.
initContainers:
- command:
- sh
- -c
- sysctl net.ipv4.conf.eth0.rp_filter=2
image: golang:1.13-stretch
name: rp_filter_setting
securityContext:
privileged: true
2) Set the rp_filter when the ec2 instance starts up. But the problem with this approach is, if anyone builds custom AMI then they have to make sure this change is added to their AMI build scripts.
3) Enabling unsafe sysctl - https://kubernetes.io/docs/tasks/administer-cluster/sysctl-cluster/#enabling-unsafe-sysctls but this doesn't help us, because https://github.com/kubernetes/kubernetes/blob/7f23a743e8c23ac6489340bbb34fa6f1d392db9d/pkg/kubelet/sysctl/whitelist.go#L89 kubelet rejects the pod allocation when net.* sysctl is used along with host network.
This leave us with option-1 being more suitable for our usecase.
Proposed solution:
1) Add a flag that instructs aws-node pod whether to perform or skip setting rp_filter value for eth0.
2) Add init container as described above and remove "privileged" mode for aws-node.
3) At later time, deprecate the flag and remove it completely from aws-node.
feel like we should use src_valid_mark instead: https://github.com/torvalds/linux/commit/28f6aeea3f12d37bd258b2c0d5ba891bff4ec479
BTW, it's indeed possible to trick without any sysctl setting, like mangle traffic from eth0 with a custom tos value, and have a route policy for that tos value to use main table. but it's a bit tricky.
https://github.com/torvalds/linux/blob/v4.14/net/ipv4/route.c#L1879
https://github.com/torvalds/linux/blob/v4.14/net/ipv4/route.c#L1660
https://github.com/torvalds/linux/blob/v4.14/net/ipv4/route.c#L1679
https://github.com/torvalds/linux/blob/v4.14/net/ipv4/fib_frontend.c#L315
This PR #130 where the code was added has some good comments.
Nice dive depp @M00nF1sh. I tested both of your suggestions and both seems to work fine
# with TOS
sudo iptables -A PREROUTING -t mangle -i eth0 -m comment --comment "AWS, primary ENI" -m addrtype --dst-type LOCAL --limit-iface-in -j TOS 鈥攕et-tos 0x08
sudo ip rule add pref 1025 tos 0x08 table main
# with MARK
sudo iptables -A PREROUTING -t mangle -i eth0 -m comment --comment "AWS, primary ENI" -m addrtype --dst-type LOCAL --limit-iface-in -j MARK 鈥攕et-xmark 0x80/0x80
sudo sysctl net.ipv4.conf.eth0.src_valid_mark=1
Removed 'if' block from https://github.com/aws/amazon-vpc-cni-k8s/blob/1ee59a06c597c62aa93e85fc1d2eedef553707a3/pkg/networkutils/network.go#L228 and was able to get aws-node running with following security context.
securityContext:
capabilities:
add:
- NET_ADMIN
We should continue the above investigation, but for context I note we read-write mount the host's /var/log (can rewrite logs) and /var/run/docker{,shim}.sock (host root equiv) into the aws-node container. We will also need to drop/reduce these hostPath volumes somehow before privileged=false has any meaning.
For now, I'm thinking of checking if we have write access to net.ipv4.conf.eth0.rp_filter file then update the rp_filter otherwise don't update. With this we don't have to introduce another env variable to have users to decide whether to do this operation or not. This would simplify user experience with respect to updates (would help both variants of updates that users performs - just editing aws-node ds version number as well as applying the manifest completely)
Resolved by adding the init container in #955
This was closed prematurely, so just reopening so we don't lose the remaining action item raised earlier.
privileged=true Kubernetes option to an earlier init container, but we still expose the CRI socket to the aws-node persistent container. This allows the aws-node container to trivially just (eg) start a new privileged container and so the aws-node pod remains "privileged" in every practical sense.Remaining action item:
(For tracking: Write access to all of /var/log was removed in #987, and the docker socket was removed in #1075)
Per https://github.com/aws/containers-roadmap/issues/1048#issuecomment-739128296, the aws-node Pod needs NET_RAW as well as NET_ADMIN. It's currently undeclared in the Daemonset, but because it's one of the default capabilities added by the Docker runtime, its absence is not noticed until a PodSecurityPolicy tries to take it away.
The symptom observed is:
daemonset pods get scheduled but fail. On top of that the aws-k8s-agent fails out silently.
Most helpful comment
Looks like only for privileged pod, /proc gets mounted with write access.
There are few ways to remove privileged pod permission for aws-node,
1) Set the rp_filter through init container.
2) Set the rp_filter when the ec2 instance starts up. But the problem with this approach is, if anyone builds custom AMI then they have to make sure this change is added to their AMI build scripts.
3) Enabling unsafe sysctl - https://kubernetes.io/docs/tasks/administer-cluster/sysctl-cluster/#enabling-unsafe-sysctls but this doesn't help us, because https://github.com/kubernetes/kubernetes/blob/7f23a743e8c23ac6489340bbb34fa6f1d392db9d/pkg/kubelet/sysctl/whitelist.go#L89 kubelet rejects the pod allocation when net.* sysctl is used along with host network.
This leave us with option-1 being more suitable for our usecase.
Proposed solution:
1) Add a flag that instructs aws-node pod whether to perform or skip setting rp_filter value for eth0.
2) Add init container as described above and remove "privileged" mode for aws-node.
3) At later time, deprecate the flag and remove it completely from aws-node.