Amazon-vpc-cni-k8s: Run aws-node as unprivileged pod

Created on 17 Jan 2020  路  10Comments  路  Source: aws/amazon-vpc-cni-k8s

We currently set the aws-node pod to be privileged, but that might not be needed.

Check if CAP_NET_ADMIN and CAP_DAC_OVERRIDE is enough to set the RDF check to loose and copy the binary and config file:

https://github.com/aws/amazon-vpc-cni-k8s/blob/1ee59a06c597c62aa93e85fc1d2eedef553707a3/pkg/networkutils/network.go#L233-L246

2.x CNI plugin enhancement

Most helpful comment

Looks like only for privileged pod, /proc gets mounted with write access.

There are few ways to remove privileged pod permission for aws-node,

1) Set the rp_filter through init container.

    initContainers:
      - command:
        - sh
        - -c
        - sysctl net.ipv4.conf.eth0.rp_filter=2
        image: golang:1.13-stretch
        name: rp_filter_setting
        securityContext:
          privileged: true

2) Set the rp_filter when the ec2 instance starts up. But the problem with this approach is, if anyone builds custom AMI then they have to make sure this change is added to their AMI build scripts.
3) Enabling unsafe sysctl - https://kubernetes.io/docs/tasks/administer-cluster/sysctl-cluster/#enabling-unsafe-sysctls but this doesn't help us, because https://github.com/kubernetes/kubernetes/blob/7f23a743e8c23ac6489340bbb34fa6f1d392db9d/pkg/kubelet/sysctl/whitelist.go#L89 kubelet rejects the pod allocation when net.* sysctl is used along with host network.

This leave us with option-1 being more suitable for our usecase.

Proposed solution:
1) Add a flag that instructs aws-node pod whether to perform or skip setting rp_filter value for eth0.
2) Add init container as described above and remove "privileged" mode for aws-node.
3) At later time, deprecate the flag and remove it completely from aws-node.

All 10 comments

Unfortunately, having the capabilities limited to ["NET_ADMIN", "DAC_OVERRIDE"] is not enough to write to /proc/sys/net/*. From the logs:

[DEBUG] Setting RPF for primary interface: /proc/sys/net/ipv4/conf/eth0/rp_filter
[ERROR] Failed to set up host networkfailed to configure eth0 RPF check: open /proc/sys/net/ipv4/conf/eth0/rp_filter: read-only file system
[ERROR] Initialization failure: ipamd init: failed to set up host network: failed to configure eth0 RPF check: open /proc/sys/net/ipv4/conf/eth0/rp_filter: read-only file system

Looks like only for privileged pod, /proc gets mounted with write access.

There are few ways to remove privileged pod permission for aws-node,

1) Set the rp_filter through init container.

    initContainers:
      - command:
        - sh
        - -c
        - sysctl net.ipv4.conf.eth0.rp_filter=2
        image: golang:1.13-stretch
        name: rp_filter_setting
        securityContext:
          privileged: true

2) Set the rp_filter when the ec2 instance starts up. But the problem with this approach is, if anyone builds custom AMI then they have to make sure this change is added to their AMI build scripts.
3) Enabling unsafe sysctl - https://kubernetes.io/docs/tasks/administer-cluster/sysctl-cluster/#enabling-unsafe-sysctls but this doesn't help us, because https://github.com/kubernetes/kubernetes/blob/7f23a743e8c23ac6489340bbb34fa6f1d392db9d/pkg/kubelet/sysctl/whitelist.go#L89 kubelet rejects the pod allocation when net.* sysctl is used along with host network.

This leave us with option-1 being more suitable for our usecase.

Proposed solution:
1) Add a flag that instructs aws-node pod whether to perform or skip setting rp_filter value for eth0.
2) Add init container as described above and remove "privileged" mode for aws-node.
3) At later time, deprecate the flag and remove it completely from aws-node.

feel like we should use src_valid_mark instead: https://github.com/torvalds/linux/commit/28f6aeea3f12d37bd258b2c0d5ba891bff4ec479

BTW, it's indeed possible to trick without any sysctl setting, like mangle traffic from eth0 with a custom tos value, and have a route policy for that tos value to use main table. but it's a bit tricky.

https://github.com/torvalds/linux/blob/v4.14/net/ipv4/route.c#L1879
https://github.com/torvalds/linux/blob/v4.14/net/ipv4/route.c#L1660
https://github.com/torvalds/linux/blob/v4.14/net/ipv4/route.c#L1679
https://github.com/torvalds/linux/blob/v4.14/net/ipv4/fib_frontend.c#L315

This PR #130 where the code was added has some good comments.

Nice dive depp @M00nF1sh. I tested both of your suggestions and both seems to work fine

# with TOS
sudo iptables -A PREROUTING -t mangle -i eth0 -m comment --comment "AWS, primary ENI" -m addrtype --dst-type LOCAL --limit-iface-in -j TOS 鈥攕et-tos 0x08
sudo ip rule add pref 1025 tos 0x08 table main

# with MARK
sudo iptables -A PREROUTING -t mangle -i eth0 -m comment --comment "AWS, primary ENI" -m addrtype --dst-type LOCAL --limit-iface-in -j MARK 鈥攕et-xmark 0x80/0x80
sudo sysctl net.ipv4.conf.eth0.src_valid_mark=1

Removed 'if' block from https://github.com/aws/amazon-vpc-cni-k8s/blob/1ee59a06c597c62aa93e85fc1d2eedef553707a3/pkg/networkutils/network.go#L228 and was able to get aws-node running with following security context.

securityContext:
          capabilities:
            add:
            - NET_ADMIN

We should continue the above investigation, but for context I note we read-write mount the host's /var/log (can rewrite logs) and /var/run/docker{,shim}.sock (host root equiv) into the aws-node container. We will also need to drop/reduce these hostPath volumes somehow before privileged=false has any meaning.

For now, I'm thinking of checking if we have write access to net.ipv4.conf.eth0.rp_filter file then update the rp_filter otherwise don't update. With this we don't have to introduce another env variable to have users to decide whether to do this operation or not. This would simplify user experience with respect to updates (would help both variants of updates that users performs - just editing aws-node ds version number as well as applying the manifest completely)

Resolved by adding the init container in #955

This was closed prematurely, so just reopening so we don't lose the remaining action item raised earlier.

955 moved the literal privileged=true Kubernetes option to an earlier init container, but we still expose the CRI socket to the aws-node persistent container. This allows the aws-node container to trivially just (eg) start a new privileged container and so the aws-node pod remains "privileged" in every practical sense.

Remaining action item:

  • [ ] Remove CRI socket from aws-node container (or equivalent docker/containerd socket)

(For tracking: Write access to all of /var/log was removed in #987, and the docker socket was removed in #1075)

Per https://github.com/aws/containers-roadmap/issues/1048#issuecomment-739128296, the aws-node Pod needs NET_RAW as well as NET_ADMIN. It's currently undeclared in the Daemonset, but because it's one of the default capabilities added by the Docker runtime, its absence is not noticed until a PodSecurityPolicy tries to take it away.

The symptom observed is:

daemonset pods get scheduled but fail. On top of that the aws-k8s-agent fails out silently.

Was this page helpful?
0 / 5 - 0 ratings