K3s: [Question] Which Kernel Capabilities does k3s need?

Created on 13 Aug 2020  路  11Comments  路  Source: k3s-io/k3s

Hi there,

While developing k3d, I often hit that one stone which is having to run the k3s containers in privileged mode.
This blocks us from e.g. making use of resource limits set on node level (privileged mode renders cgroup limits useless).

Did anyone maybe do an analysis and some point on which capabilities k3s actually needs to run?
I guess most things would be requested by containerd, right?

I'm pretty sure that we can't get to this point where we'll be able to use memory limits in k3d, but still it would be very interesting to know, if we could limit the set of capabilities given to the k3s containers for running in security sensitive environments.

Done

Most helpful comment

Rootless support is still experimental but it might be worth trying to see if it works. If rootless does work I think it could sidestep the underlying issue that you are trying to solve.

All 11 comments

Rootless support is still experimental but it might be worth trying to see if it works. If rootless does work I think it could sidestep the underlying issue that you are trying to solve.

Hello @dweomer ,
I have made some tests and for using rootless with k3s you still need to run a priviledge docker container.
Without this you have this error failed to start the child: fork/exec /proc/self/exe: operation not permitted.

I'm starting to add some capabilities but I'm afraid it's going to be like working with --privileged.

If you want to test it I have a docker image for k3s v1.17.0 with uidmap and a user. I will put the Dockerfile at the end of this message.

For tests :
OK (with privileged)

docker run --rm -it --privileged -p 6443:6443 -p 10080:10080  louiznk/k3s:rootless server --rootless 
... normal trace and no crash ...

KO (without privileged)

docker run --rm -it -p 6443:6443 -p 10080:10080  louiznk/k3s:rootless server --rootless 
... failed to start the child: fork/exec /proc/self/exe: operation not permitted

KO (with capas)

docker run --rm -it -p 6443:6443 -p 10080:10080 --security-opt seccomp=unconfined --security-opt apparmor=unconfined --security-opt systempaths=unconfined  louiznk/k3s:rootless server --rootless
... failed to setup UID/GID map: newuidmap 23 [0 1001 1 1 200000 65536] failed: newuidmap: write to uid_map failed: Operation not permitted

KO (with more capas)

docker run --rm -it -p 6443:6443 -p 10080:10080 --security-opt seccomp=unconfined --security-opt apparmor=unconfined --security-opt systempaths=unconfined  --cap-add SYS_ADMIN  louiznk/k3s:rootless server --rootless
open: No such file or directory
... failed to setup network &{binary:slirp4netns mtu:65520 ipnet:0xc000e370b0 disableHostLoopback:true apiSocketPath:}: setting up tap tap0: executing [[nsenter -t 24 -n -m -U --preserve-credentials ip tuntap add name tap0 mode tap] [nsenter -t 24 -n -m -U --preserve-credentials ip link set tap0 up]]: exit status 1 

The Dockerfile for building the test image louiznk/k3s:rootless

FROM alpine:3.12 AS uidmap
RUN apk -u --no-cache add shadow-uidmap

## k3s with uidmap binaries
FROM rancher/k3s:v1.17.0-k3s.1 AS assembly

COPY --from=uidmap /etc/passwd /etc/group /etc/shadow /etc/subgid /etc/subuid /etc/
COPY --from=uidmap /usr/bin/newgidmap /usr/bin/newuidmap /usr/bin/
COPY --from=uidmap /lib/ld-musl-x86_64.so.1 /lib/
RUN mkdir -p /var/lib/rancher/k3s 
RUN mkdir -p /output

## dest with k3s user
FROM scratch

COPY --from=assembly / /

RUN adduser -h /var/lib/rancher/k3s -g k3s -s /bin/false -D -u 1001 -G root k3s \
        && chown k3s:root /var/lib/rancher -Rv \
        && chown k3s:root /output -Rv \
        && echo k3s:200000:65536 >> /etc/subuid \
        && echo k3s:200000:65536 >> /etc/subgid



USER k3s:root

VOLUME /var/lib/kubelet
VOLUME /var/lib/rancher/k3s
VOLUME /var/lib/cni
VOLUME /var/log

ENV PATH="$PATH:/bin/aux"

ENTRYPOINT ["/bin/k3s"]
CMD ["agent"]

Hello @iwilltry42
Finally it's start with this capabilities and this devices access right (in rw and mknod)

docker run --device=/dev/net/tun --device=/dev/kmsg --rm -it -p 6443:6443 -p 10080:10080  --security-opt seccomp=unconfined --security-opt apparmor=unconfined --security-opt systempaths=unconfined  --cap-add SYS_ADMIN  louiznk/k3s:rootless server --rootless

Capabilities :
--security-opt seccomp=unconfined --security-opt apparmor=unconfined --security-opt systempaths=unconfined --cap-add SYS_ADMIN
Devices : (note sure it's portable on every system)
--device=/dev/net/tun --device=/dev/kmsg

If I well understand

  • /dev/net/tun for tap0 using by slirp4netns
  • /dev/kmsg using by OOM watcher

I will continue to investigate to see if k3s is working (curl on API server & Traefik are ok)

--security-opt systempaths=unconfined --cap-add SYS_ADMIN

These are equivalents of --privileged. I'd rather suggest just setting --privileged for simplicity.

--security-opt systempaths=unconfined --cap-add SYS_ADMIN

These are equivalents of --privileged. I'd rather suggest just setting --privileged for simplicity.

Thanks a lot for your help @AkihiroSuda, unfortunately, I would prefer to run without privileged (Apparently when you run with privileged the memory constraint you apply to the docker container are ignored by k3s).

So I need to use some restrictive capabilities, I have tried to replace SYS_ADMIN capability by SETUID + SETGID but with this, I still have the error on uidmap
failed to setup UID/GID map: newuidmap 22 [0 1001 1 1 200000 65536] failed: newuidmap: write to uid_map failed: Operation not permitted
Have you got an idea or a suggestion?

The newuidmap error can be probably avoided by compiling newuidmap with libcap https://github.com/moby/buildkit/blob/7f42dbf9b41c0de89c744823054ab8e7c4020c68/Dockerfile#L26

But anyway I don't see much benefit in choosing systempaths=unconfined instead of --privileged.
A process running as the root in a container with systempaths=unconfined can easily break the container via procfs and sysfs.

The newuidmap error can be probably avoided by compiling newuidmap with libcap https://github.com/moby/buildkit/blob/7f42dbf9b41c0de89c744823054ab8e7c4020c68/Dockerfile#L26

Thanks, I will try to build shadowuid with libcap.

But anyway I don't see much benefit in choosing systempaths=unconfined instead of --privileged.
A process running as the root in a container with systempaths=unconfined can easily break the container via procfs and sysfs.

It could sound strange but my aim is not to make a more secure container, I want to limit the "view of the resources available" for k3s to the container resources. Perhaps I take this in the wrong way.

Let me try to explain " limit the view of the resources available":
I want the limit (cpu & memory) that is on the container running k3s (with the flag --memory ...) is the reference for k3s, but the reference for k3s is the system (not the container). So if I run a cluster for k3s in docker (with k3d) I have a wrong view of the resources available (for every container we have the system resources, if you have a cluster with 1 server and 2 agents your cluster things he has 3 more time cpu and memory that he have).
Maybe an example is more clear than my explanation: A cluster with 1 server and 2 agents, on 3 containers limit at 512MiB

$ docker stats --no-stream
CONTAINER ID        NAME                       CPU %               MEM USAGE / LIMIT     MEM %               NET I/O             BLOCK I/O           PIDS
8b8ccc4e7383        k3d-memlimit-agent-1       4.07%               155MiB / 512MiB       30.27%              29.2MB / 453kB      299kB / 14.1MB      158
60be6633c344        k3d-memlimit-agent-0       9.33%               119.5MiB / 512MiB     23.34%              16.1MB / 424kB      528kB / 21.1MB      113
8979c47c1293        k3d-memlimit-server-0      21.92%              510.1MiB / 512MiB     99.63%              75.1MB / 2.09MB     16MB / 54.1MB       130
...

But for kubernetes the total memory available is the system memory * 3

$ kubectl top node                                                                                                     
NAME                    CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%   
k3d-memlimit-agent-0    39m          0%     124Mi           0%        
k3d-memlimit-agent-1    43m          0%     150Mi           0%        
k3d-memlimit-server-0   148m         1%     474Mi           1%     

$ kubectl describe node k3d-memlimit-server-0
....
Capacity:
...
  memory:             32493528Ki
...

This is with a standard k3s. So I try with to run it as rootless with the same result (directly run with docker without k3d and for this try only 1 server with 1024 MiB)
```
$ docker run --device=/dev/net/tun --device=/dev/kmsg --rm -it -p 6443:6443 -p 10080:10080 --security-opt seccomp=unconfined --security-opt apparmor=unconfined --security-opt systempaths=unconfined --cap-add SYS_ADMIN -m 1024m louiznk/k3s:rootless server --rootless
....
$ docker stats --no-stream
CONTAINER ID NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS
3ba626186bf2 xenodochial_khayyam 65.03% 387.3MiB / 1GiB 37.82% 23.2kB / 4.2kB 0B / 1.57MB 114

into the docker image (docker exec ...)

$ kubectl get node ...
....
Capacity:
....
memory: 32486464Ki
....
...
````

Perhaps for my need it's a wrong way:
It's look like the memory contraint isn't know by k3s because it doesn't know it is in container and it use something like /proc/meminfo instead of /sys/fs/cgroup/memory/memory.limit_in_bytes (which as I just read and test is used by docker with cgroup to limit the memory)

I hope you doesn't lost you time on this, thanks for your help.

PS: Just for information I try to run the container using shadowuid build with libcap with the same result (error failed to setup UID/GID map:...)

Perhaps for my need it's a wrong way:
It's look like the memory contraint isn't know by k3s because it doesn't know it is in container and it use something like /proc/meminfo instead of /sys/fs/cgroup/memory/memory.limit_in_bytes (which as I just read and test is used by docker with cgroup to limit the memory)

That look to be that: https://github.com/rancher/k3s/blob/master/vendor/github.com/google/cadvisor/machine/machine.go#L128
Cadvisor is used by kubelet to get resources (for the node and the containers) see https://github.com/rancher/k3s/blob/master/vendor/k8s.io/kubernetes/pkg/kubelet/kubelet.go#L917 (and more much time in the code)

I hope you doesn't lost you time on this, thanks for your help.

PS: Just for information I try to run the container using shadowuid build with libcap with the same result (error failed to setup UID/GID map:...)

Hi @louiznk & @AkihiroSuda thanks for looking into this and going down this rabbit hole! :)
However, it seems that even though rootless mode in k3s is maturing, we won't be able to get rid of privileged mode in k3d...
@louiznk I like your approach though to make cAdvisor aware of it's containerized environment and I see, that even if they don't accept your PR, you can make it work with a k3s fork that includes the customized cAdvisor. Wish you the best of luck and success with this :+1:
Unfortunately, it seems that we have to close this issue as unsolvable though :confused:
Thanks again!

Hi @louiznk & @AkihiroSuda thanks for looking into this and going down this rabbit hole! :)
However, it seems that even though rootless mode in k3s is maturing, we won't be able to get rid of privileged mode in k3d...
@louiznk I like your approach though to make cAdvisor aware of it's containerized environment and I see, that even if they don't accept your PR, you can make it work with a k3s fork that includes the customized cAdvisor. Wish you the best of luck and success with this
Unfortunately, it seems that we have to close this issue as unsolvable though
Thanks again!

Thanks @iwilltry42 and @AkihiroSuda for your time and explanation, I learn a lot 馃檹

Was this page helpful?
0 / 5 - 0 ratings