Hi there,
While developing k3d, I often hit that one stone which is having to run the k3s containers in privileged mode.
This blocks us from e.g. making use of resource limits set on node level (privileged mode renders cgroup limits useless).
Did anyone maybe do an analysis and some point on which capabilities k3s actually needs to run?
I guess most things would be requested by containerd, right?
I'm pretty sure that we can't get to this point where we'll be able to use memory limits in k3d, but still it would be very interesting to know, if we could limit the set of capabilities given to the k3s containers for running in security sensitive environments.
Rootless support is still experimental but it might be worth trying to see if it works. If rootless does work I think it could sidestep the underlying issue that you are trying to solve.
Hello @dweomer ,
I have made some tests and for using rootless with k3s you still need to run a priviledge docker container.
Without this you have this error failed to start the child: fork/exec /proc/self/exe: operation not permitted.
I'm starting to add some capabilities but I'm afraid it's going to be like working with --privileged.
If you want to test it I have a docker image for k3s v1.17.0 with uidmap and a user. I will put the Dockerfile at the end of this message.
For tests :
OK (with privileged)
docker run --rm -it --privileged -p 6443:6443 -p 10080:10080 louiznk/k3s:rootless server --rootless
... normal trace and no crash ...
KO (without privileged)
docker run --rm -it -p 6443:6443 -p 10080:10080 louiznk/k3s:rootless server --rootless
... failed to start the child: fork/exec /proc/self/exe: operation not permitted
KO (with capas)
docker run --rm -it -p 6443:6443 -p 10080:10080 --security-opt seccomp=unconfined --security-opt apparmor=unconfined --security-opt systempaths=unconfined louiznk/k3s:rootless server --rootless
... failed to setup UID/GID map: newuidmap 23 [0 1001 1 1 200000 65536] failed: newuidmap: write to uid_map failed: Operation not permitted
KO (with more capas)
docker run --rm -it -p 6443:6443 -p 10080:10080 --security-opt seccomp=unconfined --security-opt apparmor=unconfined --security-opt systempaths=unconfined --cap-add SYS_ADMIN louiznk/k3s:rootless server --rootless
open: No such file or directory
... failed to setup network &{binary:slirp4netns mtu:65520 ipnet:0xc000e370b0 disableHostLoopback:true apiSocketPath:}: setting up tap tap0: executing [[nsenter -t 24 -n -m -U --preserve-credentials ip tuntap add name tap0 mode tap] [nsenter -t 24 -n -m -U --preserve-credentials ip link set tap0 up]]: exit status 1
The Dockerfile for building the test image louiznk/k3s:rootless
FROM alpine:3.12 AS uidmap
RUN apk -u --no-cache add shadow-uidmap
## k3s with uidmap binaries
FROM rancher/k3s:v1.17.0-k3s.1 AS assembly
COPY --from=uidmap /etc/passwd /etc/group /etc/shadow /etc/subgid /etc/subuid /etc/
COPY --from=uidmap /usr/bin/newgidmap /usr/bin/newuidmap /usr/bin/
COPY --from=uidmap /lib/ld-musl-x86_64.so.1 /lib/
RUN mkdir -p /var/lib/rancher/k3s
RUN mkdir -p /output
## dest with k3s user
FROM scratch
COPY --from=assembly / /
RUN adduser -h /var/lib/rancher/k3s -g k3s -s /bin/false -D -u 1001 -G root k3s \
&& chown k3s:root /var/lib/rancher -Rv \
&& chown k3s:root /output -Rv \
&& echo k3s:200000:65536 >> /etc/subuid \
&& echo k3s:200000:65536 >> /etc/subgid
USER k3s:root
VOLUME /var/lib/kubelet
VOLUME /var/lib/rancher/k3s
VOLUME /var/lib/cni
VOLUME /var/log
ENV PATH="$PATH:/bin/aux"
ENTRYPOINT ["/bin/k3s"]
CMD ["agent"]
Hello @iwilltry42
Finally it's start with this capabilities and this devices access right (in rw and mknod)
docker run --device=/dev/net/tun --device=/dev/kmsg --rm -it -p 6443:6443 -p 10080:10080 --security-opt seccomp=unconfined --security-opt apparmor=unconfined --security-opt systempaths=unconfined --cap-add SYS_ADMIN louiznk/k3s:rootless server --rootless
Capabilities :
--security-opt seccomp=unconfined --security-opt apparmor=unconfined --security-opt systempaths=unconfined --cap-add SYS_ADMIN
Devices : (note sure it's portable on every system)
--device=/dev/net/tun --device=/dev/kmsg
If I well understand
I will continue to investigate to see if k3s is working (curl on API server & Traefik are ok)
--security-opt systempaths=unconfined --cap-add SYS_ADMIN
These are equivalents of --privileged. I'd rather suggest just setting --privileged for simplicity.
--security-opt systempaths=unconfined --cap-add SYS_ADMIN
These are equivalents of --privileged. I'd rather suggest just setting --privileged for simplicity.
Thanks a lot for your help @AkihiroSuda, unfortunately, I would prefer to run without privileged (Apparently when you run with privileged the memory constraint you apply to the docker container are ignored by k3s).
So I need to use some restrictive capabilities, I have tried to replace SYS_ADMIN capability by SETUID + SETGID but with this, I still have the error on uidmap
failed to setup UID/GID map: newuidmap 22 [0 1001 1 1 200000 65536] failed: newuidmap: write to uid_map failed: Operation not permitted
Have you got an idea or a suggestion?
The newuidmap error can be probably avoided by compiling newuidmap with libcap https://github.com/moby/buildkit/blob/7f42dbf9b41c0de89c744823054ab8e7c4020c68/Dockerfile#L26
But anyway I don't see much benefit in choosing systempaths=unconfined instead of --privileged.
A process running as the root in a container with systempaths=unconfined can easily break the container via procfs and sysfs.
The newuidmap error can be probably avoided by compiling newuidmap with libcap https://github.com/moby/buildkit/blob/7f42dbf9b41c0de89c744823054ab8e7c4020c68/Dockerfile#L26
Thanks, I will try to build shadowuid with libcap.
But anyway I don't see much benefit in choosing systempaths=unconfined instead of --privileged.
A process running as the root in a container with systempaths=unconfined can easily break the container via procfs and sysfs.
It could sound strange but my aim is not to make a more secure container, I want to limit the "view of the resources available" for k3s to the container resources. Perhaps I take this in the wrong way.
Let me try to explain " limit the view of the resources available":
I want the limit (cpu & memory) that is on the container running k3s (with the flag --memory ...) is the reference for k3s, but the reference for k3s is the system (not the container). So if I run a cluster for k3s in docker (with k3d) I have a wrong view of the resources available (for every container we have the system resources, if you have a cluster with 1 server and 2 agents your cluster things he has 3 more time cpu and memory that he have).
Maybe an example is more clear than my explanation: A cluster with 1 server and 2 agents, on 3 containers limit at 512MiB
$ docker stats --no-stream
CONTAINER ID NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS
8b8ccc4e7383 k3d-memlimit-agent-1 4.07% 155MiB / 512MiB 30.27% 29.2MB / 453kB 299kB / 14.1MB 158
60be6633c344 k3d-memlimit-agent-0 9.33% 119.5MiB / 512MiB 23.34% 16.1MB / 424kB 528kB / 21.1MB 113
8979c47c1293 k3d-memlimit-server-0 21.92% 510.1MiB / 512MiB 99.63% 75.1MB / 2.09MB 16MB / 54.1MB 130
...
But for kubernetes the total memory available is the system memory * 3
$ kubectl top node
NAME CPU(cores) CPU% MEMORY(bytes) MEMORY%
k3d-memlimit-agent-0 39m 0% 124Mi 0%
k3d-memlimit-agent-1 43m 0% 150Mi 0%
k3d-memlimit-server-0 148m 1% 474Mi 1%
$ kubectl describe node k3d-memlimit-server-0
....
Capacity:
...
memory: 32493528Ki
...
This is with a standard k3s. So I try with to run it as rootless with the same result (directly run with docker without k3d and for this try only 1 server with 1024 MiB)
```
$ docker run --device=/dev/net/tun --device=/dev/kmsg --rm -it -p 6443:6443 -p 10080:10080 --security-opt seccomp=unconfined --security-opt apparmor=unconfined --security-opt systempaths=unconfined --cap-add SYS_ADMIN -m 1024m louiznk/k3s:rootless server --rootless
....
$ docker stats --no-stream
CONTAINER ID NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS
3ba626186bf2 xenodochial_khayyam 65.03% 387.3MiB / 1GiB 37.82% 23.2kB / 4.2kB 0B / 1.57MB 114
$ kubectl get node ...
....
Capacity:
....
memory: 32486464Ki
....
...
````
Perhaps for my need it's a wrong way:
It's look like the memory contraint isn't know by k3s because it doesn't know it is in container and it use something like /proc/meminfo instead of /sys/fs/cgroup/memory/memory.limit_in_bytes (which as I just read and test is used by docker with cgroup to limit the memory)
I hope you doesn't lost you time on this, thanks for your help.
PS: Just for information I try to run the container using shadowuid build with libcap with the same result (error failed to setup UID/GID map:...)
Perhaps for my need it's a wrong way:
It's look like the memory contraint isn't know by k3s because it doesn't know it is in container and it use something like/proc/meminfoinstead of/sys/fs/cgroup/memory/memory.limit_in_bytes(which as I just read and test is used by docker with cgroup to limit the memory)
That look to be that: https://github.com/rancher/k3s/blob/master/vendor/github.com/google/cadvisor/machine/machine.go#L128
Cadvisor is used by kubelet to get resources (for the node and the containers) see https://github.com/rancher/k3s/blob/master/vendor/k8s.io/kubernetes/pkg/kubelet/kubelet.go#L917 (and more much time in the code)
I hope you doesn't lost you time on this, thanks for your help.
PS: Just for information I try to run the container using shadowuid build with libcap with the same result (error
failed to setup UID/GID map:...)
Hi @louiznk & @AkihiroSuda thanks for looking into this and going down this rabbit hole! :)
However, it seems that even though rootless mode in k3s is maturing, we won't be able to get rid of privileged mode in k3d...
@louiznk I like your approach though to make cAdvisor aware of it's containerized environment and I see, that even if they don't accept your PR, you can make it work with a k3s fork that includes the customized cAdvisor. Wish you the best of luck and success with this :+1:
Unfortunately, it seems that we have to close this issue as unsolvable though :confused:
Thanks again!
Hi @louiznk & @AkihiroSuda thanks for looking into this and going down this rabbit hole! :)
However, it seems that even thoughrootlessmode in k3s is maturing, we won't be able to get rid of privileged mode in k3d...
@louiznk I like your approach though to make cAdvisor aware of it's containerized environment and I see, that even if they don't accept your PR, you can make it work with a k3s fork that includes the customized cAdvisor. Wish you the best of luck and success with this
Unfortunately, it seems that we have to close this issue as unsolvable though
Thanks again!
Thanks @iwilltry42 and @AkihiroSuda for your time and explanation, I learn a lot 馃檹
Most helpful comment
Rootless support is still experimental but it might be worth trying to see if it works. If rootless does work I think it could sidestep the underlying issue that you are trying to solve.