Running a setuid-root binary inside a Nix sandbox does not actually set the uid to 0. For example whoami will report nixbld and security.wappers programs will crash with an assertion failure when detecting that the effective uid does not match the file uid.
Can reproduce either by calling /run/wrappers/bin/sudo in the sandbox (have to make it visible) or by running /run/as/root $(which id) with my little asroot NixOS module.
Seems to me like the allow-new-privileges option was added to make this work (https://github.com/NixOS/nix/issues/1429) but that this is not working with Nix from NixOS 18.09.
Notes:
nosetuid flag.allow-new-privileges = true.Wild guess: Could this be due to userns sandboxing not providing a usable root user?
Wild guess: Could this be due to userns sandboxing not providing a usable root user?
Yes, that sounds likely. From the user_namespaces manpage:
However, if either the user or the group ID of the file has no mapping inside the namespace, the set-user-ID (set-group-ID) bit is silently ignored: the new program is executed, but the process's effective user (group) ID is left unchanged.
So for this to work, there would have to be a uid mapping to the root user in the parent namespace.
So for this to work, there would have to be a uid mapping to the root user in the parent namespace.
This would only be set when allow-new-privileges is enabled, I suppose?
Curious what the use case is here, can you elaborate? It feels slightly wrong to be doing setuid things inside a build sandbox but I'm probably missing something.
Curious what the use case is here, can you elaborate?
I should write a FAQ on this :-)
We have a Hydra cluster running CI benchmarks on some open source telecom network stack called Snabb. The Nix job builds a benchmark report with Rmarkdown and that involves running many (thousands) benchmark jobs. The benchmark jobs need root privileges due to the nature of the software e.g. running userspace 10G/100G ethernet device drivers, resolving physical memory, performing DMA, etc. This cannot be wrapped in virtualization due to the performance-sensitive nature e.g. enabling the IOMMU could impact performance.
Nix has been working amazingly for us based on the setup that @domenkozar cooked up. He has even had it running whole OpenStack clusters within the test environment. Have to do a NixCon talk about it one day. Meanwhile: Got to port the %!#@ thing to Nix 2.0 ;)
Nix is also used to test pull requests including checking for breakage in our bespoke device drivers etc. So being able to run applications with superuser privileges inside Nix is really crucial to our testing strategy.
This would only be set when allow-new-privileges is enabled, I suppose?
Yes, having the mapping in general shouldn't hurt but it's best to be on the safe side.
Related problem: Once I acquire uid/gid 0 inside the sandbox I still can't do other things like mount a hugetlbfs filesystem. Is this another kernel namespace causing hassle?
(Can I somehow be rid of these restrictions once and for all somehow or is it better to take them one at a time?)
Specifically the mount() system call is failing with EPERM even though I am root:
mount("none", "/hugetlbfs", "hugetlbfs", MS_MGC_VAL, NULL) = -1 EPERM (Operation not permitted)
geteuid() = 0
This is a limitation the kernel puts in place. In a usernamespace processes even if root are only allowed to mount filesystems that have been white-listed. For that reason for example Ubuntu had a patch that allowed overlayfs. Bind mounts should be fine though.
@Mic92 Can I relax this restriction somehow? How does white-listing work?
I am running a program that expects to be able to mount filesystems within the sandbox. These are not mounted on the host and I don't want to mount them there either.
It is set at compile time for example procfs set the FS_USERNS_MOUNT flag:
https://github.com/torvalds/linux/blob/b00d209241ff7ac336543f92ea764b9d6b03032a/fs/proc/root.c#L121
The easiest way would be to ask nix to mount those from somewhere into the sandbox.
Also though in the nix case this is still not ideal because in particular hugetablefs would be shared across builds.
Tricky. On the one hand I don't really mind sharing the directory between sandboxes. On the other hand I want the application to have superuser privileges and not have to make special-case workarounds for each and every thing that it wants to do.
Just now I lean towards patching the kernel to whitelist hugetlbfs but then I'll never get binary cache hits on the kernels :)
On further reflection it's probably not safe to share one hugetlbfs mount between all sandboxes. I suspect that PIDs will not be unique across sandboxes and this will lead to collisions on the filesystem if two processes with the same PID try to create their own directories in the same bind-mounted folder.
So it might have to be the kernel patching route.
You have to be aware that this flag is only set if it is safe to do so.
It is allowed for tmpfs because memory cgroups also account those memory allocation.
Hugetablefs might be not handle this.
There is a sandbox-paths option in nix that allows bind mounting of filesystems into a build.
For what definition of "safe" though? I am not concerned about security on the local machine here e.g. privilege escalation within the sandbox. I am concerned about machine crashed though :).
If I bind-mount a shared directory from the host then I believe I'll also need to either prevent sandboxes from using the same PIDs (I'm assuming that they do now) or else patch the application I'm working with to protect against collisions with other processes having the same pid. This seems messy compared with a one-line kernel patch if that would work.
... Could also be that updating the application to support running multiple instances inside userns namespaces with a shared hugetlbfs folder [pause for breath] is a legitimate enhancement that can be useful in other deployment contexts in the future. Today I can bypass userns if I really want to but maybe tomorrow I can't because a user wants to deploy in a more restricted environment. Have to ponder...
By safe I mean actually secure.
You change would be applied here:
https://github.com/torvalds/linux/blob/b00d209241ff7ac336543f92ea764b9d6b03032a/fs/hugetlbfs/inode.c#L1287
Thank you very much for the information. I'll consider my options until the next time I'm back at the keys.
It might be even possible to compile a hugtblfs2 module out-of-tree as a single module that does not have the restriction of this one to avoid compiling the whole kernel.
Reflection: This all seems to "just work" with docker --privileged. I have not seen any problems doing superuser things like mounting hugetlbfs in that context. How come it is different with Nix?
Docker does not have usernamespaces enabled by default.
Could be that what I need to do is disable usernamespaces?
Have to take a step back and think about the goals/requirements for benchmarking system software with Nix where running on full bare metal takes priority over local privilege management on the build host. Otherwise it is just whack-a-mole.
Well, you would need to rewrite some of nix's sandbox logic for that at the moment it depends on usernamespaces to map the builder's uid/gid on the host to uid 1000/gid 1000
in the builder sandbox. Otherwise you could disable the sandbox as a whole.
Otherwise you could disable the sandbox as a whole
This may be the right solution. The main downside I see is that untracked dependencies ("impurities") can creep into the builds. However, compared with fighting kernel features like userns that starts to seem pretty minor.
Some other pros/cons that I overlook?
Yeah, disabling user namespaces sounds like the way to go. It could be set per derivation using requiredSystemFeatures, e.g.
requiredSystemFeatures = [ "no-userns" ];
Alternatively, you could just disable sandboxing, since you might not need it for benchmarking anyway?
Can I have the chroot of the sandbox only? Is that compatible with the overall approach that Nix takes?
You can always just run your daemon in a chroot that only sees
/nix/store (and /bin/sh and /usr/bin/env). Enumerating the store is
slow enough to notice in the benchmarking (and hard to write by
accident), and if the build has a path reference directly, it would be
in the sandbox anyway.
Not sure how your setup works, but you could also disable the sandbox just for the benchmark itself but not the build. Well you could even build a docker container with nix that contains the benchmark to have some test isolation: https://nixos.org/nixpkgs/manual/#sec-pkgs-dockerTools or a nixos-container based one: https://nixos.org/nixos/manual/index.html#ch-containers
At the moment we delegate most of our privilege-requiring tests to vm-based tests. There was once a plan to also extend this with container-based tests, but there we would probably also want usernamespace-based, unprivileged container for the nixpkgs usecase.
This is much trickier than I had anticipated :-) thanks for all the feedback everybody!
Suppose we follow the route of running the build without a Nix sandbox but then having the build/test script construct a sandbox of its own. For example to create a chroot that shadows /var with a scratch folder so that each test can do whatever it likes there without interfering with other tests.
Is that a viable approach and what some of these suggestions are hinting at? This way we still have a modest level of test isolation, we avoid advanced kernel features like usernamespaces, and we avoid depending on the intricacies of build.cc internals with SECCOMP and userns that are evolving in a different direction than our use case anyway. Or?
disabling user namespaces sounds like the way to go
@edolstra How would that be implemented? Just conditionalize some parts of build.cc to suppress some of the sandbox setup?
Yes, basically make passing CLONE_NEWUSER and associated setup conditional, in the same way that CLONE_NEWNET is optional.
@edolstra I made a naieve attempt at conditionalizing the userns but nothing works with this attempt:
$ nix-build -E 'with import <nixpkgs> {}; runCommand "x" {} "echo ok | tee $out"'
these derivations will be built:
/nix/store/39x3acks4hq3sp5lgwnw4lcdzw5xi72d-x.drv
error: writing to file: Operation not permitted
Maybe I disabled the usernamespace but other parts of build.cc are assuming the userids have been remapped or something. Hints would be warmly appreciated if anybody has time to peek at the patch and daemon-strace at https://gist.github.com/992d195be54c68cfe501b20727c37c16