Nomad: exec driver mounts /dev as read-only

Created on 30 Mar 2017  路  19Comments  路  Source: hashicorp/nomad

Nomad version

v0.5.5

Operating system and Environment details

Centos 7

Issue

2017/03/28 19:04:30 E! failed to open file /dev/stderr for alert logging: open /dev/stderr: permission denied

I'm wondering whether it's related to syscall.MS_RDONLY used in:

syscall.Mount("none", dev, "devtmpfs", syscall.MS_RDONLY, "")

https://github.com/hashicorp/nomad/blob/master/client/allocdir/task_dir_linux.go#L26

themdriveexec typbug

Most helpful comment

This was a very surprising behavior, as our understanding was that exec should make a chroot and not modify the host system.

Indeed! Thanks for the detailed post @jippi and hard work from everyone else in this thread.

The /dev handling and chroot'ing have changed in 0.9, and I'm not able to reproduce this problem off master with a simple exec job. If anyone else has time to test, here is a Linux amd64 binary built off of master (f10c6259c).

Sorry for the lack of communication. 0.9 is getting very close to RC, and we've been trying to avoid adding extra work. However, we definitely want to make sure this bug is fixed!

nomad-f10c6259c.gz

All 19 comments

Let me know whether you would accept a PR for this.

@Ashald Is this a blocker for you? I am hesitant because the real solution would be to not mount the folder as read/write but to rather limit what devices are mounted.

An interim solution could be adding write permission to just those devices?

Well, this essentially means that nothing that we run with exec driver can write to /dev - including /dev/stderr and /dev/stdout. As we're using a forked version of Nomad with bind-mounts support (as in #2333) I thought about applying a hotfix in our fork for the time being. But it'd be nice if it could be resolved sooner than later.

$ nomad fs 03612e4f-332f-ffbc-e144-8eaec5c2849b web/dev/ | grep std
Lrwxrwxrwx   15 B     03/25/17 12:31:42 -0400  stderr
Lrwxrwxrwx   15 B     03/25/17 12:31:42 -0400  stdin
Lrwxrwxrwx   15 B     03/25/17 12:31:42 -0400  stdout

That's how it looks like now. I think mode is irrelevant if the whole filesystem is readonly.

/dev/std{err,out} on Linux is just a symlink to /proc/self/fd/{2,1} which is _also_ mounted readonly, so both mount points need fixing.

Out of curiousity is there a reason you're using these pseudo-files instead of writing to file descriptor 2 directly in your app? This needs fixing either way, but it seems people rarely use these pseudo-files.

From my personal experience, I see a lot of usage for /dev/stdout and /dev/stderr all over the place.

But regardless of that - we actually got that error umber of application that we were trying to use (such as kapacitor, nginx and so on). So as much as we can, we write directly to FDs but sometimes you just don't have an option.

Just for the record, we found a way to fix it. Not sure it's the best way to do that so I'm not submitting a PR but the issue pretty trivial on its own. Apart from the proc filesystem being mounted as RW, there is also an issue with pipes. Nomad's executor starts a child process connecting its stdout & stderr to logger writers https://github.com/hashicorp/nomad/blob/master/client/driver/executor/executor.go#L247 and cmd.Start https://github.com/hashicorp/nomad/blob/master/client/driver/executor/executor.go#L277, internally, creates a pipe to be able to get a file descriptor to connect to a child process. This pipe is owned by root and is unwritable by the child process.
We solved this issue by applying the same technique as cmd.Start by creating a pipe beforehand, adjusting its ownership and starting a goroutine that copies input from it into the logger writer.

I may be an outlier here, but, I'm seeing this same behavior (exec driver mounting /dev as read-only) cause a failure of mounting EBS volumes on my ec2 instances.

Once any job with an exec driver fires up on a host, I can attach a drive to the instance, but the /dev/xxxx device cannot be created so the mount won't work....also, naturally, a manual mknod can't create the device either.

Seeing the same behavior with GCP instances. Once an exec driver job starts, /dev is remounted as read-only and no nodes can be created. This causes attaching volumes to fail as @jcomeaux noted.

This is a show stopper for us since we are trying to use external volumes with rexray as a persistent volume solution for docker jobs.

Having same issues as @jcomeaux and @athak - did you guys find any fix?

I can confirm that /dev is re-mounted as readonly once an exec job starts. Tested with nomad 0.7.1 and 0.8.3

nomad-test / # mount |grep devtmpfs
devtmpfs on /dev type devtmpfs (ro,nosuid,seclabel,size=3807272k,nr_inodes=951818,mode=755)

After that we're not able to mount disks attached at runtime and have to re-mount /dev as rw.

We have the same issue, and think that nomad must prepare /dev in chroot like docker does. I think that follow will be enough:

# setup device
mknod $CHROOT/dev/null c 1 3
mknod $CHROOT/dev/zero c 1 5
mknod $CHROOT/dev/tty  c 5 0
mknod $CHROOT/dev/random c 1 8
mknod $CHROOT/dev/urandom c 1 9
chmod 0666 $CHROOT/dev/{null,tty,zero}
chown root.tty $CHROOT/dev/tty
ln -s /proc/self/fd/2 $CHROOT/dev/stderr
ln -s /proc/self/fd/0 $CHROOT/dev/stdin
ln -s /proc/self/fd/1 $CHROOT/dev/stdout

or do this like in this script:
https://github.com/funtoo/realdev/blob/master/realdev

Its not normal, that nomad change root system

We write a little patch that solve our issue(must by appliet to 0.8.6 version)

diff --git a/client/allocdir/task_dir_linux.go b/client/allocdir/task_dir_linux.go
index d8e62f49c..7c17c8106 100644
--- a/client/allocdir/task_dir_linux.go
+++ b/client/allocdir/task_dir_linux.go
@@ -5,12 +5,23 @@ import (
    "os"
    "path/filepath"
    "syscall"
+   "golang.org/x/sys/unix"

    "github.com/hashicorp/go-multierror"
 )

 // mountSpecialDirs mounts the dev and proc file system from the host to the
 // chroot
+func copyDevice(dst string, fi os.FileInfo) error {
+   st, ok := fi.Sys().(*syscall.Stat_t)
+   if !ok {
+       return fmt.Errorf("unsupported stat type")
+   }
+   return unix.Mknod(dst, uint32(st.Mode), int(st.Rdev))
+}
+
+var udevEntries = []string{"core", "null", "full", "zero", "tty", "random", "urandom", "stderr", "stdin", "stdout"}
+
 func (t *TaskDir) mountSpecialDirs() error {
    // Mount dev
    dev := filepath.Join(t.Dir, "dev")
@@ -24,9 +35,39 @@ func (t *TaskDir) mountSpecialDirs() error {
        return fmt.Errorf("error listing %q: %v", dev, err)
    }
    if devEmpty {
-       if err := syscall.Mount("none", dev, "devtmpfs", syscall.MS_RDONLY, ""); err != nil {
+       if err := syscall.Mount("none", dev, "tmpfs", syscall.MS_NOEXEC | syscall.MS_NOSUID | syscall.MS_STRICTATIME, "mode=755,size=65536k"); err != nil {
            return fmt.Errorf("Couldn't mount /dev to %v: %v", dev, err)
        }
+
+       for _, udevitem:= range udevEntries {
+           systemDev := filepath.Join("/", "dev", udevitem)
+           targetDev := filepath.Join(dev, udevitem)
+
+           fi, err := os.Lstat(systemDev)
+           if err != nil {
+               return fmt.Errorf("Couldn't get stat from udev file %v: %v", systemDev, err)
+           }
+
+           switch mode := fi.Mode(); {
+           case (mode & os.ModeSymlink) == os.ModeSymlink:
+               link, err := os.Readlink(systemDev)
+               if err != nil {
+                   return fmt.Errorf("failed to read link: %s", systemDev)
+               }
+               if err := os.Symlink(link, targetDev); err != nil {
+                   return fmt.Errorf("failed to create symlink: %s", targetDev)
+               }
+
+           case (mode & os.ModeDevice) == os.ModeDevice: 
+               if err := copyDevice(targetDev, fi); err != nil {
+                   return fmt.Errorf("failed to create device %s", targetDev)
+               }
+
+               if err = unix.Chmod(targetDev, 0666); err != nil {
+                   return fmt.Errorf("failed to chmod device %s", targetDev)
+               }
+           }
+       }
    }

    // Mount proc

same problem on both 0.8.4 and 0.8.6
took me 2 days to debug why attached EBS storage is not mounted.
maybe it's time to release a fix?

@preetapan @dadgar can we give a little attention to this one? because /dev is remounted ro on host that leads to really crutial consiquences.
After checking the records we lost nearly a weak to debug this, because one of our system jobs was redeployed with exec that made all our envidonment unable to use rexray for dynamic attaching EBS volumes.

@dadgar @schmichael

I would love to get some priority on this issue. As @burdandrei described above, we ended up 1-3 infra engineers for 2-3 days trying to debug why we could not mount EBS volumes on our docker clients. The symptoms we saw was that /dev/xvdX did not exist, and udev failed to stat the device from this reason.

We ended up escalating this to AWS support as well, that was equally puzzled and (obviously) couldn't reproduce the issue.

We also spend a good amount of time debugging different kernel versions to see if it was a recent kernel upgrade issue.

After a couple of days, @burdandrei found that all of /dev on the host was remounted read-only, so any mount operations done by the kernel when we attach EBS volumes via API / CLI failed to create the device node.

The root cause was a nomad job using the exec driver (we normally only use docker for everything) had _remounted_ /dev on the host machine.

This was a very surprising behavior, as our understanding was that exec should make a chroot and not modify the host system.

At best, this behavior is a p0 / high priority bug as the behavior is destructive to the host instance, undocumented and frankly quite a surprise.

I hope you find some time to fix this in the 0.8.x or make it a breaking change improvement in 0.9.x

Thank you

This was a very surprising behavior, as our understanding was that exec should make a chroot and not modify the host system.

Indeed! Thanks for the detailed post @jippi and hard work from everyone else in this thread.

The /dev handling and chroot'ing have changed in 0.9, and I'm not able to reproduce this problem off master with a simple exec job. If anyone else has time to test, here is a Linux amd64 binary built off of master (f10c6259c).

Sorry for the lack of communication. 0.9 is getting very close to RC, and we've been trying to avoid adding extra work. However, we definitely want to make sure this bug is fixed!

nomad-f10c6259c.gz

Thanks @schmichael just ran the binary you provided and /dev on the host is untouched!

Thanks for testing @burdandrei! I'm not going to dig through the piles of PRs to figure out which one fixed this, but it will be fixed in the upcoming 0.9 release!

As always please feel free to reopen if anyone runs into this again after f10c625 or v0.9.

Was this page helpful?
0 / 5 - 0 ratings