minikube 🚀 - Migrate away from rootfs / DOCKER_RAMDISK

The security issues should be addressed now, right ? With https://github.com/opencontainers/runc/commit/28a697cce3e4f905dca700eda81d681a30eef9cd

afbjorklund on 22 Feb 2019

Mitigated but chroot is still unrecommended

cc @cyphar

AkihiroSuda on 22 Feb 2019

Yes, I absolutely recommend against using chroot. It's simply not secure, and continuing to use it is a bad idea -- the security you get from chroot is incredibly minimal compared to the security you get from pivot_root. I'm actually of half a mind to remove chroot support from runc entirely, that's how bad of an idea it is to use it.

cyphar on 23 Feb 2019

OK. Moving away from Buildroot (already from Boot2Docker) probably means some work, though... From what we have seen so far, it will also make the footprint (in terms of the various disk images) larger. So it depends on how important it is to make the local VM secure - we don't use no_pivot_root for remote.

i.e. the default user has access to sudo (through %wheel) and the default user has a known password...

afbjorklund on 23 Feb 2019

In theory the fix could be as simple as just having a tmpfs for Docker. I don't know how the images are stored (and if they're baked into images that might be difficult) but since initramfs is already in-memory there isn't a difference with tmpfs other than the fact you need to load it on-boot.

cyphar on 23 Feb 2019

Interesting! Today we have "everything" on the rootfs (besides the usual runtime suspects), and then have the docker/containers on a mounted disk (for persistance). Using overlay2/overlay storage drivers.

Something like this: (with the cgroup/bind-mounts/overlay/shm excluded)

Filesystem      Size  Used Avail Use% Mounted on
rootfs             0     0     0    - /
devtmpfs        906M     0  906M   0% /dev
sysfs              0     0     0    - /sys
proc               0     0     0    - /proc
tmpfs           996M     0  996M   0% /dev/shm
devpts             0     0     0    - /dev/pts
tmpfs           996M  102M  895M  11% /run
tmpfs           996M     0  996M   0% /sys/fs/cgroup
hugetlbfs          0     0     0    - /dev/hugepages
nfsd               0     0     0    - /proc/fs/nfsd
mqueue             0     0     0    - /dev/mqueue
fusectl            0     0     0    - /sys/fs/fuse/connections
debugfs            0     0     0    - /sys/kernel/debug
tmpfs           996M   28K  996M   1% /tmp
/dev/sda1        17G  1.5G   14G  10% /mnt/sda1

i.e. docker lives in /var/lib/docker and crio lives in /var/lib/containers

TARGET              SOURCE                         FSTYPE OPTIONS
/var/lib/docker     /dev/sda1[/var/lib/docker]     ext4   rw,relatime,data=ordered
/var/lib/containers /dev/sda1[/var/lib/containers] ext4   rw,relatime,data=ordered

What new file system would be required, for it to be able to use pivot_root ?

container create failed: container_linux.go:344: starting container process caused "process_linux.go:424: container init caused \"rootfs_linux.go:109: jailing process inside rootfs caused \\\"pivot_root invalid argument\\\"\""

afbjorklund on 23 Feb 2019

The issue is that / is the type rootfs. If you added a tmpfs mount for /var/lib/docker or you switch / to be a full rootfs (which is the case where the image size would increase). To quote the comment from the pivot_root source:

 * Also, the current root cannot be on the 'rootfs' (initial ramfs) filesystem.
 * See Documentation/filesystems/ramfs-rootfs-initramfs.txt for alternatives
 * in this situation.

In fact (if you read Documentation/filesystems/ramfs-rootfs-initramfs.txt), rootfs is precisely identical to tmpfs -- it's just a special case which cannot be unmounted. This is actually the reason you can't pivot_root with rootfs -- because you cannot move the mount by design (it would be like killing pid1).

cyphar on 23 Feb 2019

But if we try to keep all the images on tmpfs, we would have <2G available (RAM) instead of >20G ?
And it would die on reboot, and all the other fun stuff. So I am not sure that approach is doable.

afbjorklund on 23 Feb 2019

I'm not sure I understand -- rootfs is a tmpfs. It's all in-memory in either case and /var is on rootfs so there isn't a difference -- or am I misunderstanding something?

cyphar on 23 Feb 2019

Probably not, but it seems you are saying that we need to keep everything in memory or we need to keep everything on disk - the current mix with booting from memory and storing images on disk won't work ?

afbjorklund on 23 Feb 2019

Sorry, you don't need to keep everything in memory -- I misunderstood and thought you were doing that already. The only thing you need is for / to not be rootfs. This can be done by just mounting tmpfs on top of rootfs in very early boot, or any of the other ideas mentioned in Documentation/filesystems/ramfs-rootfs-initramfs.txt.

cyphar on 23 Feb 2019

@cyphar : I'm not sure if Buildroot has such an option, or if anyone in minikube has the skills to do it... So the most likely approach is switching to a more common distro such as Debian or CentOS (like minishift).

However, doing so makes minikube even more similar to using some other approach to provision a VM... And like I mentioned earlier, the current attempts to do it has so far also increased the footprint (by 2-3x) ?

If someone can make tmpfs work, then _please_ post it here.

afbjorklund on 23 Feb 2019

Here is the Fedora/CentOS method, if that helps: https://fedoraproject.org/wiki/LiveOS_image

It uses device-mapper snapshots:

live-base: 0 20971520 linear 
live-osimg-min: 0 20971520 snapshot 8608/8608 48
live-rw: 0 20971520 snapshot 383648/67108864 1504

So that the root file system is "normal":

TARGET              SOURCE                         FSTYPE OPTIONS
/                   /dev/mapper/live-rw            ext4   rw,noatime,seclabel
/var/lib/containers /dev/sda1[/var/lib/containers] xfs    rw,relatime,seclabel,attr2,inode64,noquota

So there is no need to run with no_pivot_root, but the ISO images are bigger than with rootfs.

afbjorklund on 23 Feb 2019

@afbjorklund The way you would have to do this is that you mount a tmpfs which you then fill with your rootfs (rather than doing that with /) and then doing an MS_MOVE over / so that you can then use it. There is a helper program called switch_root which is installed on most systems that does this for you (it's also a library function in a few things).

The "easiest" fix would be to do a cp -R of the imporant things on / into a new tmpfs and then switch_root to it (switch_root recursively deletes everything on the old filesystem). But you have to do this before anything else is mounted -- so you'd need to adjust your init system to do this (if you're using systemd I think it has some way of specifying that you want it to do this and it'll do it for you).

I can take a look at what changes would need to be done with Buildroot, but that would be the first thing I'd check.

As I said above, you don't need to use a loopback filesystem or anything like that -- you just need to mount tmpfs over the rootfs before anything happens on the system and it will work, because tmpfs and rootfs are completely identical except that rootfs doesn't have a parent mount (which makes pivot_root deny you from switching).

cyphar on 24 Feb 2019

@cyphar : thanks for the help, we are currently using buildroot version 2018.05 (with systemd):

https://git.busybox.net/buildroot/tree/?h=2018.05.x
deploy/iso/minikube-iso/configs/minikube_defconfig

afbjorklund on 24 Feb 2019

Finally I noticed which "bootcode" that Boot2Docker is using for tiny core linux, to use a tmpfs instead:

# noembed: put / on a tmpfs instead of the kernel "rootfs" (ramdisk);

10.32. noembed - use a separate tmpfs

This is an advanced option that changes where in RAM Core is run
from. By default, Core uses the tmpfs setup by the kernel; with this
bootcode, Core will setup a new tmpfs file system, and use that
instead.

Using this bootcode temporarily doubles the RAM use, as both
copies are kept in RAM at once during boot. As an extra copy is
made, it also slows the boot time. It allows GNU df to detect the
free space in /, used by some proprietary software installers.

Code: https://github.com/tinycorelinux/Core-scripts/blob/3013492508569a36fbb05a8a00cd90f38619f414/init#L13:L19

if mount -t tmpfs -o size=90% tmpfs /mnt; then
  if tar -C / --exclude=mnt -cf - . | tar -C /mnt/ -xf - ; then
    mkdir /mnt/mnt
    exec /sbin/switch_root mnt /sbin/init
  fi
fi
exec /sbin/init

afbjorklund on 24 Apr 2019

If someone could help out with converting this from init to systemd, that would be appreciated.

afbjorklund on 24 Apr 2019

To check if If I understand this properly: putting the equivalent of the code snippet above in a systemd unit file in deploy/iso/minikube-iso/board/coreos/minikube/rootfs-overlay/etc/systemd which needs to be executed "before anything else" will fix this?

And it should also fix #4143 (as a side effect)?

massimiliano-mantione on 26 Apr 2019

@massimiliano-mantione I believe the correct way of doing thus under systemd is with some initrd configuration magic (at least that's my understanding of this page). There is also already a systemctl switch-root command which you could use without needing to write the mount code yourself.

Every distribution I know of already does this, so we just need to copy how they do it. In particular, on my system there is a /usr/lib/systemd/system/initrd-switch-root.service (which is included as part of a systemd install) which does this:

#  SPDX-License-Identifier: LGPL-2.1+
#
#  This file is part of systemd.
#
#  systemd is free software; you can redistribute it and/or modify it
#  under the terms of the GNU Lesser General Public License as published by
#  the Free Software Foundation; either version 2.1 of the License, or
#  (at your option) any later version.

[Unit]
Description=Switch Root
DefaultDependencies=no
ConditionPathExists=/etc/initrd-release
OnFailure=emergency.target
OnFailureJobMode=replace-irreversibly
AllowIsolate=yes

[Service]
Type=oneshot
ExecStart=/usr/bin/systemctl --no-block switch-root /sysroot

So we just need to use initrd-switch-root.service (there's also initrd-switch-root.target but I'm not sure how to configure the bootup target transitions).

All of this work is done by dracut on systemd-based distributions, so maybe we should look at using that for building our initramfs?

cyphar on 26 Apr 2019

Bonus points are awarded for integrating this option into the Buildroot distribution, similar to how the "noembed" boot code (above) works for Tiny Core Linux...

afbjorklund on 26 Apr 2019

Apparently there is a magic "sysroot.mount" unit that does it:
https://www.freedesktop.org/software/systemd/man/bootup.html

                                               :
                                               v
                                         basic.target
                                               |
                        ______________________/|
                       /                       |
                       |            initrd-root-device.target
                       |                       |
                       |                       v
                       |                  sysroot.mount
                       |                       |
                       |                       v
                       |             initrd-root-fs.target
                       |                       |
                       |                       v
                       v            initrd-parse-etc.service
                (custom initrd                 |
                 services...)                  v
                       |            (sysroot-usr.mount and
                       |             various mounts marked
                       |               with fstab option
                       |              x-initrd.mount...)
                       |                       |
                       |                       v
                       |                initrd-fs.target
                       \______________________ |
                                              \|
                                               v
                                          initrd.target
                                               |
                                               v
                                     initrd-cleanup.service
                                          isolates to
                                    initrd-switch-root.target
                                               |
                                               v
                        ______________________/|
                       /                       v
                       |        initrd-udevadm-cleanup-db.service
                       v                       |
                (custom initrd                 |
                 services...)                  |
                       \______________________ |
                                              \|
                                               v
                                   initrd-switch-root.target
                                               |
                                               v
                                   initrd-switch-root.service
                                               |
                                               v
                                     Transition to Host OS

No idea how you implement that, though. It should copy from / to /sysroot.
It seems that this file is "normally" being generated by dracut at boot time:

https://github.com/dracutdevs/dracut/blob/bca1967c90967d5453d8b215ff28552776e4fcb3/modules.d/98dracut-systemd/rootfs-generator.sh

afbjorklund on 27 Apr 2019

Tried to make my own implementation of a sysroot-mount.service, but now it doesn't boot anymore...

_/etc/systemd/system/initrd-root-fs.target.requires/sysroot.mount_

[Unit]
Before=initrd-root-fs.target

[Service]
Type=oneshot
ExecStart=/usr/sbin/mksysroot

_/usr/sbin/mksysroot_

#!/bin/sh
set -e
mkdir /sysroot
mount -t tmpfs -o size=90% tmpfs /sysroot
tar -C / --exclude=sysroot -cf - . | tar -C /sysroot/ -xf -

So still need some help with all the systemd incantations that are needed to make this work properly.

afbjorklund on 25 May 2019

Found some things in the journal, once it actually booted:

Jun 11 20:13:26 minikube systemd[1]: .[0;1;39m.[0;1;31m.[0;1;39m/etc/systemd/system/sysroot.mount:4: Unknown section 'Service'. Ignoring..[0m
Jun 11 20:13:26 minikube systemd[1]: .[0;1;31m.[0;1;39m.[0;1;31msysroot.mount: What= setting is missing. Refusing..[0m

Guess my syntax was wrong. Figures.

[Mount]
What=tmpfs
Where=/sysroot
Type=tmpfs
Options=size=90%%

afbjorklund on 11 Jun 2019

The new mount unit works fine when I start it myself, but fails to load when booting for some reason...

$ sudo systemctl status sysroot.mount
● sysroot.mount - /sysroot
   Loaded: loaded (/etc/systemd/system/sysroot.mount; enabled; vendor preset: enabled)
   Active: inactive (dead)
    Where: /sysroot
     What: tmpfs

$ sudo systemctl start sysroot.mount
$ sudo systemctl status sysroot.mount
● sysroot.mount - /sysroot
   Loaded: loaded (/etc/systemd/system/sysroot.mount; enabled; vendor preset: enabled)
[[0;1;32m●[[0m sysroot.mount - /sysroot
   Loaded: loaded (/etc/systemd/system/sysroot.mount; enabled; vendor preset: enabled)
   Active: [[0;1;32mactive (mounted)[[0m since Tue 2019-06-11 21:26:21 UTC; 4s ago
    Where: /sysroot
     What: tmpfs
  Process: 3282 ExecMount=/usr/bin/mount tmpfs /sysroot -t tmpfs -o size=90% (code=exited, status=0/SUCCESS)
    Tasks: 0 (limit: 2241)
   CGroup: /system.slice/sysroot.mount

Jun 11 21:26:21 minikube systemd[1]: Mounting /sysroot...
Jun 11 21:26:21 minikube systemd[1]: Mounted /sysroot.

Same story with the service unit, that is supposed to copy everything over from / to /sysroot

afbjorklund on 11 Jun 2019

This also seems to be blocking upgrading the version of runsc used by the gVisor add-on since we use pivot_root now.

ianlewis on 13 Jun 2019

This also seems to be related to csi drivers being broken: https://github.com/kubernetes/minikube/issues/4072

kfox1111 on 18 Jul 2019

This one seems to block all CSI drivers on minikube, is there anything we can do to help debug?

entropitor on 12 Aug 2019

I would appreciate with porting the (small) shell script from init, to equivalent (long) for systemd...

afbjorklund on 12 Aug 2019

Most systems using runc can probably be convinced to pass --no-pivot one way or the other.

   --no-pivot                do not use pivot root to jail process inside rootfs.
                             This should be used whenever the rootfs is on top of a ramdisk

This still has all the security issues indicated above, so we don't want to keep on using that forever.

The _other_ problem, that storage has, is that anything based on rootfs doesn't work with du etc:

$ df -h /
Filesystem      Size  Used Avail Use% Mounted on
rootfs             0     0     0    - /
$ df -h /mnt/sda1 
Filesystem      Size  Used Avail Use% Mounted on
/dev/sda1        17G  1.3G   15G   8% /mnt/sda1

So all in all, there are several reasons from moving from rootfs to tmpfs, as soon as possible.

   ( '>')
  /) TC (\   Core is distributed with ABSOLUTELY NO WARRANTY.
 (/-_--_-\)           www.tinycorelinux.net

docker@default:~$ df -h /
Filesystem                Size      Used Available Use% Mounted on
tmpfs                   890.5M    281.2M    609.3M  32% /

It would have to be compatible with the new version of Buildroot, that we are soon upgrading to.

It was _almost_ working before, so there is some final piece left of the puzzle that is systemd(1)...

afbjorklund on 12 Aug 2019

Some more debugging info, after adding the magic /etc/initrd-release file:

initrd-root-fs.target.requires/sysroot.mount -> ../sysroot.mount

# Automatically generated by systemd-fstab-generator

[Unit]
SourcePath=/proc/cmdline
Documentation=man:fstab(5) man:systemd-fstab-generator(8)
Before=initrd-root-fs.target
Requires=systemd-fsck-root.service
After=systemd-fsck-root.service

[Mount]
Where=/sysroot
What=/dev/sr0
Options=ro

local-fs.target.requires/-.mount -> ../-.mount

# Automatically generated by systemd-fstab-generator

[Unit]
SourcePath=/etc/fstab
Documentation=man:fstab(5) man:systemd-fstab-generator(8)
Before=local-fs.target

[Mount]
Where=/
What=/dev/root
Type=ext2
Options=rw,noauto

So maybe there is a way to add our mksysroot to initrd-root-fs.target.wants ?

afbjorklund on 12 Aug 2019

Hmm, it seems like systemd uses the flags from root also for creating the sysroot:

root=
Takes the root filesystem to mount in the initrd. root= is honored by the initrd.

rootfstype=
Takes the root filesystem type that will be passed to the mount command. rootfstype= is honored by the initrd.

rootflags=
Takes the root filesystem mount options to use. rootflags= is honored by the initrd.

Worth a shot adding that to the kernel cmdline, and see if the new service works with it ?

afbjorklund on 12 Aug 2019

Here was the command to debug the generator, from systemd.generator(7):

           dir=$(mktemp -d)
           SYSTEMD_LOG_LEVEL=debug /lib/systemd/system-generators/systemd-fstab-generator \
                   "$dir" "$dir" "$dir"
           find $dir

Note that the sysroot is only generated when in_initrd(), i.e. /etc/initrd-release

afbjorklund on 12 Aug 2019

From your post above dated May 25, did you try naming /etc/systemd/system/initrd-root-fs.target.requires/sysroot.service instead of /etc/systemd/system/initrd-root-fs.target.requires/sysroot.mount?

kfox1111 on 13 Aug 2019

@kfox1111 : I had both of them eventually (there were some follow-up posts), but it never took effect.

If I understood correctly, it wanted one [Mount] to do the mount and one [Service] to do the copy ?

Any hints to an approach that might work is appreciated! We know that it works fine with /sbin/init.

Now we just need to convince /lib/systemd/systemd to do the same trick, i.e. switch_root to a tmpfs

afbjorklund on 13 Aug 2019

I'm not sure if trying to add dracut to buildroot will make things better or worse ? (assuming more bad)

afbjorklund on 13 Aug 2019

How are you tweaking the initrd? I tried to extract the iso, then the initrd, then rebuild the initrd and rebuild the iso and its failing to boot. No changes were made. I did get just the step of extracting the iso and then rebuilding the iso to boot.

kfox1111 on 13 Aug 2019

Never mind. I found the step i was missing.

kfox1111 on 13 Aug 2019

Ok. I think I may have gotten a little further.

I was looking at https://www.freedesktop.org/software/systemd/man/bootup.html
and your notes from above. I was trying to get the sysroot.service file working as above but it kept on maybe not doing anything. Felt like it wasn't getting started. In the diagram you had tried to use initrd-root-fs.target.
As far as I can tell though, even though an initrd is being used, the initrd flow isn't. That flow is used to hand off from a systemd in the initrd to a systemd inside the main rootfs.

So, it seems to be starting at local-fs-pre.target.

Linking the service to local-fs-pre.target seems to have done at least something. The vm kernel paniced with Out of memory and no killable processes during boot.

kfox1111 on 14 Aug 2019

ok, now I get the significance of the /etc/initrd-release file. its to trigger the initrd workflow.

kfox1111 on 14 Aug 2019

Hmm... ok. So, both ways, I think we're fighting against the way systemd is designed. We have an initrd with our content in it. But we're using it as a root drive, not as an initrd. So the non /etc/initrd-release seems like the way to go. I can almost get there, but things are kind of too far along to pivot, even early on with only the tools available. The other option is to go the /etc/initrd-release route. But that has config files still intended for the runtime, not the initrd time stuff all mixed in. So hard to separate.

What if we're overthinking this? Why not move everything in the initrd to sysroot-copy use busybox/init script as the first phase of init that: copies sysroot-copy to tmpfs at /sysroot and then switch roots to /sysroot?

kfox1111 on 14 Aug 2019

👍1

Ok. So, some good news and bad news...

The good news. I was able to get it to mount as type tmpfs.
The trick is to overwrite '/init' in the intrd. its just a shell script. Content should be:

#!/bin/sh
mkdir -p sysroot
mount -t tmpfs -o size=90% tmpfs /sysroot
tar -C / --exclude=sysroot -cf - . | tar -C /sysroot/ -xf -

/bin/mount -t devtmpfs devtmpfs /sysroot/dev
exec 0</sysroot/dev/console
exec 1>/sysroot/dev/console
exec 2>/sysroot/dev/console
exec switch_root /sysroot /sbin/init "$@"

It comes up ok and then:

# mount | head
tmpfs on / type tmpfs (rw,relatime,size=1748160k)

YAY! :)

Bad news, I launched my pod with bidirectional mount, and still has the same issue.

kfox1111 on 14 Aug 2019

Ok. I can confirm the bidirectional mount does not seem to be related to the rootfs/tmpfs thing. Switching the init script so it pivots root onto a temp space on /dev/vda1 still shows the issue:

#!/bin/sh

mkdir -p /dev
/bin/mount -t devtmpfs devtmpfs /dev
mkdir -p sysroot-tmp
mount /dev/vda1 /sysroot-tmp
rm -rf /sysrot-tmp/tmp-root || true
mkdir -p /sysroot-tmp/tmp-root

mkdir -p sysroot
#mount -t tmpfs -o size=90% tmpfs /sysroot
mount --bind /sysroot-tmp/tmp-root /sysroot
mkdir -p /sysroot/dev
tar -C / --exclude=sysroot --exclude=sysroot-tmp -cf - . | tar -C /sysroot/ -xf -

/bin/mount -t devtmpfs devtmpfs /sysroot/dev
exec 0</sysroot/dev/console
exec 1>/sysroot/dev/console
exec 2>/sysroot/dev/console
exec switch_root /sysroot /sbin/init "$@"

# mount | head    
/dev/vda1 on / type ext4 (rw,relatime,data=ordered)

kfox1111 on 14 Aug 2019

Hacking /init sounds like an excellent idea! Very pragmatic

I know how to build switch_root

What was with /dev/console ? Didn’t see that (devtmpfs) elsewhere

Thanks for the help, and the sanity check

afbjorklund on 14 Aug 2019

@kfox1111 : never mind, found the original source of the /dev/console code.

#!/bin/sh
# devtmpfs does not get automounted for initramfs
/bin/mount -t devtmpfs devtmpfs /dev
exec 0</dev/console
exec 1>/dev/console
exec 2>/dev/console
exec /sbin/init "$@"

Thank you for the suggestion, this will work out just fine.

Before:

$ df -h /
Filesystem      Size  Used Avail Use% Mounted on
rootfs             0     0     0    - /
$ free -m
              total        used        free      shared  buff/cache   available
Mem:           1942         425          48          16        1469        1382
Swap:             0           0           0

After:

$ df -h /
Filesystem      Size  Used Avail Use% Mounted on
tmpfs           1.8G  567M  1.2G  33% /
$ free -m
              total        used        free      shared  buff/cache   available
Mem:           1942         516          69         584        1357         830
Swap:             0           0           0

Now need to change the rest of the configuration etc, but this should be doable.

afbjorklund on 14 Aug 2019

👍1

And it worked fine with /sbin/switch_root, no need to build util-linux switch_root.

afbjorklund on 14 Aug 2019

👍1

If you can think of any reason why shared mounts break in minikube when they are first used, I'd really appreciate it. I'm struggling a bit trying to figure it out in https://github.com/kubernetes/minikube/issues/4072. I really thought it was this issue but seems unrelated. Thanks.

kfox1111 on 14 Aug 2019

No real ideas, sorry. Sounds unrelated?

afbjorklund on 14 Aug 2019

I think I tracked it down, in part, to Environment=DOCKER_RAMDISK=yes being in the docker.service. Was this because of rootfs?

kfox1111 on 14 Aug 2019

I can confirm DOCKER_RAMDISK is there so that rootfs works. With the init pivot to tmpfs from above, it is no longer required and allows shared mounts to work. We should remove it as part of this fix.

kfox1111 on 14 Aug 2019

I think I tracked it down, in part, to Environment=DOCKER_RAMDISK=yes being in the docker.service. Was this because of rootfs?

Yes, that is related to --no-pivot (same as no_pivot_root = true in podman)

NoPivotRoot: os.Getenv("DOCKER_RAMDISK") != ""

https://github.com/moby/moby/blob/master/libcontainerd/remote/client.go#L205

afbjorklund on 14 Aug 2019

👍1

Here is the same setting in crio.conf:

# If true, the runtime will not use pivot_root, but instead use MS_MOVE.
no_pivot = true

containerd:

      no_pivot = true

buildah:

export BUILDAH_NOPIVOT=true

afbjorklund on 14 Aug 2019

At least in the docker case, it looks like minikube may be injecting the docker.service file?

It only seems to show up after minikube start gets to a certain point.

kfox1111 on 14 Aug 2019

At least in the docker case, it looks like minikube may be injecting the docker.service file?

All of them, actually. The default is false. (i.e. use pivot_root)

https://github.com/kubernetes/minikube/blob/v1.3.1/pkg/provision/buildroot.go#L98_L99

afbjorklund on 14 Aug 2019

ok. so its a fix to the iso and to the minikube program.

kfox1111 on 14 Aug 2019

Yeah, theoretically we could have minikube look at the mounted file system and adjust appropriately...

That might be appreciated by people who are using older or forked version of the ISO for some reason.

afbjorklund on 14 Aug 2019

Does minikube do any templating on the files or just copy them right in?

If its a straight copy, maybe we put the files inside the iso. If they exist, then copy them from the disabled dir to the final destination. If not, inject them. This would allow users to more easily customize them too.

kfox1111 on 14 Aug 2019

It's templated, unfortunately. This also has the side effect that you can't reboot the VM yourself.

See ~#1851~ (it has been there since day one: e8a60b9cdf2323d242a5acb223cebbbb964aae4c)

afbjorklund on 14 Aug 2019

hmm... is it gotl? maybe the raw templates could be copied from the iso, templated out to the final version, then injected back to the final location?

kfox1111 on 14 Aug 2019

Here is the code for the dynamic runtime configuration: 5afa5a21a951a711766a5dd043c087a53a314611

It will detect a non-rootfs partition, and avoid DOCKER_RAMDISK

OOPS: we cannot use this code, since it needs to run over ssh

Anyway, samething as the go code - but in shell instead :-)

afbjorklund on 14 Aug 2019

👍1

Easiest is using df (from GNU coreutils), and filter out the header (as preferred):

$ minikube ssh "df --output=fstype / | sed 1d"
rootfs

So run that from go, instead of gopsutil, and adjust the go template accordingly.

afbjorklund on 14 Aug 2019

👍1

Is there a pr for the tmpfs init?

kfox1111 on 14 Aug 2019

Is there a pr for the tmpfs init?

There will be, eventually.

Basically same as above, just tweaked it a bit. We could make it dynamic, but I think that is overkill.
That is, honor the: grep -qw noembed /proc/cmdline (we already have "noembed" - but ignore it)

Used /sbin/switch_root.

afbjorklund on 14 Aug 2019

I just verified my csi driver is working with the fixes in place. So excited to get a release with this in place so everyone can csi. :)

kfox1111 on 14 Aug 2019

These two, merged together.

buildroot: https://github.com/buildroot/buildroot/blob/master/fs/cpio/init

# devtmpfs does not get automounted for initramfs
/bin/mount -t devtmpfs devtmpfs /dev
exec 0</dev/console
exec 1>/dev/console
exec 2>/dev/console
exec /sbin/init "$@"

tinycore: https://github.com/tinycorelinux/Core-scripts/blob/master/init

if mount -t tmpfs -o size=90% tmpfs /mnt; then
  if tar -C / --exclude=mnt -cf - . | tar -C /mnt/ -xf - ; then
    mkdir /mnt/mnt
    exec /sbin/switch_root mnt /sbin/init
  fi
fi
exec /sbin/init

Probably /sysroot, not /mnt.

afbjorklund on 14 Aug 2019

👍1

@afbjorklund

Easiest is using df (from GNU coreutils), and filter out the header (as preferred):
$ minikube ssh "df --output=fstype / | sed 1d"
rootfs

Why not use statfs(2) directly (or stat --file-system --format '%T' /)?

cyphar on 22 Aug 2019

Why not use statfs(2) directly (or stat --file-system --format '%T' /)?

That works too, thanks for the tip! Now looks like:

$ minikube ssh -- stat --file-system --format '%T' /
tmpfs

afbjorklund on 22 Aug 2019

Unfortunately I forgot to check that it still worked for the old ISO (it didn't):

$ minikube ssh -- stat --file-system --format '%T' /
ramfs

afbjorklund on 24 Aug 2019

Huh. It looks like "ramfs" is what stat calls "rootfs" (or rather, initramfs). Fundamentally both the df and stat solution are using the same syscall (statfs(2)) and checking what the filesystem magic number is. Arguably "ramfs" is the correct name, given the filesystem magic number is called RAMFS_MAGIC.

cyphar on 25 Aug 2019

This appears to be fixed at head. Please re-open if I am mistaken:

$ stat --file-system --format '%T' /
tmpfs

tstromberg on 3 Sep 2019

Minikube: Migrate away from rootfs / DOCKER_RAMDISK

All 69 comments

Related issues