Zfs: Dracut mounting wrong

Created on 2 Apr 2017 · 30Comments · Source: openzfs/zfs

If I have a layout with /var and/or /root in their own partition, then the mounting scheme in dracut with zpool import -N and subsequent zfs mount fails, because between the import and the mounting, /var and /root are created and filled with files and/or directories. When it comes to mounting the partitions mentioned in the pool, /var and /root fail to mount because they aren't empty.

I consider overlaying a poor solution to this. And avoiding to put /var and /root in a separate partition is no solution either.

The question is, why has the importing and mounting to be this way? Why not just zpool import rpool without -N? If I have a monolithic file system (only one partition with everything under it), then there is no point of importing and mounting separately. So why zpool import -N?

Source

jwezel

Most helpful comment

Agree, it should be supported and a systemd generator is the right solution. /var as a separate dataset is a supported configuration on Solaris 10 and 11 (an installation option, even), and it's a supported configuration using normal filesystems under systemd--RHEL 7 notably supports this configuration with LVM and XFS. It is desirable on production systems to prevent the constantly changing and growing data in /var from filling up and affecting other filesystems and vice-versa; and at least speaking personally, I want to have a separate snapshot policy on /var than anything else due to its highly volatile nature.

iamjamestl on 3 Aug 2017

👍2

All 30 comments

Dracut cannot run zpool import without the -N because then the file systems would be mounted all onto the initramfs mountpoints. Which is exactly what you don't want.

Dracut is, in fact, not in charge of mounting the file systems. All it does, is it mounts your root file system. Nothing else.

I think what you believe Dracut does, and what Dracut actually does, are not at all the same.

Rudd-O on 2 Apr 2017

Thanks for the explanation.

But it doesn't solve the problem. I can't have a separate /var or /root partition because they are populated somewhere between import and intended mount, preventing the mount.

jwezel on 2 Apr 2017

True. That would need to be diagnosed. What are they populated with? That also probably means that mounting the file systems should probably be ordered before whatever those tasks are.

On April 2, 2017 6:42:38 PM GMT+02:00, Johnny Wezel notifications@github.com wrote:

Thanks for the explanation.

But it doesn't solve the problem. I can't have a separate /var or
/root partition because they are populated somewhere between import
and intended mount, preventing the mount.

--
You are receiving this because you commented.
Reply to this email directly or view it on GitHub:
https://github.com/zfsonlinux/zfs/issues/5956#issuecomment-290998009

--
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Rudd-O on 2 Apr 2017

/root is a simple case. It's a legacy of a bash running, I believe it's a .bash_history, probably solvable. /var is more complicated because it seems it's files and directories systemd deems necessary for various services, like cache, log and all the stuff that lives in /var.

jwezel on 2 Apr 2017

I have /var working on a separate dataset with dracut and systemd. To do this, let the zfs-mount service mount up the var dataset (after dracut) like normal (zfs set mountpoint=/var rpool/var). However, systemd has no idea that zfs-mount is mounting /var, so it is unable to order dependencies properly unless give it some additional information. The easiest way I found to do this is to add the following line to /etc/fstab:

none    /var    none    fake,x-systemd.requires=zfs-mount.service   0 0

That will create a fake var.mount unit that depends on zfs-mount.service, so systemd will know to order services that depend on /var after zfs-mount. That will prevent services from cluttering up places like /var/cache and /var/log on the root dataset like you are seeing.

/var is the only filesystem I've found that I need to do this for, because it is the only one that system services typically depend on.

iamjamestl on 18 Apr 2017

❤1

This really should be fixed with a generator that runs after pivot root. For all datasets, create mount units automatically. I used to have something like that but it got lost in the great dracut rewrite when upstream and me synced our differences.

On April 18, 2017 5:32:44 PM GMT+02:00, "James T. Lee" notifications@github.com wrote:

I have /var working on a separate dataset with dracut and systemd.
To do this, let the zfs-mount service mount up the var dataset (after
dracut) like normal (zfs set mountpoint=/var rpool/var). However,
systemd has no idea that zfs-mount is mounting /var, so it is
unable to order dependencies properly unless give it some additional
information. The easiest way I found to do this is to add the
following line to /etc/fstab:
none   /var    none    fake,x-systemd.requires=zfs-mount.service   0 0
That will create a fake var.mount unit that depends on
zfs-mount.service, so systemd will know to order services that depend
on /var after zfs-mount.

--
You are receiving this because you commented.
Reply to this email directly or view it on GitHub:
https://github.com/zfsonlinux/zfs/issues/5956#issuecomment-294883084

--
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Rudd-O on 18 Apr 2017

To be clear: the generator would:

start
ZFS list all the mountpoints it intends to mount
create proper mount units for each, with the proper dependencies, and
with an extra dependency Before=zfs-mount.service
voilá, file systems mounted at the right time

This doesn't let us dispense with the zfs-mount.service unit, tho,
because the generator would only generate the units that are related to
the root pool.

Rudd-O on 18 Apr 2017

In the meantime I have devised a much simpler solution: in zfs-mount.service run a script which does the zfs mount -a with a rm -rf /root /var beforehand. Because the whole dracut-ZFS-systemd business is pretty brittle and fragile and I expect the non-atomical import/mount to fail at some other time even with a "proper" solution.

jwezel on 19 Apr 2017

I see no fragility at all with zfs-dracut. It's been more than a year since I have seen any problems.

On April 19, 2017 1:01:19 PM GMT+02:00, Johnny Wezel notifications@github.com wrote:

In the meantime I have devised a much simpler solution: in
zfs-mount.service run a script which does the zfs mount -a with a rm -rf /root /var beforehand. Because the whole dracut-ZFS-systemd
business is pretty brittle and fragile and I expect the non-atomical
import/mount to fail at some other time even with a "proper" solution.

--
You are receiving this because you commented.
Reply to this email directly or view it on GitHub:
https://github.com/zfsonlinux/zfs/issues/5956#issuecomment-295218745

--
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Rudd-O on 19 Apr 2017

@jwezel jwezel, that may work, but do so at your own risk. Because systemd activates units in parallel, without being explicit about dependencies, something could start and write to /var which you then proceed to remove. systemd already accounts for /var in a separate filesystem, for example, in basic.target, which specifies:

RequiresMountsFor=/var

But that expects there to be a var.mount unit if var is a separate filesystem. So my workaround creates it; or you can try attacking it the way Rudd-O suggested (and I agree, his way is probably the correct solution).

In any case, none of this is related to Dracut, whose job, as Rudd-O pointed out, is simply to get / mounted and is doing that just fine.

iamjamestl on 19 Apr 2017

I just ran into issues like this one in Fedora 26./var would constantly fail to mount due to garbage being created before it is mounted.

The best solution I could find was to set mountpoint=legacy for /var and all it's children datasets and mount them manually in /etc/fstab. But I found that the root dataset itself cannot have mountpoint=legacy, as the systemd generator explicitly adds the zfsutil option when mounting it, causing it to fail.

I can see a few different solutions that could help the situation. I've ran a btrfs root for quite a while and had no such problems, because it mounts datasets recursively and mounts over non-empty directories. To ensure reliable usage of ZFS as root at least one of those behaviors, if not both are likely necessary.

I can see a few solutions, but given I am not a ZoL expert, I'm not sure which ones would work better, if at all.

Don't mount only the root dataset explicitly at /sysroot, but the whole pool with the altroot set to /sysroot. I imagine there are reasons this is not done already, but I do not know them.
Keep mounting the root dataset explicitly, but also mount children datasets manually, or at least ones known to be boot-essential such as /var, /etc and /usr. This should have the nice side-effect of allowing mounting over non-empty directories.
Document that having boot-essential directories as separate datasets requires manual fstab setup. Fix the behavior such that having the root with mountpoint=legacy does not break the boot (as it seems to do now, at least with dracut).

danielkza on 31 Jul 2017

@danielkza, just curious if you tried my workaround of creating a fake /var mountpoint in /etc/fstab. Every service that writes to /var either either explicitly expresses that it should come after /var is mounted with RequiresMountsFor=/var or implicitly with Requires=local-fs.target. Using the fake /var mountpoint that requires zfs-mount.service causes all of those services to transitively depend on zfs-mount.service. Then you don't have to deal with legacy mounting or worring about child datasets. Everything will be taken care of by zfs-mount.service. I ask because this workaround has been working for me for two years across dozens of systems, including web servers, docker hosts, KVM hosts, laptops, and workstations.

I would just caution, if you do try this workaround, if you do have anything in /var on the root dataset, you will need to remove those files one time in a live CD or dracut rescue environment first, so that zfs-mount is able to mount /var cleanly.

iamjamestl on 31 Jul 2017

@iamjamestl I don't want to run zfs-mount.service early because I have a second non-root pool on spinning disks that takes about 1 minute to mount.

It would be really helpful if zfs mount could receive a pool as an argument, such that it would be possible to mount just the root pool and not all of them at once.

danielkza on 31 Jul 2017

@danielkza We have that. zpool set cachefile=none on the pool you don't want mounted at boot. Then you can zpool import it manually (or in a custom service) later on in the boot process, or whenever you want.

iamjamestl on 31 Jul 2017

@iamjamestl I currently have both pools with cachefile=none, so I had not considered using it to "gate" a pool from being mounted. But I just now realized that I can also set canmount=noauto on the slow pool and use a different service to mount it manually.

I'll try your approach and see how it goes.

danielkza on 31 Jul 2017

The robust and correct solution to all of this is to generate mount
units for all known root pool datasets that must be mounted, and later
on, let zfs-mount.service finish the job for non-root pools. That way
everything works.

This requires a bit more legwork in the generator I wrote, which I think
I abandoned. But it's doable. Please, give me a hand, I'd be happy to
guide you and send commits to fix bugs.

Rudd-O on 31 Jul 2017

@Rudd-O I agree, every other way seems a bit too hacky.

I looked into the current generator, and unfortunately I don't see how to easily extend it to do what we want. It runs way too early, before the pool is even imported or udev has settled. Did you create a separate service to do the mounting? Or can we cheat a bit and manually start a "pseudo-generator" and do a systemctl reload once the pool is imported?

danielkza on 31 Jul 2017

It seems I've found some success by simply creating a generator that does not reside on the initramfs. It just mounts all children of the root dataset after switching to the real root, which should work exactly like the mounts came from the fstab.

@iamjamestl @Rudd-O can you guys test it out and see if it works in your systems? Place the generator in /etc/systemd/system-generators/zfs-root-fs-generator.sh, creating any missing directories and ensuring it is executable. Then remove existing workarounds and reboot. Be sure to keep some backups of your current configs to restore a working state, of course :)

danielkza on 31 Jul 2017

On July 31, 2017 8:36:16 AM GMT+02:00, Daniel Miranda notifications@github.com wrote:

It seems I've found some success by simply creating a generator that
does not reside on the initramfs. It just mounts all children of the
root dataset after switching to the real root, which should work
exactly like the mounts came from the fstab.

Bingo!

@iamjamestl @Rudd-O can you guys test it out and see if it works in
your systems? All it takes is placing the
generator
in /etc/systemd/system-generators/zfs-root-fs-generator.sh (creating
any missing directories and ensuring it is executable), removing any
workarounds and rebooting. Be sure to keep some backups of your current
configs to restore a working state, of course :)

Will try it later today!

Rudd-O on 31 Jul 2017

@kpande Sure, I'll take it to the mailing list. But the whole point of the generator is mounting children datasets of the root. The bootfs property only identifies what the root is, but it is still mounted manually non-recursively by the dracut scripts.

danielkza on 31 Jul 2017

I think it should be supported and I believe the generator approach is the correct one because it prevents junk from appearing in /var prior to the mounting of /var.

Rudd-O on 3 Aug 2017

iamjamestl on 3 Aug 2017

👍2

Here's what I would think is the pie-in-the-sky ideal solution:

A generator that generates mount units for all mountable datasets that have a mountpoint property and are set to canmount. That is to say, all datasets that zfs mount would mount. As luck would have it, mount units get the necessary automatic dependencies to be ordered properly, thus things like /usr and /var can be mounted in parallel, but /var/lib/ gets mounted after /var. Potentially faster boots make everyone happy.
Each generated unit would start before zfs-mount.service, to give a chance to late-imported pools to be mounted.
zfs-mount.service is to be ordered prior to local-fs.target.

I used to have something like this, but I no longer carry that code in my branch, because the complexity of my systems decreased so I couldn't justify the maintenance. But, really, the generator is the correct solution.

This generator unit should never run in the initramfs — it should run exclusively after pivot_root, by systemd, the instant systemd starts. Technically, we still may depend on /dev/zfs so that the zfs command can query the already-imported datasets, as well as some executable stuff in /usr so we may still require to mount separate /usr by hand within the initial RAM disk prior to pivot_root. But other than that, it should be pretty trivial and straightforward to write a shell script that does this trick.

Caveat: we may also want to support stuff specified in fstab, such that we skip generating mount units for datasets being requested by fstab, and therefore the systemd-fstab-generator trumps us in that case.

That would be how I'd specify the solution.

Rudd-O on 3 Aug 2017

A generator that generates mount units for all mountable datasets that have a mountpoint property and are set to canmount. That is to say, all datasets that zfs mount would mount. As luck would have it, mount units get the necessary automatic dependencies to be ordered properly, thus things like /usr and /var can be mounted in parallel, but /var/lib/ gets mounted after /var. Potentially faster boots make everyone happy.

That's what I attempted to achieve with the generator I posted earlier, did you have a chance to test it?

Each generated unit would start before zfs-mount.service, to give a chance to late-imported pools to be mounted.
zfs-mount.service is to be ordered prior to local-fs.target.

That indeed is the most robust approach IMO.

This generator unit should never run in the initramfs — it should run exclusively after pivot_root, by systemd, the instant systemd starts.

That's is fortunately how generators are always set up. They are executed very early when systemd starts or re-execs (such as after pivoting root), while the units actually generated can get started much later.

Technically, we still may depend on /dev/zfs so that the zfs command can query the already-imported datasets, as well as some executable stuff in /usr so we may still require to mount separate /usr by hand within the initial RAM disk prior to pivot_root.

I believe systemd itself requires that a split /usr must be mounted by the initramfs. It might be necessary to have a separate initramfs-included generator to achieve that.

Caveat: we may also want to support stuff specified in fstab, such that we skip generating mount units for datasets being requested by fstab, and therefore the systemd-fstab-generator trumps us in that case.

That can be achieved by generating the units in the "late" generator directory, which should not override units generated by any other means.

danielkza on 4 Aug 2017

It seems I've found some success by simply creating a generator that does not reside on the initramfs. It just mounts all children of the root dataset after switching to the real root, which should work exactly like the mounts came from the fstab.
@iamjamestl @Rudd-O can you guys test it out and see if it works in your systems? Place the generator in /etc/systemd/system-generators/zfs-root-fs-generator.sh, creating any missing directories and ensuring it is executable. Then remove existing workarounds and reboot. Be sure to keep some backups of your current configs to restore a working state, of course :)

Doesn't work for me. Still junk in /var. Although I must state that my FS structure is a bit more elaborate:

NAME                               USED  AVAIL  REFER  MOUNTPOINT
vol                               27.3G   864G    96K  none
vol/data                          1.82G   864G    96K  none
vol/data/home                     1.55G   864G  1.55G  /home
vol/data/root                     1.02M   864G  1.02M  /root
vol/data/tmp                       278M   864G    96K  none
vol/data/tmp/var                   278M   864G   278M  /var/tmp
vol/system                        25.4G   864G    96K  none
vol/system/static                 20.0G   864G    96K  none
vol/system/static/root            20.0G   864G  20.0G  /
vol/system/var                    5.43G   864G    96K  none
vol/system/var/pkg                 100M   864G   100M  /var/db/pkg
vol/system/var/portage            5.22G   864G    96K  none
vol/system/var/portage/distfiles  4.44G   864G  4.44G  /usr/portage/distfiles
vol/system/var/portage/etc         188K   864G   188K  /etc/portage
vol/system/var/portage/usr         807M   864G   807M  /usr/portage
vol/system/var/var                 105M   864G   105M  /var

You might find some merits only after studying it. I think such a structure should be possible.

jwezel on 20 Aug 2017

@jwezel As I originally posted it, the generator only mounts datasets that are children of the root dataset. I updated the gist with a version without that limitation, if you want to try it again. Can you show me the output of the generator in the kernel log after doing systemctl daemon-reload?

danielkza on 20 Aug 2017

Yesss, _much_ better.

However now there is a problem with unmounting /var, apparently preventing exporting the volume, resulting in a failure to import it (requiring -f) next boot. I can't find a script or service dealing with unmounting, so I assume systemd is taking care of it itself.

jwezel on 20 Aug 2017

This probably means there are wrong dependencies in the autogenerated mount units (systemctl show <unit> after boot will show the dependencies as it was loaded). If the dependencies are correct (defaultdependencies no, before local-fs.target, conflicts shutdown, et cetera), then /var should be unmounted on shutdown in the exact opposite order it was mounted on boot, which means it should be unmounted after all file systems below /var have been unmounted, and after all applications have stopped writing to /var. If the systemd mount unit, because of a dependency error, tries to unmount /var before these conditions have happened, then the unmount phase wil fail, and the shutdown will be guaranteed to be unclean.

I would need to see the systemctl show var.mount to see what's going on. Clean shutdowns are extremely important. Processes should not be holding on to stuff in /var when shutdown has reached the point of unmounting datasets.

Rudd-O on 22 Aug 2017

@jwezel where's the latest Gist?

Rudd-O on 22 Aug 2017

What version of dracut contains this fix? Also, if this isn't fixed, why not just use zpool import -R /sysroot to import the pool? That should mount everything in the pool relative to sysroot.