Type | Version/Name
--- | ---
Distribution Name | NixOS
Distribution Version | 18.03.132915.d6c6c7fcec6
Linux Kernel | 4.14.56
Architecture | x86_64
ZFS Version | 0.7.0-1
SPL Version | 0.7.0-1
Exact commit | ba863d0be4cbfbea938b10e49fb6ff459ac9ec20
Declaration of complete build process: https://github.com/NixOS/nixpkgs/blob/d6c6c7fcec6dbd2b8ab14f0b35d56c7733872baa/pkgs/os-specific/linux/zfs/default.nix
On shutdown, the system hangs indefinitely after trying to unmount zfs shares.
According to logs, the kernel panics.
Simply shut down - happens every time.
Jul 26 17:00:39 nixos systemd[1]: Unmounting /var/lib...
Jul 26 17:00:39 nixos systemd[1]: Unmounting /run/keys...
Jul 26 17:00:39 nixos systemd[1]: Unmounting /boot...
Jul 26 17:00:39 nixos systemd[1]: Unmounting /tmp...
Jul 26 17:00:39 nixos systemd[1]: Unmounting /home...
Jul 26 17:00:39 nixos systemd[1]: Unmounting /run/wrappers...
Jul 26 17:00:39 nixos systemd[1]: Unmounting /run/user/1000...
Jul 26 17:00:39 nixos systemd[1]: Unmounting /zroot...
Jul 26 17:00:39 nixos systemd[1]: Stopped Update UTMP about System Boot/Shutdown.
Jul 26 17:00:39 nixos systemd[1]: Unmounted /run/keys.
Jul 26 17:00:39 nixos systemd[1]: Unmounted /run/user/1000.
Jul 26 17:00:39 nixos systemd[1]: Unmounted /run/wrappers.
Jul 26 17:00:39 nixos systemd[1]: Stopped target Swap.
Jul 26 17:00:39 nixos systemd[1]: Unmounted /tmp.
Jul 26 17:00:39 nixos systemd[1]: Unmounted /tmp.
Jul 26 17:00:39 nixos kernel: VERIFY3(0 == zap_increment(os, (-1ULL), uqn->uqn_id, uqn->uqn_delta, tx)) failed (0 == 13)
Jul 26 17:00:39 nixos kernel: PANIC at dmu_objset.c:1844:do_userquota_cacheflush()
Jul 26 17:00:39 nixos kernel: Showing stack for process 274
Jul 26 17:00:39 nixos kernel: CPU: 1 PID: 274 Comm: dp_sync_taskq Tainted: P O 4.14.56 #1-NixOS
Jul 26 17:00:39 nixos kernel: Hardware name: MSI MS-7821/Z87-G55 (MS-7821), BIOS V10.7 07/19/2014
Jul 26 17:00:39 nixos kernel: Call Trace:
Jul 26 17:00:39 nixos kernel: dump_stack+0x5c/0x85
Jul 26 17:00:39 nixos kernel: spl_panic+0xc8/0x110 [spl]
Jul 26 17:00:39 nixos kernel: ? zap_lockdir+0x37/0x90 [zfs]
Jul 26 17:00:39 nixos kernel: ? zap_lookup_norm+0x40/0xa0 [zfs]
Jul 26 17:00:39 nixos kernel: ? zap_lookup+0x12/0x20 [zfs]
Jul 26 17:00:39 nixos kernel: userquota_updates_task+0x2df/0x5b0 [zfs]
Jul 26 17:00:39 nixos kernel: ? put_prev_entity+0x2c/0x4a0
Jul 26 17:00:39 nixos kernel: ? dmu_objset_is_dirty+0x20/0x20 [zfs]
Jul 26 17:00:39 nixos kernel: ? dmu_objset_is_dirty+0x20/0x20 [zfs]
Jul 26 17:00:39 nixos kernel: ? dmu_objset_is_dirty+0x20/0x20 [zfs]
Jul 26 17:00:39 nixos kernel: taskq_thread+0x267/0x470 [spl]
Jul 26 17:00:39 nixos kernel: ? wake_up_q+0x70/0x70
Jul 26 17:00:39 nixos kernel: ? taskq_thread_should_stop+0x70/0x70 [spl]
Jul 26 17:00:39 nixos kernel: kthread+0x11a/0x130
Jul 26 17:00:39 nixos kernel: ? kthread_create_on_node+0x40/0x40
Jul 26 17:00:39 nixos kernel: ret_from_fork+0x35/0x40
Jul 26 17:00:39 nixos systemd[1]: Unmounted /boot.
Jul 26 17:02:09 nixos systemd[1]: var-lib.mount: Unmounting timed out. Terminating.
Jul 26 17:02:09 nixos systemd[1]: zroot.mount: Unmounting timed out. Terminating.
Jul 26 17:02:09 nixos systemd[1]: home.mount: Unmounting timed out. Terminating.
Jul 26 17:03:40 nixos systemd[1]: var-lib.mount: Mount process timed out. Killing.
Jul 26 17:03:40 nixos systemd[1]: var-lib.mount: Killing process 19017 (umount) with signal SIGKILL.
Jul 26 17:03:40 nixos systemd[1]: zroot.mount: Mount process timed out. Killing.
Jul 26 17:03:40 nixos systemd[1]: zroot.mount: Killing process 19025 (umount) with signal SIGKILL.
Jul 26 17:03:40 nixos systemd[1]: home.mount: Mount process timed out. Killing.
Jul 26 17:03:40 nixos systemd[1]: home.mount: Killing process 19022 (umount) with signal SIGKILL.
Jul 26 17:03:44 nixos kernel: INFO: task dp_sync_taskq:273 blocked for more than 120 seconds.
Jul 26 17:03:44 nixos kernel: Tainted: P O 4.14.56 #1-NixOS
Jul 26 17:03:44 nixos kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jul 26 17:03:44 nixos kernel: dp_sync_taskq D 0 273 2 0x80000000
Jul 26 17:03:44 nixos kernel: Call Trace:
Jul 26 17:03:44 nixos kernel: ? __schedule+0x1a2/0x6c0
Jul 26 17:03:44 nixos kernel: schedule+0x28/0x80
Jul 26 17:03:44 nixos kernel: schedule_preempt_disabled+0xa/0x10
Jul 26 17:03:44 nixos kernel: __mutex_lock.isra.2+0x17d/0x4b0
Jul 26 17:03:44 nixos kernel: ? do_userobjquota_update.part.15+0xd5/0xe0 [zfs]
Jul 26 17:03:44 nixos kernel: ? userquota_updates_task+0x288/0x5b0 [zfs]
Jul 26 17:03:44 nixos kernel: userquota_updates_task+0x288/0x5b0 [zfs]
Jul 26 17:03:44 nixos kernel: ? dmu_objset_is_dirty+0x20/0x20 [zfs]
Jul 26 17:03:44 nixos kernel: ? dmu_objset_is_dirty+0x20/0x20 [zfs]
Jul 26 17:03:44 nixos kernel: ? dmu_objset_is_dirty+0x20/0x20 [zfs]
Jul 26 17:03:44 nixos kernel: taskq_thread+0x267/0x470 [spl]
Jul 26 17:03:44 nixos kernel: ? wake_up_q+0x70/0x70
% zfs get all zroot/crypt/tmp
NAME PROPERTY VALUE SOURCE
zroot/crypt/tmp type filesystem -
zroot/crypt/tmp creation Mon May 21 17:55 2018 -
zroot/crypt/tmp used 1.37G -
zroot/crypt/tmp available 370G -
zroot/crypt/tmp referenced 1.37G -
zroot/crypt/tmp compressratio 1.00x -
zroot/crypt/tmp mounted yes -
zroot/crypt/tmp quota none default
zroot/crypt/tmp reservation none default
zroot/crypt/tmp recordsize 128K default
zroot/crypt/tmp mountpoint legacy local
zroot/crypt/tmp sharenfs off default
zroot/crypt/tmp checksum on default
zroot/crypt/tmp compression off default
zroot/crypt/tmp atime on default
zroot/crypt/tmp devices on default
zroot/crypt/tmp exec on default
zroot/crypt/tmp setuid on default
zroot/crypt/tmp readonly off default
zroot/crypt/tmp zoned off default
zroot/crypt/tmp snapdir hidden default
zroot/crypt/tmp aclinherit restricted default
zroot/crypt/tmp createtxg 16 -
zroot/crypt/tmp canmount on default
zroot/crypt/tmp xattr on default
zroot/crypt/tmp copies 1 default
zroot/crypt/tmp version 5 -
zroot/crypt/tmp utf8only off -
zroot/crypt/tmp normalization none -
zroot/crypt/tmp casesensitivity sensitive -
zroot/crypt/tmp vscan off default
zroot/crypt/tmp nbmand off default
zroot/crypt/tmp sharesmb off default
zroot/crypt/tmp refquota none default
zroot/crypt/tmp refreservation none default
zroot/crypt/tmp guid 7632425401047322315 -
zroot/crypt/tmp primarycache all default
zroot/crypt/tmp secondarycache all default
zroot/crypt/tmp usedbysnapshots 0B -
zroot/crypt/tmp usedbydataset 1.37G -
zroot/crypt/tmp usedbychildren 0B -
zroot/crypt/tmp usedbyrefreservation 0B -
zroot/crypt/tmp logbias latency default
zroot/crypt/tmp dedup off default
zroot/crypt/tmp mlslabel none default
zroot/crypt/tmp sync disabled local
zroot/crypt/tmp dnodesize legacy default
zroot/crypt/tmp refcompressratio 1.00x -
zroot/crypt/tmp written 1.37G -
zroot/crypt/tmp logicalused 1.23G -
zroot/crypt/tmp logicalreferenced 1.23G -
zroot/crypt/tmp volmode default default
zroot/crypt/tmp filesystem_limit none default
zroot/crypt/tmp snapshot_limit none default
zroot/crypt/tmp filesystem_count none default
zroot/crypt/tmp snapshot_count none default
zroot/crypt/tmp snapdev hidden default
zroot/crypt/tmp acltype off default
zroot/crypt/tmp context none default
zroot/crypt/tmp fscontext none default
zroot/crypt/tmp defcontext none default
zroot/crypt/tmp rootcontext none default
zroot/crypt/tmp relatime on temporary
zroot/crypt/tmp redundant_metadata all default
zroot/crypt/tmp overlay off default
zroot/crypt/tmp encryption aes-256-gcm -
zroot/crypt/tmp keylocation none default
zroot/crypt/tmp keyformat passphrase -
zroot/crypt/tmp pbkdf2iters 342K -
zroot/crypt/tmp encryptionroot zroot/crypt -
zroot/crypt/tmp keystatus available -
zroot/crypt/tmp nixos:shutdown-time Sat Jul 28 11:58:50 CEST 2018 inherited from zroot
I have not enabled quota on any volume:
% zfs get quota
NAME PROPERTY VALUE SOURCE
zroot quota none default
zroot/crypt quota none default
zroot/crypt/home quota none default
zroot/crypt/home@--head-- quota - -
zroot/crypt/root quota none default
zroot/crypt/tmp quota none default
zroot/crypt/var-lib quota none default
If I can provide any additional data, please tell me.
Thanks for reporting this. Just FYI quota accounting in ZFS is always enabled, it's only the enforcement which can be enabled/disabled.
What's a little surprising about this failure is the error reported by the VERIFY is 13 = EACCESS. By this point in the code most permission checks will already have been done. One exception is that EACCESS may be returned from the zio layer if the encryption keys aren't available (see commit cd32e5db8bb4411533dc21cfc12a5af5f3b56701). I see you're running a version of ZFS with encryption but without some of the follow up fixes which handle some corner cases like this. If possible could you upgrade to the latest ZFS master and see if you can still reproduce the issue.
Let's get @tcaputi's thoughts on this as well since I suspect it is encryption related.
I agree with Brian. We've made a few fixes since the version that you have came out. I would suggest installing the newest version, rebooting / reinserting the kernel module, and seeing if the problem persists. If it does, reply here and we'll take a closer look.
I will do that in the next few days and report back :)
Just for the record, to be sure: upgrading & possibly loading it with my old kernel later on will currently work without any corruption?
None of this code is currently in a stable release, and we make no guarantees about code that is just on master. That being said, I am fairly certain that we have done nothing to break the on-disk format of encryption since ba863d0 was released and I dont think any patches we have made could cause permanent errors that couldn't be undone by reverting to an older zs version.
If you are feeling concerned I would recommend making a backup of any data you can't afford to lose.
This may go without saying. But I'd just add to @tcaputi's comment that if you plan to rollback to the older version do not run zpool upgrade. This will enable new features which won't be understood by the older code.
Okay, understood :) Will do some testing & report back!
I've switched to 3d503a76e890d7711d5e906e025e092d0e244211
and now I got
Aug 03 23:02:04 nixos systemd[1]: Unmounting /zroot...
Aug 03 23:02:04 nixos systemd[1]: Unmounting /boot...
Aug 03 23:02:04 nixos systemd[1]: Unmounting /var/lib...
Aug 03 23:02:04 nixos systemd[1]: Unmounting /home...
Aug 03 23:02:04 nixos systemd[1]: Unmounting /run/user/1000...
Aug 03 23:02:05 nixos systemd[1]: Unmounting /run/keys...
Aug 03 23:02:05 nixos systemd[1]: Unmounting /run/wrappers...
Aug 03 23:02:05 nixos systemd[1]: Unmounting /tmp...
Aug 03 23:02:05 nixos systemd[1]: Stopped Update UTMP about System Boot/Shutdown.
Aug 03 23:02:05 nixos systemd[1]: Unmounted /run/keys.
Aug 03 23:02:05 nixos systemd[1]: Unmounted /boot.
Aug 03 23:02:05 nixos kernel: VERIFY3(0 == spa_do_crypt_objset_mac_abd(B_TRUE, spa, dsobj, zio->io_abd, psize, (((((bp)->blk_

prop) >> (63)) & ((1ULL << (1)) - 1)) != (1ULL)))) failed (0 == 2)
at shutdown.
Funny thing is, there were more logs that didn't make it to my log file, I'm adding a screen photo.
(Photo was taken one shutdown later and actually shows a different kernel panic?)
I'll take a look on Monday, but we haven't seen anything like this in a long time.... Were the filesystems created by zfs receive? If so what flags did you use?
No, they were created from scratch.
I was loosely following this wiki entry:
@phryneas We are having a hard time getting NixOS up and running to do any tests. Do you have a nix configuration file that we can use to get started?
@tcaputi https://gist.github.com/phryneas/7ef5eb97e35d973eb1e7c2a8c9590153
I can also give you access to a completely set up hetzner cloud VM (including virtual console, snapshots, rollback etc.) - you'd have to contact me via mail for that. See my contact info at https://phryneas.de
PS: to clarify this: the hetzner cloud vm is set up quite similar to my system, but I haven't been able to reproduce that behaviour there yet :/
I think I have located the problem after spending some time with the affected machine. The problem only seems to effect one dataset, which is has the sync property set to disabled. I will see if I can reproduce the problem locally and get a fix implemented.
I am now able to reproduce the problem. It only happens with sync=disabled and when doing a readonly remount on the filesystem before umounting. I should be able to get a fix sometime today.
Pull request has been opened with a fix. @phryneas Would you mind checking if this solves your issue?
Will do this evening - thanks for all the work you've put into this so far!
I posted this on the PR, but it seems like the work isn't completely done.... Although my reproducer is fixed, your VM is still having problems. I will continue investigating tomorrow.
@phryneas I just pushed a commit to the PR that I think will fix the issue. I will have time to test it tomorrow, which should hopefully be the last time I need your VM's. Thank you for your patience in resolving this issue.
Cool! I'll give this a test on my bare metal machine (that initially had the problem) tomorrow :)
@phryneas I tested the patch against your VM 10 times without issue. Let me know if you have more problems on your bare metal machine.
I've given it a few reboots and so far it shut down nicely every time. Seems to work :+1:
I'll take down the VMs, now that this is merged. Again, thanks for your time and hard work @tcaputi :1st_place_medal:
Thanks for letting me use the VM's to debug the issue.