Zfs: kernel: PANIC at dmu_objset.c:1844:do_userquota_cacheflush()

Created on 28 Jul 2018  路  22Comments  路  Source: openzfs/zfs

System information

Type | Version/Name
--- | ---
Distribution Name | NixOS
Distribution Version | 18.03.132915.d6c6c7fcec6
Linux Kernel | 4.14.56
Architecture | x86_64
ZFS Version | 0.7.0-1
SPL Version | 0.7.0-1
Exact commit | ba863d0be4cbfbea938b10e49fb6ff459ac9ec20

Declaration of complete build process: https://github.com/NixOS/nixpkgs/blob/d6c6c7fcec6dbd2b8ab14f0b35d56c7733872baa/pkgs/os-specific/linux/zfs/default.nix

Describe the problem you're observing

On shutdown, the system hangs indefinitely after trying to unmount zfs shares.

According to logs, the kernel panics.

Describe how to reproduce the problem

Simply shut down - happens every time.

Include any warning/errors/backtraces from the system logs

Jul 26 17:00:39 nixos systemd[1]: Unmounting /var/lib...
Jul 26 17:00:39 nixos systemd[1]: Unmounting /run/keys...
Jul 26 17:00:39 nixos systemd[1]: Unmounting /boot...
Jul 26 17:00:39 nixos systemd[1]: Unmounting /tmp...
Jul 26 17:00:39 nixos systemd[1]: Unmounting /home...
Jul 26 17:00:39 nixos systemd[1]: Unmounting /run/wrappers...
Jul 26 17:00:39 nixos systemd[1]: Unmounting /run/user/1000...
Jul 26 17:00:39 nixos systemd[1]: Unmounting /zroot...
Jul 26 17:00:39 nixos systemd[1]: Stopped Update UTMP about System Boot/Shutdown.
Jul 26 17:00:39 nixos systemd[1]: Unmounted /run/keys.
Jul 26 17:00:39 nixos systemd[1]: Unmounted /run/user/1000.
Jul 26 17:00:39 nixos systemd[1]: Unmounted /run/wrappers.
Jul 26 17:00:39 nixos systemd[1]: Stopped target Swap.
Jul 26 17:00:39 nixos systemd[1]: Unmounted /tmp.
Jul 26 17:00:39 nixos systemd[1]: Unmounted /tmp.
Jul 26 17:00:39 nixos kernel: VERIFY3(0 == zap_increment(os, (-1ULL), uqn->uqn_id, uqn->uqn_delta, tx)) failed (0 == 13)
Jul 26 17:00:39 nixos kernel: PANIC at dmu_objset.c:1844:do_userquota_cacheflush()
Jul 26 17:00:39 nixos kernel: Showing stack for process 274
Jul 26 17:00:39 nixos kernel: CPU: 1 PID: 274 Comm: dp_sync_taskq Tainted: P           O    4.14.56 #1-NixOS
Jul 26 17:00:39 nixos kernel: Hardware name: MSI MS-7821/Z87-G55 (MS-7821), BIOS V10.7 07/19/2014
Jul 26 17:00:39 nixos kernel: Call Trace:
Jul 26 17:00:39 nixos kernel:  dump_stack+0x5c/0x85
Jul 26 17:00:39 nixos kernel:  spl_panic+0xc8/0x110 [spl]
Jul 26 17:00:39 nixos kernel:  ? zap_lockdir+0x37/0x90 [zfs]
Jul 26 17:00:39 nixos kernel:  ? zap_lookup_norm+0x40/0xa0 [zfs]
Jul 26 17:00:39 nixos kernel:  ? zap_lookup+0x12/0x20 [zfs]
Jul 26 17:00:39 nixos kernel:  userquota_updates_task+0x2df/0x5b0 [zfs]
Jul 26 17:00:39 nixos kernel:  ? put_prev_entity+0x2c/0x4a0
Jul 26 17:00:39 nixos kernel:  ? dmu_objset_is_dirty+0x20/0x20 [zfs]
Jul 26 17:00:39 nixos kernel:  ? dmu_objset_is_dirty+0x20/0x20 [zfs]
Jul 26 17:00:39 nixos kernel:  ? dmu_objset_is_dirty+0x20/0x20 [zfs]
Jul 26 17:00:39 nixos kernel:  taskq_thread+0x267/0x470 [spl]
Jul 26 17:00:39 nixos kernel:  ? wake_up_q+0x70/0x70
Jul 26 17:00:39 nixos kernel:  ? taskq_thread_should_stop+0x70/0x70 [spl]
Jul 26 17:00:39 nixos kernel:  kthread+0x11a/0x130
Jul 26 17:00:39 nixos kernel:  ? kthread_create_on_node+0x40/0x40
Jul 26 17:00:39 nixos kernel:  ret_from_fork+0x35/0x40
Jul 26 17:00:39 nixos systemd[1]: Unmounted /boot.
Jul 26 17:02:09 nixos systemd[1]: var-lib.mount: Unmounting timed out. Terminating.
Jul 26 17:02:09 nixos systemd[1]: zroot.mount: Unmounting timed out. Terminating.
Jul 26 17:02:09 nixos systemd[1]: home.mount: Unmounting timed out. Terminating.
Jul 26 17:03:40 nixos systemd[1]: var-lib.mount: Mount process timed out. Killing.
Jul 26 17:03:40 nixos systemd[1]: var-lib.mount: Killing process 19017 (umount) with signal SIGKILL.
Jul 26 17:03:40 nixos systemd[1]: zroot.mount: Mount process timed out. Killing.
Jul 26 17:03:40 nixos systemd[1]: zroot.mount: Killing process 19025 (umount) with signal SIGKILL.
Jul 26 17:03:40 nixos systemd[1]: home.mount: Mount process timed out. Killing.
Jul 26 17:03:40 nixos systemd[1]: home.mount: Killing process 19022 (umount) with signal SIGKILL.
Jul 26 17:03:44 nixos kernel: INFO: task dp_sync_taskq:273 blocked for more than 120 seconds.
Jul 26 17:03:44 nixos kernel:       Tainted: P           O    4.14.56 #1-NixOS
Jul 26 17:03:44 nixos kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jul 26 17:03:44 nixos kernel: dp_sync_taskq   D    0   273      2 0x80000000
Jul 26 17:03:44 nixos kernel: Call Trace:
Jul 26 17:03:44 nixos kernel:  ? __schedule+0x1a2/0x6c0
Jul 26 17:03:44 nixos kernel:  schedule+0x28/0x80
Jul 26 17:03:44 nixos kernel:  schedule_preempt_disabled+0xa/0x10
Jul 26 17:03:44 nixos kernel:  __mutex_lock.isra.2+0x17d/0x4b0
Jul 26 17:03:44 nixos kernel:  ? do_userobjquota_update.part.15+0xd5/0xe0 [zfs]
Jul 26 17:03:44 nixos kernel:  ? userquota_updates_task+0x288/0x5b0 [zfs]
Jul 26 17:03:44 nixos kernel:  userquota_updates_task+0x288/0x5b0 [zfs]
Jul 26 17:03:44 nixos kernel:  ? dmu_objset_is_dirty+0x20/0x20 [zfs]
Jul 26 17:03:44 nixos kernel:  ? dmu_objset_is_dirty+0x20/0x20 [zfs]
Jul 26 17:03:44 nixos kernel:  ? dmu_objset_is_dirty+0x20/0x20 [zfs]
Jul 26 17:03:44 nixos kernel:  taskq_thread+0x267/0x470 [spl]
Jul 26 17:03:44 nixos kernel:  ? wake_up_q+0x70/0x70
% zfs get all zroot/crypt/tmp 
NAME             PROPERTY              VALUE                          SOURCE
zroot/crypt/tmp  type                  filesystem                     -
zroot/crypt/tmp  creation              Mon May 21 17:55 2018          -
zroot/crypt/tmp  used                  1.37G                          -
zroot/crypt/tmp  available             370G                           -
zroot/crypt/tmp  referenced            1.37G                          -
zroot/crypt/tmp  compressratio         1.00x                          -
zroot/crypt/tmp  mounted               yes                            -
zroot/crypt/tmp  quota                 none                           default
zroot/crypt/tmp  reservation           none                           default
zroot/crypt/tmp  recordsize            128K                           default
zroot/crypt/tmp  mountpoint            legacy                         local
zroot/crypt/tmp  sharenfs              off                            default
zroot/crypt/tmp  checksum              on                             default
zroot/crypt/tmp  compression           off                            default
zroot/crypt/tmp  atime                 on                             default
zroot/crypt/tmp  devices               on                             default
zroot/crypt/tmp  exec                  on                             default
zroot/crypt/tmp  setuid                on                             default
zroot/crypt/tmp  readonly              off                            default
zroot/crypt/tmp  zoned                 off                            default
zroot/crypt/tmp  snapdir               hidden                         default
zroot/crypt/tmp  aclinherit            restricted                     default
zroot/crypt/tmp  createtxg             16                             -
zroot/crypt/tmp  canmount              on                             default
zroot/crypt/tmp  xattr                 on                             default
zroot/crypt/tmp  copies                1                              default
zroot/crypt/tmp  version               5                              -
zroot/crypt/tmp  utf8only              off                            -
zroot/crypt/tmp  normalization         none                           -
zroot/crypt/tmp  casesensitivity       sensitive                      -
zroot/crypt/tmp  vscan                 off                            default
zroot/crypt/tmp  nbmand                off                            default
zroot/crypt/tmp  sharesmb              off                            default
zroot/crypt/tmp  refquota              none                           default
zroot/crypt/tmp  refreservation        none                           default
zroot/crypt/tmp  guid                  7632425401047322315            -
zroot/crypt/tmp  primarycache          all                            default
zroot/crypt/tmp  secondarycache        all                            default
zroot/crypt/tmp  usedbysnapshots       0B                             -
zroot/crypt/tmp  usedbydataset         1.37G                          -
zroot/crypt/tmp  usedbychildren        0B                             -
zroot/crypt/tmp  usedbyrefreservation  0B                             -
zroot/crypt/tmp  logbias               latency                        default
zroot/crypt/tmp  dedup                 off                            default
zroot/crypt/tmp  mlslabel              none                           default
zroot/crypt/tmp  sync                  disabled                       local
zroot/crypt/tmp  dnodesize             legacy                         default
zroot/crypt/tmp  refcompressratio      1.00x                          -
zroot/crypt/tmp  written               1.37G                          -
zroot/crypt/tmp  logicalused           1.23G                          -
zroot/crypt/tmp  logicalreferenced     1.23G                          -
zroot/crypt/tmp  volmode               default                        default
zroot/crypt/tmp  filesystem_limit      none                           default
zroot/crypt/tmp  snapshot_limit        none                           default
zroot/crypt/tmp  filesystem_count      none                           default
zroot/crypt/tmp  snapshot_count        none                           default
zroot/crypt/tmp  snapdev               hidden                         default
zroot/crypt/tmp  acltype               off                            default
zroot/crypt/tmp  context               none                           default
zroot/crypt/tmp  fscontext             none                           default
zroot/crypt/tmp  defcontext            none                           default
zroot/crypt/tmp  rootcontext           none                           default
zroot/crypt/tmp  relatime              on                             temporary
zroot/crypt/tmp  redundant_metadata    all                            default
zroot/crypt/tmp  overlay               off                            default
zroot/crypt/tmp  encryption            aes-256-gcm                    -
zroot/crypt/tmp  keylocation           none                           default
zroot/crypt/tmp  keyformat             passphrase                     -
zroot/crypt/tmp  pbkdf2iters           342K                           -
zroot/crypt/tmp  encryptionroot        zroot/crypt                    -
zroot/crypt/tmp  keystatus             available                      -
zroot/crypt/tmp  nixos:shutdown-time   Sat Jul 28 11:58:50 CEST 2018  inherited from zroot

I have not enabled quota on any volume:

% zfs get quota     
NAME                        PROPERTY  VALUE  SOURCE
zroot                       quota     none   default
zroot/crypt                 quota     none   default
zroot/crypt/home            quota     none   default
zroot/crypt/home@--head--   quota     -      -
zroot/crypt/root            quota     none   default
zroot/crypt/tmp             quota     none   default
zroot/crypt/var-lib         quota     none   default

If I can provide any additional data, please tell me.

All 22 comments

Thanks for reporting this. Just FYI quota accounting in ZFS is always enabled, it's only the enforcement which can be enabled/disabled.

What's a little surprising about this failure is the error reported by the VERIFY is 13 = EACCESS. By this point in the code most permission checks will already have been done. One exception is that EACCESS may be returned from the zio layer if the encryption keys aren't available (see commit cd32e5db8bb4411533dc21cfc12a5af5f3b56701). I see you're running a version of ZFS with encryption but without some of the follow up fixes which handle some corner cases like this. If possible could you upgrade to the latest ZFS master and see if you can still reproduce the issue.

Let's get @tcaputi's thoughts on this as well since I suspect it is encryption related.

I agree with Brian. We've made a few fixes since the version that you have came out. I would suggest installing the newest version, rebooting / reinserting the kernel module, and seeing if the problem persists. If it does, reply here and we'll take a closer look.

I will do that in the next few days and report back :)

Just for the record, to be sure: upgrading & possibly loading it with my old kernel later on will currently work without any corruption?

None of this code is currently in a stable release, and we make no guarantees about code that is just on master. That being said, I am fairly certain that we have done nothing to break the on-disk format of encryption since ba863d0 was released and I dont think any patches we have made could cause permanent errors that couldn't be undone by reverting to an older zs version.

If you are feeling concerned I would recommend making a backup of any data you can't afford to lose.

This may go without saying. But I'd just add to @tcaputi's comment that if you plan to rollback to the older version do not run zpool upgrade. This will enable new features which won't be understood by the older code.

Okay, understood :) Will do some testing & report back!

I've switched to 3d503a76e890d7711d5e906e025e092d0e244211

and now I got

Aug 03 23:02:04 nixos systemd[1]: Unmounting /zroot...
Aug 03 23:02:04 nixos systemd[1]: Unmounting /boot...
Aug 03 23:02:04 nixos systemd[1]: Unmounting /var/lib...
Aug 03 23:02:04 nixos systemd[1]: Unmounting /home...
Aug 03 23:02:04 nixos systemd[1]: Unmounting /run/user/1000...
Aug 03 23:02:05 nixos systemd[1]: Unmounting /run/keys...
Aug 03 23:02:05 nixos systemd[1]: Unmounting /run/wrappers...
Aug 03 23:02:05 nixos systemd[1]: Unmounting /tmp...
Aug 03 23:02:05 nixos systemd[1]: Stopped Update UTMP about System Boot/Shutdown.
Aug 03 23:02:05 nixos systemd[1]: Unmounted /run/keys.
Aug 03 23:02:05 nixos systemd[1]: Unmounted /boot.
Aug 03 23:02:05 nixos kernel: VERIFY3(0 == spa_do_crypt_objset_mac_abd(B_TRUE, spa, dsobj, zio->io_abd, psize, (((((bp)->blk_
![img_20180803_230521](https://user-images.githubusercontent.com/4282439/43666151-ea461b9c-9772-11e8-95eb-131ff894a9d4.jpg)
prop) >> (63)) & ((1ULL << (1)) - 1)) != (1ULL)))) failed (0 == 2)

at shutdown.

Funny thing is, there were more logs that didn't make it to my log file, I'm adding a screen photo.
(Photo was taken one shutdown later and actually shows a different kernel panic?)

https://imgur.com/a/MyH4e2N

I'll take a look on Monday, but we haven't seen anything like this in a long time.... Were the filesystems created by zfs receive? If so what flags did you use?

No, they were created from scratch.

I was loosely following this wiki entry:

https://nixos.wiki/wiki/NixOS_on_ZFS#Encrypted_ZFS

@phryneas We are having a hard time getting NixOS up and running to do any tests. Do you have a nix configuration file that we can use to get started?

@tcaputi https://gist.github.com/phryneas/7ef5eb97e35d973eb1e7c2a8c9590153

I can also give you access to a completely set up hetzner cloud VM (including virtual console, snapshots, rollback etc.) - you'd have to contact me via mail for that. See my contact info at https://phryneas.de

PS: to clarify this: the hetzner cloud vm is set up quite similar to my system, but I haven't been able to reproduce that behaviour there yet :/

I think I have located the problem after spending some time with the affected machine. The problem only seems to effect one dataset, which is has the sync property set to disabled. I will see if I can reproduce the problem locally and get a fix implemented.

I am now able to reproduce the problem. It only happens with sync=disabled and when doing a readonly remount on the filesystem before umounting. I should be able to get a fix sometime today.

Pull request has been opened with a fix. @phryneas Would you mind checking if this solves your issue?

Will do this evening - thanks for all the work you've put into this so far!

I posted this on the PR, but it seems like the work isn't completely done.... Although my reproducer is fixed, your VM is still having problems. I will continue investigating tomorrow.

@phryneas I just pushed a commit to the PR that I think will fix the issue. I will have time to test it tomorrow, which should hopefully be the last time I need your VM's. Thank you for your patience in resolving this issue.

Cool! I'll give this a test on my bare metal machine (that initially had the problem) tomorrow :)

@phryneas I tested the patch against your VM 10 times without issue. Let me know if you have more problems on your bare metal machine.

I've given it a few reboots and so far it shut down nicely every time. Seems to work :+1:

I'll take down the VMs, now that this is merged. Again, thanks for your time and hard work @tcaputi :1st_place_medal:

Thanks for letting me use the VM's to debug the issue.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

user318 picture user318  路  51Comments

chrisrd picture chrisrd  路  65Comments

vbrik picture vbrik  路  108Comments

pwolny picture pwolny  路  102Comments

Lalufu picture Lalufu  路  70Comments