Zfs: list_del corruption. next->prev should be XXX, but was dead000000000200 and crash in userquota_updates_task

Created on 7 Oct 2018 · 17Comments · Source: openzfs/zfs

This is similar to #7933 om latest RHEL7.5 kernel and 0.7.11 checked out from the git tree, but I am hittign with with Lustre testing so I guess it is materially different.

Typically this hits in recovery testing shortly after the FS is brought up after recovery - first list corruption and then crash in userquota_updates_task stepping on a bad pointer:

[ 6366.191628] Lustre: DEBUG MARKER: centos-52.localnet: executing wait_import_state_mount (FULL|IDLE) mdc.lustre-MDT0000-mdc-*.mds_server_uuid
[ 6366.457319] Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
[ 6375.281988] ------------[ cut here ]------------
[ 6375.294283] WARNING: CPU: 4 PID: 1098 at lib/list_debug.c:62 __list_del_entry+0x82/0xd0
[ 6375.301869] list_del corruption. next->prev should be ffff8800af0f7cf8, but was dead000000000200
[ 6375.304407] Modules linked in: lustre(OE) ofd(OE) osp(OE) lod(OE) ost(OE) mdt(OE) mdd(OE) mgs(OE) osd_zfs(OE) lquota(OE) lfsck(OE) obdecho(OE) mgc(OE) lov(OE) mdc(OE) osc(OE) lmv(OE) fid(OE) fld(OE) ptlrpc_gss(OE) ptlrpc(OE) obdclass(OE) ksocklnd(OE) lnet(OE) libcfs(OE) zfs(PO) zunicode(PO) zavl(PO) icp(PO) zcommon(PO) znvpair(PO) spl(O) crc_t10dif crct10dif_generic crct10dif_common rpcsec_gss_krb5 ttm ata_generic drm_kms_helper pata_acpi drm ata_piix i2c_piix4 libata serio_raw pcspkr virtio_console i2c_core virtio_balloon virtio_blk floppy ip_tables [last unloaded: libcfs]
[ 6375.317889] CPU: 4 PID: 1098 Comm: dp_sync_taskq Kdump: loaded Tainted: P           OE  ------------   3.10.0-7.5-debug2 #1
[ 6375.320240] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
[ 6375.321550] Call Trace:
[ 6375.326211]  [<ffffffff817744da>] dump_stack+0x19/0x1b
[ 6375.331553]  [<ffffffff81085798>] __warn+0xd8/0x100
[ 6375.332977]  [<ffffffff8108581f>] warn_slowpath_fmt+0x5f/0x80
[ 6375.334897]  [<ffffffff81779e4c>] ? __mutex_unlock_slowpath+0xdc/0x190
[ 6375.336225]  [<ffffffff813d42b2>] __list_del_entry+0x82/0xd0
[ 6375.338071]  [<ffffffff813d430d>] list_del+0xd/0x30
[ 6375.340917]  [<ffffffffa0b03945>] multilist_sublist_remove+0x15/0x20 [zfs]
[ 6375.343412]  [<ffffffffa0acc472>] userquota_updates_task+0x1d2/0x570 [zfs]
[ 6375.346302]  [<ffffffff810b15cb>] ? autoremove_wake_function+0x2b/0x40
[ 6375.348659]  [<ffffffffa0ac96a0>] ? dmu_objset_userobjspace_upgradable+0x60/0x60 [zfs]
[ 6375.351962]  [<ffffffffa0ac96a0>] ? dmu_objset_userobjspace_upgradable+0x60/0x60 [zfs]
[ 6375.355529]  [<ffffffffa04ffda7>] taskq_thread+0x2a7/0x4f0 [spl]
[ 6375.357697]  [<ffffffff810c5740>] ? wake_up_state+0x20/0x20
[ 6375.362061]  [<ffffffffa04ffb00>] ? taskq_thread_spawn+0x60/0x60 [spl]
[ 6375.366040]  [<ffffffff810b0504>] kthread+0xe4/0xf0
[ 6375.368557]  [<ffffffff810b0420>] ? kthread_create_on_node+0x140/0x140
[ 6375.401413]  [<ffffffff817886b7>] ret_from_fork_nospec_begin+0x21/0x21
[ 6375.429346]  [<ffffffff810b0420>] ? kthread_create_on_node+0x140/0x140
[ 6375.430864] ---[ end trace 1a66eb165735d7b7 ]---
[ 6375.443974] BUG: unable to handle kernel paging request at ffff8800af0f7f88
[ 6375.444109] IP: [<ffffffffa0acc495>] userquota_updates_task+0x1f5/0x570 [zfs]
[ 6375.447010] PGD 23e4067 PUD 33fa01067 PMD 33f888067 PTE 80000000af0f7060
[ 6375.447244] Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
[ 6375.447244] Modules linked in: lustre(OE) ofd(OE) osp(OE) lod(OE) ost(OE) mdt(OE) mdd(OE) mgs(OE) osd_zfs(OE) lquota(OE) lfsck(OE) obdecho(OE) mgc(OE) lov(OE) mdc(OE) osc(OE) lmv(OE) fid(OE) fld(OE) ptlrpc_gss(OE) ptlrpc(OE) obdclass(OE) ksocklnd(OE) lnet(OE) libcfs(OE) zfs(PO) zunicode(PO) zavl(PO) icp(PO) zcommon(PO) znvpair(PO) spl(O) crc_t10dif crct10dif_generic crct10dif_common rpcsec_gss_krb5 ttm ata_generic drm_kms_helper pata_acpi drm ata_piix i2c_piix4 libata serio_raw pcspkr virtio_console i2c_core virtio_balloon virtio_blk floppy ip_tables [last unloaded: libcfs]
[ 6375.447244] CPU: 4 PID: 1098 Comm: dp_sync_taskq Kdump: loaded Tainted: P        W  OE  ------------   3.10.0-7.5-debug2 #1
[ 6375.447244] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
[ 6375.447244] task: ffff88029df2e840 ti: ffff880298538000 task.ti: ffff880298538000
[ 6375.447244] RIP: 0010:[<ffffffffa0acc495>]  [<ffffffffa0acc495>] userquota_updates_task+0x1f5/0x570 [zfs]
[ 6375.447244] RSP: 0018:ffff88029853bc58  EFLAGS: 00010286
[ 6375.447244] RAX: ffff8800af0f7c00 RBX: ffff8802f8c23800 RCX: 0000000000000000
[ 6375.447244] RDX: ffff8800947bd738 RSI: ffff8802d4be7e68 RDI: ffff8800947bd6c0
[ 6375.481303] RBP: ffff88029853bd28 R08: 0000000000000000 R09: 0000000000000067
[ 6375.481303] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8800af0f7d08
[ 6375.481303] R13: ffff8800af0f7d50 R14: ffff8800947bd6c0 R15: ffff8800af0f7c00
[ 6375.481303] FS:  0000000000000000(0000) GS:ffff88033db00000(0000) knlGS:0000000000000000
[ 6375.481303] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 6375.481303] CR2: ffff8800af0f7f88 CR3: 00000000ba024000 CR4: 00000000000006e0
[ 6375.481303] Call Trace:
[ 6375.481303]  [<ffffffff810b15cb>] ? autoremove_wake_function+0x2b/0x40
[ 6375.481303]  [<ffffffffa0ac96a0>] ? dmu_objset_userobjspace_upgradable+0x60/0x60 [zfs]
[ 6375.481303]  [<ffffffffa0ac96a0>] ? dmu_objset_userobjspace_upgradable+0x60/0x60 [zfs]
[ 6375.481303]  [<ffffffffa04ffda7>] taskq_thread+0x2a7/0x4f0 [spl]
[ 6375.481303]  [<ffffffff810c5740>] ? wake_up_state+0x20/0x20
[ 6375.481303]  [<ffffffffa04ffb00>] ? taskq_thread_spawn+0x60/0x60 [spl]
[ 6375.481303]  [<ffffffff810b0504>] kthread+0xe4/0xf0
[ 6375.481303]  [<ffffffff810b0420>] ? kthread_create_on_node+0x140/0x140
[ 6375.481303]  [<ffffffff817886b7>] ret_from_fork_nospec_begin+0x21/0x21
[ 6375.481303]  [<ffffffff810b0420>] ? kthread_create_on_node+0x140/0x140
[ 6375.481303] Code: 4c 89 f7 e8 be 74 03 00 48 8b b3 f8 03 00 00 4c 89 ff e8 7f ec 00 00 4c 89 f7 e8 c7 74 03 00 48 85 c0 49 89 c7 0f 84 5b 01 00 00 <45> 8b af 88 03 00 00 41 f6 c5 04 0f 84 d8 fe ff ff 49 8b 87 60 
[ 6375.481303] RIP  [<ffffffffa0acc495>] userquota_updates_task+0x1f5/0x570 [zfs]

I have crashdumps and can reproduce this reasonably easy if there are any ideas for fixes.

Source

verygreen

👎1 👍1

All 17 comments

@verygreen to help narrow this down can you confirm you weren't observing this issue with say 0.7.9, or an earlier 0.7.x tag.

behlendorf on 7 Oct 2018

Well, due to some oversight the previous release I was on was some checkout of 0.7.0 as it was reportign itself, though after some doublechecking the last commit was at end of May 2018, git hash is 1a5b96b - did not exhibit any problems on that checkout. I can go back to actual 0.7.9 to make sure if you think that would be helpful.

verygreen on 7 Oct 2018

just to make it more clear hopefully, I hit this issue on 0.7.11 tag, I did not hit this issue on the checkout of 1a5b96b we also don't seem to have any signs of this in our testing on 0.7.9, but that does not mean too much since apparently our test run for 0.7.11 in standard test environment did not hit this either.

verygreen on 8 Oct 2018

@verygreen could you give us some more detail on how to reproduce this? I'd like to see if I can hit it on our lustre test cluster.

tonyhutter on 10 Oct 2018

I have a bunch of VMs running various lustre tests in a loop. the ones that hit these failures are:

while :; do rm -rf /tmp/* ; TIMEST=$(date +'%s') ; EXCEPT=101 SLOW=yes REFORMAT=yes bash recovery-small.sh ; TIMEEN=$(date +'%s') ; if [ $((TIMEEN - TIMEST)) -le 60 ] ; then echo Cycling too fast > /dev/kmsg ; echo c >/proc/sysrq-trigger ; fi ; bash llmountcleanup.sh ; done

while :; do rm -rf /tmp/* ; TIMEST=$(date +'%s') ; SLOW=yes REFORMAT=yes bash replay-single.sh ; TIMEEN=$(date +'%s') ; if [ $((TIMEEN - TIMEST)) -le 60 ] ; then echo Cycling too fast > /dev/kmsg ; echo c >/proc/sysrq-trigger ; fi ; bash llmountcleanup.sh ; SLOW=yes REFORMAT=yes bash replay-ost-single.sh ; bash llmountcleanup.sh ; SLOW=yes REFORMAT=yes bash replay-dual.sh ; bash llmountcleanup.sh ; done

while :; do rm -rf /tmp/* ; TIMEST=$(date +'%s') ; EXCEPT="32 36 67 76 78 102" SLOW=yes REFORMAT=yes bash conf-sanity.sh ; TIMEEN=$(date +'%s') ; if [ $((TIMEEN - TIMEST)) -le 60 ] ; then echo Cycling too fast > /dev/kmsg ; echo c >/proc/sysrq-trigger ; fi ; bash llmountcleanup.sh ; for i inseq 0 7; do losetup -d /dev/loop$i ; done ; done

for best results have your kernel built with DEBUG_PAGEALLOC so accesses to freed memory lead to crash (and you don't need to just monitor for the error messages)

verygreen on 10 Oct 2018

ah, btw a bit of a setup is missing, the other variables I set are like below.
obviously you need to force zfs. I cannot tell you readily if dne and other things play a bigger role here or not.

FSTYPE=ldiskfs
MDSSIZE=400000
MDSCOUNT=1
OSTCOUNT=4
# 50% probability - ZFS
test $((RANDOM % 2)) -eq 0 && FSTYPE=zfs MDSSIZE=600000

# 33% probability - DNE
test $((RANDOM % 3)) -eq 0 && MDSCOUNT=3

export FSTYPE
export MDSSIZE
export MDSCOUNT
export OSTCOUNT

you do not really need a "lustre cluster" for these tests to run. Every VM is a stand-alone "mds+oss+client" kind of config. the more of them you run the more chances something will break. I have 120 instances in my testset.

verygreen on 10 Oct 2018

@verygreen can you apply https://github.com/zfsonlinux/zfs/pull/8005 to the ZFS version you're testing. It should resolve the list corruption you're seeing. This issue was accidentally introduced by the dnode send/recv fixes, this was fixed in master by edc1e713c294d116702b034c59eed7b9a03fbc64 which I've backported.

45f0437912f4 Fix 'zfs recv' of non large_dnode send streams
dc3eea871a53 Fix object reclaim when using large dnodes
d2c8103a68e1 Fix problems receiving reallocated dnodes

behlendorf on 10 Oct 2018

👍1

ok, perfect timing, I was just preparing to update my test setup for a new testing round. Will likely have some results in the morning.

Thanks!

verygreen on 10 Oct 2018

👍1

@behlendorf On what ZoL releases the race was introduced? Is the 0.7.9 release affected?
Thanks.

shodanshok on 10 Oct 2018

@shodanshok those 3 commits first landed in the ill-fated 0.7.10. So assuming that the bug is solely from those, then 0.7.9 would not be affected, and the eventual 0.7.12 would have the fix.

rincebrain on 10 Oct 2018

Seen this first time with zfs 0.7.11 on linux 4.18.6 in Debian. No problems with 0.7.9.

woffs on 10 Oct 2018

so far so good, 0 crashes in the past 9.5 hours. Without the patch I had at least one crash per hour on my setup.

verygreen on 10 Oct 2018

@rincebrain Interesting. In https://github.com/zfsonlinux/zfs/issues/7933#issuecomment-424237752 I tried to replicate it with a very heavy workload based on concurrent instances of fio and fs_mark, with no avail. So, what does trigger the bug? Should I revert to 0.7.9 even if I did not hit the bug? Thanks.

shodanshok on 10 Oct 2018

@shodanshok in order to hit the bug you need to write to a file and then unlink the file while it's being actively written to disk. Once the data's been sync'ed out you won't be able to hit it. You might be able to reproduce it with something like dbench.

behlendorf on 10 Oct 2018

~24 hours and still no crashes so from my perspective this seems to be fixed now. Thanks!

verygreen on 11 Oct 2018

❤1

The latest comment suggests this was indeed fixed by https://github.com/zfsonlinux/zfs/commit/b32f1279d4a09a932dcbfa79e125ce60cd8d75bf (zfs-0.7.12), closing.