zfs sync hang

Created on 12 Jan 2018 · 82Comments · Source: openzfs/zfs

System information

Type | Version/Name
--- | ---
Distribution Name | centos 7.2 ，sync hang
Distribution Version |
Linux Kernel | 3.10.0-327.13.1.el7.x86_64
Architecture |
ZFS Version | v0.7.5-1
SPL Version | v0.7.5-1

Describe the problem you're observing

sync hang ，mysql hang，kworker cpu 100

Describe how to reproduce the problem

high frequency create/destroy/clone

Include any warning/errors/backtraces from the system logs

dmsg:
[ 189.990968] Adjusting tsc more than 11% (8039035 vs 7759471)
[ 2522.644734] INFO: task mysqld:6608 blocked for more than 120 seconds.
[ 2522.644790] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 2522.644854] mysqld D 0000000000000000 0 6608 5602 0x00000000
[ 2522.644860] ffff880feb913da0 0000000000000086 ffff881006eac500 ffff880feb913fd8
[ 2522.644865] ffff880feb913fd8 ffff880feb913fd8 ffff881006eac500 ffff880fffc8a180
[ 2522.644869] ffff880fffc8a000 ffff880fffc8a188 ffff880fffc8a028 0000000000000000
[ 2522.644873] Call Trace:
[ 2522.644882] [] schedule+0x29/0x70
[ 2522.644906] [] cv_wait_common+0x125/0x150 [spl]
[ 2522.644911] [] ? wake_up_atomic_t+0x30/0x30
[ 2522.644922] [] __cv_wait+0x15/0x20 [spl]
[ 2522.644995] [] zil_commit.part.12+0x8b/0x830 [zfs]
[ 2522.645006] [] ? spl_kmem_alloc+0xc7/0x170 [spl]
[ 2522.645017] [] ? tsd_set+0x324/0x500 [spl]
[ 2522.645022] [] ? mutex_lock+0x12/0x2f
[ 2522.645075] [] zil_commit+0x17/0x20 [zfs]
[ 2522.645128] [] zfs_fsync+0x77/0xf0 [zfs]
[ 2522.645179] [] zpl_fsync+0x65/0x90 [zfs]
[ 2522.645186] [] do_fsync+0x65/0xa0
[ 2522.645191] [] SyS_fsync+0x10/0x20
[ 2522.645196] [] system_call_fastpath+0x16/0x1b
[ 2522.645202] INFO: task mysqld:7514 blocked for more than 120 seconds.
[ 2522.645245] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 2522.645296] mysqld D ffff880fb554bda0 0 7514 5602 0x00000000
[ 2522.645299] ffff880fb554bd30 0000000000000086 ffff880ff4c8f300 ffff880fb554bfd8
[ 2522.645303] ffff880fb554bfd8 ffff880fb554bfd8 ffff880ff4c8f300 ffff880fb554bdb0
[ 2522.645307] ffff88107ff8dec0 0000000000000002 ffffffff811f9420 ffff880fb554bda0
[ 2522.645312] Call Trace:
[ 2522.645316] [] ? unlock_two_nondirectories+0x60/0x60
[ 2522.645321] [] schedule+0x29/0x70
[ 2522.645324] [] inode_wait+0xe/0x20
[ 2522.645328] [] __wait_on_bit+0x60/0x90
[ 2522.645335] [] __inode_wait_for_writeback+0xaf/0xf0
[ 2522.645339] [] ? wake_atomic_t_function+0x40/0x40
[ 2522.645345] [] inode_wait_for_writeback+0x26/0x40
[ 2522.645348] [] evict+0x95/0x170
[ 2522.645351] [] iput+0xf5/0x180
[ 2522.645357] [] do_unlinkat+0x1ae/0x2b0
[ 2522.645362] [] ? SyS_readlinkat+0xd1/0x140
[ 2522.645367] [] SyS_unlink+0x16/0x20
[ 2522.645371] [] system_call_fastpath+0x16/0x1b
[ 2522.645402] INFO: task sync:2653 blocked for more than 120 seconds.
[ 2522.645443] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 2522.645494] sync D ffffffff8120f8c0 0 2653 1455 0x00000000
[ 2522.645498] ffff880ffedd7d50 0000000000000082 ffff880f01449700 ffff880ffedd7fd8
[ 2522.645502] ffff880ffedd7fd8 ffff880ffedd7fd8 ffff880f01449700 ffff880ffedd7e80
[ 2522.645506] ffff880ffedd7e88 7fffffffffffffff ffff880f01449700 ffffffff8120f8c0
[ 2522.645509] Call Trace:
[ 2522.645515] [] ? generic_write_sync+0x60/0x60
[ 2522.645519] [] schedule+0x29/0x70
[ 2522.645523] [] schedule_timeout+0x209/0x2d0
[ 2522.645527] [] ? __queue_work+0x136/0x320
[ 2522.645530] [] ? __queue_delayed_work+0xaa/0x1a0
[ 2522.645534] [] ? try_to_grab_pending+0xb1/0x160
[ 2522.645538] [] ? generic_write_sync+0x60/0x60
[ 2522.645543] [] wait_for_completion+0x116/0x170
[ 2522.645548] [] ? wake_up_state+0x20/0x20
[ 2522.645552] [] sync_inodes_sb+0xb7/0x1e0
[ 2522.645557] [] ? generic_write_sync+0x60/0x60
[ 2522.645561] [] sync_inodes_one_sb+0x19/0x20
[ 2522.645565] [] iterate_supers+0xb2/0x110
[ 2522.645570] [] sys_sync+0x44/0xb0
[ 2522.645575] [] system_call_fastpath+0x16/0x1b
[ 2642.739175] INFO: task mysqld:6608 blocked for more than 120 seconds.
[ 2642.739228] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 2642.739284] mysqld D 0000000000000000 0 6608 5602 0x00000000
[ 2642.739290] ffff880feb913da0 0000000000000086 ffff881006eac500 ffff880feb913fd8
[ 2642.739296] ffff880feb913fd8 ffff880feb913fd8 ffff881006eac500 ffff880fffc8a180
[ 2642.739300] ffff880fffc8a000 ffff880fffc8a188 ffff880fffc8a028 0000000000000000
[ 2642.739305] Call Trace:
[ 2642.739315] [] schedule+0x29/0x70
[ 2642.739340] [] cv_wait_common+0x125/0x150 [spl]
[ 2642.739345] [] ? wake_up_atomic_t+0x30/0x30
[ 2642.739357] [] __cv_wait+0x15/0x20 [spl]
[ 2642.739434] [] zil_commit.part.12+0x8b/0x830 [zfs]
[ 2642.739447] [] ? spl_kmem_alloc+0xc7/0x170 [spl]
[ 2642.739460] [] ? tsd_set+0x324/0x500 [spl]
[ 2642.739466] [] ? mutex_lock+0x12/0x2f
[ 2642.739526] [] zil_commit+0x17/0x20 [zfs]
[ 2642.739587] [] zfs_fsync+0x77/0xf0 [zfs]
[ 2642.739647] [] zpl_fsync+0x65/0x90 [zfs]
[ 2642.739654] [] do_fsync+0x65/0xa0
[ 2642.739659] [] SyS_fsync+0x10/0x20
[ 2642.739665] [] system_call_fastpath+0x16/0x1b
[ 2642.739670] INFO: task mysqld:7514 blocked for more than 120 seconds.
[ 2642.739727] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 2642.739793] mysqld D ffff880fb554bda0 0 7514 5602 0x00000000
[ 2642.739798] ffff880fb554bd30 0000000000000086 ffff880ff4c8f300 ffff880fb554bfd8
[ 2642.739803] ffff880fb554bfd8 ffff880fb554bfd8 ffff880ff4c8f300 ffff880fb554bdb0
[ 2642.739808] ffff88107ff8dec0 0000000000000002 ffffffff811f9420 ffff880fb554bda0
[ 2642.739812] Call Trace:
[ 2642.739818] [] ? unlock_two_nondirectories+0x60/0x60
[ 2642.739823] [] schedule+0x29/0x70
[ 2642.739826] [] inode_wait+0xe/0x20
[ 2642.739831] [] __wait_on_bit+0x60/0x90
[ 2642.739839] [] __inode_wait_for_writeback+0xaf/0xf0
[ 2642.739843] [] ? wake_atomic_t_function+0x40/0x40
[ 2642.739850] [] inode_wait_for_writeback+0x26/0x40
[ 2642.739854] [] evict+0x95/0x170
[ 2642.739857] [] iput+0xf5/0x180
[ 2642.739862] [] do_unlinkat+0x1ae/0x2b0
[ 2642.739868] [] ? SyS_readlinkat+0xd1/0x140
[ 2642.739873] [] SyS_unlink+0x16/0x20
[ 2642.739878] [] system_call_fastpath+0x16/0x1b
[ 2642.739897] INFO: task sync:2653 blocked for more than 120 seconds.
[ 2642.739951] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 2642.740017] sync D ffffffff8120f8c0 0 2653 1455 0x00000000
[ 2642.740021] ffff880ffedd7d50 0000000000000082 ffff880f01449700 ffff880ffedd7fd8
[ 2642.740026] ffff880ffedd7fd8 ffff880ffedd7fd8 ffff880f01449700 ffff880ffedd7e80
[ 2642.740031] ffff880ffedd7e88 7fffffffffffffff ffff880f01449700 ffffffff8120f8c0
[ 2642.740036] Call Trace:
[ 2642.740042] [] ? generic_write_sync+0x60/0x60
[ 2642.740047] [] schedule+0x29/0x70
[ 2642.740051] [] schedule_timeout+0x209/0x2d0
[ 2642.740056] [] ? __queue_work+0x136/0x320
[ 2642.740060] [] ? __queue_delayed_work+0xaa/0x1a0
[ 2642.740064] [] ? try_to_grab_pending+0xb1/0x160
[ 2642.740069] [] ? generic_write_sync+0x60/0x60
[ 2642.740093] [] wait_for_completion+0x116/0x170
[ 2642.740100] [] ? wake_up_state+0x20/0x20
[ 2642.740105] [] sync_inodes_sb+0xb7/0x1e0
[ 2642.740110] [] ? generic_write_sync+0x60/0x60
[ 2642.740116] [] sync_inodes_one_sb+0x19/0x20
[ 2642.740121] [] iterate_supers+0xb2/0x110
[ 2642.740127] [] sys_sync+0x44/0xb0
[ 2642.740134] [] system_call_fastpath+0x16/0x1b
[ 2762.834495] INFO: task mysqld:6608 blocked for more than 120 seconds.
[ 2762.834547] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 2762.834604] mysqld D 0000000000000000 0 6608 5602 0x00000000
[ 2762.834609] ffff880feb913da0 0000000000000086 ffff881006eac500 ffff880feb913fd8
[ 2762.834615] ffff880feb913fd8 ffff880feb913fd8 ffff881006eac500 ffff880fffc8a180
[ 2762.834619] ffff880fffc8a000 ffff880fffc8a188 ffff880fffc8a028 0000000000000000
[ 2762.834624] Call Trace:
[ 2762.834634] [] schedule+0x29/0x70
[ 2762.834659] [] cv_wait_common+0x125/0x150 [spl]
[ 2762.834665] [] ? wake_up_atomic_t+0x30/0x30
[ 2762.834677] [] __cv_wait+0x15/0x20 [spl]
[ 2762.834748] [] zil_commit.part.12+0x8b/0x830 [zfs]
[ 2762.834761] [] ? spl_kmem_alloc+0xc7/0x170 [spl]
[ 2762.834774] [] ? tsd_set+0x324/0x500 [spl]
[ 2762.834779] [] ? mutex_lock+0x12/0x2f
[ 2762.834843] [] zil_commit+0x17/0x20 [zfs]
[ 2762.834908] [] zfs_fsync+0x77/0xf0 [zfs]
[ 2762.834971] [] zpl_fsync+0x65/0x90 [zfs]
[ 2762.834978] [] do_fsync+0x65/0xa0
[ 2762.834983] [] SyS_fsync+0x10/0x20
[ 2762.834989] [] system_call_fastpath+0x16/0x1b
[ 2762.834994] INFO: task mysqld:7514 blocked for more than 120 seconds.
[ 2762.835051] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 2762.835118] mysqld D ffff880fb554bda0 0 7514 5602 0x00000000
[ 2762.835122] ffff880fb554bd30 0000000000000086 ffff880ff4c8f300 ffff880fb554bfd8
[ 2762.835127] ffff880fb554bfd8 ffff880fb554bfd8 ffff880ff4c8f300 ffff880fb554bdb0
[ 2762.835132] ffff88107ff8dec0 0000000000000002 ffffffff811f9420 ffff880fb554bda0
[ 2762.835137] Call Trace:
[ 2762.835143] [] ? unlock_two_nondirectories+0x60/0x60
[ 2762.835148] [] schedule+0x29/0x70
[ 2762.835151] [] inode_wait+0xe/0x20
[ 2762.835156] [] __wait_on_bit+0x60/0x90
[ 2762.835164] [] __inode_wait_for_writeback+0xaf/0xf0
[ 2762.835168] [] ? wake_atomic_t_function+0x40/0x40
[ 2762.835175] [] inode_wait_for_writeback+0x26/0x40
[ 2762.835179] [] evict+0x95/0x170
[ 2762.835183] [] iput+0xf5/0x180
[ 2762.835188] [] do_unlinkat+0x1ae/0x2b0
[ 2762.835195] [] ? SyS_readlinkat+0xd1/0x140
[ 2762.835199] [] SyS_unlink+0x16/0x20
[ 2762.835205] [] system_call_fastpath+0x16/0x1b
[ 2762.835223] INFO: task sync:2653 blocked for more than 120 seconds.
[ 2762.835277] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 2762.835343] sync D ffffffff8120f8c0 0 2653 1455 0x00000000
[ 2762.835348] ffff880ffedd7d50 0000000000000082 ffff880f01449700 ffff880ffedd7fd8
[ 2762.835352] ffff880ffedd7fd8 ffff880ffedd7fd8 ffff880f01449700 ffff880ffedd7e80
[ 2762.835357] ffff880ffedd7e88 7fffffffffffffff ffff880f01449700 ffffffff8120f8c0
[ 2762.835362] Call Trace:
[ 2762.835369] [] ? generic_write_sync+0x60/0x60
[ 2762.835374] [] schedule+0x29/0x70
[ 2762.835378] [] schedule_timeout+0x209/0x2d0
[ 2762.835382] [] ? __queue_work+0x136/0x320
[ 2762.835386] [] ? __queue_delayed_work+0xaa/0x1a0
[ 2762.835390] [] ? try_to_grab_pending+0xb1/0x160
[ 2762.835396] [] ? generic_write_sync+0x60/0x60
[ 2762.835409] [] wait_for_completion+0x116/0x170
[ 2762.835417] [] ? wake_up_state+0x20/0x20
[ 2762.835422] [] sync_inodes_sb+0xb7/0x1e0
[ 2762.835427] [] ? generic_write_sync+0x60/0x60
[ 2762.835433] [] sync_inodes_one_sb+0x19/0x20
[ 2762.835438] [] iterate_supers+0xb2/0x110
[ 2762.835443] [] sys_sync+0x44/0xb0
[ 2762.835451] [] system_call_fastpath+0x16/0x1b
[ 2882.929813] INFO: task mysqld:6608 blocked for more than 120 seconds.
[ 2882.929861] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 2882.929913] mysqld D 0000000000000000 0 6608 5602 0x00000000
[ 2882.929919] ffff880feb913da0 0000000000000086 ffff881006eac500 ffff880feb913fd8
[ 2882.929924] ffff880feb913fd8 ffff880feb913fd8 ffff881006eac500 ffff880fffc8a180
[ 2882.929928] ffff880fffc8a000 ffff880fffc8a188 ffff880fffc8a028 0000000000000000
[ 2882.929932] Call Trace:
[ 2882.929942] [] schedule+0x29/0x70
[ 2882.929967] [] cv_wait_common+0x125/0x150 [spl]
[ 2882.929972] [] ? wake_up_atomic_t+0x30/0x30
[ 2882.929983] [] __cv_wait+0x15/0x20 [spl]
[ 2882.930049] [] zil_commit.part.12+0x8b/0x830 [zfs]
[ 2882.930059] [] ? spl_kmem_alloc+0xc7/0x170 [spl]
[ 2882.930070] [] ? tsd_set+0x324/0x500 [spl]
[ 2882.930075] [] ? mutex_lock+0x12/0x2f
[ 2882.930129] [] zil_commit+0x17/0x20 [zfs]
[ 2882.930181] [] zfs_fsync+0x77/0xf0 [zfs]
[ 2882.930232] [] zpl_fsync+0x65/0x90 [zfs]
[ 2882.930238] [] do_fsync+0x65/0xa0
[ 2882.930243] [] SyS_fsync+0x10/0x20
[ 2882.930248] [] system_call_fastpath+0x16/0x1b
[ 4802.367346] perf interrupt took too long (2507 > 2500), lowering kernel.perf_event_max_sample_rate to 50000

sync process stack:
[] sync_inodes_sb+0xb7/0x1e0
[] sync_inodes_one_sb+0x19/0x20
[] iterate_supers+0xb2/0x110
[] sys_sync+0x44/0xb0
[] system_call_fastpath+0x16/0x1b
[] 0xffffffffffffffff

kworker process stack:
cat /proc/3445/stack
[] dmu_zfetch+0x455/0x4f0 [zfs]
[] dbuf_read+0x8ea/0x9f0 [zfs]
[] dnode_hold_impl+0xc6/0xc30 [zfs]
[] 0xffffffffffffffff

cat /proc/3445/stack
[] 0xffffffffffffffff
[root@ifcos ~]# cat /proc/3445/stack
[] zfs_zget+0xfc/0x250 [zfs]
[] zfs_get_data+0x57/0x2d0 [zfs]
[] zil_commit.part.12+0x41c/0x830 [zfs]
[] zil_commit+0x17/0x20 [zfs]
[] zpl_writepages+0xd6/0x170 [zfs]
[] do_writepages+0x1e/0x40
[] __writeback_single_inode+0x40/0x220
[] writeback_sb_inodes+0x25e/0x420
[] wb_writeback+0xff/0x2f0
[] bdi_writeback_workfn+0x115/0x460
[] process_one_work+0x17b/0x470
[] worker_thread+0x11b/0x400
[] kthread+0xcf/0xe0
[] ret_from_fork+0x58/0x90
[] 0xffffffffffffffff

cat /proc/3445/stack
[] dmu_zfetch+0x455/0x4f0 [zfs]
[] __dbuf_hold_impl+0x135/0x5a0 [zfs]
[] 0xffffffffffffffff

cat /proc/3445/stack
[] dbuf_find+0x1c9/0x1d0 [zfs]
[] __dbuf_hold_impl+0x42/0x5a0 [zfs]
[] 0xffffffffffffffff

cat /proc/3445/stack
[] dmu_zfetch+0x455/0x4f0 [zfs]
[] dbuf_read+0x756/0x9f0 [zfs]
[] dnode_hold_impl+0xc6/0xc30 [zfs]

Defect

Source

ltz3317

👍4

Most helpful comment

Just chiming in to say 'me too'. Definitely seems to be a bug, disks are healthy.

stewartadam on 28 Jul 2018

👍5

All 82 comments

The same situation with me too.

xTire on 12 Jan 2018

👎1

Same issue. No reproduction steps identified, seems "random".

Most commonly appears on long-running disk intensive tasks (in my case caused by BURP backup software) but getting the same blocked stack traces at generic_write_sync and kworker at 100%.

Here is output from 'echo l > /proc/sysrq-trigger'

[5490997.296870] sending NMI to all CPUs:
[5490997.297924] NMI backtrace for cpu 0
[5490997.297926] CPU: 0 PID: 27539 Comm: kworker/u8:1 Tainted: P OE ------------ 3.10.0-693.el7.x86_64 #1
[5490997.297928] Hardware name: Gigabyte Technology Co., Ltd. G33-DS3R/G33-DS3R, BIOS F4 06/29/2007
[5490997.297929] Workqueue: writeback bdi_writeback_workfn (flush-zfs-1)
[5490997.297931] task: ffff8800c7db3f40 ti: ffff880120aac000 task.ti: ffff880120aac000
[5490997.297932] RIP: 0010:[] [] read_tsc+0xa/0x20
[5490997.297933] RSP: 0018:ffff880120aaf7c8 EFLAGS: 00000246
[5490997.297935] RAX: ffff8800a5cdf0d8 RBX: ffffffff81a0e3c0 RCX: 0000000000000000
[5490997.297936] RDX: 0000000000000058 RSI: ffff8800a5cdf080 RDI: ffffffff81a0e3c0
[5490997.297937] RBP: ffff880120aaf7c8 R08: 000000000053c8f3 R09: 0000000000000000
[5490997.297939] R10: 00000000006db6e0 R11: ffffea0001935000 R12: ffff880120aaf828
[5490997.297940] R13: 000000008aaf12f6 R14: ffff8800a5cdf500 R15: ffff880127d6f3e8
[5490997.297942] FS: 0000000000000000(0000) GS:ffff88012fc00000(0000) knlGS:0000000000000000
[5490997.297943] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[5490997.297944] CR2: 0000000000dde6d0 CR3: 000000005a22f000 CR4: 00000000000007f0
[5490997.297946] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[5490997.297947] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[5490997.297948] Stack:
[5490997.297949] ffff880120aaf7f0 ffffffff810ea6ae 0000000000000005 ffff8800a5cdf080
[5490997.297951] ffff880127d6f3b8 ffff880120aaf868 ffffffffc0b73a7c ffff8800c7db3f40
[5490997.297952] ffff8800c7db3f40 000000000000b048 000000000000b049 ffff880120aaf828
[5490997.297954] Call Trace:
[5490997.297955] [] getrawmonotonic64+0x2e/0xc0
[5490997.297956] [] dmu_zfetch+0x39c/0x4c0 [zfs]
[5490997.297957] [] dbuf_read+0x8e2/0x9e0 [zfs]
[5490997.297959] [] ? spl_kmem_free+0x35/0x40 [spl]
[5490997.297960] [] ? dbuf_hold_impl+0xb9/0xd0 [zfs]
[5490997.297961] [] dnode_hold_impl+0xc6/0xc30 [zfs]
[5490997.297963] [] ? avl_add+0x4a/0x80 [zavl]
[5490997.297964] [] ? mutex_lock+0x12/0x2f
[5490997.297965] [] dnode_hold+0x1b/0x20 [zfs]
[5490997.297967] [] dmu_bonus_hold+0x32/0x1d0 [zfs]
[5490997.297968] [] sa_buf_hold+0xe/0x10 [zfs]
[5490997.297969] [] zfs_zget+0x123/0x250 [zfs]
[5490997.297971] [] zfs_get_data+0x57/0x2d0 [zfs]
[5490997.297972] [] zil_commit.part.12+0x41c/0x830 [zfs]
[5490997.297973] [] zil_commit+0x17/0x20 [zfs]
[5490997.297975] [] zpl_writepages+0xd6/0x170 [zfs]
[5490997.297976] [] do_writepages+0x1e/0x40
[5490997.297977] [] __writeback_single_inode+0x40/0x220
[5490997.297979] [] writeback_sb_inodes+0x1c4/0x490
[5490997.297980] [] wb_writeback+0xff/0x2f0
[5490997.297981] [] ? set_worker_desc+0x86/0xb0
[5490997.297983] [] bdi_writeback_workfn+0x115/0x460
[5490997.297984] [] process_one_work+0x17a/0x440
[5490997.297985] [] worker_thread+0x126/0x3c0
[5490997.297987] [] ? manage_workers.isra.24+0x2a0/0x2a0
[5490997.297988] [] kthread+0xcf/0xe0
[5490997.297989] [] ? insert_kthread_work+0x40/0x40
[5490997.297991] [] ret_from_fork+0x58/0x90
[5490997.297992] [] ? insert_kthread_work+0x40/0x40
[5490997.297994] Code: 00 55 8b 05 39 cc ad 00 48 89 e5 5d c3 0f 1f 40 00 55 b8 ed ff ff ff 48 89 e5 5d c3 0f 1f 44 00 00 55 48 89 e5 66 66 90 0f ae e8 <0f> 31 89 c0 48 c1 e2 20 48 09 c2 48 89 d0 5d c3 66 0f 1f 44 00

and another shortly after

[5490939.586576] SysRq : Show backtrace of all active CPUs
[5490939.587701] sending NMI to all CPUs:
[5490939.588755] NMI backtrace for cpu 0
[5490939.588757] CPU: 0 PID: 27539 Comm: kworker/u8:1 Tainted: P OE ------------ 3.10.0-693.el7.x86_64 #1
[5490939.588759] Hardware name: Gigabyte Technology Co., Ltd. G33-DS3R/G33-DS3R, BIOS F4 06/29/2007
[5490939.588760] Workqueue: writeback bdi_writeback_workfn (flush-zfs-1)
[5490939.588762] task: ffff8800c7db3f40 ti: ffff880120aac000 task.ti: ffff880120aac000
[5490939.588763] RIP: 0010:[] [] getrawmonotonic64+0x7a/0xc0
[5490939.588764] RSP: 0018:ffff880120aaf7d8 EFLAGS: 00000207
[5490939.588766] RAX: 00000000006db6e0 RBX: ffffffff81a0e3c0 RCX: 0000000000000018
[5490939.588767] RDX: 0000000021f8aba3 RSI: 00000000000cc2e8 RDI: 00000000001dc6c8
[5490939.588768] RBP: ffff880120aaf7f0 R08: 000000000053c8b9 R09: 0000000000000000
[5490939.588770] R10: 00000000006db6e0 R11: ffffea0001935000 R12: ffff880120aaf828
[5490939.588771] R13: 000000008aad501e R14: ffff8800a5cdf500 R15: ffff880127d6f3e8
[5490939.588772] FS: 0000000000000000(0000) GS:ffff88012fc00000(0000) knlGS:0000000000000000
[5490939.588774] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[5490939.588775] CR2: 00007f4379aae8e0 CR3: 00000000b9a26000 CR4: 00000000000007f0
[5490939.588777] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[5490939.588778] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[5490939.588779] Stack:
[5490939.588781] 0000000000000007 0000000000000000 ffff880127d6f3b8 ffff880120aaf868
[5490939.588782] ffffffffc0b73a7c ffff8800c7db3f40 ffff8800c7db3f40 000000000000b048
[5490939.588784] 000000000000b049 ffff880120aaf828 000000000053c8b9 0000000022056e52
[5490939.588785] Call Trace:
[5490939.588786] [] dmu_zfetch+0x39c/0x4c0 [zfs]
[5490939.588787] [] dbuf_read+0x8e2/0x9e0 [zfs]
[5490939.588789] [] ? spl_kmem_free+0x35/0x40 [spl]
[5490939.588790] [] ? dbuf_hold_impl+0xb9/0xd0 [zfs]
[5490939.588791] [] dnode_hold_impl+0xc6/0xc30 [zfs]
[5490939.588793] [] ? avl_add+0x4a/0x80 [zavl]
[5490939.588794] [] ? mutex_lock+0x12/0x2f
[5490939.588795] [] dnode_hold+0x1b/0x20 [zfs]
[5490939.588797] [] dmu_bonus_hold+0x32/0x1d0 [zfs]
[5490939.588798] [] sa_buf_hold+0xe/0x10 [zfs]
[5490939.588799] [] zfs_zget+0x123/0x250 [zfs]
[5490939.588801] [] zfs_get_data+0x57/0x2d0 [zfs]
[5490939.588802] [] zil_commit.part.12+0x41c/0x830 [zfs]
[5490939.588803] [] zil_commit+0x17/0x20 [zfs]
[5490939.588804] [] zpl_writepages+0xd6/0x170 [zfs]
[5490939.588806] [] do_writepages+0x1e/0x40
[5490939.588807] [] __writeback_single_inode+0x40/0x220
[5490939.588809] [] writeback_sb_inodes+0x1c4/0x490
[5490939.588810] [] wb_writeback+0xff/0x2f0
[5490939.588811] [] ? set_worker_desc+0x86/0xb0
[5490939.588813] [] bdi_writeback_workfn+0x115/0x460
[5490939.588814] [] process_one_work+0x17a/0x440
[5490939.588815] [] worker_thread+0x126/0x3c0
[5490939.588817] [] ? manage_workers.isra.24+0x2a0/0x2a0
[5490939.588818] [] kthread+0xcf/0xe0
[5490939.588819] [] ? insert_kthread_work+0x40/0x40
[5490939.588820] [] ret_from_fork+0x58/0x90
[5490939.588822] [] ? insert_kthread_work+0x40/0x40
[5490939.588824] Code: da 00 48 8b 15 58 b1 da 00 44 3b 2d b9 b0 da 00 75 b0 48 29 f0 31 f6 48 21 c7 44 89 d0 48 0f 49 f7 48 0f af f0 4c 01 ce 48 d3 fe <48> 8d 04 16 48 3d ff c9 9a 3b 76 2e 31 d2 0f 1f 84 00 00 00 00

imathrowback on 7 Feb 2018

same problem (hp-dataprotector backup software; vbda process). While run full backup on a snapshot zfs receive does not go on . attach file output of perf record -a and perf sched . i'm using deadline scheduler on block device to try to workaround this problem. some problem with noop scheduler.
zfs version :0.6.5.6-0ubuntu16
Linux kernel: 4.4.0-101-generic #124-Ubuntu (Ubuntu 16.04.3 LTS)
perf_record-a2.data.gz
perf_shed_record2.data.gz

 iostat -xd 20
Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdd               0,00     0,00    0,00    0,00     0,00     0,00     0,00     0,00    0,00    0,00    0,00   0,00   0,00
sdc               0,00     0,00    0,00    0,00     0,00     0,00     0,00     0,00    0,00    0,00    0,00   0,00   0,00
sdf               0,00     0,00    4,25   15,30   406,50  1061,58   150,19     0,03    1,73    1,08    1,91   0,40   0,78
sde               0,00     0,00    4,35   16,60   419,43   752,77   111,90     0,03    1,30    0,83    1,42   0,29   0,60
sdg               0,00     0,00  319,35    0,00  9188,00     0,00    57,54     1,01    3,16    3,16    0,00   1,71  54,56
sdh               0,00     0,00  423,45    0,00 12698,40     0,00    59,98     0,93    2,20    2,20    0,00   1,29  54,44
sdj               0,00     0,00  371,45    0,00 10857,80     0,00    58,46     1,22    3,28    3,28    0,00   1,69  62,82
sdi               0,00     0,00  515,95    0,00 15066,20     0,00    58,40     1,08    2,10    2,10    0,00   1,16  59,78
sdk               0,00     0,00  295,80    0,00  8717,00     0,00    58,94     0,60    2,05    2,05    0,00   1,39  41,16
sdl               0,00     0,00  338,10    0,00  9420,80     0,00    55,73     0,70    2,06    2,06    0,00   1,35  45,60
sdq               0,00     0,00  226,60    0,00  6327,40     0,00    55,85     0,90    3,98    3,98    0,00   2,39  54,10
sdo               0,00     0,00  482,35    0,00 14504,60     0,00    60,14     1,36    2,81    2,81    0,00   1,30  62,94
sdn               0,00     0,00  430,60    0,00 12728,60     0,00    59,12     2,08    4,82    4,82    0,00   1,85  79,58
sdm               0,00     0,00  414,70    0,00 12191,80     0,00    58,80     1,00    2,42    2,42    0,00   1,28  53,06
sdr               0,00     0,00  256,05    0,00  7346,40     0,00    57,38     0,52    2,04    2,04    0,00   1,48  37,96
sdp               0,00     0,00  438,55    0,00 12945,60     0,00    59,04     0,99    2,25    2,25    0,00   1,26  55,08
zpool status
  pool: pool_z2_samba
 state: ONLINE
  scan: scrub canceled on Fri Dec  1 09:06:55 2017
config:

    NAME        STATE     READ WRITE CKSUM
    pool_z2_samba  ONLINE       0     0     0
      raidz2-0  ONLINE       0     0     0
        sdg     ONLINE       0     0     0
        sdh     ONLINE       0     0     0
        sdi     ONLINE       0     0     0
        sdj     ONLINE       0     0     0
        sdk     ONLINE       0     0     0
        sdl     ONLINE       0     0     0
      raidz2-1  ONLINE       0     0     0
        sdm     ONLINE       0     0     0
        sdn     ONLINE       0     0     0
        sdo     ONLINE       0     0     0
        sdp     ONLINE       0     0     0
        sdq     ONLINE       0     0     0
        sdr     ONLINE       0     0     0
    logs
      mirror-2  ONLINE       0     0     0
        sdc     ONLINE       0     0     0
        sdd     ONLINE       0     0     0
    cache
      sde       ONLINE       0     0     0
      sdf       ONLINE       0     0     0

errors: No known data errors

bud4 on 12 Feb 2018

the problem occurs very often. I attach other data
from : echo 'w' > /proc/sysrq-trigger
and
perf record -ag
echo_w_sysrq-trigger.gz
perf.data.gz

bud4 on 18 May 2018

Hi all,

I experienced this problem as well (multiple times). Then I found this thread and tried switching from noop scheduler to deadline scheduler and still experienced the problem.

I am running a 4.16.8 Gentoo kernel, with musl libc, and using fairly recent ZFS from master branch (zfs commit 670d74b9cee4dd185a620510bffd4797a2e1066a , spl commit 1149b62d20b7ed9d8ae25d5da7a06213d79b7602).

This tends to happen after doing a lot of compilation. I merged about 1500 packages with the deadline scheduler before having this hit. The noop scheduler would do about 500-1500 packages before quitting. My data set is probably too small to do any comparison, and I didn't bother writing down specifics (the workflow can be complicated to begin with...).

Here's my dmesg just after the hang:
https://pastebin.com/MayYqGaV

Note that the system has 56 threads to play with and would spend some fraction of its time with a load average around 60; perhaps that is plenty of opportunity for resource contention and deadlocks?

HTH.

chadjoan on 19 May 2018

I am also experiencing this issue. I don't have as many helpful logs or diagnostics, but I can also say I am similarly doing a lot of heavy compilation and small IO. I'm also using it as a Docker storage driver, but the lockups will happen with or without Docker. I'm on an older 0.6.5.6-0ubuntu18. A reboot temporarily fixes the issue.

vector-of-bool on 29 May 2018

👍1

another time (4.4.0-116-generic #140-Ubuntu zfs 0.6.5.6-0ubuntu16) While run full backup
zfsBlock.log.gz

bud4 on 8 Jun 2018

hit similar issue, happened on Gentoo during emerge --emptytree -j4 @world,
which do all package recompile with lot's parallel access and file locking,
kworker 100% cpu usage,

kernel 4.14.39
ZFS 0.7.8
SPL 0.7.8

NMI backtrace
https://pastebin.com/rYAhR1ri

AndCycle on 20 Jun 2018

As a troubleshooting step, I recommend pulling the drives and doing some read/write sector tests, make sure there are no controller or bad sector issues. I've hit this bug on a system where the drive seemed fine, except for the odd ZFS hang until the drive became completely unresponsive a few months later and failed.

imathrowback on 20 Jun 2018

Just chiming in to say 'me too'. Definitely seems to be a bug, disks are healthy.

stewartadam on 28 Jul 2018

👍5

another time sync and other processes go lock during zfs receive and full backup (4.4.0-135-generic #161-Ubuntu)
perf.data.gz
kernelStateD.log

bud4 on 20 Sep 2018

Have this issue too.

Not very big IO. If I use Docker and compile something inside Docker box in ~12 hours system hangs (X11 freezes, mouse cursor is not moves). HDD LED blinks few times immediately after it and no more.

Hardware reset fixes this issue but sometimes scrub finds errors and three times I lost my pool (import fails with different kernel stack traces)

Also, may be it is important, I have root on this ZFS pool.

Tried to change my hardware to AMD desktop and old Intel server without any result. So I think this is software issue.

I am on Debian:

Linux localhost 4.14.0-3-amd64 #1 SMP Debian 4.14.17-1 (2018-02-14) x86_64 GNU/Linux
zfs-dkms 0.7.9-3

denizzzka on 8 Oct 2018

another time sync and other processes go lock during zfs receive and full backup
attach
echo w > /proc/sysrq-trigger
echo l > /proc/sysrq-trigger
kern.log

bud4 on 6 Dec 2018

Type | Version/Name
--- | ---
Distribution Name | Ubuntu
Distribution Version | 16.04.5 LTS
Linux Kernel | 4.4.0-141-generic
Architecture | x86_64
ZFS Version | 0.6.5.6-0ubuntu26
SPL Version | 0.6.5.6-0ubuntu4
kernel.log.gz

it happened another time. Please help me!

bud4 on 24 Jan 2019

@bud4 currently most developer already got their hand full,

so if you wanna they take on this problem first,
you have to provide test script that 100% able to reproduce this problem.

so they can easily reproduce this problem on their end with some automation test environment.

and it also preferable you take a spare machine upgrade to 0.7.12 and still able to reproduce this issue.

AndCycle on 24 Jan 2019

👍1

@AndCycle
This problem is a real pain in the ass. If one of us could have provided reproduce case this issue was been closed long time ago, I believe.

This issue may not appear for 2 months and then begin to appear twice per day.

Perhaps we need to turn on some additional logging and read it throught rs232? Tell it. Just in simple words, we are not all developers here and do not know how to hunt this bug.

denizzzka on 24 Jan 2019

@AndCycle
Very difficult for me to be able to create scripts that reproduce the event. Also because I only have a master and a slave both in production.
The bug often occurs on the slave while the full backup is running and at the same time there is a zfs receive operation. The Hp-dataprotector agent reads the data from the mounted snapshots of the various filessytems. Maximum 6 simultaneous reading processes.
The bug also occurs on the master while there are about 800 smbd connections. (Many read operations and few write operations, but some writing operations are very intensive). In the zpool there are about 150 filesystems with active dedup and compression. The total data is about 60 Tera in raidz2 on two vdev of 6 disks each.
Please at least suggest what information to collect the next time the problem occurs. What would be useful? Do you have any scripts I can run?

bud4 on 24 Jan 2019

hopeless
kernel.log.gz

bud4 on 29 Jan 2019

I also have this issue, and I am running zfs on root. I also run several dockers.

When the issue occurs, I can always trace it back to a period of high I/O. The type of disk activity doesn't seem to matter because sometimes it will happen during a routine scrub, before which the disks had been fairly idle for hours. Not all periods of high I/O cause the problem though.

Seemingly random problems like this, to me, point to a hardware issue (likely disk or storage controller), but badblocks and SMART tests have been coming up clean, so I'm not sure what to think at this point. I'll keep an eye on this and try to help out with logs or test data where I can.

bglod on 4 Feb 2019

👍1

Same issue for us on multiple servers with high frequency create/destroy/clone.

Using Debian 9.7 and ZOL 0.7.12

Here are some logs:

execve("/usr/sbin/update-initramfs", ["update-initramfs", "-u"], [/* 16 vars */]) = 0
brk(NULL)                               = 0x56112fbcd000
access("/etc/ld.so.nohwcap", F_OK)      = -1 ENOENT (No such file or directory)
mmap(NULL, 12288, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f17c42f7000
access("/etc/ld.so.preload", R_OK)      = -1 ENOENT (No such file or directory)
open("/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=65446, ...}) = 0
mmap(NULL, 65446, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7f17c42e7000
close(3)                                = 0
access("/etc/ld.so.nohwcap", F_OK)      = -1 ENOENT (No such file or directory)
open("/lib/x86_64-linux-gnu/libc.so.6", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\3\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\0\4\2\0\0\0\0\0"..., 832) = 832
fstat(3, {st_mode=S_IFREG|0755, st_size=1689360, ...}) = 0
mmap(NULL, 3795296, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f17c3d38000
mprotect(0x7f17c3ecd000, 2097152, PROT_NONE) = 0
mmap(0x7f17c40cd000, 24576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x195000) = 0x7f17c40cd000
mmap(0x7f17c40d3000, 14688, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f17c40d3000
close(3)                                = 0
mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f17c42e5000
arch_prctl(ARCH_SET_FS, 0x7f17c42e5700) = 0
mprotect(0x7f17c40cd000, 16384, PROT_READ) = 0
mprotect(0x56112f1dd000, 8192, PROT_READ) = 0
mprotect(0x7f17c42fa000, 4096, PROT_READ) = 0
munmap(0x7f17c42e7000, 65446)           = 0
getpid()                                = 8030
rt_sigaction(SIGCHLD, {sa_handler=0x56112efd3ef0, sa_mask=~[RTMIN RT_1], sa_flags=SA_RESTORER, sa_restorer=0x7f17c3d6b060}, NULL, 8) = 0
geteuid()                               = 0
brk(NULL)                               = 0x56112fbcd000
brk(0x56112fbee000)                     = 0x56112fbee000
getppid()                               = 8028
stat("/root", {st_mode=S_IFDIR|0700, st_size=19, ...}) = 0
stat(".", {st_mode=S_IFDIR|0700, st_size=19, ...}) = 0
open("/usr/sbin/update-initramfs", O_RDONLY) = 3
fcntl(3, F_DUPFD, 10)                   = 10
close(3)                                = 0
fcntl(10, F_SETFD, FD_CLOEXEC)          = 0
rt_sigaction(SIGINT, NULL, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=0}, 8) = 0
rt_sigaction(SIGINT, {sa_handler=0x56112efd3ef0, sa_mask=~[RTMIN RT_1], sa_flags=SA_RESTORER, sa_restorer=0x7f17c3d6b060}, NULL, 8) = 0
rt_sigaction(SIGQUIT, NULL, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=0}, 8) = 0
rt_sigaction(SIGQUIT, {sa_handler=SIG_DFL, sa_mask=~[RTMIN RT_1], sa_flags=SA_RESTORER, sa_restorer=0x7f17c3d6b060}, NULL, 8) = 0
rt_sigaction(SIGTERM, NULL, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=0}, 8) = 0
rt_sigaction(SIGTERM, {sa_handler=SIG_DFL, sa_mask=~[RTMIN RT_1], sa_flags=SA_RESTORER, sa_restorer=0x7f17c3d6b060}, NULL, 8) = 0
read(10, "#!/bin/sh\n\nSTATEDIR=/var/lib/ini"..., 8192) = 8192
faccessat(AT_FDCWD, "/etc/initramfs-tools/update-initramfs.conf", R_OK) = 0
open("/etc/initramfs-tools/update-initramfs.conf", O_RDONLY) = 3
fcntl(3, F_DUPFD, 10)                   = 11
close(3)                                = 0
fcntl(11, F_SETFD, FD_CLOEXEC)          = 0
read(11, "#\n# Configuration file for updat"..., 8192) = 378
read(11, "", 8192)                      = 0
close(11)                               = 0
clone(child_stack=NULL, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f17c42e59d0) = 8031
wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 1}], 0, NULL) = 8031
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=8031, si_uid=0, si_status=1, si_utime=0, si_stime=0} ---
rt_sigreturn({mask=[]})                 = 8031
read(10, ";\n\tu)\n\t\tupdate\n\t\t;;\nesac\n", 8192) = 25
pipe([3, 4])                            = 0
clone(child_stack=NULL, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f17c42e59d0) = 8032
close(4)                                = 0
read(3, "4.15.18-10-pve\n4.15.18-9-pve\n4.1"..., 128) = 128
read(3, ".13.16-3-pve\n4.13.16-2-pve\n4.13."..., 128) = 128
read(3, "3.8-3-pve\n4.13.8-2-pve\n4.13.4-1-"..., 128) = 36
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=8032, si_uid=0, si_status=0, si_utime=0, si_stime=0} ---
rt_sigreturn({mask=[]})                 = 36
read(3, "", 128)                        = 0
close(3)                                = 0
wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, NULL) = 8032
faccessat(AT_FDCWD, "/proc/mounts", R_OK) = 0
stat("/usr/local/sbin/ischroot", 0x7ffe69edd4e0) = -1 ENOENT (No such file or directory)
stat("/usr/local/bin/ischroot", 0x7ffe69edd4e0) = -1 ENOENT (No such file or directory)
stat("/usr/sbin/ischroot", 0x7ffe69edd4e0) = -1 ENOENT (No such file or directory)
stat("/usr/bin/ischroot", {st_mode=S_IFREG|0755, st_size=10552, ...}) = 0
clone(child_stack=NULL, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f17c42e59d0) = 8035
wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 1}], 0, NULL) = 8035
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=8035, si_uid=0, si_status=1, si_utime=0, si_stime=0} ---
rt_sigreturn({mask=[]})                 = 8035
pipe([3, 4])                            = 0
clone(child_stack=NULL, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f17c42e59d0) = 8036
close(4)                                = 0
read(3, "", 128)                        = 0
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=8036, si_uid=0, si_status=0, si_utime=0, si_stime=0} ---
rt_sigreturn({mask=[]})                 = 0
close(3)                                = 0
wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, NULL) = 8036
stat("/boot/initrd.img-4.15.18-10-pve", {st_mode=S_IFREG|0644, st_size=36184842, ...}) = 0
stat("/usr/local/sbin/sha1sum", 0x7ffe69edd340) = -1 ENOENT (No such file or directory)
stat("/usr/local/bin/sha1sum", 0x7ffe69edd340) = -1 ENOENT (No such file or directory)
stat("/usr/sbin/sha1sum", 0x7ffe69edd340) = -1 ENOENT (No such file or directory)
stat("/usr/bin/sha1sum", {st_mode=S_IFREG|0755, st_size=47976, ...}) = 0
pipe([3, 4])                            = 0
clone(child_stack=NULL, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f17c42e59d0) = 8037
close(4)                                = 0
stat("/usr/local/sbin/diff", 0x7ffe69edd340) = -1 ENOENT (No such file or directory)
stat("/usr/local/bin/diff", 0x7ffe69edd340) = -1 ENOENT (No such file or directory)
stat("/usr/sbin/diff", 0x7ffe69edd340)  = -1 ENOENT (No such file or directory)
stat("/usr/bin/diff", {st_mode=S_IFREG|0755, st_size=146824, ...}) = 0
clone(child_stack=NULL, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f17c42e59d0) = 8038
close(3)                                = 0
close(-1)                               = -1 EBADF (Bad file descriptor)
wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, NULL) = 8037
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=8037, si_uid=0, si_status=0, si_utime=9, si_stime=1} ---
rt_sigreturn({mask=[]})                 = 8037
wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, NULL) = 8038
faccessat(AT_FDCWD, "/boot/initrd.img-4.15.18-10-pve", R_OK) = 0
faccessat(AT_FDCWD, "/boot/initrd.img-4.15.18-10-pve.dpkg-bak", R_OK) = 0
stat("/usr/local/sbin/rm", 0x7ffe69edd520) = -1 ENOENT (No such file or directory)
stat("/usr/local/bin/rm", 0x7ffe69edd520) = -1 ENOENT (No such file or directory)
stat("/usr/sbin/rm", 0x7ffe69edd520)    = -1 ENOENT (No such file or directory)
stat("/usr/bin/rm", 0x7ffe69edd520)     = -1 ENOENT (No such file or directory)
stat("/sbin/rm", 0x7ffe69edd520)        = -1 ENOENT (No such file or directory)
stat("/bin/rm", {st_mode=S_IFREG|0755, st_size=64424, ...}) = 0
clone(child_stack=NULL, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f17c42e59d0) = 8039
wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, NULL) = 8039
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=8039, si_uid=0, si_status=0, si_utime=0, si_stime=0} ---
rt_sigreturn({mask=[]})                 = 8039
stat("/usr/local/sbin/ln", 0x7ffe69edd540) = -1 ENOENT (No such file or directory)
stat("/usr/local/bin/ln", 0x7ffe69edd540) = -1 ENOENT (No such file or directory)
stat("/usr/sbin/ln", 0x7ffe69edd540)    = -1 ENOENT (No such file or directory)
stat("/usr/bin/ln", 0x7ffe69edd540)     = -1 ENOENT (No such file or directory)
stat("/sbin/ln", 0x7ffe69edd540)        = -1 ENOENT (No such file or directory)
stat("/bin/ln", {st_mode=S_IFREG|0755, st_size=56240, ...}) = 0
clone(child_stack=NULL, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f17c42e59d0) = 8040
wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, NULL) = 8040
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=8040, si_uid=0, si_status=0, si_utime=0, si_stime=0} ---
rt_sigreturn({mask=[]})                 = 8040
write(1, "update-initramfs: Generating /bo"..., 61update-initramfs: Generating /boot/initrd.img-4.15.18-10-pve
) = 61
stat("/usr/local/sbin/mkinitramfs", 0x7ffe69edd580) = -1 ENOENT (No such file or directory)
stat("/usr/local/bin/mkinitramfs", 0x7ffe69edd580) = -1 ENOENT (No such file or directory)
stat("/usr/sbin/mkinitramfs", {st_mode=S_IFREG|0755, st_size=10255, ...}) = 0
clone(child_stack=NULL, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f17c42e59d0) = 8041
wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, NULL) = 8041
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=8041, si_uid=0, si_status=0, si_utime=1, si_stime=3} ---
rt_sigreturn({mask=[]})                 = 8041
stat("/usr/local/sbin/mv", 0x7ffe69edd540) = -1 ENOENT (No such file or directory)
stat("/usr/local/bin/mv", 0x7ffe69edd540) = -1 ENOENT (No such file or directory)
stat("/usr/sbin/mv", 0x7ffe69edd540)    = -1 ENOENT (No such file or directory)
stat("/usr/bin/mv", 0x7ffe69edd540)     = -1 ENOENT (No such file or directory)
stat("/sbin/mv", 0x7ffe69edd540)        = -1 ENOENT (No such file or directory)
stat("/bin/mv", {st_mode=S_IFREG|0755, st_size=126416, ...}) = 0
clone(child_stack=NULL, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f17c42e59d0) = 14543
wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, NULL) = 14543
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=14543, si_uid=0, si_status=0, si_utime=0, si_stime=0} ---
rt_sigreturn({mask=[]})                 = 14543
open("/var/lib/initramfs-tools/4.15.18-10-pve", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 3
fcntl(1, F_DUPFD, 10)                   = 11
close(1)                                = 0
fcntl(11, F_SETFD, FD_CLOEXEC)          = 0
dup2(3, 1)                              = 1
close(3)                                = 0
clone(child_stack=NULL, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f17c42e59d0) = 14544
wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, NULL) = 14544
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=14544, si_uid=0, si_status=0, si_utime=8, si_stime=1} ---
rt_sigreturn({mask=[]})                 = 14544
dup2(11, 1)                             = 1
close(11)                               = 0
stat("/usr/local/sbin/sync", 0x7ffe69edd560) = -1 ENOENT (No such file or directory)
stat("/usr/local/bin/sync", 0x7ffe69edd560) = -1 ENOENT (No such file or directory)
stat("/usr/sbin/sync", 0x7ffe69edd560)  = -1 ENOENT (No such file or directory)
stat("/usr/bin/sync", 0x7ffe69edd560)   = -1 ENOENT (No such file or directory)
stat("/sbin/sync", 0x7ffe69edd560)      = -1 ENOENT (No such file or directory)
stat("/bin/sync", {st_mode=S_IFREG|0755, st_size=31496, ...}) = 0
clone(child_stack=NULL, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f17c42e59d0) = 14545
wait4(-1,

sysfs on /sys type sysfs (rw,nosuid,nodev,noexec,relatime)
proc on /proc type proc (rw,relatime)
udev on /dev type devtmpfs (rw,nosuid,relatime,size=74244120k,nr_inodes=18561030,mode=755)
devpts on /dev/pts type devpts (rw,nosuid,noexec,relatime,gid=5,mode=620,ptmxmode=000)
tmpfs on /run type tmpfs (rw,nosuid,noexec,relatime,size=14852988k,mode=755)
rpool/ROOT/pve-1 on / type zfs (rw,relatime,xattr,noacl)
securityfs on /sys/kernel/security type securityfs (rw,nosuid,nodev,noexec,relatime)
tmpfs on /dev/shm type tmpfs (rw,nosuid,nodev)
tmpfs on /run/lock type tmpfs (rw,nosuid,nodev,noexec,relatime,size=5120k)
tmpfs on /sys/fs/cgroup type tmpfs (ro,nosuid,nodev,noexec,mode=755)
cgroup on /sys/fs/cgroup/systemd type cgroup (rw,nosuid,nodev,noexec,relatime,xattr,release_agent=/lib/systemd/systemd-cgroups-agent,name=systemd)
pstore on /sys/fs/pstore type pstore (rw,nosuid,nodev,noexec,relatime)
cgroup on /sys/fs/cgroup/memory type cgroup (rw,nosuid,nodev,noexec,relatime,memory)
cgroup on /sys/fs/cgroup/blkio type cgroup (rw,nosuid,nodev,noexec,relatime,blkio)
cgroup on /sys/fs/cgroup/freezer type cgroup (rw,nosuid,nodev,noexec,relatime,freezer)
cgroup on /sys/fs/cgroup/rdma type cgroup (rw,nosuid,nodev,noexec,relatime,rdma)
cgroup on /sys/fs/cgroup/cpu,cpuacct type cgroup (rw,nosuid,nodev,noexec,relatime,cpu,cpuacct)
cgroup on /sys/fs/cgroup/devices type cgroup (rw,nosuid,nodev,noexec,relatime,devices)
cgroup on /sys/fs/cgroup/hugetlb type cgroup (rw,nosuid,nodev,noexec,relatime,hugetlb)
cgroup on /sys/fs/cgroup/net_cls,net_prio type cgroup (rw,nosuid,nodev,noexec,relatime,net_cls,net_prio)
cgroup on /sys/fs/cgroup/cpuset type cgroup (rw,nosuid,nodev,noexec,relatime,cpuset)
cgroup on /sys/fs/cgroup/pids type cgroup (rw,nosuid,nodev,noexec,relatime,pids)
cgroup on /sys/fs/cgroup/perf_event type cgroup (rw,nosuid,nodev,noexec,relatime,perf_event)
systemd-1 on /proc/sys/fs/binfmt_misc type autofs (rw,relatime,fd=36,pgrp=1,timeout=0,minproto=5,maxproto=5,direct,pipe_ino=4157)
mqueue on /dev/mqueue type mqueue (rw,relatime)
debugfs on /sys/kernel/debug type debugfs (rw,relatime)
hugetlbfs on /dev/hugepages type hugetlbfs (rw,relatime,pagesize=2M)
sunrpc on /run/rpc_pipefs type rpc_pipefs (rw,relatime)
configfs on /sys/kernel/config type configfs (rw,relatime)
fusectl on /sys/fs/fuse/connections type fusectl (rw,relatime)
rpool on /rpool type zfs (rw,noatime,xattr,noacl)
rpool/ROOT on /rpool/ROOT type zfs (rw,noatime,xattr,noacl)
rpool/data on /rpool/data type zfs (rw,noatime,xattr,noacl)
rpool/data/subvol-102-disk-0 on /rpool/data/subvol-102-disk-0 type zfs (rw,noatime,xattr,posixacl)
rpool/data/subvol-110-disk-0 on /rpool/data/subvol-110-disk-0 type zfs (rw,noatime,xattr,posixacl)
rpool/data/subvol-111-disk-1 on /rpool/data/subvol-111-disk-1 type zfs (rw,noatime,xattr,posixacl)
rpool/data/subvol-120-disk-0 on /rpool/data/subvol-120-disk-0 type zfs (rw,noatime,xattr,posixacl)
rpool/data/subvol-134-disk-1 on /rpool/data/subvol-134-disk-1 type zfs (rw,noatime,xattr,posixacl)
rpool/data/subvol-137-disk-1 on /rpool/data/subvol-137-disk-1 type zfs (rw,noatime,xattr,posixacl)
lxcfs on /var/lib/lxcfs type fuse.lxcfs (rw,nosuid,nodev,relatime,user_id=0,group_id=0,allow_other)
dev-nas-1:/mnt/Pool-NAS/Backups on /mnt/dev-nas-1 type nfs (rw,relatime,vers=3,rsize=131072,wsize=131072,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=192.168.1.104,mountvers=3,mountport=881,mountproto=udp,local_lock=none,addr=192.168.1.104)
binfmt_misc on /proc/sys/fs/binfmt_misc type binfmt_misc (rw,relatime)
dev-nas-1:/mnt/Pool-NAS/Proxmox on /mnt/pve/dev-nas-1 type nfs (rw,relatime,vers=3,rsize=131072,wsize=131072,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=192.168.1.104,mountvers=3,mountport=881,mountproto=udp,local_lock=none,addr=192.168.1.104)
rpool/data/subvol-104-disk-0 on /rpool/data/subvol-104-disk-0 type zfs (rw,noatime,xattr,posixacl)
rpool/data/subvol-101-disk-0 on /rpool/data/subvol-101-disk-0 type zfs (rw,noatime,xattr,posixacl)
tmpfs on /run/user/0 type tmpfs (rw,nosuid,nodev,relatime,size=14852984k,mode=700)
rpool/data/subvol-109-disk-0 on /rpool/data/subvol-109-disk-0 type zfs (rw,noatime,xattr,posixacl)
/dev/fuse on /etc/pve type fuse (rw,nosuid,nodev,relatime,user_id=0,group_id=0,default_permissions,allow_other)
tmpfs on /run/user/1006 type tmpfs (rw,nosuid,nodev,relatime,size=14852984k,mode=700,uid=1006,gid=1006)
tmpfs on /run/user/1001 type tmpfs (rw,nosuid,nodev,relatime,size=14852984k,mode=700,uid=1001,gid=1001)

[Mon Feb  4 11:08:25 2019] INFO: task kworker/u49:1:30643 blocked for more than 120 seconds.
[Mon Feb  4 11:08:25 2019]       Tainted: P        W IO     4.15.18-8-pve #1
[Mon Feb  4 11:08:25 2019] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Mon Feb  4 11:08:25 2019] kworker/u49:1   D    0 30643      2 0x80000000
[Mon Feb  4 11:08:25 2019] Workqueue: writeback wb_workfn (flush-zfs-9)
[Mon Feb  4 11:08:25 2019] Call Trace:
[Mon Feb  4 11:08:25 2019]  __schedule+0x3e0/0x870
[Mon Feb  4 11:08:25 2019]  schedule+0x36/0x80
[Mon Feb  4 11:08:25 2019]  io_schedule+0x16/0x40
[Mon Feb  4 11:08:25 2019]  __lock_page+0xff/0x140
[Mon Feb  4 11:08:25 2019]  ? page_cache_tree_insert+0xe0/0xe0
[Mon Feb  4 11:08:25 2019]  write_cache_pages+0x2e9/0x4b0
[Mon Feb  4 11:08:25 2019]  ? zpl_write_common_iovec+0xe0/0xe0 [zfs]
[Mon Feb  4 11:08:25 2019]  ? _cond_resched+0x1a/0x50
[Mon Feb  4 11:08:25 2019]  ? mutex_lock+0x12/0x40
[Mon Feb  4 11:08:25 2019]  ? rrw_exit+0x5a/0x150 [zfs]
[Mon Feb  4 11:08:25 2019]  zpl_writepages+0x91/0x170 [zfs]
[Mon Feb  4 11:08:25 2019]  do_writepages+0x1f/0x70
[Mon Feb  4 11:08:25 2019]  __writeback_single_inode+0x45/0x330
[Mon Feb  4 11:08:25 2019]  writeback_sb_inodes+0x266/0x590
[Mon Feb  4 11:08:25 2019]  __writeback_inodes_wb+0x92/0xc0
[Mon Feb  4 11:08:25 2019]  wb_writeback+0x288/0x320
[Mon Feb  4 11:08:25 2019]  wb_workfn+0x1a3/0x440
[Mon Feb  4 11:08:25 2019]  ? wb_workfn+0x1a3/0x440
[Mon Feb  4 11:08:25 2019]  process_one_work+0x1e0/0x400
[Mon Feb  4 11:08:25 2019]  worker_thread+0x4b/0x420
[Mon Feb  4 11:08:25 2019]  kthread+0x105/0x140
[Mon Feb  4 11:08:25 2019]  ? process_one_work+0x400/0x400
[Mon Feb  4 11:08:25 2019]  ? kthread_create_worker_on_cpu+0x70/0x70
[Mon Feb  4 11:08:25 2019]  ? do_syscall_64+0x73/0x130
[Mon Feb  4 11:08:25 2019]  ? SyS_exit_group+0x14/0x20
[Mon Feb  4 11:08:25 2019]  ret_from_fork+0x35/0x40



md5-36d81b4c24d114260f414832823fb477



NAME              PROPERTY              VALUE                  SOURCE
rpool/ROOT/pve-1  type                  filesystem             -
rpool/ROOT/pve-1  creation              Fri Oct 27 11:46 2017  -
rpool/ROOT/pve-1  used                  7.23G                  -
rpool/ROOT/pve-1  available             1.08T                  -
rpool/ROOT/pve-1  referenced            7.23G                  -
rpool/ROOT/pve-1  compressratio         1.60x                  -
rpool/ROOT/pve-1  mounted               yes                    -
rpool/ROOT/pve-1  quota                 none                   default
rpool/ROOT/pve-1  reservation           none                   default
rpool/ROOT/pve-1  recordsize            128K                   default
rpool/ROOT/pve-1  mountpoint            /                      local
rpool/ROOT/pve-1  sharenfs              off                    default
rpool/ROOT/pve-1  checksum              on                     default
rpool/ROOT/pve-1  compression           on                     inherited from rpool
rpool/ROOT/pve-1  atime                 off                    inherited from rpool
rpool/ROOT/pve-1  devices               on                     default
rpool/ROOT/pve-1  exec                  on                     default
rpool/ROOT/pve-1  setuid                on                     default
rpool/ROOT/pve-1  readonly              off                    default
rpool/ROOT/pve-1  zoned                 off                    default
rpool/ROOT/pve-1  snapdir               hidden                 default
rpool/ROOT/pve-1  aclinherit            restricted             default
rpool/ROOT/pve-1  createtxg             10                     -
rpool/ROOT/pve-1  canmount              on                     default
rpool/ROOT/pve-1  xattr                 on                     default
rpool/ROOT/pve-1  copies                1                      default
rpool/ROOT/pve-1  version               5                      -
rpool/ROOT/pve-1  utf8only              off                    -
rpool/ROOT/pve-1  normalization         none                   -
rpool/ROOT/pve-1  casesensitivity       sensitive              -
rpool/ROOT/pve-1  vscan                 off                    default
rpool/ROOT/pve-1  nbmand                off                    default
rpool/ROOT/pve-1  sharesmb              off                    default
rpool/ROOT/pve-1  refquota              none                   default
rpool/ROOT/pve-1  refreservation        none                   default
rpool/ROOT/pve-1  guid                  15527408840965499781   -
rpool/ROOT/pve-1  primarycache          all                    default
rpool/ROOT/pve-1  secondarycache        all                    default
rpool/ROOT/pve-1  usedbysnapshots       0B                     -
rpool/ROOT/pve-1  usedbydataset         7.23G                  -
rpool/ROOT/pve-1  usedbychildren        0B                     -
rpool/ROOT/pve-1  usedbyrefreservation  0B                     -
rpool/ROOT/pve-1  logbias               latency                default
rpool/ROOT/pve-1  dedup                 off                    default
rpool/ROOT/pve-1  mlslabel              none                   default
rpool/ROOT/pve-1  sync                  standard               inherited from rpool
rpool/ROOT/pve-1  dnodesize             legacy                 default
rpool/ROOT/pve-1  refcompressratio      1.60x                  -
rpool/ROOT/pve-1  written               7.23G                  -
rpool/ROOT/pve-1  logicalused           11.3G                  -
rpool/ROOT/pve-1  logicalreferenced     11.3G                  -
rpool/ROOT/pve-1  volmode               default                default
rpool/ROOT/pve-1  filesystem_limit      none                   default
rpool/ROOT/pve-1  snapshot_limit        none                   default
rpool/ROOT/pve-1  filesystem_count      none                   default
rpool/ROOT/pve-1  snapshot_count        none                   default
rpool/ROOT/pve-1  snapdev               hidden                 default
rpool/ROOT/pve-1  acltype               off                    default
rpool/ROOT/pve-1  context               none                   default
rpool/ROOT/pve-1  fscontext             none                   default
rpool/ROOT/pve-1  defcontext            none                   default
rpool/ROOT/pve-1  rootcontext           none                   default
rpool/ROOT/pve-1  relatime              on                     temporary
rpool/ROOT/pve-1  redundant_metadata    all                    default
rpool/ROOT/pve-1  overlay               off                    default

Let me know if you need anything else

camaer on 11 Feb 2019

Happens to me too on my Manjaro 4.19 desktop system with zol 0.7.12-11
Really annoying. once ever few hours. Since my homes (not root) are on the pool a lot starts hanging.
Most annoying I can't even reboot cleanly as the reboot never manages to unmount everything and reboot also hangs.

[87611.236429] INFO: task txg_sync:1006 blocked for more than 120 seconds.
[87611.236434]       Tainted: P           OE     4.19.23-1-MANJARO #1
[87611.236435] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[87611.236437] txg_sync        D    0  1006      2 0x80000080
[87611.236440] Call Trace:
[87611.236448]  ? __schedule+0x29b/0x8b0
[87611.236482]  ? zio_taskq_dispatch+0x6f/0x90 [zfs]
[87611.236484]  schedule+0x32/0x90
[87611.236487]  io_schedule+0x12/0x40
[87611.236493]  cv_wait_common+0xaa/0x130 [spl]
[87611.236497]  ? wait_woken+0x80/0x80
[87611.236524]  zio_wait+0x113/0x1b0 [zfs]
[87611.236548]  dsl_pool_sync+0xb9/0x400 [zfs]
[87611.236576]  spa_sync+0x48b/0xd30 [zfs]
[87611.236604]  txg_sync_thread+0x2c6/0x4a0 [zfs]
[87611.236607]  ? _raw_spin_unlock_irq+0x1d/0x30
[87611.236633]  ? txg_thread_exit.isra.4+0x60/0x60 [zfs]
[87611.236636]  ? __thread_exit+0x20/0x20 [spl]
[87611.236639]  thread_generic_wrapper+0x6f/0x80 [spl]
[87611.236642]  kthread+0x112/0x130
[87611.236644]  ? kthread_park+0x80/0x80
[87611.236646]  ret_from_fork+0x22/0x40
[87611.236779] INFO: task TaskSchedulerFo:9759 blocked for more than 120 seconds.
[87611.236781]       Tainted: P           OE     4.19.23-1-MANJARO #1
[87611.236782] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[87611.236783] TaskSchedulerFo D    0  9759   9346 0x00000080
[87611.236785] Call Trace:
[87611.236787]  ? __schedule+0x29b/0x8b0
[87611.236789]  schedule+0x32/0x90
[87611.236790]  io_schedule+0x12/0x40
[87611.236794]  cv_wait_common+0xaa/0x130 [spl]
[87611.236796]  ? wait_woken+0x80/0x80
[87611.236823]  zio_wait+0x113/0x1b0 [zfs]
[87611.236849]  zil_commit.part.8+0x85c/0x940 [zfs]
[87611.236875]  zfs_fsync+0x70/0xd0 [zfs]
[87611.236900]  zpl_fsync+0x63/0x90 [zfs]
[87611.236903]  ? __fget+0x6e/0xa0
[87611.236906]  do_fsync+0x38/0x70
[87611.236908]  __x64_sys_fdatasync+0x13/0x20
[87611.236910]  do_syscall_64+0x65/0x180
[87611.236912]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[87611.236914] RIP: 0033:0x7f28924bf73f
[87611.236919] Code: Bad RIP value.
[87611.236920] RSP: 002b:00007f28810473e0 EFLAGS: 00000293 ORIG_RAX: 000000000000004b
[87611.236922] RAX: ffffffffffffffda RBX: 0000000000000073 RCX: 00007f28924bf73f
[87611.236923] RDX: 0000000000000000 RSI: 000037ea5e9df880 RDI: 0000000000000073
[87611.236924] RBP: 00007f2881047430 R08: 0000000000000000 R09: 00000000f1b86835
[87611.236925] R10: 00000000f1b86835 R11: 0000000000000293 R12: 00007f28924bf700
[87611.236925] R13: 00007f2896b3a3f0 R14: 0000000000000000 R15: 000037ea5e9df8a0
[87734.114347] INFO: task txg_sync:1006 blocked for more than 120 seconds.
[87734.114351]       Tainted: P           OE     4.19.23-1-MANJARO #1
[87734.114352] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[87734.114353] txg_sync        D    0  1006      2 0x80000080
[87734.114355] Call Trace:
[87734.114363]  ? __schedule+0x29b/0x8b0
[87734.114392]  ? zio_taskq_dispatch+0x6f/0x90 [zfs]
[87734.114394]  schedule+0x32/0x90
[87734.114396]  io_schedule+0x12/0x40
[87734.114402]  cv_wait_common+0xaa/0x130 [spl]
[87734.114404]  ? wait_woken+0x80/0x80
[87734.114426]  zio_wait+0x113/0x1b0 [zfs]
[87734.114446]  dsl_pool_sync+0xb9/0x400 [zfs]
[87734.114468]  spa_sync+0x48b/0xd30 [zfs]
[87734.114491]  txg_sync_thread+0x2c6/0x4a0 [zfs]
[87734.114493]  ? _raw_spin_unlock_irq+0x1d/0x30
[87734.114514]  ? txg_thread_exit.isra.4+0x60/0x60 [zfs]
[87734.114517]  ? __thread_exit+0x20/0x20 [spl]
[87734.114519]  thread_generic_wrapper+0x6f/0x80 [spl]
[87734.114522]  kthread+0x112/0x130
[87734.114523]  ? kthread_park+0x80/0x80
[87734.114525]  ret_from_fork+0x22/0x40
[87734.114652] INFO: task TaskSchedulerFo:9759 blocked for more than 120 seconds.
[87734.114653]       Tainted: P           OE     4.19.23-1-MANJARO #1
[87734.114654] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[87734.114655] TaskSchedulerFo D    0  9759   9346 0x00000080
[87734.114656] Call Trace:
[87734.114658]  ? __schedule+0x29b/0x8b0
[87734.114660]  schedule+0x32/0x90
[87734.114661]  io_schedule+0x12/0x40
[87734.114664]  cv_wait_common+0xaa/0x130 [spl]
[87734.114666]  ? wait_woken+0x80/0x80
[87734.114687]  zio_wait+0x113/0x1b0 [zfs]
[87734.114708]  zil_commit.part.8+0x85c/0x940 [zfs]
[87734.114729]  zfs_fsync+0x70/0xd0 [zfs]
[87734.114750]  zpl_fsync+0x63/0x90 [zfs]
[87734.114753]  ? __fget+0x6e/0xa0
[87734.114755]  do_fsync+0x38/0x70
[87734.114756]  __x64_sys_fdatasync+0x13/0x20
[87734.114758]  do_syscall_64+0x65/0x180
[87734.114760]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[87734.114762] RIP: 0033:0x7f28924bf73f
[87734.114766] Code: Bad RIP value.
[87734.114767] RSP: 002b:00007f28810473e0 EFLAGS: 00000293 ORIG_RAX: 000000000000004b
[87734.114769] RAX: ffffffffffffffda RBX: 0000000000000073 RCX: 00007f28924bf73f
[87734.114769] RDX: 0000000000000000 RSI: 000037ea5e9df880 RDI: 0000000000000073
[87734.114770] RBP: 00007f2881047430 R08: 0000000000000000 R09: 00000000f1b86835
[87734.114771] R10: 00000000f1b86835 R11: 0000000000000293 R12: 00007f28924bf700
[87734.114771] R13: 00007f2896b3a3f0 R14: 0000000000000000 R15: 000037ea5e9df8a0
[87734.114773] INFO: task TaskSchedulerFo:24195 blocked for more than 120 seconds.
[87734.114774]       Tainted: P           OE     4.19.23-1-MANJARO #1
[87734.114775] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[87734.114776] TaskSchedulerFo D    0 24195   9346 0x00000080
[87734.114777] Call Trace:
[87734.114779]  ? __schedule+0x29b/0x8b0
[87734.114781]  ? preempt_count_add+0x79/0xb0
[87734.114783]  schedule+0x32/0x90
[87734.114786]  cv_wait_common+0xf6/0x130 [spl]
[87734.114787]  ? wait_woken+0x80/0x80
[87734.114807]  zil_commit.part.8+0x80/0x940 [zfs]
[87734.114809]  ? _raw_spin_lock+0x13/0x40
[87734.114809]  ? _raw_spin_unlock+0x16/0x30
[87734.114812]  ? tsd_set+0x203/0x4f0 [spl]
[87734.114813]  ? preempt_count_add+0x79/0xb0
[87734.114833]  zfs_fsync+0x70/0xd0 [zfs]
[87734.114853]  zpl_fsync+0x63/0x90 [zfs]
[87734.114854]  ? __fget+0x6e/0xa0
[87734.114855]  do_fsync+0x38/0x70
[87734.114857]  __x64_sys_fdatasync+0x13/0x20
[87734.114858]  do_syscall_64+0x65/0x180
[87734.114859]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[87734.114860] RIP: 0033:0x7f28924bf73f
[87734.114861] Code: Bad RIP value.
[87734.114862] RSP: 002b:00007f287f043cf0 EFLAGS: 00000293 ORIG_RAX: 000000000000004b
[87734.114863] RAX: ffffffffffffffda RBX: 000000000000005d RCX: 00007f28924bf73f
[87734.114864] RDX: 0000000000000000 RSI: 0000000000000002 RDI: 000000000000005d
[87734.114864] RBP: 00007f287f043d40 R08: 0000000000000000 R09: 0000000000000080
[87734.114865] R10: 0000000000001000 R11: 0000000000000293 R12: 0000000000000000
[87734.114865] R13: d763a120f905d5d9 R14: 0000000000000000 R15: 0000000000001000
[87734.114878] INFO: task CIPCServer::Thr:13852 blocked for more than 120 seconds.
[87734.114879]       Tainted: P           OE     4.19.23-1-MANJARO #1
[87734.114880] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[87734.114881] CIPCServer::Thr D    0 13852  13418 0x20020080
[87734.114882] Call Trace:
[87734.114884]  ? __schedule+0x29b/0x8b0
[87734.114885]  ? __switch_to_asm+0x40/0x70
[87734.114886]  ? preempt_count_add+0x79/0xb0
[87734.114887]  schedule+0x32/0x90
[87734.114890]  cv_wait_common+0xf6/0x130 [spl]
[87734.114892]  ? wait_woken+0x80/0x80
[87734.114911]  zil_commit.part.8+0x80/0x940 [zfs]
[87734.114913]  ? _raw_spin_lock+0x13/0x40
[87734.114913]  ? _raw_spin_unlock+0x16/0x30
[87734.114916]  ? tsd_set+0x203/0x4f0 [spl]
[87734.114918]  ? preempt_count_add+0x79/0xb0
[87734.114937]  zfs_fsync+0x70/0xd0 [zfs]
[87734.114957]  zpl_fsync+0x63/0x90 [zfs]
[87734.114958]  ? __fget+0x6e/0xa0
[87734.114959]  do_fsync+0x38/0x70
[87734.114960]  __ia32_sys_fdatasync+0x13/0x20
[87734.114962]  do_fast_syscall_32+0xa7/0x2a0
[87734.114963]  entry_SYSCALL_compat_after_hwframe+0x45/0x4d
[87856.992274] INFO: task txg_sync:1006 blocked for more than 120 seconds.
[87856.992279]       Tainted: P           OE     4.19.23-1-MANJARO #1
[87856.992281] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[87856.992284] txg_sync        D    0  1006      2 0x80000080
[87856.992287] Call Trace:
[87856.992297]  ? __schedule+0x29b/0x8b0
[87856.992340]  ? zio_taskq_dispatch+0x6f/0x90 [zfs]
[87856.992343]  schedule+0x32/0x90
[87856.992347]  io_schedule+0x12/0x40
[87856.992354]  cv_wait_common+0xaa/0x130 [spl]
[87856.992358]  ? wait_woken+0x80/0x80
[87856.992396]  zio_wait+0x113/0x1b0 [zfs]
[87856.992431]  dsl_pool_sync+0xb9/0x400 [zfs]
[87856.992473]  spa_sync+0x48b/0xd30 [zfs]
[87856.992513]  txg_sync_thread+0x2c6/0x4a0 [zfs]
[87856.992516]  ? _raw_spin_unlock_irq+0x1d/0x30
[87856.992553]  ? txg_thread_exit.isra.4+0x60/0x60 [zfs]
[87856.992558]  ? __thread_exit+0x20/0x20 [spl]
[87856.992562]  thread_generic_wrapper+0x6f/0x80 [spl]
[87856.992565]  kthread+0x112/0x130
[87856.992567]  ? kthread_park+0x80/0x80
[87856.992570]  ret_from_fork+0x22/0x40
[87856.992714] INFO: task TaskSchedulerFo:9759 blocked for more than 120 seconds.
[87856.992717]       Tainted: P           OE     4.19.23-1-MANJARO #1
[87856.992718] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[87856.992720] TaskSchedulerFo D    0  9759   9346 0x00000080
[87856.992722] Call Trace:
[87856.992725]  ? __schedule+0x29b/0x8b0
[87856.992728]  schedule+0x32/0x90
[87856.992730]  io_schedule+0x12/0x40
[87856.992735]  cv_wait_common+0xaa/0x130 [spl]
[87856.992738]  ? wait_woken+0x80/0x80
[87856.992775]  zio_wait+0x113/0x1b0 [zfs]
[87856.992811]  zil_commit.part.8+0x85c/0x940 [zfs]
[87856.992848]  zfs_fsync+0x70/0xd0 [zfs]
[87856.992884]  zpl_fsync+0x63/0x90 [zfs]
[87856.992888]  ? __fget+0x6e/0xa0
[87856.992891]  do_fsync+0x38/0x70
[87856.992893]  __x64_sys_fdatasync+0x13/0x20
[87856.992896]  do_syscall_64+0x65/0x180
[87856.992899]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[87856.992901] RIP: 0033:0x7f28924bf73f
[87856.992907] Code: Bad RIP value.
[87856.992908] RSP: 002b:00007f28810473e0 EFLAGS: 00000293 ORIG_RAX: 000000000000004b
[87856.992910] RAX: ffffffffffffffda RBX: 0000000000000073 RCX: 00007f28924bf73f
[87856.992911] RDX: 0000000000000000 RSI: 000037ea5e9df880 RDI: 0000000000000073
[87856.992912] RBP: 00007f2881047430 R08: 0000000000000000 R09: 00000000f1b86835
[87856.992913] R10: 00000000f1b86835 R11: 0000000000000293 R12: 00007f28924bf700
[87856.992914] R13: 00007f2896b3a3f0 R14: 0000000000000000 R15: 000037ea5e9df8a0
[87856.992918] INFO: task TaskSchedulerFo:24195 blocked for more than 120 seconds.
[87856.992920]       Tainted: P           OE     4.19.23-1-MANJARO #1
[87856.992921] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[87856.992922] TaskSchedulerFo D    0 24195   9346 0x00000080
[87856.992924] Call Trace:
[87856.992927]  ? __schedule+0x29b/0x8b0
[87856.992930]  ? preempt_count_add+0x79/0xb0
[87856.992933]  schedule+0x32/0x90
[87856.992938]  cv_wait_common+0xf6/0x130 [spl]
[87856.992940]  ? wait_woken+0x80/0x80
[87856.992975]  zil_commit.part.8+0x80/0x940 [zfs]
[87856.992978]  ? _raw_spin_lock+0x13/0x40
[87856.992979]  ? _raw_spin_unlock+0x16/0x30
[87856.992984]  ? tsd_set+0x203/0x4f0 [spl]
[87856.992986]  ? preempt_count_add+0x79/0xb0
[87856.993021]  zfs_fsync+0x70/0xd0 [zfs]
[87856.993056]  zpl_fsync+0x63/0x90 [zfs]
[87856.993058]  ? __fget+0x6e/0xa0
[87856.993060]  do_fsync+0x38/0x70
[87856.993062]  __x64_sys_fdatasync+0x13/0x20
[87856.993064]  do_syscall_64+0x65/0x180
[87856.993067]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[87856.993068] RIP: 0033:0x7f28924bf73f
[87856.993071] Code: Bad RIP value.
[87856.993071] RSP: 002b:00007f287f043cf0 EFLAGS: 00000293 ORIG_RAX: 000000000000004b
[87856.993073] RAX: ffffffffffffffda RBX: 000000000000005d RCX: 00007f28924bf73f
[87856.993074] RDX: 0000000000000000 RSI: 0000000000000002 RDI: 000000000000005d
[87856.993075] RBP: 00007f287f043d40 R08: 0000000000000000 R09: 0000000000000080
[87856.993076] R10: 0000000000001000 R11: 0000000000000293 R12: 0000000000000000
[87856.993077] R13: d763a120f905d5d9 R14: 0000000000000000 R15: 0000000000001000
[87856.993092] INFO: task CIPCServer::Thr:13852 blocked for more than 120 seconds.
[87856.993094]       Tainted: P           OE     4.19.23-1-MANJARO #1
[87856.993096] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[87856.993097] CIPCServer::Thr D    0 13852  13418 0x20020084
[87856.993099] Call Trace:
[87856.993102]  ? __schedule+0x29b/0x8b0
[87856.993104]  ? __switch_to_asm+0x40/0x70
[87856.993106]  ? preempt_count_add+0x79/0xb0
[87856.993109]  schedule+0x32/0x90
[87856.993113]  cv_wait_common+0xf6/0x130 [spl]
[87856.993116]  ? wait_woken+0x80/0x80
[87856.993150]  zil_commit.part.8+0x80/0x940 [zfs]
[87856.993153]  ? _raw_spin_lock+0x13/0x40
[87856.993154]  ? _raw_spin_unlock+0x16/0x30
[87856.993159]  ? tsd_set+0x203/0x4f0 [spl]
[87856.993161]  ? preempt_count_add+0x79/0xb0
[87856.993196]  zfs_fsync+0x70/0xd0 [zfs]
[87856.993230]  zpl_fsync+0x63/0x90 [zfs]
[87856.993232]  ? __fget+0x6e/0xa0
[87856.993234]  do_fsync+0x38/0x70
[87856.993236]  __ia32_sys_fdatasync+0x13/0x20
[87856.993238]  do_fast_syscall_32+0xa7/0x2a0
[87856.993241]  entry_SYSCALL_compat_after_hwframe+0x45/0x4d

Also the process is still stuck in D even waiting far longer:

$ ps aux | grep D
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
...
root      1006  0.0  0.0      0     0 ?        D    Feb23   0:19 [txg_sync]

My 32Gb memory is still only 40% full

S.M.A.R.T data of all disks is fine and it seems the unmount is stuck at random disks each time.

My pool also is fine:

  pool: tank
 state: ONLINE
  scan: resilvered 11,8T in 13h51m with 0 errors on Wed Jan  2 02:00:28 2019
config:

    NAME                                              STATE READ WRITE CKSUM
    tank                                              ONLINE 0     0     0
      raidz2-0                                        ONLINE 0     0     0
        sdx_crypt10                                   ONLINE 0     0     0
        sdx_crypt11                                   ONLINE 0     0     0
        sdx_crypt7                                    ONLINE 0     0     0
        sdx_crypt6                                    ONLINE 0     0     0
    cache
      nvme-Samsung_SSD_960_EVO_500GB_S3EUNX0J215828E  ONLINE 0     0     0

errors: No known data errors

My pool consists of 4 10Tb LUKS encrypted disks. And one m.2 NVME drive as cache.

Here my iostat -mx output:

iostat -mx
Linux 4.19.23-1-MANJARO (I-KNOW-YOU.torgato.de)     25.02.2019 _x86_64_    (16 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           5,86    0,00    1,90    4,02    0,00   88,22

Device            r/s     w/s     rMB/s     wMB/s   rrqm/s wrqm/s  %rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm  %util
nvme1n1          0,66    3,77      0,02      0,10     0,01 2,92   0,91  43,67    0,33    1,19   0,61    24,97    26,17 136,18  60,30
nvme0n1          0,03    0,99      0,00      0,08     0,00 0,00   0,00   0,00    0,44    2,01   0,60    36,25    81,47 592,53  60,21
sda              0,22    4,78      0,01      0,10     0,00 0,03   0,00   0,71    4,66    2,96   0,00    47,71    21,38 0,43   0,22
sdb              0,22    4,80      0,01      0,10     0,00 0,03   0,00   0,72    4,58    3,01   0,00    47,46    21,32 0,43   0,22
sdc              0,22    4,78      0,01      0,10     0,00 0,03   0,00   0,72    4,63    2,96   0,00    47,53    21,37 0,43   0,22
sdd              0,00    0,00      0,00      0,00     0,00 0,00   0,00   0,00    1,69    0,00   0,00    19,63     0,00 1,65   0,00
sde              0,22    4,80      0,01      0,10     0,00 0,03   0,00   0,71    4,53    2,84   0,00    47,66    21,26 0,43   0,22
dm-0             0,51    4,54      0,01      0,10     0,00 0,00   0,00   0,00   23,68    0,73   0,02    20,92    22,54 2,59   1,31
dm-1             0,51    4,55      0,01      0,10     0,00 0,00   0,00   0,00   22,65    0,72   0,01    20,97    22,43 2,48   1,26
dm-2             0,51    4,55      0,01      0,10     0,00 0,00   0,00   0,00   24,22    0,73   0,02    20,96    22,48 2,65   1,34
dm-3             0,51    4,53      0,01      0,10     0,00 0,00   0,00   0,00   23,64    0,73   0,28    21,04    22,56 20,12  10,14

Shutdown log:

shutdown log:

Feb 23 01:12:39 I-KNOW-YOU.torgato.de systemd[1]: Stopped Cryptography Setup for sdx_crypt10.
Feb 23 01:12:39 I-KNOW-YOU.torgato.de audit[1]: SERVICE_STOP pid=1 uid=0 auid=4294967295 ses=4294967295 msg='unit=systemd-cryptsetup@sdx_crypt10 comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=failed'
Feb 23 01:12:39 I-KNOW-YOU.torgato.de kernel: audit: type=1131 audit(1550880759.194:135): pid=1 uid=0 auid=4294967295 ses=4294967295 msg='unit=systemd-cryptsetup@sdx_crypt10 comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=failed'
Feb 23 01:12:39 I-KNOW-YOU.torgato.de systemd-cryptsetup[28103]: device-mapper: remove ioctl on sdx_crypt6  failed: Device or resource busy
Feb 23 01:12:39 I-KNOW-YOU.torgato.de systemd-cryptsetup[28103]: Device sdx_crypt6 is still in use.
Feb 23 01:12:39 I-KNOW-YOU.torgato.de systemd-cryptsetup[28103]: Failed to deactivate: Device or resource busy
Feb 23 01:12:39 I-KNOW-YOU.torgato.de systemd[1]: systemd-cryptsetup@sdx_crypt6.service: Control process exited, code=exited, status=1/FAILURE
Feb 23 01:12:39 I-KNOW-YOU.torgato.de systemd[1]: systemd-cryptsetup@sdx_crypt6.service: Failed with result 'exit-code'.
Feb 23 01:12:39 I-KNOW-YOU.torgato.de systemd[1]: Stopped Cryptography Setup for sdx_crypt6.
Feb 23 01:12:39 I-KNOW-YOU.torgato.de audit[1]: SERVICE_STOP pid=1 uid=0 auid=4294967295 ses=4294967295 msg='unit=systemd-cryptsetup@sdx_crypt6 comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=failed'
Feb 23 01:12:39 I-KNOW-YOU.torgato.de kernel: audit: type=1131 audit(1550880759.217:136): pid=1 uid=0 auid=4294967295 ses=4294967295 msg='unit=systemd-cryptsetup@sdx_crypt6 comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=failed'
Feb 23 01:13:06 I-KNOW-YOU.torgato.de systemd[1]: Received SIGINT.
Feb 23 01:13:06 I-KNOW-YOU.torgato.de systemd[1]: Condition check resulted in Generate shutdown-ramfs being skipped.
Feb 23 01:13:06 I-KNOW-YOU.torgato.de systemd[1]: Unmounting /tank/Torge...
Feb 23 01:13:06 I-KNOW-YOU.torgato.de systemd[1]: Unmounting Temporary Directory (/tmp)...
Feb 23 01:13:06 I-KNOW-YOU.torgato.de umount[28167]: umount: /tank/Torge: target is busy.
Feb 23 01:13:06 I-KNOW-YOU.torgato.de systemd[1]: tank-Torge.mount: Mount process exited, code=exited, status=32/n/a
Feb 23 01:13:06 I-KNOW-YOU.torgato.de umount[28168]: umount: /tmp: target is busy.
Feb 23 01:13:06 I-KNOW-YOU.torgato.de systemd[1]: Failed unmounting /tank/Torge.
Feb 23 01:13:06 I-KNOW-YOU.torgato.de systemd[1]: tmp.mount: Mount process exited, code=exited, status=32/n/a
Feb 23 01:13:06 I-KNOW-YOU.torgato.de systemd[1]: Failed unmounting Temporary Directory (/tmp).
Feb 23 01:13:06 I-KNOW-YOU.torgato.de systemd[1]: Received SIGINT.
Feb 23 01:13:06 I-KNOW-YOU.torgato.de systemd[1]: Unmounting Temporary Directory (/tmp)...
Feb 23 01:13:06 I-KNOW-YOU.torgato.de systemd[1]: Condition check resulted in Generate shutdown-ramfs being skipped.
Feb 23 01:13:06 I-KNOW-YOU.torgato.de systemd[1]: Unmounting /tank/Torge...
Feb 23 01:13:06 I-KNOW-YOU.torgato.de umount[28170]: umount: /tmp: target is busy.
Feb 23 01:13:06 I-KNOW-YOU.torgato.de systemd[1]: tmp.mount: Mount process exited, code=exited, status=32/n/a
Feb 23 01:13:06 I-KNOW-YOU.torgato.de systemd[1]: Failed unmounting Temporary Directory (/tmp).
Feb 23 01:13:06 I-KNOW-YOU.torgato.de umount[28171]: umount: /tank/Torge: target is busy.
Feb 23 01:13:06 I-KNOW-YOU.torgato.de systemd[1]: tank-Torge.mount: Mount process exited, code=exited, status=32/n/a
Feb 23 01:13:06 I-KNOW-YOU.torgato.de systemd[1]: Failed unmounting /tank/Torge.
Feb 23 01:13:06 I-KNOW-YOU.torgato.de systemd[1]: Received SIGINT.
Feb 23 01:13:06 I-KNOW-YOU.torgato.de systemd[1]: Condition check resulted in Generate shutdown-ramfs being skipped.
Feb 23 01:13:06 I-KNOW-YOU.torgato.de systemd[1]: Unmounting /tank/Torge...
Feb 23 01:13:06 I-KNOW-YOU.torgato.de systemd[1]: Unmounting Temporary Directory (/tmp)...
Feb 23 01:13:06 I-KNOW-YOU.torgato.de umount[28174]: umount: /tank/Torge: target is busy.
Feb 23 01:13:06 I-KNOW-YOU.torgato.de systemd[1]: tank-Torge.mount: Mount process exited, code=exited, status=32/n/a
Feb 23 01:13:06 I-KNOW-YOU.torgato.de systemd[1]: Failed unmounting /tank/Torge.
Feb 23 01:13:06 I-KNOW-YOU.torgato.de umount[28175]: umount: /tmp: target is busy.
Feb 23 01:13:06 I-KNOW-YOU.torgato.de systemd[1]: tmp.mount: Mount process exited, code=exited, status=32/n/a
Feb 23 01:13:06 I-KNOW-YOU.torgato.de systemd[1]: Failed unmounting Temporary Directory (/tmp).
Feb 23 01:13:06 I-KNOW-YOU.torgato.de systemd[1]: Received SIGINT.
Feb 23 01:13:06 I-KNOW-YOU.torgato.de systemd[1]: Unmounting /tank/Torge...
Feb 23 01:13:06 I-KNOW-YOU.torgato.de systemd[1]: Unmounting Temporary Directory (/tmp)...
Feb 23 01:13:06 I-KNOW-YOU.torgato.de systemd[1]: Condition check resulted in Generate shutdown-ramfs being skipped.
Feb 23 01:13:06 I-KNOW-YOU.torgato.de umount[28177]: umount: /tank/Torge: target is busy.
Feb 23 01:13:06 I-KNOW-YOU.torgato.de systemd[1]: tank-Torge.mount: Mount process exited, code=exited, status=32/n/a
Feb 23 01:13:06 I-KNOW-YOU.torgato.de systemd[1]: Failed unmounting /tank/Torge.
Feb 23 01:13:06 I-KNOW-YOU.torgato.de umount[28178]: umount: /tmp: target is busy.
Feb 23 01:13:06 I-KNOW-YOU.torgato.de systemd[1]: tmp.mount: Mount process exited, code=exited, status=32/n/a
Feb 23 01:13:06 I-KNOW-YOU.torgato.de systemd[1]: Failed unmounting Temporary Directory (/tmp).
Feb 23 01:13:07 I-KNOW-YOU.torgato.de systemd[1]: Received SIGINT.
Feb 23 01:13:07 I-KNOW-YOU.torgato.de systemd[1]: Condition check resulted in Generate shutdown-ramfs being skipped.
Feb 23 01:13:07 I-KNOW-YOU.torgato.de systemd[1]: Unmounting Temporary Directory (/tmp)...
Feb 23 01:13:07 I-KNOW-YOU.torgato.de systemd[1]: Unmounting /tank/Torge...
Feb 23 01:13:07 I-KNOW-YOU.torgato.de umount[28181]: umount: /tmp: target is busy.
Feb 23 01:13:07 I-KNOW-YOU.torgato.de systemd[1]: tmp.mount: Mount process exited, code=exited, status=32/n/a
Feb 23 01:13:07 I-KNOW-YOU.torgato.de systemd[1]: Failed unmounting Temporary Directory (/tmp).
Feb 23 01:13:07 I-KNOW-YOU.torgato.de umount[28182]: umount: /tank/Torge: target is busy.
Feb 23 01:13:07 I-KNOW-YOU.torgato.de systemd[1]: tank-Torge.mount: Mount process exited, code=exited, status=32/n/a
Feb 23 01:13:07 I-KNOW-YOU.torgato.de systemd[1]: Failed unmounting /tank/Torge.
Feb 23 01:13:07 I-KNOW-YOU.torgato.de systemd[1]: Received SIGINT.
Feb 23 01:13:07 I-KNOW-YOU.torgato.de systemd[1]: Condition check resulted in Generate shutdown-ramfs being skipped.
Feb 23 01:13:07 I-KNOW-YOU.torgato.de systemd[1]: Unmounting /tank/Torge...
Feb 23 01:13:07 I-KNOW-YOU.torgato.de systemd[1]: Unmounting Temporary Directory (/tmp)...
Feb 23 01:13:07 I-KNOW-YOU.torgato.de umount[28184]: umount: /tank/Torge: target is busy.
Feb 23 01:13:07 I-KNOW-YOU.torgato.de systemd[1]: tank-Torge.mount: Mount process exited, code=exited, status=32/n/a
Feb 23 01:13:07 I-KNOW-YOU.torgato.de systemd[1]: Failed unmounting /tank/Torge.
Feb 23 01:13:07 I-KNOW-YOU.torgato.de umount[28185]: umount: /tmp: target is busy.
Feb 23 01:13:07 I-KNOW-YOU.torgato.de systemd[1]: tmp.mount: Mount process exited, code=exited, status=32/n/a
Feb 23 01:13:07 I-KNOW-YOU.torgato.de systemd[1]: Failed unmounting Temporary Directory (/tmp).
Feb 23 01:13:07 I-KNOW-YOU.torgato.de systemd[1]: Received SIGINT.
Feb 23 01:13:07 I-KNOW-YOU.torgato.de systemd[1]: Condition check resulted in Generate shutdown-ramfs being skipped.
Feb 23 01:13:07 I-KNOW-YOU.torgato.de systemd[1]: Unmounting Temporary Directory (/tmp)...
Feb 23 01:13:07 I-KNOW-YOU.torgato.de systemd[1]: Unmounting /tank/Torge...
Feb 23 01:13:07 I-KNOW-YOU.torgato.de umount[28187]: umount: /tmp: target is busy.
Feb 23 01:13:07 I-KNOW-YOU.torgato.de systemd[1]: tmp.mount: Mount process exited, code=exited, status=32/n/a
Feb 23 01:13:07 I-KNOW-YOU.torgato.de systemd[1]: Failed unmounting Temporary Directory (/tmp).
Feb 23 01:13:07 I-KNOW-YOU.torgato.de umount[28188]: umount: /tank/Torge: target is busy.
Feb 23 01:13:07 I-KNOW-YOU.torgato.de systemd[1]: tank-Torge.mount: Mount process exited, code=exited, status=32/n/a
Feb 23 01:13:07 I-KNOW-YOU.torgato.de systemd[1]: Failed unmounting /tank/Torge.
Feb 23 01:13:07 I-KNOW-YOU.torgato.de systemd[1]: Received SIGINT.
Feb 23 01:13:07 I-KNOW-YOU.torgato.de systemd[1]: Forcibly rebooting: Ctrl-Alt-Del was pressed more than 7 times within 2s
Feb 23 01:13:07 I-KNOW-YOU.torgato.de systemd[1]: Shutting down.
Feb 23 01:13:07 I-KNOW-YOU.torgato.de kernel: systemd-shutdow: 49 output lines suppressed due to ratelimiting
Feb 23 01:13:07 I-KNOW-YOU.torgato.de systemd-shutdown[1]: Syncing filesystems and block devices. 
   HERE IT HANGS

Smart data sdb/sdx_crypt6

[I-KNOW-YOU log]# smartctl -a /dev/sdb
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-4.19.23-1-MANJARO] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Seagate IronWolf
Device Model:     ST10000VN0004-1ZD101
Serial Number:    ZA20SV9S
LU WWN Device Id: 5 000c50 09396796e
Firmware Version: SC60
User Capacity:    10.000.831.348.736 bytes [10,0 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-3 T13/2161-D revision 5
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Mon Feb 25 00:35:55 2019 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82)    Offline data collection activity
                    was completed without error.
                    Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0)    The previous self-test routine completed
                    without error or no self-test has ever
                    been run.
Total time to complete Offline
data collection:         (  584) seconds.
Offline data collection
capabilities:              (0x7b) SMART execute Offline immediate.
                    Auto Offline data collection on/off support.
                    Suspend Offline collection upon new
                    command.
                    Offline surface scan supported.
                    Self-test supported.
                    Conveyance Self-test supported.
                    Selective Self-test supported.
SMART capabilities:            (0x0003)    Saves SMART data before entering
                    power-saving mode.
                    Supports SMART auto save timer.
Error logging capability:        (0x01)    Error logging supported.
                    General Purpose Logging supported.
Short self-test routine
recommended polling time:      (   1) minutes.
Extended self-test routine
recommended polling time:      ( 881) minutes.
Conveyance self-test routine
recommended polling time:      (   2) minutes.
SCT capabilities:            (0x50bd)    SCT Status supported.
                    SCT Error Recovery Control supported.
                    SCT Feature Control supported.
                    SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   083   064   044    Pre-fail Always       -       190984434
  3 Spin_Up_Time            0x0003   088   086   000    Pre-fail Always       -       0
  4 Start_Stop_Count        0x0032   099   099   020    Old_age Always       -       1689
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail Always       -       0
  7 Seek_Error_Rate         0x000f   091   060   045    Pre-fail Always       -       1383480407
  9 Power_On_Hours          0x0032   088   088   000    Old_age Always       -       10784 (194 247 0)
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail Always       -       0
 12 Power_Cycle_Count       0x0032   099   099   020    Old_age Always       -       1028
184 End-to-End_Error        0x0032   100   100   099    Old_age Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age Always       -       0
188 Command_Timeout         0x0032   100   096   000    Old_age Always       -       2851901801115
189 High_Fly_Writes         0x003a   085   085   000    Old_age Always       -       15
190 Airflow_Temperature_Cel 0x0022   067   045   040    Old_age Always       -       33 (Min/Max 18/39)
191 G-Sense_Error_Rate      0x0032   099   099   000    Old_age Always       -       3678
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age Always       -       76
193 Load_Cycle_Count        0x0032   082   082   000    Old_age Always       -       37441
194 Temperature_Celsius     0x0022   033   055   000    Old_age Always       -       33 (0 18 0 0 0)
195 Hardware_ECC_Recovered  0x001a   011   001   000    Old_age Always       -       190984434
197 Current_Pending_Sector  0x0012   100   100   000    Old_age Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age Always       -       0
200 Multi_Zone_Error_Rate   0x0023   100   100   001    Pre-fail Always       -       0
240 Head_Flying_Hours       0x0000   100   253   000    Old_age Offline      -       5122 (129 181 0)
241 Total_LBAs_Written      0x0000   100   253   000    Old_age Offline      -       56499736146
242 Total_LBAs_Read         0x0000   100   253   000    Old_age Offline      -       254319596354

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

Please this is so damn annoying. Any data I can collect to help get to the bottom of this? Why is the kernel not even able to reboot?

CySlider on 7 Mar 2019

Jonathon from the Manjaro Team had me check the i/o scheduler

I did and it turned out, that while zfs did set the i/o scheduler of the dm cryptsetup devices to none (I'm using luks encrypted disks) it did not manage to change the scheduler of the underlying physical sdx disks to none as well. They were set to mq-deadline instead

I now changed all of them to none as well and until now the problem did not reappear. It is far from long enough to be sure, but it is definitely already a notable time that passed and my hopes are up.

So if you have this issue and you use LUKS devices, check that the io scheduler is set to none on both layers and if not change it.

Update: can also be "noop" on your system

CySlider on 9 Mar 2019

(I'm using luks encrypted disks)

Hmmm, me too! Is anybody here who NOT used LUKS but faced with this issue?

@CySlider you use ZFS pool on LUKS or LUKS vdev on ZFS?

denizzzka on 9 Mar 2019

check that the io scheduler is set to none

Scheduler for HDDs is called "noop", not "none". Maybe this is important.

(So, I changed it to "noop" and stay fingers crossed)

denizzzka on 9 Mar 2019

On my system it is "none" but you are right it might also be called "noop". You can check the available modes with:
cat /sys/block/THE_DEVICE/queue/scheduler

CySlider on 9 Mar 2019

@denizzzka I am NOT using LUKS, but I am using ZFS native encryption.

The scheduler for all (12) devices is set to mq-deadline. I will set these to none and see what happens!

bglod on 9 Mar 2019

If you followed by guide to persist changes after reboot. I had an error in there and corrected the post. Make sure you did not copy the mistake.

CySlider on 10 Mar 2019

(I'm using luks encrypted disks)

Hmmm, me too! Is anybody here who NOT used LUKS but faced with this issue?

@CySlider you use ZFS pool on LUKS or LUKS vdev on ZFS?

nearly missed your edit.

I am not sure which of your options it is.
I use LUKS transparently to ZFS. I set up my whole drives individually with LUKS. mount them decrypted. And only after that added them as devices to ZFS.
Hope this answers your question.

CySlider on 10 Mar 2019

I started adding new partition into rpool mirror.

While resilvering I added a new pool and started transferring gigabyte files to it from rpool. Then it caused:
https://gist.github.com/denizzzka/9fb23dbeb6db3ba6074eed04d6da47bb

Probably, if I were use rpool everything would hang, but rpool is used only for read and not affected.

denizzzka on 11 Mar 2019

Be aware that the message itself is not an indicator of something going wrong. It could also just mean your disks are really busy for two minutes. It is only a problem if it does get totally stuck as it did in my case.

CySlider on 11 Mar 2019

Be aware that the message itself is not an indicator of something going wrong.

After that copying process is just stalled...

denizzzka on 11 Mar 2019

Still hangs with noop queue scheduler for /dev/sd[a-z]

denizzzka on 14 Mar 2019

Still hangs with noop queue scheduler for /dev/sd[a-z]

Same result here. I switched it over to bfq to see what happens.

bglod on 14 Mar 2019

The hangs reported here are very similar to ones we've seen reported in the past. They're caused by an I/O which is submitted to the kernel but for some as-yet-unknown reason neither completes, fails, nor times out. This results in the observed hang. Since the issue can't be reproduced easily we have not be able to determine which component in the stack is responsible (ZFS, LUKS, scsi layer/driver, or hardware).

In order the better understand how this is possible we've added diagnostic and optional recovery code. These changes are not in the 0.7.13 release but they are in the current 0.8.0-rc releases.

@bglod since you're running ZFS's native encryption you should also have this debugging. By default, it's largely disabled but if a hung I/O is detected a new __deadman__ event will be posted. If you encounter this issue again it would be help our analysis if you could check zpool events -v for such an event.

If you'd like to enable the recovery logic, which attempts to retry the hung I/O, you'll need to set zfs_deadman_failmode property to continue. This should allow the system to continue operating after the zfs_deadman_ziotime_ms=300s time is reached.

       zfs_deadman_failmode (charp)
                   Controls the failure  behavior  when  the  deadman  detects  a
                   "hung" I/O.  Valid values are wait, continue, and panic.

                   wait - Wait for a "hung" I/O to complete.  For each "hung" I/O
                   a "deadman" event will be posted describing that I/O.

                   continue - Attempt to recover from a  "hung"  I/O  by  re-dis‐
                   patching it to the I/O pipeline if possible.

                   panic  -  Panic the system.  This can be used to facilitate an
                   automatic fail-over to a properly configured  fail-over  partner.

                   Default value: wait.

Furthermore, we want to identify the root cause for this issue so if anyone is running an 0.8.0-rc and is willing to provide additional debug logs please let me know. There are a few more debugging options we can try enabling to get more detailed information about the outstanding I/Os.

behlendorf on 20 Mar 2019

👍3

I did not see a hang when the scheduler was set to bfq for the last 10 days or so. This doesn't mean bfq is the way to go, of course. zpool events did not show any deadman events, but it seems the event log only goes back as far as the last reboot. Because I hadn't experienced another hang, it's understandable there were no deadman entries.

I upgraded my zfs from becdcec7b9cd6b8beaa2273495b97c06de24e9e5 to 060f0226e6396a3c7104fedc8d2af7063a27c1f9 (latest for Arch's zfs-linux-git) and set the zfs_deadman_failmode=continue module option. I also changed the scheduler back to the default of mq-deadline.

I setup a script to periodically check for deadman events and e-mail me if any are found. I will report back with any findings.

Update: still locks up. Running zpool events -v just hangs forever.

bglod on 23 Mar 2019

Hi, after upgrading to 0.8 I'm having a similar ZFS stuck issue that I'm able to reproduce with mdtest over NFS, as detailed in https://github.com/zfsonlinux/zfs/issues/7473

I provided further logs in the issue above but let me know if there's anything else I can provide.

Once the system go stuck I ran:

# echo continue > /sys/module/zfs/parameters/zfs_deadman_failmode
But it remains stuck and no events have been recorded. In this case txg_sync is in D state, but there's no process consuming CPU on the node. Perhaps I had to set the zfs_deadman_failmode mode before ZFS got stuck? Should I set zio_deadman_log_all to 1?

My current module parameters:

arc_summary

ZFS Subsystem Report Fri Mar 29 12:09:35 2019
ARC Summary: (HEALTHY)
Memory Throttle Count: 0

ARC Misc:
Deleted: 2.77M
Mutex Misses: 1.95k
Evict Skips: 583

ARC Size: 52.46% 237.05 GiB
Target Size: (Adaptive) 52.59% 237.63 GiB
Min Size (Hard Limit): 0.22% 1.00 GiB
Max Size (High Water): 451:1 451.88 GiB

ARC Size Breakdown:
Recently Used Cache Size: 16.31% 33.56 GiB
Frequently Used Cache Size: 83.69% 172.25 GiB

ARC Hash Breakdown:
Elements Max: 3.62M
Elements Current: 91.42% 3.31M
Collisions: 425.56k
Chain Max: 3
Chains: 79.41k

ARC Total accesses: 1.74G
Cache Hit Ratio: 99.93% 1.74G
Cache Miss Ratio: 0.07% 1.26M
Actual Hit Ratio: 99.92% 1.74G

Data Demand Efficiency:     99.46%  49.98M

CACHE HITS BY CACHE LIST:
  Most Recently Used:       5.55%   96.45M
  Most Frequently Used:     94.44%  1.64G
  Most Recently Used Ghost: 0.06%   1.04M
  Most Frequently Used Ghost:   0.00%   35.40k

CACHE HITS BY DATA TYPE:
  Demand Data:          2.86%   49.71M
  Prefetch Data:        0.00%   0
  Demand Metadata:      97.11%  1.69G
  Prefetch Metadata:        0.02%   425.38k

CACHE MISSES BY DATA TYPE:
  Demand Data:          21.32%  268.55k
  Prefetch Data:        0.00%   0
  Demand Metadata:      77.29%  973.76k
  Prefetch Metadata:        1.39%   17.57k

ZFS Tunables:
dbuf_cache_hiwater_pct 10
dbuf_cache_lowater_pct 10
dbuf_cache_max_bytes 8423723520
dbuf_cache_shift 5
dbuf_metadata_cache_max_bytes 4211861760
dbuf_metadata_cache_shift 6
dmu_object_alloc_chunk_shift 7
ignore_hole_birth 1
l2arc_feed_again 1
l2arc_feed_min_ms 200
l2arc_feed_secs 1
l2arc_headroom 2
l2arc_headroom_boost 200
l2arc_noprefetch 1
l2arc_norw 0
l2arc_write_boost 8388608
l2arc_write_max 8388608
metaslab_aliquot 524288
metaslab_bias_enabled 1
metaslab_debug_load 0
metaslab_debug_unload 0
metaslab_force_ganging 16777217
metaslab_fragmentation_factor_enabled 1
metaslab_lba_weighting_enabled 1
metaslab_preload_enabled 1
send_holes_without_birth_time 1
spa_asize_inflation 24
spa_config_path /etc/zfs/zpool.cache
spa_load_print_vdev_tree 0
spa_load_verify_data 1
spa_load_verify_maxinflight 10000
spa_load_verify_metadata 1
spa_slop_shift 5
vdev_removal_max_span 32768
vdev_validate_skip 0
zfetch_array_rd_sz 1048576
zfetch_max_distance 8388608
zfetch_max_streams 8
zfetch_min_sec_reap 2
zfs_abd_scatter_enabled 1
zfs_abd_scatter_max_order 10
zfs_abd_scatter_min_size 1536
zfs_admin_snapshot 0
zfs_arc_average_blocksize 8192
zfs_arc_dnode_limit 0
zfs_arc_dnode_limit_percent 10
zfs_arc_dnode_reduce_percent 10
zfs_arc_grow_retry 0
zfs_arc_lotsfree_percent 10
zfs_arc_max 485206474752
zfs_arc_meta_adjust_restarts 4096
zfs_arc_meta_limit 0
zfs_arc_meta_limit_percent 75
zfs_arc_meta_min 0
zfs_arc_meta_prune 10000
zfs_arc_meta_strategy 1
zfs_arc_min 1073741824
zfs_arc_min_prefetch_ms 0
zfs_arc_min_prescient_prefetch_ms 0
zfs_arc_p_dampener_disable 1
zfs_arc_p_min_shift 0
zfs_arc_pc_percent 0
zfs_arc_shrink_shift 0
zfs_arc_sys_free 0
zfs_async_block_max_blocks 100000
zfs_autoimport_disable 1
zfs_checksum_events_per_second 20
zfs_commit_timeout_pct 5
zfs_compressed_arc_enabled 1
zfs_condense_indirect_commit_entry_delay_ms 0
zfs_condense_indirect_vdevs_enable 1
zfs_condense_max_obsolete_bytes 1073741824
zfs_condense_min_mapping_bytes 131072
zfs_dbgmsg_enable 1
zfs_dbgmsg_maxsize 4194304
zfs_dbuf_state_index 0
zfs_ddt_data_is_special 1
zfs_deadman_checktime_ms 60000
zfs_deadman_enabled 1
zfs_deadman_failmode continue
zfs_deadman_synctime_ms 600000
zfs_deadman_ziotime_ms 300000
zfs_dedup_prefetch 0
zfs_delay_min_dirty_percent 60
zfs_delay_scale 500000
zfs_delete_blocks 20480
zfs_dirty_data_max 4294967296
zfs_dirty_data_max_max 4294967296
zfs_dirty_data_max_max_percent 25
zfs_dirty_data_max_percent 10
zfs_dirty_data_sync_percent 20
zfs_disable_ivset_guid_check 0
zfs_dmu_offset_next_sync 0
zfs_expire_snapshot 300
zfs_flags 0
zfs_free_bpobj_enabled 1
zfs_free_leak_on_eio 0
zfs_free_min_time_ms 1000
zfs_immediate_write_sz 32768
zfs_initialize_value 16045690984833335022
zfs_key_max_salt_uses 400000000
zfs_lua_max_instrlimit 100000000
zfs_lua_max_memlimit 104857600
zfs_max_missing_tvds 0
zfs_max_recordsize 1048576
zfs_metaslab_fragmentation_threshold 70
zfs_metaslab_segment_weight_enabled 1
zfs_metaslab_switch_threshold 2
zfs_mg_fragmentation_threshold 85
zfs_mg_noalloc_threshold 0
zfs_multihost_fail_intervals 10
zfs_multihost_history 0
zfs_multihost_import_intervals 20
zfs_multihost_interval 2000
zfs_multilist_num_sublists 0
zfs_no_scrub_io 0
zfs_no_scrub_prefetch 0
zfs_nocacheflush 0
zfs_nopwrite_enabled 1
zfs_object_mutex_size 64
zfs_obsolete_min_time_ms 500
zfs_override_estimate_recordsize 0
zfs_pd_bytes_max 52428800
zfs_per_txg_dirty_frees_percent 5
zfs_prefetch_disable 1
zfs_read_chunk_size 1048576
zfs_read_history 0
zfs_read_history_hits 0
zfs_reconstruct_indirect_combinations_max 4096
zfs_recover 0
zfs_recv_queue_length 16777216
zfs_removal_ignore_errors 0
zfs_removal_suspend_progress 0
zfs_remove_max_segment 16777216
zfs_resilver_disable_defer 0
zfs_resilver_min_time_ms 3000
zfs_scan_checkpoint_intval 7200
zfs_scan_fill_weight 3
zfs_scan_ignore_errors 0
zfs_scan_issue_strategy 0
zfs_scan_legacy 0
zfs_scan_max_ext_gap 2097152
zfs_scan_mem_lim_fact 20
zfs_scan_mem_lim_soft_fact 20
zfs_scan_strict_mem_lim 0
zfs_scan_suspend_progress 0
zfs_scan_vdev_limit 4194304
zfs_scrub_min_time_ms 1000
zfs_send_corrupt_data 0
zfs_send_queue_length 16777216
zfs_slow_io_events_per_second 20
zfs_spa_discard_memory_limit 16777216
zfs_sync_pass_deferred_free 2
zfs_sync_pass_dont_compress 5
zfs_sync_pass_rewrite 2
zfs_sync_taskq_batch_pct 75
zfs_txg_history 100
zfs_txg_timeout 5
zfs_unlink_suspend_progress 0
zfs_user_indirect_is_special 1
zfs_vdev_aggregation_limit 16777216
zfs_vdev_aggregation_limit_non_rotating 131072
zfs_vdev_async_read_max_active 3
zfs_vdev_async_read_min_active 1
zfs_vdev_async_write_active_max_dirty_percent 60
zfs_vdev_async_write_active_min_dirty_percent 30
zfs_vdev_async_write_max_active 10
zfs_vdev_async_write_min_active 2
zfs_vdev_cache_bshift 16
zfs_vdev_cache_max 16384
zfs_vdev_cache_size 0
zfs_vdev_default_ms_count 200
zfs_vdev_initializing_max_active 1
zfs_vdev_initializing_min_active 1
zfs_vdev_max_active 1000
zfs_vdev_min_ms_count 16
zfs_vdev_mirror_non_rotating_inc 0
zfs_vdev_mirror_non_rotating_seek_inc 1
zfs_vdev_mirror_rotating_inc 0
zfs_vdev_mirror_rotating_seek_inc 5
zfs_vdev_mirror_rotating_seek_offset 1048576
zfs_vdev_ms_count_limit 131072
zfs_vdev_queue_depth_pct 1000
zfs_vdev_raidz_impl [fastest] original scalar sse2 ssse3 avx2 avx512f avx512bw
zfs_vdev_read_gap_limit 32768
zfs_vdev_removal_max_active 2
zfs_vdev_removal_min_active 1
zfs_vdev_scheduler deadline
zfs_vdev_scrub_max_active 2
zfs_vdev_scrub_min_active 1
zfs_vdev_sync_read_max_active 10
zfs_vdev_sync_read_min_active 10
zfs_vdev_sync_write_max_active 10
zfs_vdev_sync_write_min_active 10
zfs_vdev_write_gap_limit 4096
zfs_zevent_cols 80
zfs_zevent_console 0
zfs_zevent_len_max 1152
zfs_zil_clean_taskq_maxalloc 1048576
zfs_zil_clean_taskq_minalloc 1024
zfs_zil_clean_taskq_nthr_pct 100
zil_nocacheflush 0
zil_replay_disable 0
zil_slog_bulk 786432
zio_deadman_log_all 0
zio_dva_throttle_enabled 1
zio_requeue_io_start_cut_in_line 1
zio_slow_io_ms 30000
zio_taskq_batch_pct 75
zvol_inhibit_dev 0
zvol_major 230
zvol_max_discard_blocks 16384
zvol_prefetch_bytes 131072
zvol_request_sync 0
zvol_threads 32
zvol_volmode 1

gerardba on 29 Mar 2019

Can this issue be related with latest kernel issue "Incorrect disk IO caused by blk-mq direct issue can lead to file system corruption"?
https://bugzilla.kernel.org/show_bug.cgi?id=201685

denizzzka on 29 Mar 2019

Can this issue be related with latest kernel issue "Incorrect disk IO caused by blk-mq direct issue can lead to file system corruption"?
https://bugzilla.kernel.org/show_bug.cgi?id=201685

I don't think so, I'm using 'none' for all the (nvme) devices in this zpool:

# cat /sys/block/nvme*/queue/scheduler
[none] mq-deadline kyber
[none] mq-deadline kyber
[none] mq-deadline kyber
[none] mq-deadline kyber
[none] mq-deadline kyber
[none] mq-deadline kyber
[none] mq-deadline kyber
[none] mq-deadline kyber
[none] mq-deadline kyber
[none] mq-deadline kyber
[none] mq-deadline kyber
[none] mq-deadline kyber
[none] mq-deadline kyber
[none] mq-deadline kyber
[none] mq-deadline kyber
[none] mq-deadline kyber
[none] mq-deadline kyber
[none] mq-deadline kyber
[none] mq-deadline kyber
[none] mq-deadline kyber
[none] mq-deadline kyber
[none] mq-deadline kyber
[none] mq-deadline kyber
[none] mq-deadline kyber
[none] mq-deadline kyber
[none] mq-deadline kyber
[none] mq-deadline kyber
[none] mq-deadline kyber
[none] mq-deadline kyber
[none] mq-deadline kyber
[none] mq-deadline kyber
[none] mq-deadline kyber

gerardba on 29 Mar 2019

I managed to reproduce the stuck on 0.8.0-rc3_126_g5dbf8b4ed, now with zpool events. Details in https://github.com/zfsonlinux/zfs/issues/7473

gerardba on 31 Mar 2019

If it is still actual to reproduce this issue, here is my way to reproduce:

Clone/download and make this nice compressor program https://github.com/byronknoll/cmix
cd to directory inside of pool and run cmix with argument is some 10-100 Mb file:
~/Dev/cmix/cmix -c 100Mb_file.bin result.cmix
After few minutis system will hang

Cmix consumes a lot of RAM, maybe this is important

denizzzka on 12 Apr 2019

Any updates? My issue is suddenly back after being gone for 6 weeks. Really annoying

CySlider on 2 May 2019

New "ZFS sync hang" event, #2-2019 in five months:i

kernel 4.14.105-gentoo-r1 #1 SMP (elevator=deadline)

SPL: Loaded module v0.7.12-r0-gentoo
ZFS: Loaded module v0.7.12-r0-gentoo, ZFS pool version 5000, ZFS filesystem version 5

3631 root 20 0 0 0 0 R 100.0 0.0 793:51.76 kworker/u32:4

hang AGAIN, hogging 100% CPU.

Again while bacula was updating its postgres database, a 12Gb db on a zfs filesystem.

Any sync command was hanging and:
cat /proc/3631/stack[] 0xffffffffffffffff

Same incident happened on this system:

24/11/2017 #1-2017
06/05/2018 #1-2018
06/05/2018 #2-2018
12/01/2019 #1-2019
17/05/2019 #2-2019

under different kernels and different ZFS/SPL versions, from v0.7.7-r0-gentoo to to v0.7.12-r0-gentoo.

Just hoping it will help to debug this really boring problem.

Only way to recover was a "reboot -f -n".

Orfheo on 18 May 2019

👍1

Only way to recover was a "reboot -f -n".

All mine hang events caused hangs of all GUI and consoles too

denizzzka on 18 May 2019

Reading other related "ZFS sync hanging" issues, a sync hang may be related to an hw problem, for example a flaky cable.

Not my case! I had the same events on two different systems, pretty sure it is not an hw problem.

My systems hadn't any other problem beside "sync hanging" apparently.

The only relation I can see? They have, on same disks, a mix of ext4 and ZFS pools.

May be the "smoking gun"?

Orfheo on 18 May 2019

Issue repruduced on my system after upgrade to Linux kernel 4.19.0-5-amd64 and ZFS sources ver 0.8.0

denizzzka on 11 Jun 2019

From my statistics, the issue show up, when it show up, really a rare event, ONLY on systems where there is a mix, on the same disks, of ZFS pools and ext3/4 filesystems.

I'm beginning to think the problem is not really a ZFS ONLY issue.

May be a kernel bug which show up just when ZFS and ext3/4 I/O requests mix on the same system for the same disks?

In other words when on the VERY SAME I/O queue there is a mix of I/O requests, for the same kernel worker, directed to the same disk, coming from an ext3/4 filesystem and from ZFS?

Orfheo on 12 Jun 2019

ONLY on systems where there is a mix, on the same disks, of ZFS pools and ext3/4 filesystems.

My case also refers to this: I use ZFS rpool and zvol what contains LUKS ext4 /home partition

denizzzka on 12 Jun 2019

I understand your layout, denizzzka. Mine is a little bit different, I've MBR partitioned disks which contains native ZFS pools and native ext3/ext4 filesystems, no zvol at all.

But I guess, the effect is the same: a mix of ext3/4 and ZFS I/O requests may be handled by the same kworker, which, for what we have seen, in some case, hangs, for unknown reasons.

May I ask to the others posting to this issue to report if they have the same disk/filesystems layout?

Another info: I handn't never seen this event in Linux kernels with version < 4.xx. I still have a couple of lin
ux kernels 3.16.5, ZFS v0.6.3-r0-gentoo, with the very same layout, and I've never seens this hang to happen.

Orfheo on 12 Jun 2019

Same for me, I have an ext4 root, everything else is ZFS.

Anteru on 12 Jun 2019

No ext3/4 partitions here. Just vfat on /boot (because EFI) and ZFS for everything else.

bglod on 12 Jun 2019

I have only ZFS filesystem (no ext3/4 ) I use zfs filesystem for samba
shares. I suspect that the hang is related in some way to the reading /
writing of extended attributes.
it often happens even during backup activities (hp dataprotector) I have a

z2 pool with deduplication and compressions. all identical sas disks

«L'immaginazione è più importante della conoscenza.» - Albert Einstein.

Alberto M.Fiaschi

*http://it.linkedin.com/pub/alberto-fiaschi
http://it.linkedin.com/pub/alberto-fiaschi/38/783/a5 *

Il giorno mer 12 giu 2019 alle ore 15:47 Brian Glod <
[email protected]> ha scritto:

No ext3/4 partitions here. Just vfat on /boot (EFS) and ZFS for everything
else.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/zfsonlinux/zfs/issues/7038?email_source=notifications&email_token=ADSJ6JIJDDP2LVWACKIPSG3P2D5AXA5CNFSM4ELQBVV2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODXQPBPQ#issuecomment-501280958,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ADSJ6JOEYZVJ5FCBGGZJIMDP2D5AXANCNFSM4ELQBVVQ
.

bud4 on 12 Jun 2019

Amusingly, I experienced the hang when I ran df -Th to make sure I hadn't forgotten any filesystems. (The command was fine after a reboot). Linux 5.1.7 with zfs 1ec7eda16.

How are you all getting these nice stack traces? My system completely locks up and I have to reboot it.

bglod on 12 Jun 2019

My system completely locks up and I have to reboot it.

My system does extractly that too

denizzzka on 12 Jun 2019

I would like to limit this question to people which had been able to trace the hang with a command like:

cat /proc/3631/stack

where "3631" is the pid of the kworker hogging the CPU, obtain a plain empty stack:

[] 0xffffffffffffffff

Otherwise we may mix problems here, this issue begin to be like a place where all the reasons for a ZFS hang mix together.

This specific "hang" doesn't actually hang anything else than a 'sync' command, at least I've never seen anything different, even after a full night with a 100% CPU kworker running.

The system, apparently, for everything else, run correctly, only a specifc "sync" command hangs or any process calling "sync()" o "fsync()".

Nothing else.

Orfheo on 12 Jun 2019

And ... on my systems ... two of them ... it is a really "rare" event.

It happens couple of times for year in one system, even less, once for year on the other one.

The difference between the two systems? The first one has a postgresql database over a ZFS filesystem (well had, I moved it to an ext4), the other one is just a plan ZFS file server.

The postgres daemon, I guess, call "fsync()" each time it needs to be sure buffers are flushed to disk, so each time it has to "update" the database tables.

I never verified it, but I bet the number of "sync/day" are defintely greater on the first than on the second.

And I bet the kworker begin to hog the CPU just after a system call to "sync".

Orfheo on 12 Jun 2019

We have a backup server, that has the same ZFS hang every two weeks. We're using BURP as our backup software. When we were running Ubuntu 16.04, the machine had no issues. Ever since migrating to Ubuntu 18.04, it hangs every 2 weeks. Pool is 4x 1TB SSD (EVO 850) in a RAIDZ-1. Installed ZFS version is Ubuntu package zfslinux-util 0.7.5-1ubuntu16.4 amd64.

Symptoms: several backup processes (usually 5) end up in the state 'running' for several days, unable to do any file IO. Load will be at 6.0. A 'sync' command will not return. The server can only be rebooted by power cycling it. Kworker is at 100%, but iotop is showing no actual disk IO happening. This likely has to do with whatever housekeeping BURP is doing once a week.

If there's any output that might be useful for debugging this, let me know and I should be able to provide it in two weeks.

PE1NUT on 29 Jul 2019

@PE1NUT Given how old this issue is, and the fact that you're seeing this issue with a 0.7.X release, you should probably post a new issue report once the problem happens again. If there are any blocked processes, please post their stacks (dmesg and/or sysrq-w) and also any interesting output from "zpool events" and/or "zpool events -v" (if "zpool events" show anything out of the ordinary). It might also be useful to run "perf top" (and/or "perf top -ag") and see what the system is doing when it's spinning. Finally, I'd suggest that you might try an 0.8.X relase if possible to see whether the problem persists.

dweeezil on 30 Jul 2019

Very similar issue in 0.8.1

root@pve2:/proc/21616# ps axu | grep " D "
root 13397 0.0 0.0 11548 4224 ? D 14:50 0:00 zfs rollback rpool/data/subvol-118-disk-0@AD_connect
root 21616 0.0 0.0 0 0 ? D Jul26 0:00 [kworker/u82:3+flush-zfs-20]
root 23085 0.0 0.0 6072 884 pts/28 S+ 15:28 0:00 grep D
root 31414 0.0 0.0 0 0 ? D Jul27 0:00 [arc_prune]
root@pve2:~# cat /proc/13397/stack
[<0>] cv_wait_common+0x104/0x130 [spl]
[<0>] __cv_wait+0x15/0x20 [spl]
[<0>] rrw_enter_write+0x3c/0xa0 [zfs]
[<0>] rrm_enter+0x2a/0x80 [zfs]
[<0>] zfsvfs_teardown+0x8b/0x2a0 [zfs]
[<0>] zfs_suspend_fs+0x10/0x20 [zfs]
[<0>] zfs_ioc_rollback+0xf8/0x180 [zfs]
[<0>] zfsdev_ioctl+0x1e0/0x8f0 [zfs]
[<0>] do_vfs_ioctl+0xa9/0x640
[<0>] ksys_ioctl+0x67/0x90
[<0>] __x64_sys_ioctl+0x1a/0x20
[<0>] do_syscall_64+0x5a/0x110
[<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[<0>] 0xffffffffffffffff
root@pve2:~# cat /proc/21616/stack
[<0>] io_schedule+0x16/0x40
[<0>] __lock_page+0x122/0x220
[<0>] write_cache_pages+0x23e/0x4d0
[<0>] zpl_writepages+0x94/0x170 [zfs]
[<0>] do_writepages+0x41/0xd0
[<0>] __writeback_single_inode+0x40/0x350
[<0>] writeback_sb_inodes+0x211/0x500
[<0>] __writeback_inodes_wb+0x67/0xb0
[<0>] wb_writeback+0x25f/0x2f0
[<0>] wb_workfn+0x2fb/0x3f0
[<0>] process_one_work+0x20f/0x410
[<0>] worker_thread+0x34/0x400
[<0>] kthread+0x120/0x140
[<0>] ret_from_fork+0x35/0x40
[<0>] 0xffffffffffffffff
root@pve2:~# cat /proc/31414/stack
[<0>] __inode_wait_for_writeback+0xb9/0xf0
[<0>] inode_wait_for_writeback+0x26/0x40
[<0>] evict+0xb5/0x1a0
[<0>] iput+0x148/0x210
[<0>] zfs_prune+0x33c/0x380 [zfs]
[<0>] zpl_prune_sb+0x35/0x50 [zfs]
[<0>] arc_prune_task+0x22/0x40 [zfs]
[<0>] taskq_thread+0x2ec/0x4d0 [spl]
[<0>] kthread+0x120/0x140
[<0>] ret_from_fork+0x35/0x40
[<0>] 0xffffffffffffffff

dmesg
[794123.204085] INFO: task kworker/u82:3:21616 blocked for more than 120 seconds.
[794123.204122] Tainted: P O 5.0.15-1-pve #1
[794123.204138] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[794123.204160] kworker/u82:3 D 0 21616 2 0x80000000
[794123.204169] Workqueue: writeback wb_workfn (flush-zfs-20)
[794123.204170] Call Trace:
[794123.204178] __schedule+0x2d4/0x870
[794123.204180] schedule+0x2c/0x70
[794123.204184] io_schedule+0x16/0x40
[794123.204187] __lock_page+0x122/0x220
[794123.204189] ? file_check_and_advance_wb_err+0xe0/0xe0
[794123.204192] write_cache_pages+0x23e/0x4d0
[794123.204261] ? zpl_write_common_iovec+0xf0/0xf0 [zfs]
[794123.204263] ? _cond_resched+0x19/0x30
[794123.204273] ? mutex_lock+0x12/0x30
[794123.204302] ? rrw_exit+0x5e/0x150 [zfs]
[794123.204332] zpl_writepages+0x94/0x170 [zfs]
[794123.204334] do_writepages+0x41/0xd0
[794123.204335] ? __wb_calc_thresh+0x3e/0x130
[794123.204337] __writeback_single_inode+0x40/0x350
[794123.204338] writeback_sb_inodes+0x211/0x500
[794123.204340] __writeback_inodes_wb+0x67/0xb0
[794123.204342] wb_writeback+0x25f/0x2f0
[794123.204343] wb_workfn+0x2fb/0x3f0
[794123.204347] process_one_work+0x20f/0x410
[794123.204348] worker_thread+0x34/0x400
[794123.204351] kthread+0x120/0x140
[794123.204352] ? process_one_work+0x410/0x410
[794123.204354] ? __kthread_parkme+0x70/0x70
[794123.204355] ret_from_fork+0x35/0x40

root@pve2:~# cat /proc/spl/taskq
taskq act nthr spwn maxt pri mina maxa cura flags
arc_prune/0 1 40 0 40 120 40 2147483647 40 80000005
active: [31414]arc_prune_task zfs

Lockout happened 4 days ago, so no events or debug messages from SPL: they're rolled over.

(wanted to write to #7473 but there's more similar backtrace)

mityarzn on 30 Jul 2019

👍1

This issue started appearing last week for us. Multiple of our servers have had similar issues since. It's very unclear what triggers this. We're running Ubuntu 16.04 with kernel 4.4.0-154 and zfs 0.7.13 (versions might differ a bit on other servers).

Hanging sync process:

170490  0.0  0.0   4360   604 ?        D    20:01   0:00  \_ sync

cat /proc/170490/stack
[<ffffffff812471c8>] wb_wait_for_completion+0x58/0xa0
[<ffffffff8124ac01>] sync_inodes_sb+0xa1/0x1e0
[<ffffffff81251345>] sync_inodes_one_sb+0x15/0x20
[<ffffffff8121ffe1>] iterate_supers+0xb1/0x100
[<ffffffff812516b4>] sys_sync+0x44/0xb0
[<ffffffff81863b9b>] entry_SYSCALL_64_fastpath+0x22/0xcb
[<ffffffffffffffff>] 0xffffffffffffffff

dmesg:

[Fri Aug  2 20:05:01 2019] INFO: task z_iput:976 blocked for more than 120 seconds.
[Fri Aug  2 20:05:01 2019]       Tainted: P           OE   4.4.0-154-generic #181-Ubuntu
[Fri Aug  2 20:05:01 2019] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Fri Aug  2 20:05:01 2019] z_iput          D ffff8810280f7c00     0   976      2 0x00000000
[Fri Aug  2 20:05:01 2019]  ffff8810280f7c00 0000000000017300 ffff88102a112d00 ffff8810262ae900
[Fri Aug  2 20:05:01 2019]  ffff8810280f8000 ffff8810717b1600 ffff8810280f7c98 ffffffff8185fa50
[Fri Aug  2 20:05:01 2019]  ffff8810280f7c80 ffff8810280f7c18 ffffffff8185f205 0000000000000002
[Fri Aug  2 20:05:01 2019] Call Trace:
[Fri Aug  2 20:05:01 2019]  [<ffffffff8185fa50>] ? out_of_line_wait_on_atomic_t+0xf0/0xf0
[Fri Aug  2 20:05:01 2019]  [<ffffffff8185f205>] schedule+0x35/0x80
[Fri Aug  2 20:05:01 2019]  [<ffffffff8185fa61>] bit_wait+0x11/0x60
[Fri Aug  2 20:05:01 2019]  [<ffffffff8185f65f>] __wait_on_bit+0x5f/0x90
[Fri Aug  2 20:05:01 2019]  [<ffffffff8185eccd>] ? __schedule+0x30d/0x810
[Fri Aug  2 20:05:01 2019]  [<ffffffff81246788>] __inode_wait_for_writeback+0xa8/0xe0
[Fri Aug  2 20:05:01 2019]  [<ffffffff810ca520>] ? autoremove_wake_function+0x40/0x40
[Fri Aug  2 20:05:01 2019]  [<ffffffff8124d166>] inode_wait_for_writeback+0x26/0x40
[Fri Aug  2 20:05:01 2019]  [<ffffffff8123843a>] evict+0xaa/0x190
[Fri Aug  2 20:05:01 2019]  [<ffffffff81238731>] iput+0x1c1/0x240
[Fri Aug  2 20:05:01 2019]  [<ffffffffc05b0734>] taskq_thread+0x2b4/0x4b0 [spl]
[Fri Aug  2 20:05:01 2019]  [<ffffffff810b2330>] ? wake_up_q+0x70/0x70
[Fri Aug  2 20:05:01 2019]  [<ffffffffc05b0480>] ? task_done+0xb0/0xb0 [spl]
[Fri Aug  2 20:05:01 2019]  [<ffffffff810a65c7>] kthread+0xe7/0x100
[Fri Aug  2 20:05:01 2019]  [<ffffffff8185ecc1>] ? __schedule+0x301/0x810
[Fri Aug  2 20:05:01 2019]  [<ffffffff810a64e0>] ? kthread_create_on_node+0x1e0/0x1e0
[Fri Aug  2 20:05:01 2019]  [<ffffffff81864025>] ret_from_fork+0x55/0x80
[Fri Aug  2 20:05:01 2019]  [<ffffffff810a64e0>] ? kthread_create_on_node+0x1e0/0x1e0
[Fri Aug  2 20:05:01 2019] INFO: task mysqld:27253 blocked for more than 120 seconds.
[Fri Aug  2 20:05:01 2019]       Tainted: P           OE   4.4.0-154-generic #181-Ubuntu
[Fri Aug  2 20:05:01 2019] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Fri Aug  2 20:05:01 2019] mysqld          D ffff880fbdd5bd38     0 27253  26238 0x00000120
[Fri Aug  2 20:05:01 2019]  ffff880fbdd5bd38 ffff880e000b0008 ffffffff81e13500 ffff880af9984b00
[Fri Aug  2 20:05:01 2019]  ffff880fbdd5c000 ffff880cb31df800 ffff880cb31df9d0 ffff880cb31df828
[Fri Aug  2 20:05:01 2019]  0000000000000000 ffff880fbdd5bd50 ffffffff8185f205 ffff880cb31df9c8
[Fri Aug  2 20:05:01 2019] Call Trace:
[Fri Aug  2 20:05:01 2019]  [<ffffffff8185f205>] schedule+0x35/0x80
[Fri Aug  2 20:05:01 2019]  [<ffffffffc05b44e9>] cv_wait_common+0x109/0x140 [spl]
[Fri Aug  2 20:05:01 2019]  [<ffffffff810ca4e0>] ? wake_atomic_t_function+0x60/0x60
[Fri Aug  2 20:05:01 2019]  [<ffffffffc05b4535>] __cv_wait+0x15/0x20 [spl]
[Fri Aug  2 20:05:01 2019]  [<ffffffffc0908313>] zil_commit.part.12+0x83/0x870 [zfs]
[Fri Aug  2 20:05:01 2019]  [<ffffffffc05b5b1d>] ? tsd_set+0x30d/0x4d0 [spl]
[Fri Aug  2 20:05:01 2019]  [<ffffffff81861482>] ? mutex_lock+0x12/0x30
[Fri Aug  2 20:05:01 2019]  [<ffffffffc0908b17>] zil_commit+0x17/0x20 [zfs]
[Fri Aug  2 20:05:01 2019]  [<ffffffffc08fe897>] zfs_fsync+0x77/0xf0 [zfs]
[Fri Aug  2 20:05:01 2019]  [<ffffffffc09159c6>] zpl_fsync+0x66/0x90 [zfs]
[Fri Aug  2 20:05:01 2019]  [<ffffffff8125149e>] vfs_fsync_range+0x4e/0xb0
[Fri Aug  2 20:05:01 2019]  [<ffffffff81003938>] ? syscall_trace_enter_phase1+0xc8/0x140
[Fri Aug  2 20:05:01 2019]  [<ffffffff8125155d>] do_fsync+0x3d/0x70
[Fri Aug  2 20:05:01 2019]  [<ffffffff81251800>] SyS_fsync+0x10/0x20
[Fri Aug  2 20:05:01 2019]  [<ffffffff81863b9b>] entry_SYSCALL_64_fastpath+0x22/0xcb
[Fri Aug  2 20:05:01 2019] INFO: task sync:170490 blocked for more than 120 seconds.
[Fri Aug  2 20:05:01 2019]       Tainted: P           OE   4.4.0-154-generic #181-Ubuntu
[Fri Aug  2 20:05:01 2019] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Fri Aug  2 20:05:01 2019] sync            D ffff880f6e23fde0     0 170490  28296 0x00000124
[Fri Aug  2 20:05:01 2019]  ffff880f6e23fde0 ffff8809e8af5b20 ffff88102a114b00 ffff88003e7d5a00
[Fri Aug  2 20:05:01 2019]  ffff880f6e240000 ffff8809e8af5b48 ffff8809e8af5800 ffff880f6e23fe6c
[Fri Aug  2 20:05:01 2019]  ffff880cb31db870 ffff880f6e23fdf8 ffffffff8185f205 ffff880f6e23fe6c
[Fri Aug  2 20:05:01 2019] Call Trace:
[Fri Aug  2 20:05:01 2019]  [<ffffffff8185f205>] schedule+0x35/0x80
[Fri Aug  2 20:05:01 2019]  [<ffffffff812471c8>] wb_wait_for_completion+0x58/0xa0
[Fri Aug  2 20:05:01 2019]  [<ffffffff810ca4e0>] ? wake_atomic_t_function+0x60/0x60
[Fri Aug  2 20:05:01 2019]  [<ffffffff8124ac01>] sync_inodes_sb+0xa1/0x1e0
[Fri Aug  2 20:05:01 2019]  [<ffffffff81251330>] ? SyS_tee+0x3e0/0x3e0
[Fri Aug  2 20:05:01 2019]  [<ffffffff81251345>] sync_inodes_one_sb+0x15/0x20
[Fri Aug  2 20:05:01 2019]  [<ffffffff8121ffe1>] iterate_supers+0xb1/0x100
[Fri Aug  2 20:05:01 2019]  [<ffffffff812516b4>] sys_sync+0x44/0xb0
[Fri Aug  2 20:05:01 2019]  [<ffffffff81863b9b>] entry_SYSCALL_64_fastpath+0x22/0xcb

kworker thread: most of the time it's 0xffffffffffffffff, but continuesly printing the stack shows this:

# while true; do cat /proc/170492/stack; echo ; done

[<ffffffffc087163c>] dmu_zfetch+0x15c/0x520 [zfs]
[<ffffffffffffffff>] 0xffffffffffffffff

[<ffffffffffffffff>] 0xffffffffffffffff

[<ffffffffc087163c>] dmu_zfetch+0x15c/0x520 [zfs]
[<ffffffffc0853bd9>] dbuf_read+0x6b9/0x940 [zfs]
[<ffffffffc0872eee>] dnode_hold_impl+0x15e/0xba0 [zfs]
[<ffffffffffffffff>] 0xffffffffffffffff

[<ffffffffc0871844>] dmu_zfetch+0x364/0x520 [zfs]
[<ffffffffc0853bd9>] dbuf_read+0x6b9/0x940 [zfs]
[<ffffffffc0872e56>] dnode_hold_impl+0xc6/0xba0 [zfs]
[<ffffffffffffffff>] 0xffffffffffffffff

[<ffffffffc087163c>] dmu_zfetch+0x15c/0x520 [zfs]
[<ffffffffc0853bd9>] dbuf_read+0x6b9/0x940 [zfs]
[<ffffffffc0872e25>] dnode_hold_impl+0x95/0xba0 [zfs]
[<ffffffffffffffff>] 0xffffffffffffffff

[<ffffffffc087159f>] dmu_zfetch+0xbf/0x520 [zfs]
[<ffffffffc0853bd9>] dbuf_read+0x6b9/0x940 [zfs]
[<ffffffffc0872e25>] dnode_hold_impl+0x95/0xba0 [zfs]
[<ffffffffc087394b>] dnode_hold+0x1b/0x20 [zfs]
[<ffffffffc085cec2>] dmu_bonus_hold+0x32/0x1a0 [zfs]
[<ffffffffc089d70e>] sa_buf_hold+0xe/0x10 [zfs]
[<ffffffffc0904e0f>] zfs_zget+0x11f/0x240 [zfs]
[<ffffffffc0900607>] zfs_get_data+0x57/0x2c0 [zfs]
[<ffffffffc0908683>] zil_commit.part.12+0x3f3/0x870 [zfs]
[<ffffffffc0908b17>] zil_commit+0x17/0x20 [zfs]
[<ffffffffc0915816>] zpl_writepages+0xd6/0x170 [zfs]
[<ffffffff811a7661>] do_writepages+0x21/0x30
[<ffffffff8124bcc5>] __writeback_single_inode+0x45/0x350
[<ffffffff8124c54e>] writeback_sb_inodes+0x2ae/0x5a0
[<ffffffff8124caae>] wb_writeback+0xee/0x310
[<ffffffff8124d242>] wb_workfn+0xc2/0x3a0
[<ffffffff8109fb2b>] process_one_work+0x16b/0x4e0
[<ffffffff8109feee>] worker_thread+0x4e/0x590
[<ffffffff810a65c7>] kthread+0xe7/0x100
[<ffffffff81864025>] ret_from_fork+0x55/0x80
[<ffffffffffffffff>] 0xffffffffffffffff

[<ffffffffc087163c>] dmu_zfetch+0x15c/0x520 [zfs]
[<ffffffffc0853bd9>] dbuf_read+0x6b9/0x940 [zfs]
[<ffffffffc0872e25>] dnode_hold_impl+0x95/0xba0 [zfs]
[<ffffffffffffffff>] 0xffffffffffffffff

[<ffffffffc087163c>] dmu_zfetch+0x15c/0x520 [zfs]
[<ffffffffc0853bd9>] dbuf_read+0x6b9/0x940 [zfs]
[<ffffffffc0872e56>] dnode_hold_impl+0xc6/0xba0 [zfs]
[<ffffffffffffffff>] 0xffffffffffffffff

[<ffffffffc087159f>] dmu_zfetch+0xbf/0x520 [zfs]
[<ffffffffc0853bd9>] dbuf_read+0x6b9/0x940 [zfs]
[<ffffffffc08730db>] dnode_hold_impl+0x34b/0xba0 [zfs]
[<ffffffffffffffff>] 0xffffffffffffffff

[<ffffffffc087163c>] dmu_zfetch+0x15c/0x520 [zfs]
[<ffffffffc0853bd9>] dbuf_read+0x6b9/0x940 [zfs]
[<ffffffffffffffff>] 0xffffffffffffffff

[<ffffffffffffffff>] 0xffffffffffffffff

[<ffffffffc087163c>] dmu_zfetch+0x15c/0x520 [zfs]
[<ffffffffffffffff>] 0xffffffffffffffff

[<ffffffffc087163c>] dmu_zfetch+0x15c/0x520 [zfs]
[<ffffffffffffffff>] 0xffffffffffffffff

[<ffffffffffffffff>] 0xffffffffffffffff

[<ffffffffffffffff>] 0xffffffffffffffff

...

[<ffffffffc085279f>] dbuf_find+0x1bf/0x1d0 [zfs]
[<ffffffffffffffff>] 0xffffffffffffffff

[<ffffffffffffffff>] 0xffffffffffffffff

[<ffffffffffffffff>] 0xffffffffffffffff

[<ffffffffffffffff>] 0xffffffffffffffff

[<ffffffffffffffff>] 0xffffffffffffffff

[<ffffffffffffffff>] 0xffffffffffffffff

[<ffffffffffffffff>] 0xffffffffffffffff

[<ffffffffc085279f>] dbuf_find+0x1bf/0x1d0 [zfs]
[<ffffffffffffffff>] 0xffffffffffffffff

...

[<ffffffffc087303b>] dnode_hold_impl+0x2ab/0xba0 [zfs]
[<ffffffffc087394b>] dnode_hold+0x1b/0x20 [zfs]
[<ffffffffc085cec2>] dmu_bonus_hold+0x32/0x1a0 [zfs]
[<ffffffffc089d70e>] sa_buf_hold+0xe/0x10 [zfs]
[<ffffffffc0904e0f>] zfs_zget+0x11f/0x240 [zfs]
[<ffffffffc0900607>] zfs_get_data+0x57/0x2c0 [zfs]
[<ffffffffc0908683>] zil_commit.part.12+0x3f3/0x870 [zfs]
[<ffffffffc0908b17>] zil_commit+0x17/0x20 [zfs]
[<ffffffffc0915816>] zpl_writepages+0xd6/0x170 [zfs]
[<ffffffff811a7661>] do_writepages+0x21/0x30
[<ffffffff8124bcc5>] __writeback_single_inode+0x45/0x350
[<ffffffff8124c54e>] writeback_sb_inodes+0x2ae/0x5a0
[<ffffffff8124caae>] wb_writeback+0xee/0x310
[<ffffffff8124d242>] wb_workfn+0xc2/0x3a0
[<ffffffff8109fb2b>] process_one_work+0x16b/0x4e0
[<ffffffff8109feee>] worker_thread+0x4e/0x590
[<ffffffff810a65c7>] kthread+0xe7/0x100
[<ffffffff81864025>] ret_from_fork+0x55/0x80
[<ffffffffffffffff>] 0xffffffffffffffff

[<ffffffffffffffff>] 0xffffffffffffffff

[<ffffffffc084e42f>] arc_buf_access+0xbf/0x290 [zfs]
[<ffffffffc0855aa4>] __dbuf_hold_impl+0xd4/0x570 [zfs]
[<ffffffffc0855fe1>] dbuf_hold_impl+0xa1/0xd0 [zfs]
[<ffffffffc08560a6>] dbuf_hold+0x36/0x60 [zfs]
[<ffffffffc0872e25>] dnode_hold_impl+0x95/0xba0 [zfs]
[<ffffffffc087394b>] dnode_hold+0x1b/0x20 [zfs]
[<ffffffffc085cec2>] dmu_bonus_hold+0x32/0x1a0 [zfs]
[<ffffffffc089d70e>] sa_buf_hold+0xe/0x10 [zfs]
[<ffffffffc0904e0f>] zfs_zget+0x11f/0x240 [zfs]
[<ffffffffc0900607>] zfs_get_data+0x57/0x2c0 [zfs]
[<ffffffffc0908683>] zil_commit.part.12+0x3f3/0x870 [zfs]
[<ffffffffc0908b17>] zil_commit+0x17/0x20 [zfs]
[<ffffffffc0915816>] zpl_writepages+0xd6/0x170 [zfs]
[<ffffffff811a7661>] do_writepages+0x21/0x30
[<ffffffff8124bcc5>] __writeback_single_inode+0x45/0x350
[<ffffffff8124c54e>] writeback_sb_inodes+0x2ae/0x5a0
[<ffffffff8124caae>] wb_writeback+0xee/0x310
[<ffffffff8124d242>] wb_workfn+0xc2/0x3a0
[<ffffffff8109fb2b>] process_one_work+0x16b/0x4e0
[<ffffffff8109feee>] worker_thread+0x4e/0x590
[<ffffffff810a65c7>] kthread+0xe7/0x100
[<ffffffff81864025>] ret_from_fork+0x55/0x80
[<ffffffffffffffff>] 0xffffffffffffffff

[<ffffffffffffffff>] 0xffffffffffffffff

...

[<ffffffffc09011ee>] zfs_znode_hold_exit+0x6e/0x130 [zfs]
[<ffffffffffffffff>] 0xffffffffffffffff

...

[<ffffffffc0855fe1>] dbuf_hold_impl+0xa1/0xd0 [zfs]
[<ffffffffffffffff>] 0xffffffffffffffff

[<ffffffffffffffff>] 0xffffffffffffffff

[<ffffffffc0855ff1>] dbuf_hold_impl+0xb1/0xd0 [zfs]
[<ffffffffffffffff>] 0xffffffffffffffff

[<ffffffffffffffff>] 0xffffffffffffffff

[<ffffffffc0855fe1>] dbuf_hold_impl+0xa1/0xd0 [zfs]
[<ffffffffffffffff>] 0xffffffffffffffff

z_iput process in D state:

cat /proc/976/stack
[<ffffffff81246788>] __inode_wait_for_writeback+0xa8/0xe0
[<ffffffff8124d166>] inode_wait_for_writeback+0x26/0x40
[<ffffffff8123843a>] evict+0xaa/0x190
[<ffffffff81238731>] iput+0x1c1/0x240
[<ffffffffc05b0734>] taskq_thread+0x2b4/0x4b0 [spl]
[<ffffffff810a65c7>] kthread+0xe7/0x100
[<ffffffff81864025>] ret_from_fork+0x55/0x80
[<ffffffffffffffff>] 0xffffffffffffffff

mathiasaerts on 2 Aug 2019

Hello, same issue over here. That's way too troublesome, since this issue appears on LXC containers guests too.

Same sync stack here:

# cat /proc/2091977/stack 
[<0>] wb_wait_for_completion+0x64/0x90
[<0>] sync_inodes_sb+0xa6/0x280
[<0>] sync_inodes_one_sb+0x15/0x20
[<0>] iterate_supers+0xbf/0x110
[<0>] sys_sync+0x42/0xb0
[<0>] do_syscall_64+0x73/0x130
[<0>] entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[<0>] 0xffffffffffffffff

Nothing appears in dmesg, should I look somewhere specifically? My /proc/sys/kernel/hung_task_timeout_secs is set to 120.

Since dpkg is using the sync command, a dirty workaround is this one:

# mv /bin/sync /bin/sync.back
# echo -e "#!/bin/bash\n\ntrue" > /bin/sync
# chmod +x /bin/sync

bLuka on 21 Aug 2019

Here are some more logs from sysrq

[1143540.087357] Workqueue: writeback wb_workfn (flush-zfs-128)
[1143540.088092] Call Trace:
[1143540.088860]  __schedule+0x2d4/0x870
[1143540.089525]  schedule+0x2c/0x70
[1143540.090187]  io_schedule+0x16/0x40
[1143540.090850]  __lock_page+0x122/0x220
[1143540.091499]  ? file_check_and_advance_wb_err+0xe0/0xe0
[1143540.092204]  write_cache_pages+0x23e/0x4d0
[1143540.092952]  ? zpl_write_common_iovec+0xf0/0xf0 [zfs]
[1143540.093607]  zpl_writepages+0x94/0x170 [zfs]
[1143540.094213]  do_writepages+0x41/0xd0
[1143540.094812]  ? __wb_calc_thresh+0x3e/0x130
[1143540.095404]  __writeback_single_inode+0x40/0x350
[1143540.096050]  writeback_sb_inodes+0x211/0x500
[1143540.096683]  __writeback_inodes_wb+0x67/0xb0
[1143540.097255]  wb_writeback+0x25f/0x2f0
[1143540.097841]  wb_workfn+0x2fb/0x3f0
[1143540.098393]  ? __switch_to+0x96/0x4e0
[1143540.098939]  process_one_work+0x20f/0x410
[1143540.099473]  worker_thread+0x34/0x400
[1143540.100056]  kthread+0x120/0x140
[1143540.100631]  ? process_one_work+0x410/0x410
[1143540.101141]  ? __kthread_parkme+0x70/0x70
[1143540.101644]  ret_from_fork+0x35/0x40
[1143540.102203] umount          D    0  6501      1 0x00000000
[1143540.102693] Call Trace:
[1143540.103171]  __schedule+0x2d4/0x870
[1143540.103642]  ? __schedule+0x2dc/0x870
[1143540.104158]  schedule+0x2c/0x70
[1143540.104666]  rwsem_down_write_failed+0x160/0x340
[1143540.105116]  ? mnt_get_count+0x3d/0x50
[1143540.105555]  call_rwsem_down_write_failed+0x17/0x30
[1143540.105991]  down_write+0x2d/0x40
[1143540.106419]  deactivate_super+0x49/0x60
[1143540.106835]  cleanup_mnt+0x3f/0x80
[1143540.107242]  __cleanup_mnt+0x12/0x20
[1143540.107642]  task_work_run+0x9d/0xc0
[1143540.108085]  exit_to_usermode_loop+0xf2/0x100
[1143540.108523]  do_syscall_64+0xf0/0x110
[1143540.108898]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[1143540.109268] RIP: 0033:0x7f873d0bb507
[1143540.109644] Code: Bad RIP value.
[1143540.110012] RSP: 002b:00007ffc5937b558 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
[1143540.110400] RAX: 0000000000000000 RBX: 0000561b10853970 RCX: 00007f873d0bb507
[1143540.110788] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000561b108651f0
[1143540.111177] RBP: 0000000000000000 R08: 0000561b10865220 R09: 00007f873d13ce80
[1143540.111563] R10: 0000000000000000 R11: 0000000000000246 R12: 0000561b108651f0
[1143540.112000] R13: 00007f873d1e11c4 R14: 0000561b10853a68 R15: 0000561b10853b80
[1143601.803905] sysrq: SysRq : Show Blocked State
[1143601.803938]   task                        PC stack   pid father
[1143601.805241] kworker/u96:1   D    0 30396      2 0x80000000
[1143601.805264] Workqueue: writeback wb_workfn (flush-zfs-128)
[1143601.805282] Call Trace:
[1143601.805300]  __schedule+0x2d4/0x870
[1143601.805314]  schedule+0x2c/0x70
[1143601.805328]  io_schedule+0x16/0x40
[1143601.805342]  __lock_page+0x122/0x220
[1143601.805354]  ? file_check_and_advance_wb_err+0xe0/0xe0
[1143601.805372]  write_cache_pages+0x23e/0x4d0
[1143601.805449]  ? zpl_write_common_iovec+0xf0/0xf0 [zfs]
[1143601.805500]  zpl_writepages+0x94/0x170 [zfs]
[1143601.805515]  do_writepages+0x41/0xd0
[1143601.805528]  ? __wb_calc_thresh+0x3e/0x130
[1143601.805543]  __writeback_single_inode+0x40/0x350
[1143601.805558]  writeback_sb_inodes+0x211/0x500
[1143601.805573]  __writeback_inodes_wb+0x67/0xb0
[1143601.805588]  wb_writeback+0x25f/0x2f0
[1143601.805601]  wb_workfn+0x2fb/0x3f0
[1143601.805614]  ? __switch_to+0x96/0x4e0
[1143601.805628]  process_one_work+0x20f/0x410
[1143601.805641]  worker_thread+0x34/0x400
[1143601.806111]  kthread+0x120/0x140
[1143601.806651]  ? process_one_work+0x410/0x410
[1143601.807148]  ? __kthread_parkme+0x70/0x70
[1143601.807585]  ret_from_fork+0x35/0x40
[1143601.808074] umount          D    0  6501      1 0x00000000
[1143601.808501] Call Trace:
[1143601.808921]  __schedule+0x2d4/0x870
[1143601.809343]  ? __schedule+0x2dc/0x870
[1143601.809765]  schedule+0x2c/0x70
[1143601.810176]  rwsem_down_write_failed+0x160/0x340
[1143601.810622]  ? mnt_get_count+0x3d/0x50
[1143601.811087]  call_rwsem_down_write_failed+0x17/0x30
[1143601.811506]  down_write+0x2d/0x40
[1143601.811925]  deactivate_super+0x49/0x60
[1143601.812344]  cleanup_mnt+0x3f/0x80
[1143601.812759]  __cleanup_mnt+0x12/0x20
[1143601.813170]  task_work_run+0x9d/0xc0
[1143601.813577]  exit_to_usermode_loop+0xf2/0x100
[1143601.813975]  do_syscall_64+0xf0/0x110
[1143601.814403]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[1143601.814841] RIP: 0033:0x7f873d0bb507
[1143601.815222] Code: Bad RIP value.
[1143601.815597] RSP: 002b:00007ffc5937b558 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
[1143601.815987] RAX: 0000000000000000 RBX: 0000561b10853970 RCX: 00007f873d0bb507
[1143601.816379] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000561b108651f0
[1143601.816763] RBP: 0000000000000000 R08: 0000561b10865220 R09: 00007f873d13ce80
[1143601.817142] R10: 0000000000000000 R11: 0000000000000246 R12: 0000561b108651f0
[1143601.817518] R13: 00007f873d1e11c4 R14: 0000561b10853a68 R15: 0000561b10853b80
[1143733.095150] sysrq: SysRq : Show Blocked State
[1143733.095724]   task                        PC stack   pid father
[1143733.097469] kworker/u96:1   D    0 30396      2 0x80000000
[1143733.097887] Workqueue: writeback wb_workfn (flush-zfs-128)
[1143733.098294] Call Trace:
[1143733.098703]  __schedule+0x2d4/0x870
[1143733.099106]  schedule+0x2c/0x70
[1143733.099568]  io_schedule+0x16/0x40
[1143733.100052]  __lock_page+0x122/0x220
[1143733.100447]  ? file_check_and_advance_wb_err+0xe0/0xe0
[1143733.100849]  write_cache_pages+0x23e/0x4d0
[1143733.101314]  ? zpl_write_common_iovec+0xf0/0xf0 [zfs]
[1143733.101752]  zpl_writepages+0x94/0x170 [zfs]
[1143733.102155]  do_writepages+0x41/0xd0
[1143733.102558]  ? __wb_calc_thresh+0x3e/0x130
[1143733.102966]  __writeback_single_inode+0x40/0x350
[1143733.103408]  writeback_sb_inodes+0x211/0x500
[1143733.103849]  __writeback_inodes_wb+0x67/0xb0
[1143733.104259]  wb_writeback+0x25f/0x2f0
[1143733.104668]  wb_workfn+0x2fb/0x3f0
[1143733.105076]  ? __switch_to+0x96/0x4e0
[1143733.105485]  process_one_work+0x20f/0x410
[1143733.105889]  worker_thread+0x34/0x400
[1143733.106293]  kthread+0x120/0x140
[1143733.106694]  ? process_one_work+0x410/0x410
[1143733.107099]  ? __kthread_parkme+0x70/0x70
[1143733.107533]  ret_from_fork+0x35/0x40
[1143733.108038] umount          D    0  6501      1 0x00000000
[1143733.108446] Call Trace:
[1143733.108849]  __schedule+0x2d4/0x870
[1143733.109251]  ? __schedule+0x2dc/0x870
[1143733.109652]  schedule+0x2c/0x70
[1143733.110051]  rwsem_down_write_failed+0x160/0x340
[1143733.110457]  ? mnt_get_count+0x3d/0x50
[1143733.110863]  call_rwsem_down_write_failed+0x17/0x30
[1143733.111273]  down_write+0x2d/0x40
[1143733.111721]  deactivate_super+0x49/0x60
[1143733.112191]  cleanup_mnt+0x3f/0x80
[1143733.112594]  __cleanup_mnt+0x12/0x20
[1143733.112996]  task_work_run+0x9d/0xc0
[1143733.113390]  exit_to_usermode_loop+0xf2/0x100
[1143733.113778]  do_syscall_64+0xf0/0x110
[1143733.114163]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[1143733.114554] RIP: 0033:0x7f873d0bb507
[1143733.114945] Code: Bad RIP value.
[1143733.115331] RSP: 002b:00007ffc5937b558 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
[1143733.115743] RAX: 0000000000000000 RBX: 0000561b10853970 RCX: 00007f873d0bb507
[1143733.116197] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000561b108651f0
[1143733.116589] RBP: 0000000000000000 R08: 0000561b10865220 R09: 00007f873d13ce80
[1143733.116972] R10: 0000000000000000 R11: 0000000000000246 R12: 0000561b108651f0
[1143733.117354] R13: 00007f873d1e11c4 R14: 0000561b10853a68 R15: 0000561b10853b80

camaer on 18 Sep 2019

Looks like at least some of these hangs are fixed by https://github.com/zfsonlinux/zfs/pull/9203

one of the way to tell if you've been bitten by the above deadlock is checking output from /proc/2/stack

alek-p on 20 Sep 2019

We think we've been hitting this bug for about a year and a half. We've noticed that doing a zfs list unlocks the zfs, so doing a while true; do zfs list; sleep 5m; done in a screen/tmux makes it usable.

newtoncorp on 23 Sep 2019

😄1

Doing a 'zfs list' certainly doesn't unlock a hung system for us.

PE1NUT on 24 Sep 2019

As usual, our backup/burp server is again stuck in zfs. Load is 6.00, but no actual disk IO is happening. OS is Ubuntu Bionic (18.04.2 LTS), which comes with ZFS 0.7.5-1ubuntu16.4.

ps axu |grep " D " lists 5 burp-server instances.

dmesg shows each of these burp processes as a hung task. As they all show a different stack trace, I've included them all.

[Tue Sep 17 00:14:52 2019] INFO: task burp:11641 blocked for more than 120 seconds.
[Tue Sep 17 00:14:52 2019] Tainted: P O 4.15.0-48-generic #51-Ubuntu
[Tue Sep 17 00:14:52 2019] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Tue Sep 17 00:14:52 2019] burp D 0 11641 6260 0x00000000
[Tue Sep 17 00:14:52 2019] Call Trace:
[Tue Sep 17 00:14:52 2019] __schedule+0x291/0x8a0
[Tue Sep 17 00:14:52 2019] ? atomic_t_wait+0x60/0x60
[Tue Sep 17 00:14:52 2019] schedule+0x2c/0x80
[Tue Sep 17 00:14:52 2019] bit_wait+0x11/0x60
[Tue Sep 17 00:14:52 2019] __wait_on_bit+0x4c/0x90
[Tue Sep 17 00:14:52 2019] ? atomic_t_wait+0x60/0x60
[Tue Sep 17 00:14:52 2019] __inode_wait_for_writeback+0xb9/0xf0
[Tue Sep 17 00:14:52 2019] ? bit_waitqueue+0x40/0x40
[Tue Sep 17 00:14:52 2019] inode_wait_for_writeback+0x26/0x40
[Tue Sep 17 00:14:52 2019] evict+0xb5/0x1a0
[Tue Sep 17 00:14:52 2019] iput+0x156/0x220
[Tue Sep 17 00:14:52 2019] do_unlinkat+0x19c/0x320
[Tue Sep 17 00:14:52 2019] SyS_unlink+0x1f/0x30
[Tue Sep 17 00:14:52 2019] do_syscall_64+0x73/0x130
[Tue Sep 17 00:14:52 2019] entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[Tue Sep 17 00:14:52 2019] RIP: 0033:0x7f8cee52fd47
[Tue Sep 17 00:14:52 2019] RSP: 002b:00007ffdbed6f698 EFLAGS: 00000206 ORIG_RAX: 0000000000000057
[Tue Sep 17 00:14:52 2019] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f8cee52fd47
[Tue Sep 17 00:14:52 2019] RDX: 0000000000000000 RSI: 000055a273edc170 RDI: 000055a273efee00
[Tue Sep 17 00:14:52 2019] RBP: 00007ffdbed6f8e0 R08: 00007f8ceebbe110 R09: 0000000000000040
[Tue Sep 17 00:14:52 2019] R10: 00007f8cee809ca0 R11: 0000000000000206 R12: 000055a273eeeeb0
[Tue Sep 17 00:14:52 2019] R13: 000055a273efc070 R14: 000055a273eee4a0 R15: 0000000000000001
[Tue Sep 17 00:14:52 2019] INFO: task burp:31256 blocked for more than 120 seconds.
[Tue Sep 17 00:14:52 2019] Tainted: P O 4.15.0-48-generic #51-Ubuntu
[Tue Sep 17 00:14:52 2019] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Tue Sep 17 00:14:52 2019] burp D 0 31256 6260 0x00000000
[Tue Sep 17 00:14:52 2019] Call Trace:
[Tue Sep 17 00:14:52 2019] __schedule+0x291/0x8a0
[Tue Sep 17 00:14:52 2019] schedule+0x2c/0x80
[Tue Sep 17 00:14:52 2019] wb_wait_for_completion+0x64/0x90
[Tue Sep 17 00:14:52 2019] ? wait_woken+0x80/0x80
[Tue Sep 17 00:14:52 2019] sync_inodes_sb+0xc7/0x2b0
[Tue Sep 17 00:14:52 2019] sync_inodes_one_sb+0x16/0x20
[Tue Sep 17 00:14:52 2019] iterate_supers+0xa2/0x110
[Tue Sep 17 00:14:52 2019] ? SyS_tee+0x3d0/0x3d0
[Tue Sep 17 00:14:52 2019] sys_sync+0x42/0xb0
[Tue Sep 17 00:14:52 2019] do_syscall_64+0x73/0x130
[Tue Sep 17 00:14:52 2019] entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[Tue Sep 17 00:14:52 2019] RIP: 0033:0x7f8cee535287
[Tue Sep 17 00:14:52 2019] RSP: 002b:00007ffdbed6fae8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a2
[Tue Sep 17 00:14:52 2019] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f8cee535287
[Tue Sep 17 00:14:52 2019] RDX: 000055a273eaa010 RSI: 000055a273eaa010 RDI: 000055a273ef6060
[Tue Sep 17 00:14:52 2019] RBP: 00007ffdbed6fb48 R08: 00007f8cf015ac00 R09: 000055a271edf774
[Tue Sep 17 00:14:52 2019] R10: 000055a273ef12a0 R11: 0000000000000246 R12: 0000000000000000
[Tue Sep 17 00:14:52 2019] R13: 000055a273edab60 R14: 000055a273efc0a0 R15: 0000000000000000
[Tue Sep 17 00:16:53 2019] INFO: task burp:11641 blocked for more than 120 seconds.
[Tue Sep 17 00:16:53 2019] Tainted: P O 4.15.0-48-generic #51-Ubuntu
[Tue Sep 17 00:16:53 2019] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Tue Sep 17 00:16:53 2019] burp D 0 11641 6260 0x00000000
[Tue Sep 17 00:16:53 2019] Call Trace:
[Tue Sep 17 00:16:53 2019] __schedule+0x291/0x8a0
[Tue Sep 17 00:16:53 2019] ? atomic_t_wait+0x60/0x60
[Tue Sep 17 00:16:53 2019] schedule+0x2c/0x80
[Tue Sep 17 00:16:53 2019] bit_wait+0x11/0x60
[Tue Sep 17 00:16:53 2019] __wait_on_bit+0x4c/0x90
[Tue Sep 17 00:16:53 2019] ? atomic_t_wait+0x60/0x60
[Tue Sep 17 00:16:53 2019] __inode_wait_for_writeback+0xb9/0xf0
[Tue Sep 17 00:16:53 2019] ? bit_waitqueue+0x40/0x40
[Tue Sep 17 00:16:53 2019] inode_wait_for_writeback+0x26/0x40
[Tue Sep 17 00:16:53 2019] evict+0xb5/0x1a0
[Tue Sep 17 00:16:53 2019] iput+0x156/0x220
[Tue Sep 17 00:16:53 2019] do_unlinkat+0x19c/0x320
[Tue Sep 17 00:16:53 2019] SyS_unlink+0x1f/0x30
[Tue Sep 17 00:16:53 2019] do_syscall_64+0x73/0x130
[Tue Sep 17 00:16:53 2019] entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[Tue Sep 17 00:16:53 2019] RIP: 0033:0x7f8cee52fd47
[Tue Sep 17 00:16:53 2019] RSP: 002b:00007ffdbed6f698 EFLAGS: 00000206 ORIG_RAX: 0000000000000057
[Tue Sep 17 00:16:53 2019] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f8cee52fd47
[Tue Sep 17 00:16:53 2019] RDX: 0000000000000000 RSI: 000055a273edc170 RDI: 000055a273efee00
[Tue Sep 17 00:16:53 2019] RBP: 00007ffdbed6f8e0 R08: 00007f8ceebbe110 R09: 0000000000000040
[Tue Sep 17 00:16:53 2019] R10: 00007f8cee809ca0 R11: 0000000000000206 R12: 000055a273eeeeb0
[Tue Sep 17 00:16:53 2019] R13: 000055a273efc070 R14: 000055a273eee4a0 R15: 0000000000000001
[Tue Sep 17 00:16:53 2019] INFO: task burp:31242 blocked for more than 120 seconds.
[Tue Sep 17 00:16:53 2019] Tainted: P O 4.15.0-48-generic #51-Ubuntu
[Tue Sep 17 00:16:53 2019] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Tue Sep 17 00:16:53 2019] burp D 0 31242 6260 0x00000000
[Tue Sep 17 00:16:53 2019] Call Trace:
[Tue Sep 17 00:16:53 2019] __schedule+0x291/0x8a0
[Tue Sep 17 00:16:53 2019] schedule+0x2c/0x80
[Tue Sep 17 00:16:53 2019] wb_wait_for_completion+0x64/0x90
[Tue Sep 17 00:16:53 2019] ? wait_woken+0x80/0x80
[Tue Sep 17 00:16:53 2019] sync_inodes_sb+0xc7/0x2b0
[Tue Sep 17 00:16:53 2019] sync_inodes_one_sb+0x16/0x20
[Tue Sep 17 00:16:53 2019] iterate_supers+0xa2/0x110
[Tue Sep 17 00:16:53 2019] ? SyS_tee+0x3d0/0x3d0
[Tue Sep 17 00:16:53 2019] sys_sync+0x42/0xb0
[Tue Sep 17 00:16:53 2019] do_syscall_64+0x73/0x130
[Tue Sep 17 00:16:53 2019] entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[Tue Sep 17 00:16:53 2019] RIP: 0033:0x7f8cee535287
[Tue Sep 17 00:16:53 2019] RSP: 002b:00007ffdbed6f9c8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a2
[Tue Sep 17 00:16:53 2019] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f8cee535287
[Tue Sep 17 00:16:53 2019] RDX: 000055a273ef2720 RSI: 0000000000000000 RDI: 000055a273ef5e50
[Tue Sep 17 00:16:53 2019] RBP: 00007ffdbed6fa50 R08: 000055a271edf910 R09: 0000000000000000
[Tue Sep 17 00:16:53 2019] R10: 00007f8cee5bccc0 R11: 0000000000000246 R12: 0000000000000000
[Tue Sep 17 00:16:53 2019] R13: 00007ffdbed6fa50 R14: 00007ffdbed6faa8 R15: 000055a273ed8220
[Tue Sep 17 00:16:53 2019] INFO: task burp:31256 blocked for more than 120 seconds.
[Tue Sep 17 00:16:53 2019] Tainted: P O 4.15.0-48-generic #51-Ubuntu
[Tue Sep 17 00:16:53 2019] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Tue Sep 17 00:16:53 2019] burp D 0 31256 6260 0x00000000
[Tue Sep 17 00:16:53 2019] Call Trace:
[Tue Sep 17 00:16:53 2019] __schedule+0x291/0x8a0
[Tue Sep 17 00:16:53 2019] schedule+0x2c/0x80
[Tue Sep 17 00:16:53 2019] wb_wait_for_completion+0x64/0x90
[Tue Sep 17 00:16:53 2019] ? wait_woken+0x80/0x80
[Tue Sep 17 00:16:53 2019] sync_inodes_sb+0xc7/0x2b0
[Tue Sep 17 00:16:53 2019] sync_inodes_one_sb+0x16/0x20
[Tue Sep 17 00:16:53 2019] iterate_supers+0xa2/0x110
[Tue Sep 17 00:16:53 2019] ? SyS_tee+0x3d0/0x3d0
[Tue Sep 17 00:16:53 2019] sys_sync+0x42/0xb0
[Tue Sep 17 00:16:53 2019] do_syscall_64+0x73/0x130
[Tue Sep 17 00:16:53 2019] entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[Tue Sep 17 00:16:53 2019] RIP: 0033:0x7f8cee535287
[Tue Sep 17 00:16:53 2019] RSP: 002b:00007ffdbed6fae8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a2
[Tue Sep 17 00:16:53 2019] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f8cee535287
[Tue Sep 17 00:16:53 2019] RDX: 000055a273eaa010 RSI: 000055a273eaa010 RDI: 000055a273ef6060
[Tue Sep 17 00:16:53 2019] RBP: 00007ffdbed6fb48 R08: 00007f8cf015ac00 R09: 000055a271edf774
[Tue Sep 17 00:16:53 2019] R10: 000055a273ef12a0 R11: 0000000000000246 R12: 0000000000000000
[Tue Sep 17 00:16:53 2019] R13: 000055a273edab60 R14: 000055a273efc0a0 R15: 0000000000000000
[Tue Sep 17 00:16:53 2019] INFO: task burp:3039 blocked for more than 120 seconds.
[Tue Sep 17 00:16:53 2019] Tainted: P O 4.15.0-48-generic #51-Ubuntu
[Tue Sep 17 00:16:53 2019] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Tue Sep 17 00:16:53 2019] burp D 0 3039 6260 0x00000000
[Tue Sep 17 00:16:53 2019] Call Trace:
[Tue Sep 17 00:16:53 2019] __schedule+0x291/0x8a0
[Tue Sep 17 00:16:53 2019] ? _cond_resched+0x19/0x40
[Tue Sep 17 00:16:53 2019] schedule+0x2c/0x80
[Tue Sep 17 00:16:53 2019] cv_wait_common+0x11e/0x140 [spl]
[Tue Sep 17 00:16:53 2019] ? wait_woken+0x80/0x80
[Tue Sep 17 00:16:53 2019] __cv_wait+0x15/0x20 [spl]
[Tue Sep 17 00:16:53 2019] zil_commit.part.14+0x61/0x100 [zfs]
[Tue Sep 17 00:16:53 2019] zil_commit+0x17/0x20 [zfs]
[Tue Sep 17 00:16:53 2019] zfs_fsync+0x77/0xe0 [zfs]
[Tue Sep 17 00:16:53 2019] zpl_fsync+0x68/0xa0 [zfs]
[Tue Sep 17 00:16:53 2019] vfs_fsync_range+0x51/0xb0
[Tue Sep 17 00:16:53 2019] do_fsync+0x3d/0x70
[Tue Sep 17 00:16:53 2019] SyS_fsync+0x10/0x20
[Tue Sep 17 00:16:53 2019] do_syscall_64+0x73/0x130
[Tue Sep 17 00:16:53 2019] entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[Tue Sep 17 00:16:53 2019] RIP: 0033:0x7f8cee535214
[Tue Sep 17 00:16:53 2019] RSP: 002b:00007ffdbed6fc98 EFLAGS: 00000246 ORIG_RAX: 000000000000004a
[Tue Sep 17 00:16:53 2019] RAX: ffffffffffffffda RBX: 000055a273eeea10 RCX: 00007f8cee535214
[Tue Sep 17 00:16:53 2019] RDX: 00007ffdbed71f36 RSI: 000055a271c8bc8d RDI: 0000000000000003
[Tue Sep 17 00:16:53 2019] RBP: 000055a273edb790 R08: 00007ffdbed71f36 R09: 0000000000000004
[Tue Sep 17 00:16:53 2019] R10: 00000000fffffffc R11: 0000000000000246 R12: 000055a273efc0a0
[Tue Sep 17 00:16:53 2019] R13: 000055a273edc8f0 R14: 00007ffdbed7004c R15: 000055a273ed8e80
[Tue Sep 17 00:18:54 2019] INFO: task burp:11641 blocked for more than 120 seconds.
[Tue Sep 17 00:18:54 2019] Tainted: P O 4.15.0-48-generic #51-Ubuntu
[Tue Sep 17 00:18:54 2019] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Tue Sep 17 00:18:54 2019] burp D 0 11641 6260 0x00000000
[Tue Sep 17 00:18:54 2019] Call Trace:
[Tue Sep 17 00:18:54 2019] __schedule+0x291/0x8a0
[Tue Sep 17 00:18:54 2019] ? atomic_t_wait+0x60/0x60
[Tue Sep 17 00:18:54 2019] schedule+0x2c/0x80
[Tue Sep 17 00:18:54 2019] bit_wait+0x11/0x60
[Tue Sep 17 00:18:54 2019] __wait_on_bit+0x4c/0x90
[Tue Sep 17 00:18:54 2019] ? atomic_t_wait+0x60/0x60
[Tue Sep 17 00:18:54 2019] __inode_wait_for_writeback+0xb9/0xf0
[Tue Sep 17 00:18:54 2019] ? bit_waitqueue+0x40/0x40
[Tue Sep 17 00:18:54 2019] inode_wait_for_writeback+0x26/0x40
[Tue Sep 17 00:18:54 2019] evict+0xb5/0x1a0
[Tue Sep 17 00:18:54 2019] iput+0x156/0x220
[Tue Sep 17 00:18:54 2019] do_unlinkat+0x19c/0x320
[Tue Sep 17 00:18:54 2019] SyS_unlink+0x1f/0x30
[Tue Sep 17 00:18:54 2019] do_syscall_64+0x73/0x130
[Tue Sep 17 00:18:54 2019] entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[Tue Sep 17 00:18:54 2019] RIP: 0033:0x7f8cee52fd47
[Tue Sep 17 00:18:54 2019] RSP: 002b:00007ffdbed6f698 EFLAGS: 00000206 ORIG_RAX: 0000000000000057
[Tue Sep 17 00:18:54 2019] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f8cee52fd47
[Tue Sep 17 00:18:54 2019] RDX: 0000000000000000 RSI: 000055a273edc170 RDI: 000055a273efee00
[Tue Sep 17 00:18:54 2019] RBP: 00007ffdbed6f8e0 R08: 00007f8ceebbe110 R09: 0000000000000040
[Tue Sep 17 00:18:54 2019] R10: 00007f8cee809ca0 R11: 0000000000000206 R12: 000055a273eeeeb0
[Tue Sep 17 00:18:54 2019] R13: 000055a273efc070 R14: 000055a273eee4a0 R15: 0000000000000001
[Tue Sep 17 00:18:54 2019] INFO: task burp:31242 blocked for more than 120 seconds.
[Tue Sep 17 00:18:54 2019] Tainted: P O 4.15.0-48-generic #51-Ubuntu
[Tue Sep 17 00:18:54 2019] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Tue Sep 17 00:18:54 2019] burp D 0 31242 6260 0x00000000
[Tue Sep 17 00:18:54 2019] Call Trace:
[Tue Sep 17 00:18:54 2019] __schedule+0x291/0x8a0
[Tue Sep 17 00:18:54 2019] schedule+0x2c/0x80
[Tue Sep 17 00:18:54 2019] wb_wait_for_completion+0x64/0x90
[Tue Sep 17 00:18:54 2019] ? wait_woken+0x80/0x80
[Tue Sep 17 00:18:54 2019] sync_inodes_sb+0xc7/0x2b0
[Tue Sep 17 00:18:54 2019] sync_inodes_one_sb+0x16/0x20
[Tue Sep 17 00:18:54 2019] iterate_supers+0xa2/0x110
[Tue Sep 17 00:18:54 2019] ? SyS_tee+0x3d0/0x3d0
[Tue Sep 17 00:18:54 2019] sys_sync+0x42/0xb0
[Tue Sep 17 00:18:54 2019] do_syscall_64+0x73/0x130
[Tue Sep 17 00:18:54 2019] entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[Tue Sep 17 00:18:54 2019] RIP: 0033:0x7f8cee535287
[Tue Sep 17 00:18:54 2019] RSP: 002b:00007ffdbed6f9c8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a2
[Tue Sep 17 00:18:54 2019] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f8cee535287
[Tue Sep 17 00:18:54 2019] RDX: 000055a273ef2720 RSI: 0000000000000000 RDI: 000055a273ef5e50
[Tue Sep 17 00:18:54 2019] RBP: 00007ffdbed6fa50 R08: 000055a271edf910 R09: 0000000000000000
[Tue Sep 17 00:18:54 2019] R10: 00007f8cee5bccc0 R11: 0000000000000246 R12: 0000000000000000
[Tue Sep 17 00:18:54 2019] R13: 00007ffdbed6fa50 R14: 00007ffdbed6faa8 R15: 000055a273ed8220
[Tue Sep 17 00:18:54 2019] INFO: task burp:31256 blocked for more than 120 seconds.
[Tue Sep 17 00:18:54 2019] Tainted: P O 4.15.0-48-generic #51-Ubuntu
[Tue Sep 17 00:18:54 2019] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Tue Sep 17 00:18:54 2019] burp D 0 31256 6260 0x00000000
[Tue Sep 17 00:18:54 2019] Call Trace:
[Tue Sep 17 00:18:54 2019] __schedule+0x291/0x8a0
[Tue Sep 17 00:18:54 2019] schedule+0x2c/0x80
[Tue Sep 17 00:18:54 2019] wb_wait_for_completion+0x64/0x90
[Tue Sep 17 00:18:54 2019] ? wait_woken+0x80/0x80
[Tue Sep 17 00:18:54 2019] sync_inodes_sb+0xc7/0x2b0
[Tue Sep 17 00:18:54 2019] sync_inodes_one_sb+0x16/0x20
[Tue Sep 17 00:18:54 2019] iterate_supers+0xa2/0x110
[Tue Sep 17 00:18:54 2019] ? SyS_tee+0x3d0/0x3d0
[Tue Sep 17 00:18:54 2019] sys_sync+0x42/0xb0
[Tue Sep 17 00:18:54 2019] do_syscall_64+0x73/0x130
[Tue Sep 17 00:18:54 2019] entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[Tue Sep 17 00:18:54 2019] RIP: 0033:0x7f8cee535287
[Tue Sep 17 00:18:54 2019] RSP: 002b:00007ffdbed6fae8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a2
[Tue Sep 17 00:18:54 2019] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f8cee535287
[Tue Sep 17 00:18:54 2019] RDX: 000055a273eaa010 RSI: 000055a273eaa010 RDI: 000055a273ef6060
[Tue Sep 17 00:18:54 2019] RBP: 00007ffdbed6fb48 R08: 00007f8cf015ac00 R09: 000055a271edf774
[Tue Sep 17 00:18:54 2019] R10: 000055a273ef12a0 R11: 0000000000000246 R12: 0000000000000000
[Tue Sep 17 00:18:54 2019] R13: 000055a273edab60 R14: 000055a273efc0a0 R15: 0000000000000000
[Tue Sep 17 00:18:54 2019] INFO: task burp:3039 blocked for more than 120 seconds.
[Tue Sep 17 00:18:54 2019] Tainted: P O 4.15.0-48-generic #51-Ubuntu
[Tue Sep 17 00:18:54 2019] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Tue Sep 17 00:18:54 2019] burp D 0 3039 6260 0x00000000
[Tue Sep 17 00:18:54 2019] Call Trace:
[Tue Sep 17 00:18:54 2019] __schedule+0x291/0x8a0
[Tue Sep 17 00:18:54 2019] ? _cond_resched+0x19/0x40
[Tue Sep 17 00:18:54 2019] schedule+0x2c/0x80
[Tue Sep 17 00:18:54 2019] cv_wait_common+0x11e/0x140 [spl]
[Tue Sep 17 00:18:54 2019] ? wait_woken+0x80/0x80
[Tue Sep 17 00:18:54 2019] __cv_wait+0x15/0x20 [spl]
[Tue Sep 17 00:18:54 2019] zil_commit.part.14+0x61/0x100 [zfs]
[Tue Sep 17 00:18:54 2019] zil_commit+0x17/0x20 [zfs]
[Tue Sep 17 00:18:54 2019] zfs_fsync+0x77/0xe0 [zfs]
[Tue Sep 17 00:18:54 2019] zpl_fsync+0x68/0xa0 [zfs]
[Tue Sep 17 00:18:54 2019] vfs_fsync_range+0x51/0xb0
[Tue Sep 17 00:18:54 2019] do_fsync+0x3d/0x70
[Tue Sep 17 00:18:54 2019] SyS_fsync+0x10/0x20
[Tue Sep 17 00:18:54 2019] do_syscall_64+0x73/0x130
[Tue Sep 17 00:18:54 2019] entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[Tue Sep 17 00:18:54 2019] RIP: 0033:0x7f8cee535214
[Tue Sep 17 00:18:54 2019] RSP: 002b:00007ffdbed6fc98 EFLAGS: 00000246 ORIG_RAX: 000000000000004a
[Tue Sep 17 00:18:54 2019] RAX: ffffffffffffffda RBX: 000055a273eeea10 RCX: 00007f8cee535214
[Tue Sep 17 00:18:54 2019] RDX: 00007ffdbed71f36 RSI: 000055a271c8bc8d RDI: 0000000000000003
[Tue Sep 17 00:18:54 2019] RBP: 000055a273edb790 R08: 00007ffdbed71f36 R09: 0000000000000004
[Tue Sep 17 00:18:54 2019] R10: 00000000fffffffc R11: 0000000000000246 R12: 000055a273efc0a0
[Tue Sep 17 00:18:54 2019] R13: 000055a273edc8f0 R14: 00007ffdbed7004c R15: 000055a273ed8e80

PE1NUT on 24 Sep 2019

Knock on wood, we have not had an hang or corruption issue since moving our
backup servers to Ubuntu Eoan.

prgwiz on 24 Sep 2019

I've just had this with 0.8.2 on 5.2.16 kernel. I was running a world update on Gentoo when the process stopped doing anything and kworker/u256:3+flush-zfs-1 was sat using 100% of a core. ranlib and emerge were the two deadlocking processes being chucked into dmesg. All zfs system except for the fat32 uefi partition, rpool is a mirror on nvme, (where the hang happened), misc is a 3 rust disk raid-z with ssd as l2arc/zil.

Ninpo on 12 Oct 2019

It seems to me that this is somehow related to the swap. When it is become heavy used, but not all disks have the same IO speed, this happens.

denizzzka on 12 Oct 2019

I don't have swap enabled.

Ninpo on 12 Oct 2019

I would suggest that you provide scripts to use in these cases. so that it is possible to gather all the information you need.

bud4 on 14 Oct 2019

Does anyone suffering this issue make use of FUSE or AppImages mounted via FUSE? In #9430 we've found two of us have that in common.

Ninpo on 15 Oct 2019

Additionally, what processors are you all using? In #9430 we also found us both using AMD processors to be in common.

ReimuHakurei on 15 Oct 2019

I'm not using FUSE or AppImages (unless Ubuntu server uses FUSE by default for something). I've seen this on EPYC 7601 FWIW.

Anteru on 15 Oct 2019

Does anyone suffering this issue make use of FUSE or AppImages mounted via FUSE? In #9430 we've found two of us have that in common.

No to both.

Additionally, what processors are you all using? In #9430 we also found us both using AMD processors to be in common.

System has two processors, both are Intel Xeon E5-2660 @ 2.20GHz.

bglod on 15 Oct 2019

@Ninpo

I don't have swap enabled.

Ok, but maybe you (same as me) use pool with different HDDs where not all disks have the same IO speed? I use one slow "green" WD drive and 2 another more faster HDDs.

If I switch slow partition to "offline" this hangs will become less frequent.

denizzzka on 31 Oct 2019

At least my HDDs are all the same, and some people experienced this will all NVMe setup as well.

Anteru on 31 Oct 2019

On one of my pools the SSDs being used for ZIL/metadata caching are different drives but the sync problem is happens on my main rpool zpool which has identical NVMe drives in.

Ninpo on 2 Nov 2019

We are using Dell R720 with 2x Xeon E5-2670 and 512GB Ram. The HBA is an LSI 9207 with 4 KINGSTON SEDC500M960G in stripped mirror configuration. No L2ARC, Slog and no swap either. The problem happens when rolling back LXC container hosting Postgres DBs which use their own dataset for their disk.

zfs send is also used often on the ZFS dataset that is being rollbacked. Could it be a race condition ?

ZFS 0.8.2

camaer on 20 Nov 2019

@kpande It's the same issue as in this thread. I posted my specs to maybe help pinpoint some similarities in the hardware between everyone experiencing this issue.

camaer on 20 Nov 2019

I had the same issue but when I disabled cache and log and we set synchronization to standard after long debugging the problem went away. (cache and log were both SSD drives)

jimmyfi on 25 Apr 2020

I moved swap into another ssd drive and problem away
(I know what it is not recommentded to store swap at ZFS for now)

denizzzka on 25 Apr 2020

once again in production. the situation is truly critical. zfs version 0.8.1-1ubuntu14.4
Linux zfs-sch 5.3.0-62-generic #56~18.04.1-Ubuntu SMP

sync stack :
[<0>] dmu_tx_delay.isra.8+0x138/0x140 [zfs]
[<0>] dmu_tx_wait+0xb1/0x210 [zfs]
[<0>] dmu_tx_assign+0x49/0x70 [zfs]
[<0>] dmu_sync_late_arrival+0x53/0x160 [zfs]
[<0>] dmu_sync+0x1de/0x4e0 [zfs]
[<0>] zfs_get_data+0x1a8/0x350 [zfs]
[<0>] zil_lwb_commit+0x1cd/0x360 [zfs]
[<0>] zil_process_commit_list+0xea/0x1f0 [zfs]
[<0>] zil_commit_writer.isra.14+0xa1/0xb0 [zfs]
[<0>] zil_commit_impl+0x57/0x90 [zfs]
[<0>] zil_commit+0x3d/0x60 [zfs]
[<0>] zfs_sync+0x6d/0xd0 [zfs]
[<0>] zpl_sync_fs+0x46/0x70 [zfs]
[<0>] sync_fs_one_sb+0x23/0x30
[<0>] iterate_supers+0xa2/0x110
[<0>] ksys_sync+0x62/0xb0
[<0>] __ia32_sys_sync+0xe/0x20
[<0>] do_syscall_64+0x5a/0x130
[<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9

Proces in state D
29378 931 nobody D 4.8 smbd dmu_tx_delay.isra.8
2 3051 root D 1.0 txg_sync cv_timedwait_common
29378 4574 s.paoli+ D 1.3 smbd dmu_tx_delay.isra.8
29378 5462 nobody D 4.1 smbd cv_timedwait_common
29378 6407 nobody D 1.2 smbd dmu_tx_delay.isra.8
29378 9449 nobody D 2.6 smbd dmu_tx_delay.isra.8
29378 11204 philips D 3.7 smbd dmu_tx_delay.isra.8
29378 12025 nobody D 2.2 smbd dmu_tx_delay.isra.8
1 17148 nobody D 4.6 smbd dmu_tx_delay.isra.8
17769 23245 root D+ 3.3 sync dmu_tx_delay.isra.8

dmes

[ 243.613937] INFO: task zpool:6002 blocked for more than 120 seconds.
[ 243.613980] Tainted: P O 5.3.0-62-generic #56~18.04.1-Ubuntu
[ 243.614028] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 243.614079] zpool D 0 6002 1 0x00004002
[ 243.614081] Call Trace:
[ 243.614085] __schedule+0x2a8/0x670
[ 243.614089] ? autoremove_wake_function+0x12/0x40
[ 243.614092] schedule+0x33/0xa0
[ 243.614094] io_schedule+0x16/0x40
[ 243.614102] cv_wait_common+0xdb/0x160 [spl]
[ 243.614105] ? wait_woken+0x80/0x80
[ 243.614112] __cv_wait_io+0x18/0x20 [spl]
[ 243.614223] txg_wait_synced+0x88/0xd0 [zfs]
[ 243.614332] spa_config_update+0x117/0x160 [zfs]
[ 243.614440] spa_import+0x645/0x7b0 [zfs]
[ 243.614553] zfs_ioc_pool_import+0x13c/0x150 [zfs]
[ 243.614667] zfsdev_ioctl+0x584/0x620 [zfs]
[ 243.614671] do_vfs_ioctl+0xa9/0x640
[ 243.614677] ? handle_mm_fault+0xcb/0x210
[ 243.614680] ksys_ioctl+0x75/0x80
[ 243.614683] __x64_sys_ioctl+0x1a/0x20
[ 243.614688] do_syscall_64+0x5a/0x130
[ 243.614692] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 243.614694] RIP: 0033:0x7f19f61826d7
[ 243.614702] Code: Bad RIP value.
[ 243.614704] RSP: 002b:00007fff9cf6bec8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[ 243.614707] RAX: ffffffffffffffda RBX: 00007fff9cf6f500 RCX: 00007f19f61826d7
[ 243.614708] RDX: 00007fff9cf6f500 RSI: 0000000000005a02 RDI: 0000000000000003
[ 243.614709] RBP: 0000000000005a02 R08: 0000000000005a02 R09: 0000000000000000
[ 243.614711] R10: 0000000000000022 R11: 0000000000000246 R12: 00005616bdb13660
[ 243.614712] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000008
[ 364.445890] INFO: task l2arc_feed:2110 blocked for more than 241 seconds.
[ 364.445947] Tainted: P O 5.3.0-62-generic #56~18.04.1-Ubuntu
[ 364.445996] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 364.446048] l2arc_feed D 0 2110 2 0x80004000
[ 364.446051] Call Trace:
[ 364.446061] __schedule+0x2a8/0x670
[ 364.446067] schedule+0x33/0xa0
[ 364.446069] schedule_preempt_disabled+0xe/0x10
[ 364.446073] __mutex_lock.isra.9+0x26d/0x4e0
[ 364.446187] ? l2arc_evict+0x290/0x290 [zfs]
[ 364.446191] __mutex_lock_slowpath+0x13/0x20
[ 364.446193] ? __mutex_lock_slowpath+0x13/0x20
[ 364.446196] mutex_lock+0x2f/0x40
[ 364.446258] l2arc_dev_get_next+0x1a/0x1e0 [zfs]
[ 364.446320] ? l2arc_evict+0x290/0x290 [zfs]
[ 364.446382] l2arc_feed_thread+0xd6/0x250 [zfs]
[ 364.446443] ? l2arc_evict+0x290/0x290 [zfs]
[ 364.446458] thread_generic_wrapper+0x74/0x90 [spl]
[ 364.446464] kthread+0x121/0x140
[ 364.446474] ? __thread_exit+0x20/0x20 [spl]
[ 364.446476] ? kthread_park+0xb0/0xb0
[ 364.446481] ret_from_fork+0x35/0x40
[ 606.109902] INFO: task spl_system_task:1487 blocked for more than 120 seconds.
[ 606.109959] Tainted: P O 5.3.0-62-generic #56~18.04.1-Ubuntu
[ 606.110008] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 606.110059] spl_system_task D 0 1487 2 0x80004000
[ 606.110063] Call Trace:
[ 606.110072] __schedule+0x2a8/0x670
[ 606.110077] schedule+0x33/0xa0
[ 606.110080] schedule_preempt_disabled+0xe/0x10
[ 606.110083] __mutex_lock.isra.9+0x26d/0x4e0
[ 606.110090] ? update_load_avg+0x17d/0x680
[ 606.110092] ? update_load_avg+0x17d/0x680
[ 606.110095] __mutex_lock_slowpath+0x13/0x20
[ 606.110098] ? __mutex_lock_slowpath+0x13/0x20
[ 606.110100] mutex_lock+0x2f/0x40
[ 606.110256] spa_open_common+0x68/0x500 [zfs]
[ 606.110261] ? __switch_to_asm+0x40/0x70
[ 606.110264] ? __switch_to_asm+0x34/0x70
[ 606.110267] ? __switch_to_asm+0x40/0x70
[ 606.110269] ? __switch_to_asm+0x34/0x70
[ 606.110382] spa_open+0x13/0x20 [zfs]
[ 606.110481] dsl_pool_hold+0x33/0x80 [zfs]
[ 606.110562] dmu_objset_own+0x46/0x190 [zfs]
[ 606.110678] zvol_prefetch_minors_impl+0x4b/0xa0 [zfs]
[ 606.110694] taskq_thread+0x210/0x400 [spl]
[ 606.110698] ? __switch_to_asm+0x40/0x70
[ 606.110704] ? wake_up_q+0x80/0x80
[ 606.110710] kthread+0x121/0x140
[ 606.110720] ? task_done+0x90/0x90 [spl]
[ 606.110722] ? kthread_park+0xb0/0xb0
[ 606.110726] ret_from_fork+0x35/0x40
[ 606.110733] INFO: task l2arc_feed:2110 blocked for more than 120 seconds.
[ 606.110779] Tainted: P O 5.3.0-62-generic #56~18.04.1-Ubuntu
[ 606.110828] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 606.110878] l2arc_feed D 0 2110 2 0x80004000
[ 606.110881] Call Trace:
[ 606.110884] __schedule+0x2a8/0x670
[ 606.110888] schedule+0x33/0xa0
[ 606.110890] schedule_preempt_disabled+0xe/0x10
[ 606.110893] __mutex_lock.isra.9+0x26d/0x4e0
[ 606.110896] __mutex_lock_slowpath+0x13/0x20
[ 606.110899] ? __mutex_lock_slowpath+0x13/0x20
[ 606.110901] mutex_lock+0x2f/0x40
[ 606.110964] l2arc_dev_get_next+0x1a/0x1e0 [zfs]
[ 606.111026] l2arc_feed_thread+0xd6/0x250 [zfs]
[ 606.111088] ? l2arc_evict+0x290/0x290 [zfs]
[ 606.111100] thread_generic_wrapper+0x74/0x90 [spl]
[ 606.111103] kthread+0x121/0x140
[ 606.111114] ? __thread_exit+0x20/0x20 [spl]
[ 606.111117] ? kthread_park+0xb0/0xb0
[ 606.111121] ret_from_fork+0x35/0x40
[ 606.111129] INFO: task zpool:6002 blocked for more than 120 seconds.
[ 606.111174] Tainted: P O 5.3.0-62-generic #56~18.04.1-Ubuntu
[ 606.111223] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 606.111275] zpool D 0 6002 1 0x00000000
[ 606.111278] Call Trace:
[ 606.111281] __schedule+0x2a8/0x670
[ 606.111285] schedule+0x33/0xa0
[ 606.111287] schedule_preempt_disabled+0xe/0x10
[ 606.111290] __mutex_lock.isra.9+0x26d/0x4e0
[ 606.111294] __mutex_lock_slowpath+0x13/0x20
[ 606.111296] ? __mutex_lock_slowpath+0x13/0x20
[ 606.111298] mutex_lock+0x2f/0x40
[ 606.111413] spa_open_common+0x68/0x500 [zfs]
[ 606.111527] spa_get_stats+0x4d/0x300 [zfs]
[ 606.111530] ? _cond_resched+0x19/0x40
[ 606.111651] zfs_ioc_pool_stats+0x39/0x90 [zfs]
[ 606.111773] zfsdev_ioctl+0x584/0x620 [zfs]
[ 606.111777] do_vfs_ioctl+0xa9/0x640
[ 606.111784] ? handle_mm_fault+0xcb/0x210
[ 606.111786] ksys_ioctl+0x75/0x80
[ 606.111789] __x64_sys_ioctl+0x1a/0x20
[ 606.111795] do_syscall_64+0x5a/0x130
[ 606.111799] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 606.111802] RIP: 0033:0x7f19f61826d7
[ 606.111809] Code: Bad RIP value.
[ 606.111810] RSP: 002b:00007fff9cf6be48 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[ 606.111813] RAX: ffffffffffffffda RBX: 00007fff9cf6be70 RCX: 00007f19f61826d7
[ 606.111815] RDX: 00007fff9cf6be70 RSI: 0000000000005a05 RDI: 0000000000000003
[ 606.111816] RBP: 00005616bdbb95b0 R08: 00005616bdbb9700 R09: 0000000000000000
[ 606.111817] R10: 00005616bdb12010 R11: 0000000000000246 R12: 00005616bdb13660
[ 606.111819] R13: 0000000000000000 R14: 00007fff9cf6f464 R15: 0000000000000008
[ 606.111832] INFO: task z_zvol:7381 blocked for more than 120 seconds.
[ 606.111876] Tainted: P O 5.3.0-62-generic #56~18.04.1-Ubuntu
[ 606.111924] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 606.111975] z_zvol D 0 7381 2 0x80004000
[ 606.111977] Call Trace:
[ 606.111981] __schedule+0x2a8/0x670
[ 606.111984] schedule+0x33/0xa0
[ 606.111986] schedule_preempt_disabled+0xe/0x10
[ 606.111989] __mutex_lock.isra.9+0x26d/0x4e0
[ 606.112057] ? do_raw_spin_unlock+0x9/0x10 [zfs]
[ 606.112124] ? dbuf_rele_and_unlock+0x17d/0x4f0 [zfs]
[ 606.112132] ? cv_destroy_wakeup+0x25/0x40 [spl]
[ 606.112136] __mutex_lock_slowpath+0x13/0x20
[ 606.112138] ? __mutex_lock_slowpath+0x13/0x20
[ 606.112140] mutex_lock+0x2f/0x40
[ 606.112252] spa_open_common+0x68/0x500 [zfs]
[ 606.112364] spa_open+0x13/0x20 [zfs]
[ 606.112459] dsl_pool_hold+0x33/0x80 [zfs]
[ 606.112538] dmu_objset_hold_flags+0x3b/0xc0 [zfs]
[ 606.112616] dmu_objset_hold+0x16/0x20 [zfs]
[ 606.112713] dsl_prop_get+0x44/0xb0 [zfs]
[ 606.112831] ? list_insert_tail+0x20/0x20 [zfs]
[ 606.112927] dsl_prop_get_integer+0x1e/0x20 [zfs]
[ 606.113044] zvol_create_minors_cb+0x3b/0x100 [zfs]
[ 606.113152] ? rrw_exit+0x61/0xb0 [zfs]
[ 606.113229] dmu_objset_find_impl+0x107/0x410 [zfs]
[ 606.113349] ? list_insert_tail+0x20/0x20 [zfs]
[ 606.113426] dmu_objset_find_impl+0x211/0x410 [zfs]
[ 606.113545] ? list_insert_tail+0x20/0x20 [zfs]
[ 606.113622] dmu_objset_find_impl+0x211/0x410 [zfs]
[ 606.113740] ? list_insert_tail+0x20/0x20 [zfs]
[ 606.113818] dmu_objset_find+0x58/0x90 [zfs]
[ 606.113955] zvol_create_minors_impl+0x193/0x1b0 [zfs]
[ 606.114079] zvol_task_cb+0xc1/0xf0 [zfs]
[ 606.114099] taskq_thread+0x210/0x400 [spl]
[ 606.114108] ? __switch_to_asm+0x40/0x70
[ 606.114117] ? wake_up_q+0x80/0x80
[ 606.114125] kthread+0x121/0x140
[ 606.114141] ? task_done+0x90/0x90 [spl]
[ 606.114147] ? kthread_park+0xb0/0xb0
[ 606.114155] ret_from_fork+0x35/0x40
[ 606.114214] INFO: task spa_async:12502 blocked for more than 120 seconds.
[ 606.114263] Tainted: P O 5.3.0-62-generic #56~18.04.1-Ubuntu
[ 606.114313] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 606.114365] spa_async D 0 12502 2 0x80004000
[ 606.114368] Call Trace:
[ 606.114372] __schedule+0x2a8/0x670
[ 606.114375] schedule+0x33/0xa0
[ 606.114384] cv_wait_common+0x14c/0x160 [spl]
[ 606.114388] ? wait_woken+0x80/0x80
[ 606.114397] __cv_wait+0x15/0x20 [spl]
[ 606.114515] spa_config_enter+0x7c/0x110 [zfs]
[ 606.114633] spa_config_update+0x34/0x160 [zfs]
[ 606.114752] spa_async_thread+0x453/0x530 [zfs]
[ 606.114873] ? spa_async_autoexpand+0x90/0x90 [zfs]
[ 606.114890] thread_generic_wrapper+0x74/0x90 [spl]
[ 606.114898] kthread+0x121/0x140
[ 606.114914] ? __thread_exit+0x20/0x20 [spl]
[ 606.114917] ? kthread_park+0xb0/0xb0
[ 606.114921] ret_from_fork+0x35/0x40
[ 726.942290] INFO: task spl_system_task:1487 blocked for more than 241 seconds.
[ 726.942349] Tainted: P O 5.3.0-62-generic #56~18.04.1-Ubuntu
[ 726.942398] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 726.942450] spl_system_task D 0 1487 2 0x80004000
[ 726.942454] Call Trace:
[ 726.942464] __schedule+0x2a8/0x670
[ 726.942469] schedule+0x33/0xa0
[ 726.942472] schedule_preempt_disabled+0xe/0x10
[ 726.942475] __mutex_lock.isra.9+0x26d/0x4e0
[ 726.942481] ? update_load_avg+0x17d/0x680
[ 726.942484] ? update_load_avg+0x17d/0x680
[ 726.942487] __mutex_lock_slowpath+0x13/0x20
[ 726.942489] ? __mutex_lock_slowpath+0x13/0x20
[ 726.942491] mutex_lock+0x2f/0x40
[ 726.942647] spa_open_common+0x68/0x500 [zfs]
[ 726.942652] ? __switch_to_asm+0x40/0x70
[ 726.942655] ? __switch_to_asm+0x34/0x70
[ 726.942657] ? __switch_to_asm+0x40/0x70
[ 726.942660] ? __switch_to_asm+0x34/0x70
[ 726.942778] spa_open+0x13/0x20 [zfs]
[ 726.942904] dsl_pool_hold+0x33/0x80 [zfs]
[ 726.942994] dmu_objset_own+0x46/0x190 [zfs]
[ 726.943104] zvol_prefetch_minors_impl+0x4b/0xa0 [zfs]
[ 726.943122] taskq_thread+0x210/0x400 [spl]
[ 726.943126] ? __switch_to_asm+0x40/0x70
[ 726.943131] ? wake_up_q+0x80/0x80
[ 726.943136] kthread+0x121/0x140
[ 726.943147] ? task_done+0x90/0x90 [spl]
[ 726.943150] ? kthread_park+0xb0/0xb0
[ 726.943153] ret_from_fork+0x35/0x40
[ 726.943160] INFO: task l2arc_feed:2110 blocked for more than 241 seconds.
[ 726.943209] Tainted: P O 5.3.0-62-generic #56~18.04.1-Ubuntu
[ 726.943259] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 726.943311] l2arc_feed D 0 2110 2 0x80004000
[ 726.943314] Call Trace:
[ 726.943317] __schedule+0x2a8/0x670
[ 726.943321] schedule+0x33/0xa0
[ 726.943323] schedule_preempt_disabled+0xe/0x10
[ 726.943326] __mutex_lock.isra.9+0x26d/0x4e0
[ 726.943330] __mutex_lock_slowpath+0x13/0x20
[ 726.943332] ? __mutex_lock_slowpath+0x13/0x20
[ 726.943335] mutex_lock+0x2f/0x40
[ 726.943391] l2arc_dev_get_next+0x1a/0x1e0 [zfs]
[ 726.943449] l2arc_feed_thread+0xd6/0x250 [zfs]
[ 726.943506] ? l2arc_evict+0x290/0x290 [zfs]
[ 726.943519] thread_generic_wrapper+0x74/0x90 [spl]
[ 726.943522] kthread+0x121/0x140
[ 726.943534] ? __thread_exit+0x20/0x20 [spl]
[ 726.943536] ? kthread_park+0xb0/0xb0
[ 726.943540] ret_from_fork+0x35/0x40

update :+1:
726.943160] INFO: task l2arc_feed:2110 blocked for more than 241 seconds.
[ 726.943209] Tainted: P O 5.3.0-62-generic #56~18.04.1-Ubuntu
[ 726.943259] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 726.943311] l2arc_feed D 0 2110 2 0x80004000
[ 726.943314] Call Trace:
[ 726.943317] __schedule+0x2a8/0x670
[ 726.943321] schedule+0x33/0xa0
[ 726.943323] schedule_preempt_disabled+0xe/0x10
[ 726.943326] __mutex_lock.isra.9+0x26d/0x4e0
[ 726.943330] __mutex_lock_slowpath+0x13/0x20
[ 726.943332] ? __mutex_lock_slowpath+0x13/0x20
[ 726.943335] mutex_lock+0x2f/0x40
[ 726.943391] l2arc_dev_get_next+0x1a/0x1e0 [zfs]
[ 726.943449] l2arc_feed_thread+0xd6/0x250 [zfs]
[ 726.943506] ? l2arc_evict+0x290/0x290 [zfs]
[ 726.943519] thread_generic_wrapper+0x74/0x90 [spl]
[ 726.943522] kthread+0x121/0x140
[ 726.943534] ? __thread_exit+0x20/0x20 [spl]
[ 726.943536] ? kthread_park+0xb0/0xb0
[ 726.943540] ret_from_fork+0x35/0x40
[ 6928.956809] perf: interrupt took too long (2502 > 2500), lowering kernel.perf_event_max_sample_rate to 79750