Zfs: ZFS deadlocks on disk failure

Created on 24 Mar 2013 · 14Comments · Source: openzfs/zfs

While I think this may be a bug in mpt_sas rather than ZFSOnLinux/ZFS I'll post this here since I think it's at least a documentation issue on how to deal with this type of situation. Anyone else experienced similar issues?

I experienced a disk failure that caused a ZFS pool to fully deadlock. Pool configuration uses 2-disk mirrors. I attempted to offline the failing disk using zpool offline but to no avail, zpool offline appeared to deadlock (or at least was in D-state for 15+ minutes).

Excerpt from the kernel log:

[2153265.654808] Read(10): 28 00 5c 2d 8e 6e 00 01 00 00
[2153265.654813] scsi target1:0:13: handle(0x0017), sas_address(0x50030480017c6559), phy(25)
[2153265.654815] scsi target1:0:13: enclosure_logical_id(0x50030480017c657f), slot(13)
[2153265.654819] sd 1:0:13:0: task abort: SUCCESS scmd(ffff881002f05b00)
[2153265.654821] sd 1:0:13:0: attempting task abort! scmd(ffff8806c7048100)
[2153265.654822] sd 1:0:13:0: [sdo] CDB:
[2153265.654823] Read(10): 28 00 2d 93 25 ae 00 00 10 00
[2153265.654828] scsi target1:0:13: handle(0x0017), sas_address(0x50030480017c6559), phy(25)
[2153265.654829] scsi target1:0:13: enclosure_logical_id(0x50030480017c657f), slot(13)
[2153265.654834] sd 1:0:13:0: task abort: SUCCESS scmd(ffff8806c7048100)
[2153265.654835] sd 1:0:13:0: attempting task abort! scmd(ffff88070901ea00)
[2153265.654837] sd 1:0:13:0: [sdo] CDB:
[2153265.654837] Read(10): 28 00 12 ab ee 46 00 00 03 00
[2153265.654842] scsi target1:0:13: handle(0x0017), sas_address(0x50030480017c6559), phy(25)
[2153265.654843] scsi target1:0:13: enclosure_logical_id(0x50030480017c657f), slot(13)
[2153265.654848] sd 1:0:13:0: task abort: SUCCESS scmd(ffff88070901ea00)
[2153265.654850] sd 1:0:13:0: attempting task abort! scmd(ffff8809ed2d5f00)
[2153265.654851] sd 1:0:13:0: [sdo] CDB:
[2153265.654852] Read(10): 28 00 12 af f0 33 00 00 09 00
[2153265.654856] scsi target1:0:13: handle(0x0017), sas_address(0x50030480017c6559), phy(25)
[2153265.654858] scsi target1:0:13: enclosure_logical_id(0x50030480017c657f), slot(13)
[2153265.654863] sd 1:0:13:0: task abort: SUCCESS scmd(ffff8809ed2d5f00)
[2153265.654864] sd 1:0:13:0: attempting task abort! scmd(ffff880e7a624000)
[2153265.654865] sd 1:0:13:0: [sdo] CDB:

And some blocking stacks:

[2154394.730691]   task                        PC stack   pid father
[2154394.730697] spl_system_task D ffff88107fc91500     0   157      2 0x00000000
[2154394.730700]  ffff88103720f080 0000000000000046 0000000000000004 ffff881038d55140
[2154394.730702]  0000000000011500 ffff8810372c5fd8 0000000000011500 ffff8810372c4010
[2154394.730704]  ffff8810372c5fd8 0000000000011500 ffff88103720f080 0000000000011500
[2154394.730706] Call Trace:
[2154394.730714]  [<ffffffff8112dd75>] ? cv_wait_common+0x115/0x1d0
[2154394.730717]  [<ffffffff810850d0>] ? wake_up_bit+0x40/0x40
[2154394.730722]  [<ffffffff811ec87d>] ? spa_config_enter+0x17d/0x1a0
[2154394.730725]  [<ffffffff81233e0e>] ? zio_vdev_io_start+0x22e/0x300
[2154394.730727]  [<ffffffff812338cd>] ? zio_nowait+0x9d/0x130
[2154394.730731]  [<ffffffff8119908f>] ? arc_read_nolock+0x4df/0x7f0
[2154394.730733]  [<ffffffff811996bc>] ? arc_read+0xbc/0x1c0
[2154394.730737]  [<ffffffff811af8d6>] ? traverse_prefetcher+0x106/0x170
[2154394.730739]  [<ffffffff811afc7e>] ? traverse_visitbp+0x33e/0x710
[2154394.730742]  [<ffffffff811af8d6>] ? traverse_prefetcher+0x106/0x170
[2154394.730745]  [<ffffffff811afd9e>] ? traverse_visitbp+0x45e/0x710
[2154394.730747]  [<ffffffff811afd9e>] ? traverse_visitbp+0x45e/0x710
[2154394.730750]  [<ffffffff811afd9e>] ? traverse_visitbp+0x45e/0x710
[2154394.730752]  [<ffffffff811b044b>] ? traverse_dnode+0x7b/0x100
[2154394.730755]  [<ffffffff811afeb2>] ? traverse_visitbp+0x572/0x710
[2154394.730758]  [<ffffffff811af8d6>] ? traverse_prefetcher+0x106/0x170
[2154394.730760]  [<ffffffff811afd9e>] ? traverse_visitbp+0x45e/0x710
[2154394.730763]  [<ffffffff811afd9e>] ? traverse_visitbp+0x45e/0x710
[2154394.730765]  [<ffffffff811afd9e>] ? traverse_visitbp+0x45e/0x710
[2154394.730768]  [<ffffffff811afd9e>] ? traverse_visitbp+0x45e/0x710
[2154394.730771]  [<ffffffff811afd9e>] ? traverse_visitbp+0x45e/0x710
[2154394.730773]  [<ffffffff811afd9e>] ? traverse_visitbp+0x45e/0x710
[2154394.730776]  [<ffffffff811b044b>] ? traverse_dnode+0x7b/0x100
[2154394.730778]  [<ffffffff811aff5a>] ? traverse_visitbp+0x61a/0x710
[2154394.730781]  [<ffffffff81096b90>] ? idle_balance+0x100/0x120
[2154394.730784]  [<ffffffff811b0558>] ? traverse_prefetch_thread+0x88/0xc0
[2154394.730786]  [<ffffffff811af7d0>] ? traverse_zil_block+0xa0/0xa0
[2154394.730789]  [<ffffffff81127696>] ? taskq_thread+0x216/0x4c0
[2154394.730792]  [<ffffffff810923f0>] ? try_to_wake_up+0x2a0/0x2a0
[2154394.730794]  [<ffffffff81127480>] ? task_expire+0x110/0x110
[2154394.730796]  [<ffffffff81127480>] ? task_expire+0x110/0x110
[2154394.730798]  [<ffffffff81084aee>] ? kthread+0xce/0xe0
[2154394.730800]  [<ffffffff81084a20>] ? kthread_parkme+0x30/0x30
[2154394.730804]  [<ffffffff81a4c0ec>] ? ret_from_fork+0x7c/0xb0
[2154394.730806]  [<ffffffff81084a20>] ? kthread_parkme+0x30/0x30

[2154394.730861] z_rd_iss/3      D ffff88107fc71500     0  4739      2 0x00000000
[2154394.730863]  ffff88101f2ea580 0000000000000046 0000000000000003 ffff881038d54b00
[2154394.730864]  0000000000011500 ffff8810079d9fd8 0000000000011500 ffff8810079d8010
[2154394.730866]  ffff8810079d9fd8 0000000000011500 ffff88101f2ea580 0000000000011500
[2154394.730868] Call Trace:
[2154394.730871]  [<ffffffff8112dd75>] ? cv_wait_common+0x115/0x1d0
[2154394.730873]  [<ffffffff810850d0>] ? wake_up_bit+0x40/0x40
[2154394.730875]  [<ffffffff811ec87d>] ? spa_config_enter+0x17d/0x1a0
[2154394.730877]  [<ffffffff81233e0e>] ? zio_vdev_io_start+0x22e/0x300
[2154394.730879]  [<ffffffff81233485>] ? zio_execute+0x95/0x100
[2154394.730882]  [<ffffffff81127696>] ? taskq_thread+0x216/0x4c0
[2154394.730884]  [<ffffffff810923f0>] ? try_to_wake_up+0x2a0/0x2a0
[2154394.730886]  [<ffffffff81127480>] ? task_expire+0x110/0x110
[2154394.730888]  [<ffffffff81127480>] ? task_expire+0x110/0x110
[2154394.730890]  [<ffffffff81084aee>] ? kthread+0xce/0xe0
[2154394.730892]  [<ffffffff81084a20>] ? kthread_parkme+0x30/0x30
[2154394.730895]  [<ffffffff81a4c0ec>] ? ret_from_fork+0x7c/0xb0
[2154394.730897]  [<ffffffff81084a20>] ? kthread_parkme+0x30/0x30

[2154394.730904] txg_sync        D ffff88107fc91500     0  6782      2 0x00000000
[2154394.730906]  ffff88100cc61900 0000000000000046 0000000000000000 ffff881038d55140
[2154394.730908]  0000000000011500 ffff880f534b5fd8 0000000000011500 ffff880f534b4010
[2154394.730910]  ffff880f534b5fd8 0000000000011500 ffff88100cc61900 0000000000011500
[2154394.730911] Call Trace:
[2154394.730914]  [<ffffffff81a4aca4>] ? io_schedule+0x84/0xd0
[2154394.730916]  [<ffffffff8112dd0e>] ? cv_wait_common+0xae/0x1d0
[2154394.730918]  [<ffffffff8108df33>] ? __wake_up+0x43/0x70
[2154394.730920]  [<ffffffff810850d0>] ? wake_up_bit+0x40/0x40
[2154394.730923]  [<ffffffff81127d8d>] ? taskq_dispatch_ent+0x6d/0x1c0
[2154394.730925]  [<ffffffff812335db>] ? zio_wait+0xeb/0x180
[2154394.730927]  [<ffffffff8119d674>] ? dbuf_read+0x444/0x750
[2154394.730930]  [<ffffffff811a4cea>] ? dmu_buf_hold+0x10a/0x1c0
[2154394.730934]  [<ffffffff81202d7d>] ? zap_lockdir+0x6d/0x770
[2154394.730937]  [<ffffffff8111350f>] ? kfree+0xf/0xb0
[2154394.730939]  [<ffffffff81122f41>] ? kmem_free_debug+0x41/0x140
[2154394.730942]  [<ffffffff81204bfc>] ? zap_lookup_norm+0x4c/0x1c0
[2154394.730944]  [<ffffffff81204dea>] ? zap_lookup+0x2a/0x30
[2154394.730947]  [<ffffffff811fd991>] ? zap_increment+0x81/0xf0
[2154394.730949]  [<ffffffff811fda64>] ? zap_increment_int+0x64/0xa0
[2154394.730952]  [<ffffffff811a7c94>] ? do_userquota_update+0x74/0x140
[2154394.730954]  [<ffffffff811a7f6e>] ? dmu_objset_do_userquota_updates+0x20e/0x280
[2154394.730957]  [<ffffffff8109a3ad>] ? ktime_get_ts+0x3d/0xe0
[2154394.730960]  [<ffffffff811c9154>] ? dsl_pool_sync+0x134/0x580
[2154394.730964]  [<ffffffff811dcfc6>] ? spa_sync+0x396/0x9e0
[2154394.730967]  [<ffffffff8112b61c>] ? __gethrtime+0xc/0x20
[2154394.730969]  [<ffffffff8109a3ad>] ? ktime_get_ts+0x3d/0xe0
[2154394.730971]  [<ffffffff811ef2be>] ? txg_sync_thread+0x30e/0x570
[2154394.730974]  [<ffffffff8108f7a8>] ? set_user_nice+0xe8/0x180
[2154394.730976]  [<ffffffff811eefb0>] ? txg_do_callbacks+0x50/0x50
[2154394.730978]  [<ffffffff811263a0>] ? __thread_create+0x360/0x360
[2154394.730981]  [<ffffffff81126415>] ? thread_generic_wrapper+0x75/0x90
[2154394.730983]  [<ffffffff81084aee>] ? kthread+0xce/0xe0
[2154394.730985]  [<ffffffff81084a20>] ? kthread_parkme+0x30/0x30
[2154394.730987]  [<ffffffff81a4c0ec>] ? ret_from_fork+0x7c/0xb0
[2154394.730989]  [<ffffffff81084a20>] ? kthread_parkme+0x30/0x30

[2154394.731204] nfsd            D ffff88107fcf1500     0 13599      2 0x00000000
[2154394.731205]  ffff88101f3ab200 0000000000000046 0000000000000000 ffff88067d07abc0
[2154394.731207]  0000000000011500 ffff8806b34d7fd8 0000000000011500 ffff8806b34d6010
[2154394.731209]  ffff8806b34d7fd8 0000000000011500 ffff88101f3ab200 0000000000011500
[2154394.731211] Call Trace:
[2154394.731213]  [<ffffffff8112dd75>] ? cv_wait_common+0x115/0x1d0
[2154394.731216]  [<ffffffff8119de74>] ? dbuf_hold_impl+0x94/0xc0
[2154394.731218]  [<ffffffff810850d0>] ? wake_up_bit+0x40/0x40
[2154394.731220]  [<ffffffff811a57bb>] ? dmu_buf_hold_array_by_dnode+0x25b/0x560
[2154394.731223]  [<ffffffff811a5d44>] ? dmu_buf_hold_array+0x64/0xa0
[2154394.731225]  [<ffffffff811a5db9>] ? dmu_read_uio+0x39/0xd0
[2154394.731227]  [<ffffffff8122b907>] ? zfs_inode_update+0x117/0x190
[2154394.731229]  [<ffffffff81227e6a>] ? zfs_read+0x16a/0x4a0
[2154394.731231]  [<ffffffff813486b0>] ? fh_verify+0x550/0x550
[2154394.731233]  [<ffffffff81344a9f>] ? find_acceptable_alias+0x1f/0x110
[2154394.731235]  [<ffffffff8123a6f1>] ? zpl_read_common+0x51/0x70
[2154394.731237]  [<ffffffff8123a775>] ? zpl_read+0x65/0xa0
[2154394.731240]  [<ffffffff8112efcc>] ? tsd_hash_search+0x8c/0x1f0
[2154394.731242]  [<ffffffff8123a710>] ? zpl_read_common+0x70/0x70
[2154394.731244]  [<ffffffff8114e14f>] ? do_loop_readv_writev+0x5f/0xb0
[2154394.731247]  [<ffffffff8114f506>] ? do_readv_writev+0x206/0x210
[2154394.731249]  [<ffffffff812211ea>] ? zfs_open+0x9a/0x140
[2154394.731251]  [<ffffffff8123a212>] ? zpl_open+0x52/0xb0
[2154394.731253]  [<ffffffff8123a1c0>] ? zpl_release+0x70/0x70
[2154394.731255]  [<ffffffff81349902>] ? nfsd_vfs_read+0x72/0x160
[2154394.731257]  [<ffffffff81349c33>] ? nfsd_open+0xf3/0x190
[2154394.731259]  [<ffffffff8134a2ad>] ? nfsd_read+0x1bd/0x2c0
[2154394.731262]  [<ffffffff819de2c9>] ? cache_check+0x69/0x390
[2154394.731264]  [<ffffffff81352a98>] ? nfsd3_proc_read+0xa8/0xf0
[2154394.731266]  [<ffffffff8134561d>] ? nfsd_dispatch+0xbd/0x1c0
[2154394.731268]  [<ffffffff819d4c4e>] ? svc_process_common+0x30e/0x580
[2154394.731270]  [<ffffffff81345b00>] ? nfsd_svc+0x1f0/0x1f0
[2154394.731271]  [<ffffffff819d518c>] ? svc_process+0xfc/0x160
[2154394.731273]  [<ffffffff81345b00>] ? nfsd_svc+0x1f0/0x1f0
[2154394.731275]  [<ffffffff81345b9f>] ? nfsd+0x9f/0x140
[2154394.731277]  [<ffffffff81084aee>] ? kthread+0xce/0xe0
[2154394.731279]  [<ffffffff81084a20>] ? kthread_parkme+0x30/0x30
[2154394.731281]  [<ffffffff81a4c0ec>] ? ret_from_fork+0x7c/0xb0
[2154394.731283]  [<ffffffff81084a20>] ? kthread_parkme+0x30/0x30

[2154394.741861] zfs             D ffff88107fc31500     0 24681      1 0x00000004
[2154394.741863]  ffff88066a0f44c0 0000000000000086 0000002000000000 ffff881038d53e80
[2154394.741864]  0000000000011500 ffff880a40fbdfd8 0000000000011500 ffff880a40fbc010
[2154394.741866]  ffff880a40fbdfd8 0000000000011500 ffff88066a0f44c0 0000000000011500
[2154394.741868] Call Trace:
[2154394.741870]  [<ffffffff8110e1ee>] ? alloc_pages_vma+0x5e/0x220
[2154394.741872]  [<ffffffff81a499e0>] ? __mutex_lock_slowpath+0xe0/0x150
[2154394.741874]  [<ffffffff81186650>] ? nv_mem_zalloc+0x30/0x50
[2154394.741876]  [<ffffffff81a494ca>] ? mutex_lock+0x1a/0x40
[2154394.741878]  [<ffffffff811e8329>] ? spa_all_configs+0x49/0x120
[2154394.741881]  [<ffffffff81214607>] ? zfs_ioc_pool_configs+0x27/0x60
[2154394.741883]  [<ffffffff81214bee>] ? zfsdev_ioctl+0xfe/0x1d0
[2154394.741885]  [<ffffffff8115f253>] ? do_vfs_ioctl+0x93/0x500
[2154394.741887]  [<ffffffff810fc725>] ? remove_vma+0x55/0x60
[2154394.741889]  [<ffffffff810fe716>] ? do_munmap+0x2d6/0x380
[2154394.741890]  [<ffffffff8115f768>] ? sys_ioctl+0xa8/0xb0
[2154394.741893]  [<ffffffff81a4c192>] ? system_call_fastpath+0x16/0x1b


[2154394.742394] zpool           D ffff88107fc31500     0 27021      1 0x00000004
[2154394.742395]  ffff88101f322bc0 0000000000000082 0000002000000000 ffff881038d53e80
[2154394.742397]  0000000000011500 ffff880e597cbfd8 0000000000011500 ffff880e597ca010
[2154394.742399]  ffff880e597cbfd8 0000000000011500 ffff88101f322bc0 0000000000011500
[2154394.742401] Call Trace:
[2154394.742403]  [<ffffffff8110e1ee>] ? alloc_pages_vma+0x5e/0x220
[2154394.742405]  [<ffffffff81a499e0>] ? __mutex_lock_slowpath+0xe0/0x150
[2154394.742407]  [<ffffffff81186650>] ? nv_mem_zalloc+0x30/0x50
[2154394.742409]  [<ffffffff81a494ca>] ? mutex_lock+0x1a/0x40
[2154394.742411]  [<ffffffff811e8329>] ? spa_all_configs+0x49/0x120
[2154394.742413]  [<ffffffff81214607>] ? zfs_ioc_pool_configs+0x27/0x60
[2154394.742416]  [<ffffffff81214bee>] ? zfsdev_ioctl+0xfe/0x1d0
[2154394.742417]  [<ffffffff8115f253>] ? do_vfs_ioctl+0x93/0x500
[2154394.742419]  [<ffffffff810fc725>] ? remove_vma+0x55/0x60
[2154394.742421]  [<ffffffff810fe716>] ? do_munmap+0x2d6/0x380
[2154394.742423]  [<ffffffff8115f768>] ? sys_ioctl+0xa8/0xb0
[2154394.742425]  [<ffffffff81a4c192>] ? system_call_fastpath+0x16/0x1b

Source

atonkyra

All 14 comments

@duidalus This is something of a known issue. ZFS must block waiting for all outstanding I/O to a device to either complete, fail, or timeout. For certain types of disk failures it can take the SCSI mid layer and driver a very long time (10's of minutes) to exhaust all of the internal retries and finally timeout.

Unfortunately, there's nothing we can do about this at the ZFS layer. We already set the FASTFAIL flags but if the lower layers don't honor them, or are just buggy, ZFS must wait. This is something which needs to be improved in the Linux kernel drivers. The good news is that BTRFS and the MD drivers suffer from this too so I image the upstream kernel community would be receptive to any improvements in this area.

behlendorf on 25 Mar 2013

I have a similar issue but for me the devices have come back online...Issuing a clear can sometimes resolve this but doesn't always:

Nov 13 13:36:40 storage1 kernel: SPLError: 2995:0:(spl-err.c:67:vcmn_err()) WARNING: Pool 'avp1' has encountered an uncorrectable I/O failure and has been suspended.
Nov 13 13:36:40 storage1 kernel:
Nov 13 13:37:39 storage1 kernel: SPLError: 3296:0:(spl-err.c:67:vcmn_err()) WARNING: Pool 'avp2' has encountered an uncorrectable I/O failure and has been suspended.
Nov 13 13:37:39 storage1 kernel:
Nov 13 13:37:42 storage1 kernel: SPLError: 3298:0:(spl-err.c:67:vcmn_err()) WARNING: Pool 'avp2' has encountered an uncorrectable I/O failure and has been suspended.
Nov 13 13:37:42 storage1 kernel:
Nov 13 13:41:33 storage1 kernel: INFO: task txg_sync:3421 blocked for more than 120 seconds.
Nov 13 13:41:33 storage1 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov 13 13:41:33 storage1 kernel: txg_sync D 0000000000000007 0 3421 2 0x00000080
Nov 13 13:41:33 storage1 kernel: ffff8807d33618d0 0000000000000046 ffff8807d3361898 ffff8807d3361894
Nov 13 13:41:33 storage1 kernel: 0000000000000000 ffff88087fc24f80 ffff880028236700 000000000000ad64
Nov 13 13:41:33 storage1 kernel: ffff880804133ab8 ffff8807d3361fd8 000000000000fb88 ffff880804133ab8
Nov 13 13:41:33 storage1 kernel: Call Trace:
Nov 13 13:41:33 storage1 kernel: [] io_schedule+0x73/0xc0
Nov 13 13:41:33 storage1 kernel: [] cv_wait_common+0xac/0x1c0 [spl]
Nov 13 13:41:33 storage1 kernel: [] ? zio_execute+0x0/0x140 [zfs]
Nov 13 13:41:33 storage1 kernel: [] ? autoremove_wake_function+0x0/0x40
Nov 13 13:41:33 storage1 kernel: [] __cv_wait_io+0x18/0x20 [spl]
Nov 13 13:41:33 storage1 kernel: [] zio_wait+0xfb/0x1b0 [zfs]
Nov 13 13:41:33 storage1 kernel: [] dbuf_read+0x3fd/0x740 [zfs]
Nov 13 13:41:33 storage1 kernel: [] dmu_buf_hold+0x108/0x1d0 [zfs]
Nov 13 13:41:33 storage1 kernel: [] zap_lockdir+0x57/0x730 [zfs]
Nov 13 13:41:33 storage1 kernel: [] ? kmem_free_debug+0x4b/0x150 [spl]
Nov 13 13:41:33 storage1 kernel: [] zap_update+0x53/0x240 [zfs]
Nov 13 13:41:33 storage1 kernel: [] spa_errlog_sync+0x168/0x220 [zfs]
Nov 13 13:41:33 storage1 kernel: [] ? spa_deadman+0x0/0x120 [zfs]
Nov 13 13:41:33 storage1 kernel: [] ? spa_error_entry_compare+0x0/0x40 [zfs]
Nov 13 13:41:33 storage1 kernel: [] ? bpobj_space+0x9f/0xb0 [zfs]
Nov 13 13:41:33 storage1 kernel: [] ? spa_error_entry_compare+0x0/0x40 [zfs]
Nov 13 13:41:33 storage1 kernel: [] ? dsl_scan_active+0x9b/0xa0 [zfs]
Nov 13 13:41:33 storage1 kernel: [] spa_sync+0x402/0xad0 [zfs]
Nov 13 13:41:33 storage1 kernel: [] txg_sync_thread+0x33f/0x5d0 [zfs]
Nov 13 13:41:33 storage1 kernel: [] ? set_user_nice+0xc9/0x130
Nov 13 13:41:33 storage1 kernel: [] ? txg_sync_thread+0x0/0x5d0 [zfs]
Nov 13 13:41:33 storage1 kernel: [] thread_generic_wrapper+0x68/0x80 [spl]
Nov 13 13:41:33 storage1 kernel: [] ? thread_generic_wrapper+0x0/0x80 [spl]
Nov 13 13:41:33 storage1 kernel: [] kthread+0x96/0xa0
Nov 13 13:41:33 storage1 kernel: [] child_rip+0xa/0x20
Nov 13 13:41:33 storage1 kernel: [] ? kthread+0x0/0xa0
Nov 13 13:41:33 storage1 kernel: [] ? child_rip+0x0/0x20
Nov 13 13:41:33 storage1 kernel: INFO: task zpool:4171 blocked for more than 120 seconds.
Nov 13 13:41:33 storage1 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov 13 13:41:33 storage1 kernel: zpool D 0000000000000002 0 4171 5336 0x00000080
Nov 13 13:41:33 storage1 kernel: ffff880823c23478 0000000000000082 ffff880800000000 ffffffffa03e7320
Nov 13 13:41:33 storage1 kernel: ffff880285ae1710 00000002539d7c18 0000000000000004 00000000a0422023
Nov 13 13:41:33 storage1 kernel: ffff88081ff65af8 ffff880823c23fd8 000000000000fb88 ffff88081ff65af8
Nov 13 13:41:33 storage1 kernel: Call Trace:
Nov 13 13:41:33 storage1 kernel: [] ? vdev_mirror_child_done+0x0/0x30 [zfs]
Nov 13 13:41:33 storage1 kernel: [] ? ktime_get_ts+0xb1/0xf0
Nov 13 13:41:33 storage1 kernel: [] io_schedule+0x73/0xc0
Nov 13 13:41:33 storage1 kernel: [] cv_wait_common+0xac/0x1c0 [spl]
Nov 13 13:41:33 storage1 kernel: [] ? autoremove_wake_function+0x0/0x40
Nov 13 13:41:33 storage1 kernel: [] __cv_wait_io+0x18/0x20 [spl]
Nov 13 13:41:33 storage1 kernel: [] zio_wait+0xfb/0x1b0 [zfs]
Nov 13 13:41:33 storage1 kernel: [] dbuf_read+0x3fd/0x740 [zfs]
Nov 13 13:41:33 storage1 kernel: [] dmu_buf_hold+0x108/0x1d0 [zfs]
Nov 13 13:41:33 storage1 kernel: [] zap_lockdir+0x57/0x730 [zfs]
Nov 13 13:41:33 storage1 kernel: [] zap_cursor_retrieve+0x1e4/0x2f0 [zfs]
Nov 13 13:41:33 storage1 kernel: [] ? __alloc_pages_nodemask+0x113/0x8d0
Nov 13 13:41:33 storage1 kernel: [] ? __kmalloc+0x20c/0x220
Nov 13 13:41:33 storage1 kernel: [] dsl_prop_get_all_impl+0xe3/0x520 [zfs]
Nov 13 13:41:33 storage1 kernel: [] ? down_read+0x16/0x30
Nov 13 13:41:33 storage1 kernel: [] ? down_read+0x16/0x30
Nov 13 13:41:33 storage1 kernel: [] ? dmu_zfetch+0x356/0xe40 [zfs]
Nov 13 13:41:33 storage1 kernel: [] ? RW_WRITE_HELD+0x66/0xb0 [zfs]
Nov 13 13:41:33 storage1 kernel: [] ? kmem_alloc_debug+0x13a/0x4c0 [spl]
Nov 13 13:41:33 storage1 kernel: [] ? kmem_alloc_debug+0x13a/0x4c0 [spl]
Nov 13 13:41:33 storage1 kernel: [] ? nv_alloc_sleep_spl+0x28/0x30 [znvpair]
Nov 13 13:41:33 storage1 kernel: [] ? nv_mem_zalloc+0x38/0x50 [znvpair]
Nov 13 13:41:33 storage1 kernel: [] ? mutex_lock+0x1e/0x50
Nov 13 13:41:33 storage1 kernel: [] dsl_prop_get_all_ds+0xc2/0x180 [zfs]
Nov 13 13:41:33 storage1 kernel: [] ? dmu_object_info_from_db+0x3e/0x50 [zfs]
Nov 13 13:41:33 storage1 kernel: [] ? dsl_dataset_hold_obj+0x99/0x640 [zfs]
Nov 13 13:41:33 storage1 kernel: [] ? dbuf_rele_and_unlock+0x169/0x210 [zfs]
Nov 13 13:41:33 storage1 kernel: [] ? dmu_buf_rele+0x30/0x40 [zfs]
Nov 13 13:41:33 storage1 kernel: [] ? dsl_dir_rele+0x35/0x40 [zfs]
Nov 13 13:41:33 storage1 kernel: [] ? dsl_dataset_hold+0x65/0x1f0 [zfs]
Nov 13 13:41:33 storage1 kernel: [] ? mutex_lock+0x1e/0x50
Nov 13 13:41:33 storage1 kernel: [] dsl_prop_get_all+0x13/0x20 [zfs]
Nov 13 13:41:33 storage1 kernel: [] zfs_ioc_objset_stats_impl+0x5c/0xf0 [zfs]
Nov 13 13:41:33 storage1 kernel: [] zfs_ioc_objset_stats+0x31/0x50 [zfs]
Nov 13 13:41:33 storage1 kernel: [] ? strlcpy+0x4a/0x60
Nov 13 13:41:33 storage1 kernel: [] zfsdev_ioctl+0x48e/0x500 [zfs]
Nov 13 13:41:33 storage1 kernel: [] ? do_sync_read+0xfa/0x140
Nov 13 13:41:33 storage1 kernel: [] vfs_ioctl+0x22/0xa0
Nov 13 13:41:33 storage1 kernel: [] do_vfs_ioctl+0x84/0x580
Nov 13 13:41:33 storage1 kernel: [] ? vfs_read+0x12f/0x1a0
Nov 13 13:41:33 storage1 kernel: [] sys_ioctl+0x81/0xa0
Nov 13 13:41:33 storage1 kernel: [] ? __audit_syscall_exit+0x265/0x290
Nov 13 13:41:33 storage1 kernel: [] system_call_fastpath+0x16/0x1b

tcf909 on 13 Nov 2013

Oddly enough this only locks my zpool cmds. ZFS commands still seem to work just fine.

tcf909 on 13 Nov 2013

ZoL definitely doesn't seem to be handling disk failures gracefully:

(SCSI timeout set to 0 so there shouldn't be stalled IO)

Nov 14 21:44:21 storage1 kernel: SPLError: 25224:0:(spl-err.c:67:vcmn_err()) WARNING: Pool 'test' has encountered an uncorrectable I/O failure and has been suspended.
Nov 14 21:44:21 storage1 kernel:
Nov 14 21:44:36 storage1 kernel: SPLError: 25213:0:(spl-err.c:67:vcmn_err()) WARNING: Pool 'test' has encountered an uncorrectable I/O failure and has been suspended.
Nov 14 21:44:36 storage1 kernel:
Nov 14 21:44:36 storage1 kernel: INFO: task txg_sync:25292 blocked for more than 120 seconds.
Nov 14 21:44:36 storage1 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov 14 21:44:36 storage1 kernel: txg_sync D 0000000000000004 0 25292 2 0x00000080
Nov 14 21:44:36 storage1 kernel: ffff88059fc95b60 0000000000000046 0000000000000001 ffff88086e696570
Nov 14 21:44:36 storage1 kernel: 0000000000000000 0000000000000000 ffff88059fc95ae0 ffffffff810639a2
Nov 14 21:44:36 storage1 kernel: ffff880859528638 ffff88059fc95fd8 000000000000fb88 ffff880859528638
Nov 14 21:44:36 storage1 kernel: Call Trace:
Nov 14 21:44:36 storage1 kernel: [] ? default_wake_function+0x12/0x20
Nov 14 21:44:36 storage1 kernel: [] ? ktime_get_ts+0xb1/0xf0
Nov 14 21:44:36 storage1 kernel: [] io_schedule+0x73/0xc0
Nov 14 21:44:36 storage1 kernel: [] cv_wait_common+0xac/0x1c0 [spl]
Nov 14 21:44:36 storage1 kernel: [] ? zio_execute+0x0/0x140 [zfs]
Nov 14 21:44:36 storage1 kernel: [] ? autoremove_wake_function+0x0/0x40
Nov 14 21:44:36 storage1 kernel: [] __cv_wait_io+0x18/0x20 [spl]
Nov 14 21:44:36 storage1 kernel: [] zio_wait+0xfb/0x1b0 [zfs]
Nov 14 21:44:36 storage1 kernel: [] dsl_pool_sync+0xf4/0x540 [zfs]
Nov 14 21:44:36 storage1 kernel: [] spa_sync+0x40e/0xa80 [zfs]
Nov 14 21:44:36 storage1 kernel: [] ? read_tsc+0x9/0x20
Nov 14 21:44:36 storage1 kernel: [] txg_sync_thread+0x307/0x590 [zfs]
Nov 14 21:44:36 storage1 kernel: [] ? set_user_nice+0xc9/0x130
Nov 14 21:44:36 storage1 kernel: [] ? txg_sync_thread+0x0/0x590 [zfs]
Nov 14 21:44:36 storage1 kernel: [] thread_generic_wrapper+0x68/0x80 [spl]
Nov 14 21:44:36 storage1 kernel: [] ? thread_generic_wrapper+0x0/0x80 [spl]
Nov 14 21:44:36 storage1 kernel: [] kthread+0x96/0xa0
Nov 14 21:44:36 storage1 kernel: [] child_rip+0xa/0x20
Nov 14 21:44:18 storage1 kernel: sd 40:0:0:61: [sdamt] 2621440 4096-byte logical blocks: (10.7 GB/10.0 GiB)
Nov 14 21:44:18 storage1 kernel: sd 32:0:0:30: [sdapc] Attached SCSI disk
Nov 14 21:44:18 storage1 kernel: sd 35:0:0:31: [sdaqi] 2621440 4096-byte logical blocks: (10.7 GB/10.0 GiB)
Nov 14 21:44:18 storage1 kernel: unknown partition table
Nov 14 21:44:18 storage1 kernel: sd 35:0:0:63: [sdaqj] 2621440 4096-byte logical blocks: (10.7 GB/10.0 GiB)
Nov 14 21:44:18 storage1 kernel: sd 35:0:0:62: [sdaqh] Attached SCSI disk
Nov 14 21:44:18 storage1 kernel: sd 40:0:0:60: [sdalr] Attached SCSI disk
Nov 14 21:44:18 storage1 kernel: sd 35:0:0:29: [sdaqb] Attached SCSI disk
Nov 14 21:44:18 storage1 kernel: sd 35:0:0:63: [sdaqj] Attached SCSI disk
Nov 14 21:44:18 storage1 kernel: sd 35:0:0:31: [sdaqi] Attached SCSI disk
Nov 14 21:44:18 storage1 kernel: sd 32:0:0:29: [sdaor] Attached SCSI disk
Nov 14 21:44:18 storage1 kernel: sd 32:0:0:31: [sdapi] Attached SCSI disk
Nov 14 21:44:18 storage1 kernel: sd 40:0:0:61: [sdamt] Attached SCSI disk
Nov 14 21:44:18 storage1 kernel: unknown partition table
Nov 14 21:44:18 storage1 kernel: sd 41:0:0:31: [sdaps] 2621440 4096-byte logical blocks: (10.7 GB/10.0 GiB)
Nov 14 21:44:18 storage1 kernel: sd 41:0:0:31: [sdaps] Attached SCSI disk
Nov 14 21:44:21 storage1 kernel: SPLError: 25224:0:(spl-err.c:67:vcmn_err()) WARNING: Pool 'test' has encountered an uncorrectable I/O failure an
d has been suspended.
Nov 14 21:44:21 storage1 kernel:
Nov 14 21:44:36 storage1 kernel: SPLError: 25213:0:(spl-err.c:67:vcmn_err()) WARNING: Pool 'test' has encountered an uncorrectable I/O failure an
d has been suspended.
Nov 14 21:44:36 storage1 kernel:
Nov 14 21:44:36 storage1 kernel: INFO: task txg_sync:25292 blocked for more than 120 seconds.
Nov 14 21:44:36 storage1 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov 14 21:44:36 storage1 kernel: txg_sync D 0000000000000004 0 25292 2 0x00000080
Nov 14 21:44:36 storage1 kernel: ffff88059fc95b60 0000000000000046 0000000000000001 ffff88086e696570
Nov 14 21:44:36 storage1 kernel: 0000000000000000 0000000000000000 ffff88059fc95ae0 ffffffff810639a2
Nov 14 21:44:36 storage1 kernel: ffff880859528638 ffff88059fc95fd8 000000000000fb88 ffff880859528638
Nov 14 21:44:36 storage1 kernel: Call Trace:
Nov 14 21:44:36 storage1 kernel: [] ? default_wake_function+0x12/0x20
Nov 14 21:44:36 storage1 kernel: [] ? ktime_get_ts+0xb1/0xf0
Nov 14 21:44:36 storage1 kernel: [] io_schedule+0x73/0xc0
Nov 14 21:44:36 storage1 kernel: [] cv_wait_common+0xac/0x1c0 [spl]
Nov 14 21:44:36 storage1 kernel: [] ? zio_execute+0x0/0x140 [zfs]
Nov 14 21:44:36 storage1 kernel: [] ? autoremove_wake_function+0x0/0x40
Nov 14 21:44:36 storage1 kernel: [] __cv_wait_io+0x18/0x20 [spl]
Nov 14 21:44:36 storage1 kernel: [] zio_wait+0xfb/0x1b0 [zfs]
Nov 14 21:44:36 storage1 kernel: [] dsl_pool_sync+0xf4/0x540 [zfs]
Nov 14 21:44:36 storage1 kernel: [] spa_sync+0x40e/0xa80 [zfs]
Nov 14 21:44:36 storage1 kernel: [] ? read_tsc+0x9/0x20
Nov 14 21:44:36 storage1 kernel: [] txg_sync_thread+0x307/0x590 [zfs]
Nov 14 21:44:36 storage1 kernel: [] ? set_user_nice+0xc9/0x130
Nov 14 21:44:36 storage1 kernel: [] ? txg_sync_thread+0x0/0x590 [zfs]
Nov 14 21:44:36 storage1 kernel: [] thread_generic_wrapper+0x68/0x80 [spl]
Nov 14 21:44:36 storage1 kernel: [] ? thread_generic_wrapper+0x0/0x80 [spl]
Nov 14 21:44:36 storage1 kernel: [] kthread+0x96/0xa0
Nov 14 21:44:36 storage1 kernel: [] child_rip+0xa/0x20
Nov 14 21:44:36 storage1 kernel: [] ? kthread+0x0/0xa0
Nov 14 21:44:36 storage1 kernel: [] ? child_rip+0x0/0x20
Nov 14 21:45:52 storage1 kernel: SPLError: 25215:0:(spl-err.c:67:vcmn_err()) WARNING: Pool 'test' has encountered an uncorrectable I/O failure and has been suspended.
Nov 14 21:45:52 storage1 kernel:
Nov 14 21:45:59 storage1 kernel: SPLError: 25221:0:(spl-err.c:67:vcmn_err()) WARNING: Pool 'test' has encountered an uncorrectable I/O failure and has been suspended.
Nov 14 21:45:59 storage1 kernel:
Nov 14 21:46:36 storage1 kernel: INFO: task txg_sync:25292 blocked for more than 120 seconds.
Nov 14 21:46:36 storage1 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov 14 21:46:36 storage1 kernel: txg_sync D 0000000000000004 0 25292 2 0x00000080
Nov 14 21:46:36 storage1 kernel: ffff88059fc95b60 0000000000000046 0000000000000001 ffff88086e696570
Nov 14 21:46:36 storage1 kernel: 0000000000000000 0000000000000000 ffff88059fc95ae0 ffffffff810639a2
Nov 14 21:46:36 storage1 kernel: ffff880859528638 ffff88059fc95fd8 000000000000fb88 ffff880859528638
Nov 14 21:46:36 storage1 kernel: Call Trace:
Nov 14 21:46:36 storage1 kernel: [] ? default_wake_function+0x12/0x20
Nov 14 21:46:36 storage1 kernel: [] ? ktime_get_ts+0xb1/0xf0
Nov 14 21:46:36 storage1 kernel: [] io_schedule+0x73/0xc0
Nov 14 21:46:36 storage1 kernel: [] cv_wait_common+0xac/0x1c0 [spl]
Nov 14 21:46:36 storage1 kernel: [] ? zio_execute+0x0/0x140 [zfs]
Nov 14 21:46:36 storage1 kernel: [] ? autoremove_wake_function+0x0/0x40
Nov 14 21:46:36 storage1 kernel: [] __cv_wait_io+0x18/0x20 [spl]
Nov 14 21:46:36 storage1 kernel: [] zio_wait+0xfb/0x1b0 [zfs]
Nov 14 21:46:36 storage1 kernel: [] dsl_pool_sync+0xf4/0x540 [zfs]
Nov 14 21:46:36 storage1 kernel: [] spa_sync+0x40e/0xa80 [zfs]
Nov 14 21:46:36 storage1 kernel: [] ? read_tsc+0x9/0x20
Nov 14 21:46:36 storage1 kernel: [] txg_sync_thread+0x307/0x590 [zfs]
Nov 14 21:46:36 storage1 kernel: [] ? set_user_nice+0xc9/0x130
Nov 14 21:46:36 storage1 kernel: [] ? txg_sync_thread+0x0/0x590 [zfs]
Nov 14 21:46:36 storage1 kernel: [] thread_generic_wrapper+0x68/0x80 [spl]
Nov 14 21:46:36 storage1 kernel: [] ? thread_generic_wrapper+0x0/0x80 [spl]
Nov 14 21:46:36 storage1 kernel: [] kthread+0x96/0xa0
Nov 14 21:46:36 storage1 kernel: [] child_rip+0xa/0x20
Nov 14 21:46:36 storage1 kernel: [] ? kthread+0x0/0xa0
Nov 14 21:46:36 storage1 kernel: [] ? child_rip+0x0/0x20
Nov 14 21:48:35 storage1 kernel: SPLError: 25215:0:(spl-err.c:67:vcmn_err()) WARNING: Pool 'test' has encountered an uncorrectable I/O failure and has been suspended.
Nov 14 21:48:35 storage1 kernel:
Nov 14 21:48:36 storage1 kernel: INFO: task txg_sync:25292 blocked for more than 120 seconds.
Nov 14 21:48:36 storage1 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov 14 21:48:36 storage1 kernel: txg_sync D 0000000000000004 0 25292 2 0x00000080
Nov 14 21:48:36 storage1 kernel: ffff88059fc95b60 0000000000000046 0000000000000001 ffff88086e696570
Nov 14 21:48:36 storage1 kernel: 0000000000000000 0000000000000000 ffff88059fc95ae0 ffffffff810639a2
Nov 14 21:48:36 storage1 kernel: ffff880859528638 ffff88059fc95fd8 000000000000fb88 ffff880859528638
Nov 14 21:48:36 storage1 kernel: Call Trace:
Nov 14 21:48:36 storage1 kernel: [] ? default_wake_function+0x12/0x20
Nov 14 21:48:36 storage1 kernel: [] ? ktime_get_ts+0xb1/0xf0
Nov 14 21:48:36 storage1 kernel: [] io_schedule+0x73/0xc0
Nov 14 21:48:36 storage1 kernel: [] cv_wait_common+0xac/0x1c0 [spl]
Nov 14 21:48:36 storage1 kernel: [] ? zio_execute+0x0/0x140 [zfs]
Nov 14 21:48:36 storage1 kernel: [] ? autoremove_wake_function+0x0/0x40
Nov 14 21:48:36 storage1 kernel: [] __cv_wait_io+0x18/0x20 [spl]
Nov 14 21:48:36 storage1 kernel: [] zio_wait+0xfb/0x1b0 [zfs]
Nov 14 21:48:36 storage1 kernel: [] dsl_pool_sync+0xf4/0x540 [zfs]
Nov 14 21:48:36 storage1 kernel: [] spa_sync+0x40e/0xa80 [zfs]
Nov 14 21:48:36 storage1 kernel: [] ? read_tsc+0x9/0x20
Nov 14 21:48:36 storage1 kernel: [] txg_sync_thread+0x307/0x590 [zfs]
Nov 14 21:48:36 storage1 kernel: [] ? set_user_nice+0xc9/0x130
Nov 14 21:48:36 storage1 kernel: [] ? txg_sync_thread+0x0/0x590 [zfs]
Nov 14 21:48:36 storage1 kernel: [] thread_generic_wrapper+0x68/0x80 [spl]
Nov 14 21:48:36 storage1 kernel: [] ? thread_generic_wrapper+0x0/0x80 [spl]
Nov 14 21:48:36 storage1 kernel: [] kthread+0x96/0xa0
Nov 14 21:48:36 storage1 kernel: [] child_rip+0xa/0x20
Nov 14 21:48:36 storage1 kernel: [] ? kthread+0x0/0xa0
Nov 14 21:48:36 storage1 kernel: [] ? child_rip+0x0/0x20
Nov 14 21:49:32 storage1 kernel: SPLError: 25221:0:(spl-err.c:67:vcmn_err()) WARNING: Pool 'test' has encountered an uncorrectable I/O failure and has been suspended.
Nov 14 21:49:32 storage1 kernel:
Nov 14 21:50:36 storage1 kernel: INFO: task txg_sync:25292 blocked for more than 120 seconds.
Nov 14 21:50:36 storage1 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov 14 21:50:36 storage1 kernel: txg_sync D 0000000000000004 0 25292 2 0x00000080
Nov 14 21:50:36 storage1 kernel: ffff88059fc95b60 0000000000000046 0000000000000001 ffff88086e696570
Nov 14 21:50:36 storage1 kernel: 0000000000000000 0000000000000000 ffff88059fc95ae0 ffffffff810639a2
Nov 14 21:50:36 storage1 kernel: ffff880859528638 ffff88059fc95fd8 000000000000fb88 ffff880859528638
Nov 14 21:50:36 storage1 kernel: Call Trace:
Nov 14 21:50:36 storage1 kernel: [] ? default_wake_function+0x12/0x20
Nov 14 21:50:36 storage1 kernel: [] ? ktime_get_ts+0xb1/0xf0
Nov 14 21:50:36 storage1 kernel: [] io_schedule+0x73/0xc0
Nov 14 21:50:36 storage1 kernel: [] cv_wait_common+0xac/0x1c0 [spl]
Nov 14 21:50:36 storage1 kernel: [] ? zio_execute+0x0/0x140 [zfs]
Nov 14 21:50:36 storage1 kernel: [] ? autoremove_wake_function+0x0/0x40
Nov 14 21:50:36 storage1 kernel: [] __cv_wait_io+0x18/0x20 [spl]
Nov 14 21:50:36 storage1 kernel: [] zio_wait+0xfb/0x1b0 [zfs]
Nov 14 21:50:36 storage1 kernel: [] dsl_pool_sync+0xf4/0x540 [zfs]
Nov 14 21:50:36 storage1 kernel: [] spa_sync+0x40e/0xa80 [zfs]
Nov 14 21:50:36 storage1 kernel: [] ? read_tsc+0x9/0x20
Nov 14 21:50:36 storage1 kernel: [] txg_sync_thread+0x307/0x590 [zfs]
Nov 14 21:50:36 storage1 kernel: [] ? set_user_nice+0xc9/0x130
Nov 14 21:50:36 storage1 kernel: [] ? txg_sync_thread+0x0/0x590 [zfs]
Nov 14 21:50:36 storage1 kernel: [] thread_generic_wrapper+0x68/0x80 [spl]
Nov 14 21:50:36 storage1 kernel: [] ? thread_generic_wrapper+0x0/0x80 [spl]
Nov 14 21:50:36 storage1 kernel: [] kthread+0x96/0xa0
Nov 14 21:50:36 storage1 kernel: [] child_rip+0xa/0x20
Nov 14 21:50:36 storage1 kernel: [] ? kthread+0x0/0xa0
Nov 14 21:50:36 storage1 kernel: [] ? child_rip+0x0/0x20
Nov 14 21:52:36 storage1 kernel: INFO: task txg_sync:25292 blocked for more than 120 seconds.
Nov 14 21:52:36 storage1 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov 14 21:52:36 storage1 kernel: txg_sync D 0000000000000004 0 25292 2 0x00000080
Nov 14 21:52:36 storage1 kernel: ffff88059fc95b60 0000000000000046 0000000000000001 ffff88086e696570
Nov 14 21:52:36 storage1 kernel: 0000000000000000 0000000000000000 ffff88059fc95ae0 ffffffff810639a2
Nov 14 21:52:36 storage1 kernel: ffff880859528638 ffff88059fc95fd8 000000000000fb88 ffff880859528638
Nov 14 21:52:36 storage1 kernel: Call Trace:
Nov 14 21:52:36 storage1 kernel: [] ? default_wake_function+0x12/0x20
Nov 14 21:52:36 storage1 kernel: [] ? ktime_get_ts+0xb1/0xf0
Nov 14 21:52:36 storage1 kernel: [] io_schedule+0x73/0xc0
Nov 14 21:52:36 storage1 kernel: [] cv_wait_common+0xac/0x1c0 [spl]
Nov 14 21:52:36 storage1 kernel: [] ? zio_execute+0x0/0x140 [zfs]
Nov 14 21:52:36 storage1 kernel: [] ? autoremove_wake_function+0x0/0x40
Nov 14 21:52:36 storage1 kernel: [] __cv_wait_io+0x18/0x20 [spl]
Nov 14 21:52:36 storage1 kernel: [] zio_wait+0xfb/0x1b0 [zfs]
Nov 14 21:52:36 storage1 kernel: [] dsl_pool_sync+0xf4/0x540 [zfs]
Nov 14 21:52:36 storage1 kernel: [] spa_sync+0x40e/0xa80 [zfs]
Nov 14 21:52:36 storage1 kernel: [] ? read_tsc+0x9/0x20
Nov 14 21:52:36 storage1 kernel: [] txg_sync_thread+0x307/0x590 [zfs]
Nov 14 21:52:36 storage1 kernel: [] ? set_user_nice+0xc9/0x130
Nov 14 21:52:36 storage1 kernel: [] ? txg_sync_thread+0x0/0x590 [zfs]
Nov 14 21:52:36 storage1 kernel: [] thread_generic_wrapper+0x68/0x80 [spl]
Nov 14 21:52:36 storage1 kernel: [] ? thread_generic_wrapper+0x0/0x80 [spl]
Nov 14 21:52:36 storage1 kernel: [] kthread+0x96/0xa0
Nov 14 21:52:36 storage1 kernel: [] child_rip+0xa/0x20
Nov 14 21:52:36 storage1 kernel: [] ? kthread+0x0/0xa0
Nov 14 21:52:36 storage1 kernel: [] ? child_rip+0x0/0x20
Nov 14 21:54:03 storage1 kernel: SPLError: 25219:0:(spl-err.c:67:vcmn_err()) WARNING: Pool 'test' has encountered an uncorrectable I/O failure and has been suspended.
Nov 14 21:54:03 storage1 kernel:
Nov 14 21:54:36 storage1 kernel: INFO: task txg_sync:25292 blocked for more than 120 seconds.
Nov 14 21:54:36 storage1 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov 14 21:54:36 storage1 kernel: txg_sync D 0000000000000004 0 25292 2 0x00000080
Nov 14 21:54:36 storage1 kernel: ffff88059fc95b60 0000000000000046 0000000000000001 ffff88086e696570
Nov 14 21:54:36 storage1 kernel: 0000000000000000 0000000000000000 ffff88059fc95ae0 ffffffff810639a2
Nov 14 21:54:36 storage1 kernel: ffff880859528638 ffff88059fc95fd8 000000000000fb88 ffff880859528638
Nov 14 21:54:36 storage1 kernel: Call Trace:
Nov 14 21:54:36 storage1 kernel: [] ? default_wake_function+0x12/0x20
Nov 14 21:54:36 storage1 kernel: [] ? ktime_get_ts+0xb1/0xf0
Nov 14 21:54:36 storage1 kernel: [] io_schedule+0x73/0xc0
Nov 14 21:54:36 storage1 kernel: [] cv_wait_common+0xac/0x1c0 [spl]
Nov 14 21:54:36 storage1 kernel: [] ? zio_execute+0x0/0x140 [zfs]
Nov 14 21:54:36 storage1 kernel: [] ? autoremove_wake_function+0x0/0x40
Nov 14 21:54:36 storage1 kernel: [] __cv_wait_io+0x18/0x20 [spl]
Nov 14 21:54:36 storage1 kernel: [] zio_wait+0xfb/0x1b0 [zfs]
Nov 14 21:54:36 storage1 kernel: [] dsl_pool_sync+0xf4/0x540 [zfs]
Nov 14 21:54:36 storage1 kernel: [] spa_sync+0x40e/0xa80 [zfs]
Nov 14 21:54:36 storage1 kernel: [] ? read_tsc+0x9/0x20
Nov 14 21:54:36 storage1 kernel: [] txg_sync_thread+0x307/0x590 [zfs]
Nov 14 21:54:36 storage1 kernel: [] ? set_user_nice+0xc9/0x130
Nov 14 21:54:36 storage1 kernel: [] ? txg_sync_thread+0x0/0x590 [zfs]
Nov 14 21:54:36 storage1 kernel: [] thread_generic_wrapper+0x68/0x80 [spl]
Nov 14 21:54:36 storage1 kernel: [] ? thread_generic_wrapper+0x0/0x80 [spl]
Nov 14 21:54:36 storage1 kernel: [] kthread+0x96/0xa0
Nov 14 21:54:36 storage1 kernel: [] child_rip+0xa/0x20
Nov 14 21:54:36 storage1 kernel: [] ? kthread+0x0/0xa0
Nov 14 21:54:36 storage1 kernel: [] ? child_rip+0x0/0x20
Nov 14 21:54:36 storage1 kernel: INFO: task umount:13912 blocked for more than 120 seconds.
Nov 14 21:54:36 storage1 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov 14 21:54:36 storage1 kernel: umount D 000000000000000a 0 13912 13911 0x00000080
Nov 14 21:54:36 storage1 kernel: ffff8804edb95b28 0000000000000086 0000000000000000 ffff88059ccb6340
Nov 14 21:54:36 storage1 kernel: 0000000000000046 0000000000000286 ffff8804edb95bc8 ffffffffa01c99e7
Nov 14 21:54:36 storage1 kernel: ffff88085b6dd098 ffff8804edb95fd8 000000000000fb88 ffff88085b6dd098
Nov 14 21:54:36 storage1 kernel: Call Trace:
Nov 14 21:54:36 storage1 kernel: [] ? spl_debug_msg+0x427/0x9d0 [spl]
Nov 14 21:54:36 storage1 kernel: [] cv_wait_common+0x105/0x1c0 [spl]
Nov 14 21:54:36 storage1 kernel: [] ? autoremove_wake_function+0x0/0x40
Nov 14 21:54:36 storage1 kernel: [] __cv_wait+0x15/0x20 [spl]
Nov 14 21:54:36 storage1 kernel: [] txg_wait_synced+0xb3/0x190 [zfs]
Nov 14 21:54:36 storage1 kernel: [] dmu_tx_wait+0xc5/0xf0 [zfs]
Nov 14 21:54:36 storage1 kernel: [] dmu_tx_assign+0x8e/0x4e0 [zfs]
Nov 14 21:54:36 storage1 kernel: [] zfs_inactive+0x186/0x220 [zfs]
Nov 14 21:54:36 storage1 kernel: [] zpl_clear_inode+0xe/0x10 [zfs]
Nov 14 21:54:36 storage1 kernel: [] clear_inode+0xac/0x140
Nov 14 21:54:36 storage1 kernel: [] dispose_list+0x40/0x120
Nov 14 21:54:36 storage1 kernel: [] invalidate_inodes+0xea/0x190
Nov 14 21:54:36 storage1 kernel: [] generic_shutdown_super+0x4c/0xe0
Nov 14 21:54:36 storage1 kernel: [] kill_anon_super+0x16/0x60
Nov 14 21:54:36 storage1 kernel: [] zpl_kill_sb+0x1e/0x30 [zfs]
Nov 14 21:54:36 storage1 kernel: [] deactivate_super+0x57/0x80
Nov 14 21:54:36 storage1 kernel: [] mntput_no_expire+0xbf/0x110
Nov 14 21:54:36 storage1 kernel: [] sys_umount+0x7b/0x3a0
Nov 14 21:54:36 storage1 kernel: [] system_call_fastpath+0x16/0x1b
Nov 14 21:55:38 storage1 kernel: SPLError: 25215:0:(spl-err.c:67:vcmn_err()) WARNING: Pool 'test' has encountered an uncorrectable I/O failure and has been suspended.
Nov 14 21:55:38 storage1 kernel:
Nov 14 21:55:38 storage1 kernel: SPLError: 25215:0:(spl-err.c:67:vcmn_err()) Skipped 3 previous similar messages
Nov 14 21:56:36 storage1 kernel: INFO: task txg_sync:25292 blocked for more than 120 seconds.
Nov 14 21:56:36 storage1 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov 14 21:56:36 storage1 kernel: txg_sync D 0000000000000004 0 25292 2 0x00000080
Nov 14 21:56:36 storage1 kernel: ffff88059fc95b60 0000000000000046 0000000000000001 ffff88086e696570
Nov 14 21:56:36 storage1 kernel: 0000000000000000 0000000000000000 ffff88059fc95ae0 ffffffff810639a2
Nov 14 21:56:36 storage1 kernel: ffff880859528638 ffff88059fc95fd8 000000000000fb88 ffff880859528638
Nov 14 21:56:36 storage1 kernel: Call Trace:
Nov 14 21:56:36 storage1 kernel: [] ? default_wake_function+0x12/0x20
Nov 14 21:56:36 storage1 kernel: [] ? ktime_get_ts+0xb1/0xf0
Nov 14 21:56:36 storage1 kernel: [] io_schedule+0x73/0xc0
Nov 14 21:56:36 storage1 kernel: [] cv_wait_common+0xac/0x1c0 [spl]
Nov 14 21:56:36 storage1 kernel: [] ? zio_execute+0x0/0x140 [zfs]
Nov 14 21:56:36 storage1 kernel: [] ? autoremove_wake_function+0x0/0x40
Nov 14 21:56:36 storage1 kernel: [] __cv_wait_io+0x18/0x20 [spl]
Nov 14 21:56:36 storage1 kernel: [] zio_wait+0xfb/0x1b0 [zfs]
Nov 14 21:56:36 storage1 kernel: [] dsl_pool_sync+0xf4/0x540 [zfs]
Nov 14 21:56:36 storage1 kernel: [] spa_sync+0x40e/0xa80 [zfs]
Nov 14 21:56:36 storage1 kernel: [] ? read_tsc+0x9/0x20
Nov 14 21:56:36 storage1 kernel: [] txg_sync_thread+0x307/0x590 [zfs]
Nov 14 21:56:36 storage1 kernel: [] ? set_user_nice+0xc9/0x130
Nov 14 21:56:36 storage1 kernel: [] ? txg_sync_thread+0x0/0x590 [zfs]
Nov 14 21:56:36 storage1 kernel: [] thread_generic_wrapper+0x68/0x80 [spl]
Nov 14 21:56:36 storage1 kernel: [] ? thread_generic_wrapper+0x0/0x80 [spl]
Nov 14 21:56:36 storage1 kernel: [] kthread+0x96/0xa0
Nov 14 21:56:36 storage1 kernel: [] child_rip+0xa/0x20
Nov 14 21:56:36 storage1 kernel: [] ? kthread+0x0/0xa0
Nov 14 21:56:36 storage1 kernel: [] ? child_rip+0x0/0x20
Nov 14 21:56:36 storage1 kernel: INFO: task umount:13912 blocked for more than 120 seconds.
Nov 14 21:56:36 storage1 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov 14 21:56:36 storage1 kernel: umount D 000000000000000a 0 13912 1 0x00000080
Nov 14 21:56:36 storage1 kernel: ffff8804edb95b28 0000000000000086 0000000000000000 ffff88059ccb6340
Nov 14 21:56:36 storage1 kernel: 0000000000000046 0000000000000286 ffff8804edb95bc8 ffffffffa01c99e7
Nov 14 21:56:36 storage1 kernel: ffff88085b6dd098 ffff8804edb95fd8 000000000000fb88 ffff88085b6dd098
Nov 14 21:56:36 storage1 kernel: Call Trace:
Nov 14 21:56:36 storage1 kernel: [] ? spl_debug_msg+0x427/0x9d0 [spl]
Nov 14 21:56:36 storage1 kernel: [] cv_wait_common+0x105/0x1c0 [spl]
Nov 14 21:56:36 storage1 kernel: [] ? autoremove_wake_function+0x0/0x40
Nov 14 21:56:36 storage1 kernel: [] __cv_wait+0x15/0x20 [spl]
Nov 14 21:56:36 storage1 kernel: [] txg_wait_synced+0xb3/0x190 [zfs]
Nov 14 21:56:36 storage1 kernel: [] dmu_tx_wait+0xc5/0xf0 [zfs]
Nov 14 21:56:36 storage1 kernel: [] dmu_tx_assign+0x8e/0x4e0 [zfs]
Nov 14 21:56:36 storage1 kernel: [] zfs_inactive+0x186/0x220 [zfs]
Nov 14 21:56:36 storage1 kernel: [] zpl_clear_inode+0xe/0x10 [zfs]
Nov 14 21:56:36 storage1 kernel: [] clear_inode+0xac/0x140
Nov 14 21:56:36 storage1 kernel: [] dispose_list+0x40/0x120
Nov 14 21:56:36 storage1 kernel: [] invalidate_inodes+0xea/0x190
Nov 14 21:56:36 storage1 kernel: [] generic_shutdown_super+0x4c/0xe0
Nov 14 21:56:36 storage1 kernel: [] kill_anon_super+0x16/0x60
Nov 14 21:56:36 storage1 kernel: [] zpl_kill_sb+0x1e/0x30 [zfs]
Nov 14 21:56:36 storage1 kernel: [] deactivate_super+0x57/0x80
Nov 14 21:56:36 storage1 kernel: [] mntput_no_expire+0xbf/0x110
Nov 14 21:56:36 storage1 kernel: [] sys_umount+0x7b/0x3a0
Nov 14 21:56:36 storage1 kernel: [] system_call_fastpath+0x16/0x1b

tcf909 on 15 Nov 2013

This is marked as documentation...not sure it should be... zfs just hard stops on failed devices.

(I saved a core dump from this issue if your interested)

tcf909 on 15 Nov 2013

@CrashHD I think what you are encountering is actual loss of enough disks (bus reset?) so that your pool enters wait-mode.

In my case it was simply a single disk being stuck with redundancy still available.

atonkyra on 15 Nov 2013

Bumping to 0.6.4. We need to address this but at this point it shouldn't block getting 0.6.3 released.

behlendorf on 1 Feb 2014

I found another problem when there are zvols on the pool. When we execute any command which tries to do something with zvol it deadlocks.

[ 2081.904195] fdisk D 0000000000000000 0 31124 24226 0x20020000
[ 2081.904259] ffff8800336c5368 0000000000000006 ffff8800336c5278 ffffffff810761cd
[ 2081.904266] ffff88002db660f0 ffff8800336c5fd8 ffff8800336c5fd8 ffff8800336c5fd8
[ 2081.904271] ffff88002f1bac10 ffff88002db660f0 0000000000000000 000000000000002e
[ 2081.904277] Call Trace:
[ 2081.904284] [] ? up+0x2d/0x50
[ 2081.904365] [] ? zio_taskq_dispatch+0xce/0x120 [zfs]
[ 2081.904373] [] schedule+0x24/0x70
[ 2081.904380] [] io_schedule+0x5a/0x80
[ 2081.904402] [] cv_wait_common+0xc0/0x1d0 [spl]
[ 2081.904474] [] ? zio_done+0x8e/0xcb0 [zfs]
[ 2081.904483] [] ? abort_exclusive_wait+0xb0/0xb0
[ 2081.904526] [] __cv_wait_io+0x13/0x20 [spl]
[ 2081.904599] [] zio_wait+0x103/0x1c0 [zfs]
[ 2081.904657] [] dbuf_read+0x33a/0x880 [zfs]
[ 2081.904715] [] __dbuf_hold_impl+0x364/0x480 [zfs]
[ 2081.904772] [] ? dbuf_find+0xd8/0x100 [zfs]
[ 2081.904829] [] __dbuf_hold_impl+0x175/0x480 [zfs]
[ 2081.904885] [] ? dbuf_find+0xd8/0x100 [zfs]
[ 2081.904942] [] __dbuf_hold_impl+0x175/0x480 [zfs]
[ 2081.904998] [] ? dbuf_find+0xd8/0x100 [zfs]
[ 2081.905055] [] __dbuf_hold_impl+0x175/0x480 [zfs]
[ 2081.905111] [] ? dbuf_find+0xd8/0x100 [zfs]
[ 2081.905168] [] __dbuf_hold_impl+0x175/0x480 [zfs]
[ 2081.905224] [] ? dbuf_find+0xd8/0x100 [zfs]
[ 2081.905281] [] __dbuf_hold_impl+0x175/0x480 [zfs]
[ 2081.905338] [] dbuf_hold_impl+0x81/0xb0 [zfs]
[ 2081.905356] [] ? kmem_free_debug+0x46/0x140 [spl]
[ 2081.905413] [] dbuf_hold+0x1b/0x30 [zfs]
[ 2081.905478] [] dnode_hold_impl+0x15c/0x5d0 [zfs]
[ 2081.905486] [] ? free_debug_processing+0x1b6/0x20d
[ 2081.905503] [] ? kmem_free_debug+0x46/0x140 [spl]
[ 2081.905567] [] dnode_hold+0x14/0x20 [zfs]
[ 2081.905627] [] dmu_buf_hold+0x48/0x1d0 [zfs]
[ 2081.905647] [] ? __cv_init+0x56/0x110 [spl]
[ 2081.905666] [] ? tsd_hash_search.isra.1+0x93/0x1a0 [spl]
[ 2081.905744] [] zap_lockdir+0x55/0x780 [zfs]
[ 2081.905751] [] ? kfree+0x13d/0x180
[ 2081.905829] [] zap_lookup_norm+0x45/0x1a0 [zfs]
[ 2081.905899] [] ? dsl_pool_rele+0x2d/0x40 [zfs]
[ 2081.905976] [] zap_lookup+0x2e/0x30 [zfs]
[ 2081.906049] [] zvol_open+0x19c/0x2f0 [zfs]
[ 2081.906059] [] __blkdev_get+0xad/0x410
[ 2081.906066] [] ? inode_init_always+0xf9/0x1c0
[ 2081.906075] [] blkdev_get+0x4e/0x300
[ 2081.906083] [] ? unlock_new_inode+0x42/0x70
[ 2081.906090] [] ? bdget+0x115/0x130
[ 2081.906097] [] ? blkdev_get+0x300/0x300
[ 2081.906105] [] blkdev_open+0x58/0x80
[ 2081.906113] [] do_dentry_open+0x21e/0x2a0
[ 2081.906120] [] finish_open+0x30/0x40
[ 2081.906127] [] do_last+0x7ee/0xea0
[ 2081.906133] [] ? inode_permission+0x13/0x50
[ 2081.906139] [] ? link_path_walk+0x235/0x910
[ 2081.906146] [] path_openat+0xb3/0x4e0
[ 2081.906153] [] ? getname_flags+0x55/0x190
[ 2081.906160] [] do_filp_open+0x3d/0xa0
[ 2081.906167] [] ? __alloc_fd+0x45/0x110
[ 2081.906173] [] do_sys_open+0xf9/0x1e0
[ 2081.906180] [] compat_SyS_open+0x59/0xe0
[ 2081.906188] [] ? syscall_trace_enter+0x24/0x260
[ 2081.906242] [] ia32_do_call+0x13/0x13

[ 645.412271] zfs D 0000000000000000 0 29938 24961 0x00000000
[ 645.412336] ffff8800283016d8 0000000000000006 ffff8800283015e8 ffffffff810761cd
[ 645.412343] ffff880037bf7290 ffff880028301fd8 ffff880028301fd8 ffff880028301fd8
[ 645.412348] ffff88002f3988d0 ffff880037bf7290 0000000000000000 0000000000000030
[ 645.412354] Call Trace:
[ 645.412361] [] ? up+0x2d/0x50
[ 645.412445] [] ? spa_config_enter+0xc9/0x110 [zfs]
[ 645.412453] [] schedule+0x24/0x70
[ 645.412460] [] io_schedule+0x5a/0x80
[ 645.412481] [] cv_wait_common+0xc0/0x1d0 [spl]
[ 645.412554] [] ? zio_done+0x8e/0xcb0 [zfs]
[ 645.412562] [] ? abort_exclusive_wait+0xb0/0xb0
[ 645.412583] [] __cv_wait_io+0x13/0x20 [spl]
[ 645.412655] [] zio_wait+0x103/0x1c0 [zfs]
[ 645.412712] [] dbuf_read+0x33a/0x880 [zfs]
[ 645.412769] [] ? dbuf_rele_and_unlock+0x159/0x210 [zfs]
[ 645.412827] [] __dbuf_hold_impl+0x364/0x480 [zfs]
[ 645.412883] [] ? dbuf_find+0xd8/0x100 [zfs]
[ 645.412940] [] __dbuf_hold_impl+0x175/0x480 [zfs]
[ 645.412996] [] ? dbuf_find+0xd8/0x100 [zfs]
[ 645.413053] [] __dbuf_hold_impl+0x175/0x480 [zfs]
[ 645.413109] [] ? dbuf_find+0xd8/0x100 [zfs]
[ 645.413166] [] __dbuf_hold_impl+0x175/0x480 [zfs]
[ 645.413224] [] dbuf_hold_impl+0x81/0xb0 [zfs]
[ 645.413281] [] dbuf_hold+0x1b/0x30 [zfs]
[ 645.413346] [] dnode_hold_impl+0x15c/0x5d0 [zfs]
[ 645.413354] [] ? save_stack_trace+0x2a/0x50
[ 645.413361] [] ? set_track+0x5d/0x1a0
[ 645.413426] [] dnode_hold+0x14/0x20 [zfs]
[ 645.413485] [] dmu_buf_hold+0x48/0x1d0 [zfs]
[ 645.413493] [] ? __slab_free+0x251/0x2fc
[ 645.413570] [] zap_lockdir+0x55/0x780 [zfs]
[ 645.413578] [] ? kfree+0x13d/0x180
[ 645.413655] [] zap_lookup_norm+0x45/0x1a0 [zfs]
[ 645.413732] [] zap_lookup+0x2e/0x30 [zfs]
[ 645.413806] [] zvol_get_stats+0x3a/0xe0 [zfs]
[ 645.413881] [] zfs_ioc_objset_stats_impl+0xb1/0xf0 [zfs]
[ 645.413956] [] zfs_ioc_objset_stats+0x2c/0x50 [zfs]
[ 645.414018] [] ? dmu_objset_rele+0x39/0x50 [zfs]
[ 645.414094] [] zfs_ioc_dataset_list_next+0x120/0x140 [zfs]
[ 645.414170] [] zfsdev_ioctl+0x4eb/0x560 [zfs]
[ 645.414178] [] ? sched_clock_local+0x25/0x90
[ 645.414185] [] ? sched_clock_cpu+0xa8/0x110
[ 645.414193] [] do_vfs_ioctl+0x8a/0x4e0
[ 645.414200] [] ? vtime_account_user+0x52/0x70
[ 645.414207] [] SyS_ioctl+0x91/0xb0
[ 645.414215] [] tracesys+0xdd/0xe2

I've already tried to resolve this issue by checking spa_suspended before zap_lookup but it works only in case when pool was suspended while performing I/O operations on dataset.

ab-oe on 7 Apr 2014

Closing. Since this issue was last visited patches have been merged in to ZoL to address both the zvol deadlocks and the zpool status hangs of a faulted pool.

behlendorf on 25 Mar 2016

Should I still be experiencing this on 0.6.5.3 then?

FWIW, I got:

WARNING: Pool '...' has encountered an uncorrectable I/O failure and has been suspended.

in my kernel log without a single disk error being logged by the kernel.

brianjmurrell on 17 Aug 2016

👍1

Still hangs on Ubuntu 18.04 x64 with cumstom kernel 4.17.0 and v0.7.9-1 (compiled by myself).

Should we re-open this issue?

Pool 'rescuetank' is on a usb disk, I plug out the usb disk after import this pool.

And 'zpool status' report every thing is OK about this pool:

  pool: rescuetank
 state: ONLINE
  scan: none requested
config:

    NAME        STATE     READ WRITE CKSUM
    rescuetank  ONLINE       0     0     0
      sdc2      ONLINE       0     0     0

errors: No known data errors

But 'zpool export/destroy/clear rescuetank' will hang with kernel message (even after system reboot and remove the usb disk from system):

[    6.833572] ZFS: Loaded module v0.7.9-1, ZFS pool version 5000, ZFS filesystem version 5

[   81.165963] WARNING: Pool 'rescuetank' has encountered an uncorrectable I/O failure and has been suspended.

[  243.010444] INFO: task txg_sync:4214 blocked for more than 120 seconds.
[  243.010835]       Tainted: P           OE     4.17.0-qemu-4.17-1+ #2
[  243.011185] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  243.011614] txg_sync        D    0  4214      2 0x80000000
[  243.011619] Call Trace:
[  243.011631]  __schedule+0x291/0x870
[  243.011636]  schedule+0x2c/0x80
[  243.011643]  io_schedule+0x16/0x40
[  243.011659]  cv_wait_common+0xb2/0x140 [spl]
[  243.011664]  ? wait_woken+0x80/0x80
[  243.011675]  __cv_wait_io+0x18/0x20 [spl]
[  243.011762]  zio_wait+0xf8/0x1b0 [zfs]
[  243.011828]  dsl_pool_sync+0x3d5/0x420 [zfs]
[  243.011900]  spa_sync+0x43e/0xd30 [zfs]
[  243.011977]  txg_sync_thread+0x2cd/0x4a0 [zfs]
[  243.012046]  ? txg_quiesce_thread+0x3d0/0x3d0 [zfs]
[  243.012055]  thread_generic_wrapper+0x74/0x90 [spl]
[  243.012061]  kthread+0x121/0x140
[  243.012069]  ? __thread_exit+0x20/0x20 [spl]
[  243.012073]  ? kthread_create_worker_on_cpu+0x70/0x70
[  243.012079]  ret_from_fork+0x35/0x40

wheelcomplex on 30 Jun 2018

👍1

A WorkForMe workaround: sudo rm -f /etc/zfs/zpool.cache will clear the states and the pool can be re-imported.

wheelcomplex on 30 Jun 2018

👍1

A WorkForMe workaround: sudo rm -f /etc/zfs/zpool.cache will clear the states and the pool can be re-imported.

Worked for me too. Thanks.

braian87b on 26 Mar 2019

I am still experiencing the same problem on Ubuntu 19.04 – both with and without /etc/zfs/zpool.cache. Import > Kernel Trace > Zpool Crash… :crying_cat_face: