Zfs: Improvement: ZPOOL ‘status’ currently does not count low level disk errors

Created on 14 Jul 2016  Â·  8Comments  Â·  Source: openzfs/zfs

Today I've received that mail:

Von: [email protected] (root)
An: [email protected]
Betreff: ZFS io error for tanka on xxx.yyy.com
Datum: Thu, 14 Jul 2016 07:09:43 +0200

ZFS has detected an io error:

   eid: 12
 class: io
  host: xxx.yyy.com
  time: 2016-07-14 07:09:42+0200
 vtype: disk
 vpath: /dev/disk/by-id/ata-ST3000DM001-9YN166_S1F0F544-part1
 vguid: 0xA28F73CE25C89BFD
 cksum: 0
  read: 0
 write: 0
  pool: tanka

But the pool is clear:

# zpool list
NAME    SIZE  ALLOC   FREE  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
tanka  24,5T  2,94T  21,6T         -     5%    12%  1.19x  ONLINE  -
# zpool status
  pool: tanka
 state: ONLINE
  scan: resilvered 498G in 52h55m with 0 errors on Sat Jul  9 02:13:26 2016
config:

    NAME                                          STATE     READ WRITE CKSUM
    tanka                                         ONLINE       0     0     0
      raidz2-0                                    ONLINE       0     0     0
        ata-TOSHIBA_DT01ACA300_Z37G820GS          ONLINE       0     0     0
        ata-WDC_WD30EZRZ-00Z5HB0_WD-WCC4N2TFU67F  ONLINE       0     0     0
        ata-ST3000DM001-9YN166_Z1F0NBDP           ONLINE       0     0     0
        ata-ST3000DM001-9YN166_S1F0F544           ONLINE       0     0     0
        ata-WDC_WD30EZRZ-00Z5HB0_WD-WCC4N1LET275  ONLINE       0     0     0
        ata-TOSHIBA_DT01ACA300_Z2O51H2GS          ONLINE       0     0     0
        ata-TOSHIBA_DT01ACA300_Z5QS77VKS          ONLINE       0     0     0
        ata-TOSHIBA_DT01ACA300_56U7SHTAS          ONLINE       0     0     0
        ata-TOSHIBA_DT01ACA300_Z2O51RZGS          ONLINE       0     0     0

errors: No known data errors
Jul 14 2016 07:09:42.874641339 ereport.fs.zfs.io
        class = "ereport.fs.zfs.io"
        ena = 0x9302b36c500001
        detector = (embedded nvlist)
                version = 0x0
                scheme = "zfs"
                pool = 0xd2912d8866b767d
                vdev = 0xa28f73ce25c89bfd
        (end detector)
        pool = "tanka"
        pool_guid = 0xd2912d8866b767d
        pool_context = 0x0
        pool_failmode = "wait"
        vdev_guid = 0xa28f73ce25c89bfd
        vdev_type = "disk"
        vdev_path = "/dev/disk/by-id/ata-ST3000DM001-9YN166_S1F0F544-part1"
        vdev_ashift = 0xc
        vdev_complete_ts = 0x7009302b256d
        vdev_delta_ts = 0x44489
        vdev_read_errors = 0x0
        vdev_write_errors = 0x0
        vdev_cksum_errors = 0x0
        parent_guid = 0x27b370226659d459
        parent_type = "raidz"
        vdev_spare_paths = 
        vdev_spare_guids = 
        zio_err = 0x5
        zio_flags = 0xc0880
        zio_stage = 0x200000
        zio_pipeline = 0x210000
        zio_delay = 0x9f
        zio_timestamp = 0x70090d226539
        zio_delta = 0x1f8a9a49
        zio_offset = 0x9d78c98000
        zio_size = 0x1800
        zio_objset = 0x1459
        zio_object = 0xa
        zio_level = 0x0
        zio_blkid = 0x5882
        time = 0x57871e96 0x3421f7bb 
        eid = 0xc

Jul 14 2016 07:09:42.874641339 ereport.fs.zfs.io
        class = "ereport.fs.zfs.io"
        ena = 0x9302baa9c00001
        detector = (embedded nvlist)
                version = 0x0
                scheme = "zfs"
                pool = 0xd2912d8866b767d
                vdev = 0xa28f73ce25c89bfd
        (end detector)
        pool = "tanka"
        pool_guid = 0xd2912d8866b767d
        pool_context = 0x0
        pool_failmode = "wait"
        vdev_guid = 0xa28f73ce25c89bfd
        vdev_type = "disk"
        vdev_path = "/dev/disk/by-id/ata-ST3000DM001-9YN166_S1F0F544-part1"
        vdev_ashift = 0xc
        vdev_complete_ts = 0x7009302b256d
        vdev_delta_ts = 0x44489
        vdev_read_errors = 0x0
        vdev_write_errors = 0x0
        vdev_cksum_errors = 0x0
        parent_guid = 0x27b370226659d459
        parent_type = "raidz"
        vdev_spare_paths = 
        vdev_spare_guids = 
        zio_err = 0x5
        zio_flags = 0xc0880
        zio_stage = 0x200000
        zio_pipeline = 0x210000
        zio_delay = 0x9f
        zio_timestamp = 0x70090d1ab417
        zio_delta = 0x1f9126a1
        zio_offset = 0x9b00fec000
        zio_size = 0x200
        zio_objset = 0x0
        zio_object = 0x71
        zio_level = 0x0
        zio_blkid = 0x5bb4
        time = 0x57871e96 0x3421f7bb 
        eid = 0xd

Currently the disks are configured to spin down, so zfs have to wait sometimes some seconds when no read/write had been done for minutes, but this should not be a "I/O"-Error

Defect

Most helpful comment

So the title is wrong, there was a reason zed wrote a mail.

Since this isn’t a bug anymore, I like to ask for a reopening:

A disk read-error is a read-error of a disk and should be displayed as such, for unrecoverable read-errors there is a line per raidz/pool to show this on that logical level.

I would never think that zpool status would not show such read-errors, which are corrected, at the device level. This clearly break the expectations of the users and should be changed.

@behlendorf what are you thoughts about that?

All 8 comments

Thats my dmesg:

[62997.726191] INFO: task zfs:16808 blocked for more than 120 seconds.
[62997.726259]       Tainted: P           O    4.6.3-1-ARCH #1
[62997.726309] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[62997.726379] zfs             D ffff88083ff439f0     0 16808   9252 0x00000000
[62997.726388]  ffff88083ff439f0 00ff880843fe7224 ffff88084ff54c40 ffff88083ebb8000
[62997.726394]  ffff88083ff44000 ffff880843fe7220 ffff880843fe72e0 ffff880843fe7248
[62997.726400]  0000000000000000 ffff88083ff43a08 ffffffff815c372c ffff880843fe72d8
[62997.726406] Call Trace:
[62997.726418]  [<ffffffff815c372c>] schedule+0x3c/0x90
[62997.726430]  [<ffffffffa0135e29>] cv_wait_common+0x109/0x140 [spl]
[62997.726437]  [<ffffffff810bd490>] ? wake_atomic_t_function+0x60/0x60
[62997.726445]  [<ffffffffa0135e75>] __cv_wait+0x15/0x20 [spl]
[62997.726493]  [<ffffffffa02e39b5>] txg_wait_synced+0xe5/0x130 [zfs]
[62997.726537]  [<ffffffffa02bf194>] dsl_sync_task+0x184/0x270 [zfs]
[62997.726569]  [<ffffffffa0299cc0>] ? restore_write+0x1d0/0x1d0 [zfs]
[62997.726599]  [<ffffffffa029a0b0>] ? dmu_recv_end_sync+0x3f0/0x3f0 [zfs]
[62997.726627]  [<ffffffffa029a0b0>] ? dmu_recv_end_sync+0x3f0/0x3f0 [zfs]
[62997.726654]  [<ffffffffa0299cc0>] ? restore_write+0x1d0/0x1d0 [zfs]
[62997.726682]  [<ffffffffa029c103>] dmu_recv_end+0x93/0x1b0 [zfs]
[62997.726719]  [<ffffffffa030d614>] zfs_ioc_recv+0x4b4/0xbf0 [zfs]
[62997.726728]  [<ffffffffa01370c9>] ? tsd_set+0x199/0x4f0 [spl]
[62997.726773]  [<ffffffffa02c8768>] ? rrw_exit+0x68/0x150 [zfs]
[62997.726816]  [<ffffffffa02df0ae>] ? spa_close+0xe/0x10 [zfs]
[62997.726856]  [<ffffffffa02b8399>] ? dsl_pool_rele+0x29/0x30 [zfs]
[62997.726863]  [<ffffffff811d419e>] ? __kmalloc+0x2e/0x230
[62997.726870]  [<ffffffffa012eb75>] ? strdup+0x45/0x70 [spl]
[62997.726907]  [<ffffffffa030c290>] zfsdev_ioctl+0x460/0x4f0 [zfs]
[62997.726914]  [<ffffffff811f634b>] ? __vfs_read+0xcb/0x100
[62997.726920]  [<ffffffff81209be3>] do_vfs_ioctl+0xa3/0x5d0
[62997.726927]  [<ffffffff811f7176>] ? vfs_read+0x86/0x130
[62997.726932]  [<ffffffff8120a189>] SyS_ioctl+0x79/0x90
[62997.726939]  [<ffffffff815c71b2>] entry_SYSCALL_64_fastpath+0x1a/0xa4
[62997.726945] INFO: task zfs:18831 blocked for more than 120 seconds.
[62997.727003]       Tainted: P           O    4.6.3-1-ARCH #1
[62997.727052] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[62997.727120] zfs             D ffff8807aa32bbc0     0 18831  16007 0x00000000
[62997.727126]  ffff8807aa32bbc0 0000000000000000 ffffffff8180d500 ffff88083ed82dc0
[62997.727132]  ffff8807aa32c000 ffff880843fe7220 ffff880843fe72e0 ffff880843fe7248
[62997.727138]  0000000000000000 ffff8807aa32bbd8 ffffffff815c372c ffff880843fe72d8
[62997.727143] Call Trace:
[62997.727149]  [<ffffffff815c372c>] schedule+0x3c/0x90
[62997.727157]  [<ffffffffa0135e29>] cv_wait_common+0x109/0x140 [spl]
[62997.727162]  [<ffffffff810bd490>] ? wake_atomic_t_function+0x60/0x60
[62997.727170]  [<ffffffffa0135e75>] __cv_wait+0x15/0x20 [spl]
[62997.727213]  [<ffffffffa02e39b5>] txg_wait_synced+0xe5/0x130 [zfs]
[62997.727254]  [<ffffffffa02bf194>] dsl_sync_task+0x184/0x270 [zfs]
[62997.727287]  [<ffffffffa033c1f0>] ? dsl_destroy_snapshot_sync_impl+0xb60/0xb60 [zfs]
[62997.727320]  [<ffffffffa033d040>] ? dsl_destroy_snapshot_check_impl+0x90/0x90 [zfs]
[62997.727353]  [<ffffffffa033d040>] ? dsl_destroy_snapshot_check_impl+0x90/0x90 [zfs]
[62997.727386]  [<ffffffffa033c1f0>] ? dsl_destroy_snapshot_sync_impl+0xb60/0xb60 [zfs]
[62997.727418]  [<ffffffffa033d211>] dsl_destroy_snapshots_nvl+0x91/0x100 [zfs]
[62997.727451]  [<ffffffffa033d2bf>] dsl_destroy_snapshot+0x3f/0x60 [zfs]
[62997.727486]  [<ffffffffa030e547>] zfs_ioc_destroy+0x37/0x50 [zfs]
[62997.727519]  [<ffffffffa030c290>] zfsdev_ioctl+0x460/0x4f0 [zfs]
[62997.727525]  [<ffffffff81209be3>] do_vfs_ioctl+0xa3/0x5d0
[62997.727530]  [<ffffffff815c2fb9>] ? __schedule+0x399/0xad0
[62997.727536]  [<ffffffff8120a189>] SyS_ioctl+0x79/0x90
[62997.727542]  [<ffffffff815c71b2>] entry_SYSCALL_64_fastpath+0x1a/0xa4
[63837.687235] INFO: task txg_sync:1618 blocked for more than 120 seconds.
[63837.687306]       Tainted: P           O    4.6.3-1-ARCH #1
[63837.687356] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[63837.687425] txg_sync        D ffff880841e97aa8     0  1618      2 0x00000000
[63837.687434]  ffff880841e97aa8 00ffffff810a3c2d ffff88084ff52dc0 ffff8808420a1e80
[63837.687441]  ffff880841e98000 ffff880873c95940 7fffffffffffffff ffff88050ed05f30
[63837.687446]  0000000000000001 ffff880841e97ac0 ffffffff815c372c 0000000000000000
[63837.687452] Call Trace:
[63837.687464]  [<ffffffff815c372c>] schedule+0x3c/0x90
[63837.687471]  [<ffffffff815c6203>] schedule_timeout+0x1d3/0x260
[63837.687479]  [<ffffffff810a4ad2>] ? default_wake_function+0x12/0x20
[63837.687485]  [<ffffffff810bcddd>] ? __wake_up_common+0x4d/0x80
[63837.687490]  [<ffffffff810eea31>] ? ktime_get+0x41/0xb0
[63837.687495]  [<ffffffff815c2bb4>] io_schedule_timeout+0xa4/0x110
[63837.687506]  [<ffffffffa0135dda>] cv_wait_common+0xba/0x140 [spl]
[63837.687511]  [<ffffffff810bd490>] ? wake_atomic_t_function+0x60/0x60
[63837.687520]  [<ffffffffa0135eb8>] __cv_wait_io+0x18/0x20 [spl]
[63837.687555]  [<ffffffffa032dd5e>] zio_wait+0x10e/0x1f0 [zfs]
[63837.687596]  [<ffffffffa02b7148>] dsl_pool_sync+0xb8/0x480 [zfs]
[63837.687643]  [<ffffffffa02d2c2f>] spa_sync+0x37f/0xb40 [zfs]
[63837.687650]  [<ffffffff810a4ad2>] ? default_wake_function+0x12/0x20
[63837.687694]  [<ffffffffa02e4208>] txg_sync_thread+0x3b8/0x630 [zfs]
[63837.687731]  [<ffffffffa02e3e50>] ? txg_delay+0x180/0x180 [zfs]
[63837.687739]  [<ffffffffa0130ee1>] thread_generic_wrapper+0x71/0x80 [spl]
[63837.687746]  [<ffffffffa0130e70>] ? __thread_exit+0x20/0x20 [spl]
[63837.687753]  [<ffffffff81099998>] kthread+0xd8/0xf0
[63837.687760]  [<ffffffff815c73c2>] ret_from_fork+0x22/0x40
[63837.687766]  [<ffffffff810998c0>] ? kthread_worker_fn+0x170/0x170
[63837.687781] INFO: task zfs:16808 blocked for more than 120 seconds.
[63837.687838]       Tainted: P           O    4.6.3-1-ARCH #1
[63837.687887] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[63837.687955] zfs             D ffff88083ff439f0     0 16808   9252 0x00000000
[63837.687961]  ffff88083ff439f0 00ff88083ff439d0 ffff88084ff54c40 ffff88083ebb8000
[63837.687967]  ffff88083ff44000 ffff880843fe7220 ffff880843fe72e0 ffff880843fe7248
[63837.687973]  0000000000000000 ffff88083ff43a08 ffffffff815c372c ffff880843fe72d8
[63837.687978] Call Trace:
[63837.687984]  [<ffffffff815c372c>] schedule+0x3c/0x90
[63837.687992]  [<ffffffffa0135e29>] cv_wait_common+0x109/0x140 [spl]
[63837.687998]  [<ffffffff810bd490>] ? wake_atomic_t_function+0x60/0x60
[63837.688006]  [<ffffffffa0135e75>] __cv_wait+0x15/0x20 [spl]
[63837.688042]  [<ffffffffa02e39b5>] txg_wait_synced+0xe5/0x130 [zfs]
[63837.688084]  [<ffffffffa02bf194>] dsl_sync_task+0x184/0x270 [zfs]
[63837.688116]  [<ffffffffa0299cc0>] ? restore_write+0x1d0/0x1d0 [zfs]
[63837.688147]  [<ffffffffa029a0b0>] ? dmu_recv_end_sync+0x3f0/0x3f0 [zfs]
[63837.688175]  [<ffffffffa029a0b0>] ? dmu_recv_end_sync+0x3f0/0x3f0 [zfs]
[63837.688203]  [<ffffffffa0299cc0>] ? restore_write+0x1d0/0x1d0 [zfs]
[63837.688231]  [<ffffffffa029c103>] dmu_recv_end+0x93/0x1b0 [zfs]
[63837.688268]  [<ffffffffa030d614>] zfs_ioc_recv+0x4b4/0xbf0 [zfs]
[63837.688276]  [<ffffffffa01370c9>] ? tsd_set+0x199/0x4f0 [spl]
[63837.688321]  [<ffffffffa02c8768>] ? rrw_exit+0x68/0x150 [zfs]
[63837.688366]  [<ffffffffa02df0ae>] ? spa_close+0xe/0x10 [zfs]
[63837.688406]  [<ffffffffa02b8399>] ? dsl_pool_rele+0x29/0x30 [zfs]
[63837.688413]  [<ffffffff811d419e>] ? __kmalloc+0x2e/0x230
[63837.688419]  [<ffffffffa012eb75>] ? strdup+0x45/0x70 [spl]
[63837.688456]  [<ffffffffa030c290>] zfsdev_ioctl+0x460/0x4f0 [zfs]
[63837.688463]  [<ffffffff811f634b>] ? __vfs_read+0xcb/0x100
[63837.688469]  [<ffffffff81209be3>] do_vfs_ioctl+0xa3/0x5d0
[63837.688476]  [<ffffffff811f7176>] ? vfs_read+0x86/0x130
[63837.688481]  [<ffffffff8120a189>] SyS_ioctl+0x79/0x90
[63837.688488]  [<ffffffff815c71b2>] entry_SYSCALL_64_fastpath+0x1a/0xa4
[63837.688493] INFO: task zfs:18831 blocked for more than 120 seconds.
[63837.688549]       Tainted: P           O    4.6.3-1-ARCH #1
[63837.688598] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[63837.688665] zfs             D ffff8807aa32bbc0     0 18831  16007 0x00000000
[63837.688671]  ffff8807aa32bbc0 0000000000000000 ffff88084ff53d00 ffff88083ed82dc0
[63837.688677]  ffff8807aa32c000 ffff880843fe7220 ffff880843fe72e0 ffff880843fe7248
[63837.688683]  0000000000000000 ffff8807aa32bbd8 ffffffff815c372c ffff880843fe72d8
[63837.688688] Call Trace:
[63837.688693]  [<ffffffff815c372c>] schedule+0x3c/0x90
[63837.688701]  [<ffffffffa0135e29>] cv_wait_common+0x109/0x140 [spl]
[63837.688706]  [<ffffffff810bd490>] ? wake_atomic_t_function+0x60/0x60
[63837.688714]  [<ffffffffa0135e75>] __cv_wait+0x15/0x20 [spl]
[63837.688757]  [<ffffffffa02e39b5>] txg_wait_synced+0xe5/0x130 [zfs]
[63837.688797]  [<ffffffffa02bf194>] dsl_sync_task+0x184/0x270 [zfs]
[63837.688830]  [<ffffffffa033c1f0>] ? dsl_destroy_snapshot_sync_impl+0xb60/0xb60 [zfs]
[63837.688863]  [<ffffffffa033d040>] ? dsl_destroy_snapshot_check_impl+0x90/0x90 [zfs]
[63837.688896]  [<ffffffffa033d040>] ? dsl_destroy_snapshot_check_impl+0x90/0x90 [zfs]
[63837.688928]  [<ffffffffa033c1f0>] ? dsl_destroy_snapshot_sync_impl+0xb60/0xb60 [zfs]
[63837.688961]  [<ffffffffa033d211>] dsl_destroy_snapshots_nvl+0x91/0x100 [zfs]
[63837.688994]  [<ffffffffa033d2bf>] dsl_destroy_snapshot+0x3f/0x60 [zfs]
[63837.689030]  [<ffffffffa030e547>] zfs_ioc_destroy+0x37/0x50 [zfs]
[63837.689062]  [<ffffffffa030c290>] zfsdev_ioctl+0x460/0x4f0 [zfs]
[63837.689069]  [<ffffffff81209be3>] do_vfs_ioctl+0xa3/0x5d0
[63837.689073]  [<ffffffff815c2fb9>] ? __schedule+0x399/0xad0
[63837.689079]  [<ffffffff8120a189>] SyS_ioctl+0x79/0x90
[63837.689086]  [<ffffffff815c71b2>] entry_SYSCALL_64_fastpath+0x1a/0xa4
[64077.676215] INFO: task txg_sync:1618 blocked for more than 120 seconds.
[64077.676286]       Tainted: P           O    4.6.3-1-ARCH #1
[64077.676336] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[64077.676405] txg_sync        D ffff880841e97aa8     0  1618      2 0x00000000
[64077.676415]  ffff880841e97aa8 00ffffff810a3c2d ffff88084ff53d00 ffff8808420a1e80
[64077.676421]  ffff880841e98000 ffff880873d15940 7fffffffffffffff ffff8800763a87b0
[64077.676427]  0000000000000001 ffff880841e97ac0 ffffffff815c372c 0000000000000000
[64077.676433] Call Trace:
[64077.676445]  [<ffffffff815c372c>] schedule+0x3c/0x90
[64077.676451]  [<ffffffff815c6203>] schedule_timeout+0x1d3/0x260
[64077.676460]  [<ffffffff810a4ad2>] ? default_wake_function+0x12/0x20
[64077.676465]  [<ffffffff810bcddd>] ? __wake_up_common+0x4d/0x80
[64077.676470]  [<ffffffff810eea31>] ? ktime_get+0x41/0xb0
[64077.676476]  [<ffffffff815c2bb4>] io_schedule_timeout+0xa4/0x110
[64077.676487]  [<ffffffffa0135dda>] cv_wait_common+0xba/0x140 [spl]
[64077.676492]  [<ffffffff810bd490>] ? wake_atomic_t_function+0x60/0x60
[64077.676500]  [<ffffffffa0135eb8>] __cv_wait_io+0x18/0x20 [spl]
[64077.676535]  [<ffffffffa032dd5e>] zio_wait+0x10e/0x1f0 [zfs]
[64077.676575]  [<ffffffffa02b7148>] dsl_pool_sync+0xb8/0x480 [zfs]
[64077.676621]  [<ffffffffa02d2c2f>] spa_sync+0x37f/0xb40 [zfs]
[64077.676628]  [<ffffffff810a4ad2>] ? default_wake_function+0x12/0x20
[64077.676672]  [<ffffffffa02e4208>] txg_sync_thread+0x3b8/0x630 [zfs]
[64077.676709]  [<ffffffffa02e3e50>] ? txg_delay+0x180/0x180 [zfs]
[64077.676718]  [<ffffffffa0130ee1>] thread_generic_wrapper+0x71/0x80 [spl]
[64077.676725]  [<ffffffffa0130e70>] ? __thread_exit+0x20/0x20 [spl]
[64077.676732]  [<ffffffff81099998>] kthread+0xd8/0xf0
[64077.676739]  [<ffffffff815c73c2>] ret_from_fork+0x22/0x40
[64077.676745]  [<ffffffff810998c0>] ? kthread_worker_fn+0x170/0x170
[64077.676760] INFO: task zfs:16808 blocked for more than 120 seconds.
[64077.676816]       Tainted: P           O    4.6.3-1-ARCH #1
[64077.676865] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[64077.676933] zfs             D ffff88083ff439f0     0 16808   9252 0x00000000
[64077.676939]  ffff88083ff439f0 00ff88083ff439d0 ffffffff8180d500 ffff88083ebb8000
[64077.676945]  ffff88083ff44000 ffff880843fe7220 ffff880843fe72e0 ffff880843fe7248
[64077.676951]  0000000000000000 ffff88083ff43a08 ffffffff815c372c ffff880843fe72d8
[64077.676956] Call Trace:
[64077.676962]  [<ffffffff815c372c>] schedule+0x3c/0x90
[64077.676970]  [<ffffffffa0135e29>] cv_wait_common+0x109/0x140 [spl]
[64077.676975]  [<ffffffff810bd490>] ? wake_atomic_t_function+0x60/0x60
[64077.676983]  [<ffffffffa0135e75>] __cv_wait+0x15/0x20 [spl]
[64077.677020]  [<ffffffffa02e39b5>] txg_wait_synced+0xe5/0x130 [zfs]
[64077.677062]  [<ffffffffa02bf194>] dsl_sync_task+0x184/0x270 [zfs]
[64077.677095]  [<ffffffffa0299cc0>] ? restore_write+0x1d0/0x1d0 [zfs]
[64077.677125]  [<ffffffffa029a0b0>] ? dmu_recv_end_sync+0x3f0/0x3f0 [zfs]
[64077.677154]  [<ffffffffa029a0b0>] ? dmu_recv_end_sync+0x3f0/0x3f0 [zfs]
[64077.677181]  [<ffffffffa0299cc0>] ? restore_write+0x1d0/0x1d0 [zfs]
[64077.677210]  [<ffffffffa029c103>] dmu_recv_end+0x93/0x1b0 [zfs]
[64077.677247]  [<ffffffffa030d614>] zfs_ioc_recv+0x4b4/0xbf0 [zfs]
[64077.677255]  [<ffffffffa01370c9>] ? tsd_set+0x199/0x4f0 [spl]
[64077.677300]  [<ffffffffa02c8768>] ? rrw_exit+0x68/0x150 [zfs]
[64077.677343]  [<ffffffffa02df0ae>] ? spa_close+0xe/0x10 [zfs]
[64077.677382]  [<ffffffffa02b8399>] ? dsl_pool_rele+0x29/0x30 [zfs]
[64077.677390]  [<ffffffff811d419e>] ? __kmalloc+0x2e/0x230
[64077.677396]  [<ffffffffa012eb75>] ? strdup+0x45/0x70 [spl]
[64077.677433]  [<ffffffffa030c290>] zfsdev_ioctl+0x460/0x4f0 [zfs]
[64077.677440]  [<ffffffff811f634b>] ? __vfs_read+0xcb/0x100
[64077.677446]  [<ffffffff81209be3>] do_vfs_ioctl+0xa3/0x5d0
[64077.677453]  [<ffffffff811f7176>] ? vfs_read+0x86/0x130
[64077.677458]  [<ffffffff8120a189>] SyS_ioctl+0x79/0x90
[64077.677465]  [<ffffffff815c71b2>] entry_SYSCALL_64_fastpath+0x1a/0xa4
[64077.677471] INFO: task zfs:18831 blocked for more than 120 seconds.
[64077.677527]       Tainted: P           O    4.6.3-1-ARCH #1
[64077.677576] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[64077.677643] zfs             D ffff8807aa32bbc0     0 18831  16007 0x00000000
[64077.677649]  ffff8807aa32bbc0 0000000000000000 ffff88084ff54c40 ffff88083ed82dc0
[64077.677655]  ffff8807aa32c000 ffff880843fe7220 ffff880843fe72e0 ffff880843fe7248
[64077.677661]  0000000000000000 ffff8807aa32bbd8 ffffffff815c372c ffff880843fe72d8
[64077.677666] Call Trace:
[64077.677671]  [<ffffffff815c372c>] schedule+0x3c/0x90
[64077.677679]  [<ffffffffa0135e29>] cv_wait_common+0x109/0x140 [spl]
[64077.677684]  [<ffffffff810bd490>] ? wake_atomic_t_function+0x60/0x60
[64077.677692]  [<ffffffffa0135e75>] __cv_wait+0x15/0x20 [spl]
[64077.677734]  [<ffffffffa02e39b5>] txg_wait_synced+0xe5/0x130 [zfs]
[64077.677776]  [<ffffffffa02bf194>] dsl_sync_task+0x184/0x270 [zfs]
[64077.677809]  [<ffffffffa033c1f0>] ? dsl_destroy_snapshot_sync_impl+0xb60/0xb60 [zfs]
[64077.677842]  [<ffffffffa033d040>] ? dsl_destroy_snapshot_check_impl+0x90/0x90 [zfs]
[64077.677875]  [<ffffffffa033d040>] ? dsl_destroy_snapshot_check_impl+0x90/0x90 [zfs]
[64077.677908]  [<ffffffffa033c1f0>] ? dsl_destroy_snapshot_sync_impl+0xb60/0xb60 [zfs]
[64077.677941]  [<ffffffffa033d211>] dsl_destroy_snapshots_nvl+0x91/0x100 [zfs]
[64077.677974]  [<ffffffffa033d2bf>] dsl_destroy_snapshot+0x3f/0x60 [zfs]
[64077.678009]  [<ffffffffa030e547>] zfs_ioc_destroy+0x37/0x50 [zfs]
[64077.678042]  [<ffffffffa030c290>] zfsdev_ioctl+0x460/0x4f0 [zfs]
[64077.678048]  [<ffffffff81209be3>] do_vfs_ioctl+0xa3/0x5d0
[64077.678053]  [<ffffffff815c2fb9>] ? __schedule+0x399/0xad0
[64077.678059]  [<ffffffff8120a189>] SyS_ioctl+0x79/0x90
[64077.678065]  [<ffffffff815c71b2>] entry_SYSCALL_64_fastpath+0x1a/0xa4
[64677.649034] INFO: task zfs:16808 blocked for more than 120 seconds.
[64677.649091]       Tainted: P           O    4.6.3-1-ARCH #1
[64677.649133] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[64677.649194] zfs             D ffff88083ff439f0     0 16808   9252 0x00000000
[64677.649201]  ffff88083ff439f0 00ff88083ff439d0 ffff88084ff52dc0 ffff88083ebb8000
[64677.649205]  ffff88083ff44000 ffff880843fe7220 ffff880843fe72e0 ffff880843fe7248
[64677.649208]  0000000000000000 ffff88083ff43a08 ffffffff815c372c ffff880843fe72d8
[64677.649212] Call Trace:
[64677.649220]  [<ffffffff815c372c>] schedule+0x3c/0x90
[64677.649228]  [<ffffffffa0135e29>] cv_wait_common+0x109/0x140 [spl]
[64677.649232]  [<ffffffff810bd490>] ? wake_atomic_t_function+0x60/0x60
[64677.649238]  [<ffffffffa0135e75>] __cv_wait+0x15/0x20 [spl]
[64677.649268]  [<ffffffffa02e39b5>] txg_wait_synced+0xe5/0x130 [zfs]
[64677.649294]  [<ffffffffa02bf194>] dsl_sync_task+0x184/0x270 [zfs]
[64677.649314]  [<ffffffffa0299cc0>] ? restore_write+0x1d0/0x1d0 [zfs]
[64677.649333]  [<ffffffffa029a0b0>] ? dmu_recv_end_sync+0x3f0/0x3f0 [zfs]
[64677.649350]  [<ffffffffa029a0b0>] ? dmu_recv_end_sync+0x3f0/0x3f0 [zfs]
[64677.649367]  [<ffffffffa0299cc0>] ? restore_write+0x1d0/0x1d0 [zfs]
[64677.649384]  [<ffffffffa029c103>] dmu_recv_end+0x93/0x1b0 [zfs]
[64677.649407]  [<ffffffffa030d614>] zfs_ioc_recv+0x4b4/0xbf0 [zfs]
[64677.649412]  [<ffffffffa01370c9>] ? tsd_set+0x199/0x4f0 [spl]
[64677.649440]  [<ffffffffa02c8768>] ? rrw_exit+0x68/0x150 [zfs]
[64677.649467]  [<ffffffffa02df0ae>] ? spa_close+0xe/0x10 [zfs]
[64677.649492]  [<ffffffffa02b8399>] ? dsl_pool_rele+0x29/0x30 [zfs]
[64677.649497]  [<ffffffff811d419e>] ? __kmalloc+0x2e/0x230
[64677.649502]  [<ffffffffa012eb75>] ? strdup+0x45/0x70 [spl]
[64677.649524]  [<ffffffffa030c290>] zfsdev_ioctl+0x460/0x4f0 [zfs]
[64677.649528]  [<ffffffff811f634b>] ? __vfs_read+0xcb/0x100
[64677.649532]  [<ffffffff81209be3>] do_vfs_ioctl+0xa3/0x5d0
[64677.649536]  [<ffffffff811f7176>] ? vfs_read+0x86/0x130
[64677.649539]  [<ffffffff8120a189>] SyS_ioctl+0x79/0x90
[64677.649544]  [<ffffffff815c71b2>] entry_SYSCALL_64_fastpath+0x1a/0xa4
[64677.649548] INFO: task zfs:18831 blocked for more than 120 seconds.
[64677.649597]       Tainted: P           O    4.6.3-1-ARCH #1
[64677.649639] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[64677.649699] zfs             D ffff8807aa32bbc0     0 18831  16007 0x00000000
[64677.649703]  ffff8807aa32bbc0 0000000000000000 ffff88084ff54c40 ffff88083ed82dc0
[64677.649707]  ffff8807aa32c000 ffff880843fe7220 ffff880843fe72e0 ffff880843fe7248
[64677.649710]  0000000000000000 ffff8807aa32bbd8 ffffffff815c372c ffff880843fe72d8
[64677.649714] Call Trace:
[64677.649717]  [<ffffffff815c372c>] schedule+0x3c/0x90
[64677.649722]  [<ffffffffa0135e29>] cv_wait_common+0x109/0x140 [spl]
[64677.649726]  [<ffffffff810bd490>] ? wake_atomic_t_function+0x60/0x60
[64677.649731]  [<ffffffffa0135e75>] __cv_wait+0x15/0x20 [spl]
[64677.649757]  [<ffffffffa02e39b5>] txg_wait_synced+0xe5/0x130 [zfs]
[64677.649782]  [<ffffffffa02bf194>] dsl_sync_task+0x184/0x270 [zfs]
[64677.649803]  [<ffffffffa033c1f0>] ? dsl_destroy_snapshot_sync_impl+0xb60/0xb60 [zfs]
[64677.649823]  [<ffffffffa033d040>] ? dsl_destroy_snapshot_check_impl+0x90/0x90 [zfs]
[64677.649843]  [<ffffffffa033d040>] ? dsl_destroy_snapshot_check_impl+0x90/0x90 [zfs]
[64677.649864]  [<ffffffffa033c1f0>] ? dsl_destroy_snapshot_sync_impl+0xb60/0xb60 [zfs]
[64677.649884]  [<ffffffffa033d211>] dsl_destroy_snapshots_nvl+0x91/0x100 [zfs]
[64677.649904]  [<ffffffffa033d2bf>] dsl_destroy_snapshot+0x3f/0x60 [zfs]
[64677.649926]  [<ffffffffa030e547>] zfs_ioc_destroy+0x37/0x50 [zfs]
[64677.649946]  [<ffffffffa030c290>] zfsdev_ioctl+0x460/0x4f0 [zfs]
[64677.649950]  [<ffffffff81209be3>] do_vfs_ioctl+0xa3/0x5d0
[64677.649953]  [<ffffffff815c2fb9>] ? __schedule+0x399/0xad0
[64677.649956]  [<ffffffff8120a189>] SyS_ioctl+0x79/0x90
[64677.649960]  [<ffffffff815c71b2>] entry_SYSCALL_64_fastpath+0x1a/0xa4
[97401.943039] Process accounting resumed
[123180.950179] ata2.00: exception Emask 0x10 SAct 0xc0000 SErr 0x40000 action 0x6 frozen
[123180.950243] ata2.00: irq_stat 0x08000000, interface fatal error
[123180.950287] ata2: SError: { CommWake }
[123180.950316] ata2.00: failed command: READ FPDMA QUEUED
[123180.950355] ata2.00: cmd 60/01:90:60:87:80/00:00:4d:00:00/40 tag 18 ncq 512 in
                         res c0/00:01:f1:fa:e6/00:00:4e:00:00/40 Emask 0x12 (ATA bus error)
[123180.950468] ata2.00: status: { Busy }
[123180.950496] ata2.00: failed command: WRITE FPDMA QUEUED
[123180.950536] ata2.00: cmd 61/0c:98:c0:6c:bc/00:00:4e:00:00/40 tag 19 ncq 6144 out
                         res c0/00:01:f1:fa:e6/00:00:4e:00:00/40 Emask 0x12 (ATA bus error)
[123180.950650] ata2.00: status: { Busy }
[123180.950679] ata2: hard resetting link
[123181.259628] ata2: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[123181.461200] ata2.00: configured for UDMA/133
[123181.461292] sd 1:0:0:0: [sdb] tag#18 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08
[123181.461294] sd 1:0:0:0: [sdb] tag#18 Sense Key : 0xb [current] [descriptor] 
[123181.461295] sd 1:0:0:0: [sdb] tag#18 ASC=0x47 ASCQ=0x0 
[123181.461297] sd 1:0:0:0: [sdb] tag#18 CDB: opcode=0x88 88 00 00 00 00 00 4d 80 87 60 00 00 00 01 00 00
[123181.461299] blk_update_request: I/O error, dev sdb, sector 1300268896
[123181.461372] sd 1:0:0:0: [sdb] tag#19 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08
[123181.461374] sd 1:0:0:0: [sdb] tag#19 Sense Key : 0xb [current] [descriptor] 
[123181.461375] sd 1:0:0:0: [sdb] tag#19 ASC=0x47 ASCQ=0x0 
[123181.461377] sd 1:0:0:0: [sdb] tag#19 CDB: opcode=0x8a 8a 00 00 00 00 00 4e bc 6c c0 00 00 00 0c 00 00
[123181.461378] blk_update_request: I/O error, dev sdb, sector 1320971456
[123181.461429] ata2: EH complete

Looks like there was an I/O error but zfs didn't count it as one... Strange!

This can be explained by the following comment: https://github.com/zfsonlinux/zfs/issues/1256#issuecomment-317913624

If you're reading a file and hit a read error, the data is re-constructed from the other drive and re-written to the bad sectors, but the read error counter is not incremented since it "self-healed". However, unlike a checksum error, a self-healed read error will generate an IO event in zpool events.

Your zpool events shows:

zio_err = 0x5 (EIO)
zio_flags = 0xc0880 (ZIO_FLAG_CANFAIL|ZIO_FLAG_DONT_CACHE|ZIO_FLAG_OPTIONAL|ZIO_FLAG_DONT_QUEUE)

These ZIOs without ZIO_FLAG_IO_RETRY don't bump vdev stats: this is the reason your zpool status is "clear".

vdev_stat_update(zio_t *zio, uint64_t psize)
...
         /*
          * If this is an I/O error that is going to be retried, then ignore the
          * error.  Otherwise, the user may interpret B_FAILFAST I/O errors as
          * hard errors, when in reality they can happen for any number of
          * innocuous reasons (bus resets, MPxIO link failure, etc).
          */
         if (zio->io_error == EIO &&
             !(zio->io_flags & ZIO_FLAG_IO_RETRY)) {
                 return;
         }

Not an issue, feel free to join the discussion in the zfs-devel mailing list if you have some thoughts on the matter.

So the title is wrong, there was a reason zed wrote a mail.

Since this isn’t a bug anymore, I like to ask for a reopening:

A disk read-error is a read-error of a disk and should be displayed as such, for unrecoverable read-errors there is a line per raidz/pool to show this on that logical level.

I would never think that zpool status would not show such read-errors, which are corrected, at the device level. This clearly break the expectations of the users and should be changed.

@behlendorf what are you thoughts about that?

@RubenKelevra like i said, this is being debated in the zfs-devel mailing list which is better suited for this kind of discussion.

You may find some interesting work already being done here: http://list.zfsonlinux.org/pipermail/zfs-devel/2017-August/000132.html

@RubenKelevra I'm actively working on a fix for this. I'll assign this bug to myself.

@tonyhutter thanks Tony, highly appreciate it!

Yes, i appreciate it too! Much thanks.

I've got a fix out for review (https://github.com/zfsonlinux/zfs/pull/7817).

Was this page helpful?
0 / 5 - 0 ratings

Related issues

seonwoolee picture seonwoolee  Â·  3Comments

marker5a picture marker5a  Â·  4Comments

tronder88 picture tronder88  Â·  3Comments

mailinglists35 picture mailinglists35  Â·  4Comments

jakeogh picture jakeogh  Â·  3Comments