Zfs: Feature request: Detect and warn slow disk performance

Created on 19 Nov 2017  路  11Comments  路  Source: openzfs/zfs

Wasn't really sure how to bring this up. Short version : assuming it's not already possible, it would be quite helpful if ZFS detected when a disk or vdev fell below a consistent metric like access time.

Slightly longer version and why this came up is I have run into cases where individual disks silently fail, both soft and hard. Vendors of course have their "NAS" versions however I've run into this WITH THOSE DISKS. I've also run into this where Samsung SSD's (the 840 and 850 for example), will start to crawl due to well known firmware bugs especially on the 840s. However regardless of firmware or power management bugs, the underlining issue is vendor neutral.

This really came to light last week when I had two 10TB Segate Ironwolf pro drives in a three-way mirror go bad in 4 hours. There were no SMART errors and certainly no errors bubbling up to the kernel. Users didn't really notice anything as most of their data is hot in either cache or L2 . Despite the number of systems that should have caught it, ZFS only did through checksums (something I'll be forever thankful for). Yet testing those disks with a simple dd, would have shown they were operating at under 10M/s.

Is any of this already in place outside individual implementations? I'm not really looking to replace vendors analytic engines for example as they serve a different purpose.

Feature

All 11 comments

It seems tuning zfs_deadman_synctime_ms from zfs-module-parameters(5) and then checking "zfs.delay" events could be an option:

root@linux:~# cat /sys/module/zfs/parameters/zfs_deadman_synctime_ms 
1000000
root@linux:~# echo 1000 > /sys/module/zfs/parameters/zfs_deadman_synctime_ms 
root@linux:~# truncate -s 64m /var/tmp/file-dev{1,2}
root@linux:~# zpool create testpool mirror /var/tmp/file-dev{1,2}
root@linux:~# systemctl start zfs-zed
root@linux:~# zpool events -c
cleared 21 events
root@linux:~# zinject -d /var/tmp/file-dev1 -D 2000:1 testpool
Added handler 1 with the following properties:
  pool: testpool
  vdev: 2f60a8ebf1df2a0b
root@linux:~# dd if=/dev/zero of=/testpool/zero &
[1] 964
root@linux:~# zpool events
TIME                           CLASS
root@linux:~# zpool events -f
TIME                           CLASS
Nov 19 2017 08:30:49.580000000 ereport.fs.zfs.delay
Nov 19 2017 08:30:59.596000000 ereport.fs.zfs.delay
Nov 19 2017 08:31:04.604000000 ereport.fs.zfs.delay
^C
root@linux:~# zpool events -v
TIME                           CLASS
Nov 19 2017 08:30:49.580000000 ereport.fs.zfs.delay
        class = "ereport.fs.zfs.delay"
        ena = 0x560b05ac1800001
        detector = (embedded nvlist)
                version = 0x0
                scheme = "zfs"
                pool = 0xd697c679f6a9b106
                vdev = 0x2f60a8ebf1df2a0b
        (end detector)
        pool = "testpool"
        pool_guid = 0xd697c679f6a9b106
        pool_state = 0x0
        pool_context = 0x0
        pool_failmode = "wait"
        vdev_guid = 0x2f60a8ebf1df2a0b
        vdev_type = "file"
        vdev_path = "/var/tmp/file-dev1"
        vdev_ashift = 0x9
        vdev_complete_ts = 0x5593d01188
        vdev_delta_ts = 0xee97922d
        vdev_read_errors = 0x0
        vdev_write_errors = 0x0
        vdev_cksum_errors = 0x0
        parent_guid = 0xeb27ed7cbf2e8214
        parent_type = "mirror"
        vdev_spare_paths = 
        vdev_spare_guids = 
        zio_err = 0x0
        zio_flags = 0x40080c80
        zio_stage = 0x100000
        zio_pipeline = 0x1700000
        zio_delay = 0x551c9af394
        zio_timestamp = 0x54a5399e6b
        zio_delta = 0x0
        zio_offset = 0x45e000
        zio_size = 0x4000
        time = 0x5a113329 0x22921900 
        eid = 0x16

Interesting. Appears I have yet another problem, out of date manpages.

If I understand that correctly, there are two timers: zfs_deadman_checktime and zfs_deadman_synctime. I can understand not wanting to flood logs but synctime defaults to 16 minutes(!??) before entering the more serious hung IO. Not really sure what checktime does or the difference, it defaults to 5 seconds? Guessing they are different types of requests.

AFAIK:

  • zfs_deadman_checktime_ms: Defines the frequency at which we check for hung I/O.
  • zfs_deadman_synctime_ms: Determines if an I/O is considered "hung".

@h1z1 does this existing functionality address your use case?

Observer effect aside, one would think if synctime kicked off, it also checked for some condition first :) "zfs_deadman_checktime_ms" is also a fairly recent addition.

It almost does, the cases I've run into aren't necessarily purely response time though. The power management in particular wouldn't likely ever hit that unless the drain rate to the disk hit zero. At least ioping had no indication. Had there not been an actual data error I don't know if scrub would have picked it up. Does it have any concept of a performance baseline?

This is absolutely the kind of thing SMART monitoring should be altering on if vendors didn't lie. I can also see something like it being useful for virtual disks (ISCSI luns, luks backed devices, etc)

@h1z1 as of the 0.7 release you can use the zpool iostat -l latency option to see the current IO latency for your devices. Usually this will let you identify a particularly slow drive which is dragging down the pool performance.

There's also the zfs.delay event which @loli10K pointed out. This was added to provide a mechanism by which ZFS could automatically degrade a slow drive and potentially replace it with a hot spare. Today the default threshold for generating these events is 30s and is controlled by the zio_delay_max module option.

Most of the functionality needed to implement this automatic replacement feature is now in place and it should be pretty straight forward to implement if someone has the time. This can be handled by the ZED in the same way as IO and checksum events. Specifically, when a leaf vdev generates a sufficient number of these events within a given time span the ZED degrades the leaf vdev if there is sufficient redundancy and conditionally kicks in a hot spare. We might want to add a new property to control this behavior since depending on how the pool is being used you may not want this action to be taken automatically.

@behlendorf - Thank you! That little variable is awesome. I'm less concerned with automatically replacing volumes has having no visibility at all,

I can kind of understand why the default time was chosen based on larger arrays with massive amounts of memory but 30 seconds for IO completion is quite high. Dropping it down to 1 second already yielded results.

Still not quite sure what the difference between those timers is given despite dropping zfs_deadman_synctime_ms to 2seconds last week, it has not turned up any alerts.

# grep . zfs_deadman_* zio_delay_max 
zfs_deadman_checktime_ms:5000
zfs_deadman_enabled:1
zfs_deadman_synctime_ms:2000
zio_delay_max:600
# 

Just to clarify, even zio_delay_max appears to be strictly a latency based metric. Still very handy but it would not have caught the case above where the disks almost went into low RPM or something.

30 seconds for IO completion is quite high.

Right, it was originally choosen to be much much longer than any reasonable IO time in order to avoid false positives.

zio_delay_max appears to be strictly a latency based metric

Correct.

@h1z1 also consider looking at zpool iostat -pw. It will give you a histogram of all your IO latencies so you can see where the outliers are. The disk_wait column shows the actual disk latency.

$ zpool iostat -pw

jet1           total_wait         disk_wait        sync_queue        async_queue  
latency        read    write     read    write     read    write     read    write    scrub
----------  -------  -------  -------  -------  -------  -------  -------  -------  -------
1ns               0        0        0        0        0        0        0        0        0
3ns               0        0        0        0        0        0        0        0        0
7ns               0        0        0        0        0        0        0        0        0
15ns              0        0        0        0        0        0        0        0        0
31ns              0        0        0        0        0        0        0        0        0
63ns              0        0        0        0        0        0        0        0        0
127ns             0        0        0        0        0        0        0        0        0
255ns             0        0        0        0        0        0        0        0        0
511ns             0        0        0        0        0        0        0        0        0
1us               0        0        0        0        0        0        0        0        0
2us               0        0        0        0       82     1131       66    16212    15418
4us               0        0        0        0      105  2044365      462   154170   372666
8us               0        0        0        0       34  4728933       23    75928   125820
16us              0        0        0        0       18   339649       20    11413   146705
32us              0        0        5        0        2    67113       14     8032   124741
65us         450688        0   665490        7        1     3371       16     8586   164738
131us        396358      289   311365     3469        0     1153       40    13161   113396
262us        169918   100345    91884   419619        1      416       65    33246    39920
524us         60301  2524730    76717  3277262        6      160       84    71004    15836
1ms           23728   751333   228529   919054        7       31      143    68035    10027
2ms           14268  1328330   299614  1356003        5        2      260    60599     9625
4ms           15922  2539168   361206  2528189        3        0      507    44904    14463
8ms           26414   515342   503031   521229       15        0      970    28908    23988
16ms          45502    45465   251066    48396       14        0     1841    38381    42122
33ms          79156    70427    31303    19728        2        0     3890    67843    71302
67ms         132420   122118     4063     2130        0        0     7697   120104   123761
134ms        214680   203823      190     1001        0        0    15854   202739   194322
268ms        327119   295805        4     2023        0        0    29768   293487   294565
536ms        293854   452775        4     2782        0        0    36660   449295   251472
1s           243575   150913        0      345        0        0    34413   148500   207848
2s           264295      362        0        0        0        0    28446      353   235027
4s            56140       12        0        0        0        0    15901       12    39176
8s             8765        0        0        0        0        0     8192        0      541
17s            1367        0        0        0        0        0     1365        0        0
34s               0        0        0        0        0        0        0        0        0
68s               0        0        0        0        0        0        0        0        0
137s              0        0        0        0        0        0        0        0        0
-------------------------------------------------------------------------------------------

@tonyhutter Not on the 0.6.x series...

zpool iostat -pw
invalid option 'p'

I've got a PR to print the IO delays in a separate zpool status column here: https://github.com/zfsonlinux/zfs/pull/7756

Something to keep in mind #4713

Was this page helpful?
0 / 5 - 0 ratings

Related issues

mailinglists35 picture mailinglists35  路  4Comments

pcd1193182 picture pcd1193182  路  4Comments

adamdmoss picture adamdmoss  路  3Comments

geek-at picture geek-at  路  3Comments

kernelOfTruth picture kernelOfTruth  路  4Comments