Wasn't really sure how to bring this up. Short version : assuming it's not already possible, it would be quite helpful if ZFS detected when a disk or vdev fell below a consistent metric like access time.
Slightly longer version and why this came up is I have run into cases where individual disks silently fail, both soft and hard. Vendors of course have their "NAS" versions however I've run into this WITH THOSE DISKS. I've also run into this where Samsung SSD's (the 840 and 850 for example), will start to crawl due to well known firmware bugs especially on the 840s. However regardless of firmware or power management bugs, the underlining issue is vendor neutral.
This really came to light last week when I had two 10TB Segate Ironwolf pro drives in a three-way mirror go bad in 4 hours. There were no SMART errors and certainly no errors bubbling up to the kernel. Users didn't really notice anything as most of their data is hot in either cache or L2 . Despite the number of systems that should have caught it, ZFS only did through checksums (something I'll be forever thankful for). Yet testing those disks with a simple dd, would have shown they were operating at under 10M/s.
Is any of this already in place outside individual implementations? I'm not really looking to replace vendors analytic engines for example as they serve a different purpose.
It seems tuning zfs_deadman_synctime_ms from zfs-module-parameters(5) and then checking "zfs.delay" events could be an option:
root@linux:~# cat /sys/module/zfs/parameters/zfs_deadman_synctime_ms
1000000
root@linux:~# echo 1000 > /sys/module/zfs/parameters/zfs_deadman_synctime_ms
root@linux:~# truncate -s 64m /var/tmp/file-dev{1,2}
root@linux:~# zpool create testpool mirror /var/tmp/file-dev{1,2}
root@linux:~# systemctl start zfs-zed
root@linux:~# zpool events -c
cleared 21 events
root@linux:~# zinject -d /var/tmp/file-dev1 -D 2000:1 testpool
Added handler 1 with the following properties:
pool: testpool
vdev: 2f60a8ebf1df2a0b
root@linux:~# dd if=/dev/zero of=/testpool/zero &
[1] 964
root@linux:~# zpool events
TIME CLASS
root@linux:~# zpool events -f
TIME CLASS
Nov 19 2017 08:30:49.580000000 ereport.fs.zfs.delay
Nov 19 2017 08:30:59.596000000 ereport.fs.zfs.delay
Nov 19 2017 08:31:04.604000000 ereport.fs.zfs.delay
^C
root@linux:~# zpool events -v
TIME CLASS
Nov 19 2017 08:30:49.580000000 ereport.fs.zfs.delay
class = "ereport.fs.zfs.delay"
ena = 0x560b05ac1800001
detector = (embedded nvlist)
version = 0x0
scheme = "zfs"
pool = 0xd697c679f6a9b106
vdev = 0x2f60a8ebf1df2a0b
(end detector)
pool = "testpool"
pool_guid = 0xd697c679f6a9b106
pool_state = 0x0
pool_context = 0x0
pool_failmode = "wait"
vdev_guid = 0x2f60a8ebf1df2a0b
vdev_type = "file"
vdev_path = "/var/tmp/file-dev1"
vdev_ashift = 0x9
vdev_complete_ts = 0x5593d01188
vdev_delta_ts = 0xee97922d
vdev_read_errors = 0x0
vdev_write_errors = 0x0
vdev_cksum_errors = 0x0
parent_guid = 0xeb27ed7cbf2e8214
parent_type = "mirror"
vdev_spare_paths =
vdev_spare_guids =
zio_err = 0x0
zio_flags = 0x40080c80
zio_stage = 0x100000
zio_pipeline = 0x1700000
zio_delay = 0x551c9af394
zio_timestamp = 0x54a5399e6b
zio_delta = 0x0
zio_offset = 0x45e000
zio_size = 0x4000
time = 0x5a113329 0x22921900
eid = 0x16
Interesting. Appears I have yet another problem, out of date manpages.
If I understand that correctly, there are two timers: zfs_deadman_checktime and zfs_deadman_synctime. I can understand not wanting to flood logs but synctime defaults to 16 minutes(!??) before entering the more serious hung IO. Not really sure what checktime does or the difference, it defaults to 5 seconds? Guessing they are different types of requests.
AFAIK:
zfs_deadman_checktime_ms: Defines the frequency at which we check for hung I/O.zfs_deadman_synctime_ms: Determines if an I/O is considered "hung".@h1z1 does this existing functionality address your use case?
Observer effect aside, one would think if synctime kicked off, it also checked for some condition first :) "zfs_deadman_checktime_ms" is also a fairly recent addition.
It almost does, the cases I've run into aren't necessarily purely response time though. The power management in particular wouldn't likely ever hit that unless the drain rate to the disk hit zero. At least ioping had no indication. Had there not been an actual data error I don't know if scrub would have picked it up. Does it have any concept of a performance baseline?
This is absolutely the kind of thing SMART monitoring should be altering on if vendors didn't lie. I can also see something like it being useful for virtual disks (ISCSI luns, luks backed devices, etc)
@h1z1 as of the 0.7 release you can use the zpool iostat -l latency option to see the current IO latency for your devices. Usually this will let you identify a particularly slow drive which is dragging down the pool performance.
There's also the zfs.delay event which @loli10K pointed out. This was added to provide a mechanism by which ZFS could automatically degrade a slow drive and potentially replace it with a hot spare. Today the default threshold for generating these events is 30s and is controlled by the zio_delay_max module option.
Most of the functionality needed to implement this automatic replacement feature is now in place and it should be pretty straight forward to implement if someone has the time. This can be handled by the ZED in the same way as IO and checksum events. Specifically, when a leaf vdev generates a sufficient number of these events within a given time span the ZED degrades the leaf vdev if there is sufficient redundancy and conditionally kicks in a hot spare. We might want to add a new property to control this behavior since depending on how the pool is being used you may not want this action to be taken automatically.
@behlendorf - Thank you! That little variable is awesome. I'm less concerned with automatically replacing volumes has having no visibility at all,
I can kind of understand why the default time was chosen based on larger arrays with massive amounts of memory but 30 seconds for IO completion is quite high. Dropping it down to 1 second already yielded results.
Still not quite sure what the difference between those timers is given despite dropping zfs_deadman_synctime_ms to 2seconds last week, it has not turned up any alerts.
# grep . zfs_deadman_* zio_delay_max
zfs_deadman_checktime_ms:5000
zfs_deadman_enabled:1
zfs_deadman_synctime_ms:2000
zio_delay_max:600
#
Just to clarify, even zio_delay_max appears to be strictly a latency based metric. Still very handy but it would not have caught the case above where the disks almost went into low RPM or something.
30 seconds for IO completion is quite high.
Right, it was originally choosen to be much much longer than any reasonable IO time in order to avoid false positives.
zio_delay_max appears to be strictly a latency based metric
Correct.
@h1z1 also consider looking at zpool iostat -pw. It will give you a histogram of all your IO latencies so you can see where the outliers are. The disk_wait column shows the actual disk latency.
$ zpool iostat -pw
jet1 total_wait disk_wait sync_queue async_queue
latency read write read write read write read write scrub
---------- ------- ------- ------- ------- ------- ------- ------- ------- -------
1ns 0 0 0 0 0 0 0 0 0
3ns 0 0 0 0 0 0 0 0 0
7ns 0 0 0 0 0 0 0 0 0
15ns 0 0 0 0 0 0 0 0 0
31ns 0 0 0 0 0 0 0 0 0
63ns 0 0 0 0 0 0 0 0 0
127ns 0 0 0 0 0 0 0 0 0
255ns 0 0 0 0 0 0 0 0 0
511ns 0 0 0 0 0 0 0 0 0
1us 0 0 0 0 0 0 0 0 0
2us 0 0 0 0 82 1131 66 16212 15418
4us 0 0 0 0 105 2044365 462 154170 372666
8us 0 0 0 0 34 4728933 23 75928 125820
16us 0 0 0 0 18 339649 20 11413 146705
32us 0 0 5 0 2 67113 14 8032 124741
65us 450688 0 665490 7 1 3371 16 8586 164738
131us 396358 289 311365 3469 0 1153 40 13161 113396
262us 169918 100345 91884 419619 1 416 65 33246 39920
524us 60301 2524730 76717 3277262 6 160 84 71004 15836
1ms 23728 751333 228529 919054 7 31 143 68035 10027
2ms 14268 1328330 299614 1356003 5 2 260 60599 9625
4ms 15922 2539168 361206 2528189 3 0 507 44904 14463
8ms 26414 515342 503031 521229 15 0 970 28908 23988
16ms 45502 45465 251066 48396 14 0 1841 38381 42122
33ms 79156 70427 31303 19728 2 0 3890 67843 71302
67ms 132420 122118 4063 2130 0 0 7697 120104 123761
134ms 214680 203823 190 1001 0 0 15854 202739 194322
268ms 327119 295805 4 2023 0 0 29768 293487 294565
536ms 293854 452775 4 2782 0 0 36660 449295 251472
1s 243575 150913 0 345 0 0 34413 148500 207848
2s 264295 362 0 0 0 0 28446 353 235027
4s 56140 12 0 0 0 0 15901 12 39176
8s 8765 0 0 0 0 0 8192 0 541
17s 1367 0 0 0 0 0 1365 0 0
34s 0 0 0 0 0 0 0 0 0
68s 0 0 0 0 0 0 0 0 0
137s 0 0 0 0 0 0 0 0 0
-------------------------------------------------------------------------------------------
@tonyhutter Not on the 0.6.x series...
zpool iostat -pw
invalid option 'p'
I've got a PR to print the IO delays in a separate zpool status column here: https://github.com/zfsonlinux/zfs/pull/7756
Something to keep in mind #4713