Type | Version/Name
--- | ---
Distribution Name | CentOS
Distribution Version | 8.1
Linux Kernel | 4.18.0-147.5.1.el8_1.x86_64
Architecture | x86_64
ZFS Version | 0.8.3-1
SPL Version | 0.8.3-1
The new scub code heavily impacts application IO performance when used with HDD-based pools. Application IOPs are reducted by up to 10x factor.
Using a 4x SAS 15k 300 GB disks test pool which can provide ~250 IOPs for 4K single-thread sync random read (as measured by fio), starting a scrub degrades application random 4K reads to 20-60 (so 4-10x lower random read speed).
The older ZFS 0.7.x relase had a zfs_scrub_delay which can be used to limit how much scrub "conflicts" with other read/write operation, but this parameter is gone with the new 0.8.x relase. The rationale is that management of the different IO classes should be done excluvely via ZIO scheduler tuning, adjusting the relative weight via *_max_active tunables, but I can't see any meaningful difference even when setting zfs_vdev_scrub_max_active=1 and zfs_vdev_sync_read_max_active=1000.
I think the problem is due to the new scrub code batching read in very large block sizes, leading to long depth on the scrub queue and, finally, on the vdev queue. Indeed the new scrub code is very fast (reading at 400-500 MB/s on that test array), but this leads to poor random IOPs delivered to (test) application.
While faster scrub is great, we need a method to limit its impact on production pool (even if this means a longer scrub time).
fio --name=test --filename=/tank/test.img --rw=randread --size=32G and look at the current IOPszpool scrub tankfioNOTE: using a 128k random read (matching the dataset recordsize) will not change IOPs number (except that the raw throughput value is higher).
After some more investigation, I found that the very low scrub performance was not directly related to the new zfs scan mode, but due to the interaction of
mq-scheduler IO sched (rather than none)zfs_scrub_delay and zfs_scan_idleIf the first point was my fault (well, I set it as noop but it is not valid anymore on CentOS 8; rather, none must be used), the second one (lack of scrub throttling) is a real concern: it generally means that, even setting zfs_vdev_scrub_max_active=1, single-threaded / low queue depth application running on HDD pools will face a ~50% reduction in random IO speed.
Lets see the output of zpool iostat -q 1 during a concurrent fio and zpool scrub:
capacity operations bandwidth syncq_read syncq_write asyncq_read asyncq_write scrubq_read trimq_write
pool alloc free read write read write pend activ pend activ pend activ pend activ pend activ pend activ
---------- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- -----
tank 108G 972G 214 0 144M 0 0 1 0 0 0 0 0 0 24 1 0 0
tank 108G 972G 222 0 137M 0 0 1 0 0 0 0 0 0 24 1 0 0
tank 108G 972G 206 59 130M 855K 0 1 0 0 0 0 0 0 24 1 0 0
tank 108G 972G 206 0 140M 0 0 1 0 0 0 0 0 0 24 1 0 0
tank 108G 972G 206 0 140M 0 0 1 0 0 0 0 0 0 24 1 0 0
tank 108G 972G 203 0 137M 0 0 1 0 0 0 0 0 0 24 1 0 0
tank 108G 972G 195 0 123M 0 0 1 0 0 0 0 0 0 24 1 0 0
tank 108G 972G 213 51 133M 855K 0 1 0 0 0 0 0 0 23 1 0 0
tank 108G 972G 217 0 148M 0 0 1 0 0 0 0 0 0 24 1 0 0
scrubq_read has 1 request always active/issued, with no throttling. On rotational media this means the seek rate effectively doubles, halving application performance for random reads.
While I really like the new scrub/resilver performance, I think we need an "escape hatch" to throttle scrubbing when application IO should be affected as little as possible.
An update: I considered restoring some form of delay, taking it from 0.7.x branch. However, dsl_scan.c and the scrub approach as a whole are sufficiently different that I am not sure this would be reasonable, much less accepted.
I found that limiting zfs_scan_vdev_limit (in addition to zfs_vdev_scrub_max_active) can reduce scrub impact on low queue depth random reads. Moreover, and more importantly, multi-thread random reads (ie: higher queue depth reads) are much less impacted by scrub overhead (as the zfs_vdev_scrub_max_active vs zfs_vdev_sync_read_max_active comparison is, by default, 2 vs 10).
Finally, a scrub can be stopped/paused during work hours.
@behlendorf feel free to close the ticket. I am not closing it now only because I don't know if you (or other maintainers) want to track the problem described above. Thanks.
@behlendorf @ahrens (I do not remember who contributed the sequential scrub code, please feel free to add the right person)
I would like to add another datapoint. Short summary: scrub so heavily impacts performance that VMs sometime see 0 (zero) read IOPs. This is a small pool with 4x 2TB HDD + 2x L2ARC SSD + 1x NVME SLOG and a running scrub:
[root@localhost parameters]# zpool iostat -q 1 -v
capacity operations bandwidth syncq_read syncq_write asyncq_read asyncq_write scrubq_read trimq_write
pool alloc free read write read write pend activ pend activ pend activ pend activ pend activ pend activ
-------------------------------- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- -----
tank 1.52T 2.11T 1.33K 16 515M 311K 0 1 0 0 16 3 0 0 242 8 0 0
mirror 776G 1.05T 681 0 253M 0 0 0 0 0 0 0 0 0 66 4 0 0
pci-0000:02:00.1-ata-1.0 - - 381 0 133M 0 0 0 0 0 0 0 0 0 0 2 0 0
pci-0000:02:00.1-ata-2.0 - - 300 0 120M 0 0 0 0 0 0 0 0 0 66 2 0 0
mirror 777G 1.05T 679 0 262M 0 0 1 0 0 16 3 0 0 176 4 0 0
pci-0000:02:00.1-ata-5.0 - - 342 0 131M 0 0 1 0 0 0 0 0 0 69 2 0 0
pci-0000:02:00.1-ata-6.0 - - 337 0 131M 0 0 0 0 0 16 3 0 0 107 2 0 0
logs - - - - - - - - - - - - - - - - - -
nvme0n1 94.5M 26.9G 0 16 0 311K 0 0 0 0 0 0 0 0 0 0 0 0
cache - - - - - - - - - - - - - - - - - -
pci-0000:02:00.1-ata-3.0-part6 11.1G 245G 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
pci-0000:02:00.1-ata-4.0-part6 11.5G 245G 0 51 0 5.16M 0 0 0 0 0 0 0 0 0 0 0 0
-------------------------------- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- -----
[root@localhost parameters]# iostat -x -k 1
avg-cpu: %user %nice %system %iowait %steal %idle
1.76 0.00 3.39 41.50 0.00 53.35
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
nvme0n1 0.00 0.00 0.00 20.00 0.00 816.00 81.60 0.01 0.30 0.00 0.30 0.10 0.20
sda 0.00 0.00 376.00 0.00 72480.00 0.00 385.53 6.19 6.36 6.36 0.00 2.66 100.00
sdb 0.00 0.00 390.00 0.00 70576.00 0.00 361.93 4.95 16.39 16.39 0.00 2.56 100.00
sdc 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sde 0.00 0.00 0.00 56.00 0.00 5684.00 203.00 0.02 0.43 0.00 0.43 0.21 1.20
sdf 0.00 0.00 440.00 0.00 123236.00 0.00 560.16 6.76 18.85 18.85 0.00 2.27 100.00
sdg 0.00 0.00 437.00 0.00 125008.00 0.00 572.12 6.72 6.24 6.24 0.00 2.29 100.00
md127 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
md126 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
dm-0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
dm-1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
Please note how the HDDs are overwelmed by pending ZFS scrub request: while the scrub itself is very fast, it completely saturates the HDDs with very bad resulting performance for running VMs. Setting zfs_scan_vdev_limit to 128K and zfs_vdev_scrub_max_active only slightly lessen the problem, while fiddling with zfs_no_scrub_prefetch and zfs_scrub_min_time_ms seems to have no effect at all.
Any idea on what can be done to further decrease scrub load?
Well, I did an interesting discovery: setting sd[abfg]/device/queue_depth=1 (effectively disabling NCQ) solved the VMs stalling problem. I can confirm with a fio --rw=randread that no 0 (or very low) IOPs are recorded.
I got curious and tested a disk (WD Gold 2 TB) in isolation. I can replicate the issue by concurrently running the following two fio commands:
fio --name=test --filename=/dev/sda --direct=1 --rw=read #sequential read
fio --name=test --filename=/dev/sda --direct=1 --rw=randread #random read
While the first fio sucked almost all IOPs, the second one was mostly stalled. In short, it seems that the new scrub code, which is much more sequential than the old behavior, causes some disks (WD Gold in this case) to stall random read requests. I suppose this is due to over-aggressive read-ahead enabled by "seeing" multiple concurrent requests (setting sda/device/queue_depth=2, using a minimal NCQ amount, give the same stalling outcome) , but the exact cause is probably not so important. The old scrub code, with its more random IO pattern, did not expose the problem.
As a side note, an older WD Green did not show any issue.
I am leaving this issue open for some days only because I don't know if someone want to comment and/or share other relevant experiences. Anyway, feel free to close it.
Thanks.
That's really interesting. It definitely sounds like an issue with the WD Gold drives, and it's not something I would have expected from an enterprise branded drive. You might want to check if there's a firmware update available. Thanks for posting a solution for anyone else who may encounter this.
I was able to reproduce this on a different model of Western Digital hard drives: WD Red 10 TB (WD100EFAX). I am using 6 of these drives in a zpool made of 3 mirrors. See #10535.
My experience closely matches @shodanshok's: following the steps to reproduce in his original post, with default settings (scheduler=none, queue_depth=32), I get roughly 150-180 IOPs in fio, falling to a measly trickle of about 10 IOPs when a scrub is ongoing. But if I set queue_depth=1, then I get about 60-100 IOPs - a huge improvement. So thank you, @shodanshok, for getting to the bottom of this issue! Your workaround seems to work quite well. In fact, I get the impression that I'm getting better performance with queue_depth=1 during normal operation even when a scrub is not running (about 200-250 IOPs).
Now, if only Western Digital could fix their firmware… stalling all random reads when sequential reads are inflight sounds pretty bad. One can easily imagine such behaviour causing problems with production services becoming unresponsive just because some random user decided to scan the contents of a file.
Is it time for the ZFS wiki or related documentation to make a "known bad" list of drives/firmwares that have been definitively identified as interacting badly with ZFS?
@gdevenyi Rather than a list (which will become outdated pretty fast), I suggest inserting a note in the hardware/performance page stating that if excessive performance degradation is observed during scrub, disabling NCQ is a possible workaround (maybe even linking to this issue).
I was actually planning to add a "Queue depth" section on the Performance tuning OpenZFS wiki page to describe this problem, but that page doesn't seem to have open edit access.
…and for reference, I used the following udev rule to automatically apply the workaround to all my affected disks:
DRIVER=="sd", ATTR{model}=="WDC WD100EFAX-68", ATTR{queue_depth}="1"
On Jul 11, 2020, at 3:38 AM, Etienne Dechamps notifications@github.com wrote:
…and for reference, I used the following udev rule to automatically apply the workaround to all my affected disks:
DRIVER=="sd", ATTR{model}=="WDC WD100EFAX-68", ATTR{queue_depth}="1"
This has long been a behaviour seen by HDDs, with some firmware better than others.
You might find queue_depth=2 works better, but higher queue depths are worse. For
some background, see
http://blog.richardelling.com/2012/03/iops-and-latency-are-not-related-hdd.html http://blog.richardelling.com/2012/03/iops-and-latency-are-not-related-hdd.html
-- richard
@richardelling Unfortunately, for the specific case of WD Gold disks (and I suppose @dechamps WD Red too), using anything over 1 causes the read starvation issue described above.
with WD Gold disks, disabling the disk scheduler (using noop) resolves this issue, I don't need to set queue_depth=1. however, if I use any other disk scheduler than noop, zpool scrub will cause IO starvation.
To clarify, in my case, /sys/class/block/sd*/queue/scheduler was [none] from the very beginning, so clearly that didn't help with my WD Red WD100EFAX. Only setting queue_depth to 1 fixed the issue.
@misterbigstuff I my case, using noop or none made no difference on IOPS recorded during a scrub. Limiting queue_depth to 1 was the only solution, matching @dechamps experience.
notably i'm using the SATA revision of these devices, which have different firmware from the SAS counterpart.
@misterbigstuff Interesting: I also have multiple SATA WD Gold, but these drives show the described issued unless I set queue_depth=1, irrespective of the IO scheduler (which is consistent with the fio tests which, by using direct=1 and issuing a single IOP at a times, should be unaffected by the scheduler). Maybe some newer firmware fixed it? For reference, here are my disk details:
Model Family: Western Digital Gold
Device Model: WDC WD2005FBYZ-01YCBB2
Serial Number: WD-XXX
LU WWN Device Id: 5 0014ee 0af18d58f
Firmware Version: RR07
User Capacity: 2,000,398,934,016 bytes [2.00 TB]
Can you share your disk model/firmware version? Did you try the reproducer involving concurrently running these two fio commands? Can you post the results of the tests below?
fio --name=test --filename=/dev/sda --direct=1 --rw=read #sequential read
fio --name=test --filename=/dev/sda --direct=1 --rw=randread #random read
@misterbigstuff
i'm using the SATA revision of these devices
I am also using SATA, so that shouldn't make a difference.
Here are the details of one of my drives:
Model Family: Western Digital Red
Device Model: WDC WD100EFAX-68LHPN0
Serial Number: JEJEGWKM
LU WWN Device Id: 5 000cca 267e24fb9
Firmware Version: 83.H0A83
User Capacity: 10,000,831,348,736 bytes [10.0 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Rotation Rate: 5400 rpm
Form Factor: 3.5 inches
Device is: In smartctl database [for details use: -P show]
ATA Version is: ACS-2, ATA8-ACS T13/1699-D revision 4
SATA Version is: SATA 3.2, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Tue Jul 14 10:12:02 2020 BST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
OS details: Debian Unstable/Sid, Linux 5.7.0-1, ZFS/SPL 0.8.4-1.
@misterbigstuff One thing that might be different in your case is that you might be running an older Linux Kernel - you keep mentioning the noop scheduler, but in modern kernels with mq, that scheduler is called none:
$ cat /sys/block/sda/queue/scheduler
[none] mq-deadline
i am also having this issue. zpool scrub seems runs without any throttle at all and thus
impacting io latency and overall system load.
with zfs 0.7.* i throttled scrubs with those parameters:
this slowed down scrubs without impacting the application (much).
with zfs 0.8 these parameter do not exist anymore. i've been reading the zfs module parameters man page and began playing around with these parameters, but i am unable to slow down the scrub at all:
i also made sure, that the system parameters for queue depth and io scheduler are set as seen above.
$ for i in /sys/block/sd* ; do [[ $(cat $i/queue/rotational) == 1 ]] && cat $i/device/queue_depth ; done | sort | uniq
1
$ for i in /sys/block/sd* ; do [[ $(cat $i/queue/rotational) == 1 ]] && cat $i/queue/scheduler ; done | sort | uniq
[none] mq-deadline
system configuration:
graphs from the prometheus node exporter. i did stop the scrub after some time:


i could use some help on how to go on with this. which other parameters might be helpful in decreasing the scrub speed? what else can i try?
At Delphix, we have investigated reducing the impact of scrub by having it run at a reduced i/o rate. Several years back, one of our interns prototyped this. It would be wonderful if we took this discussion as motivation to complete that work with a goal of having scrub on by default in more deployments of ZFS! If anyone is interested in working on that, I can dig up the design documents and any code.
@wildente from the graphs you posted, it seems the pools had almost no load excluding scrub itself. Did you scrub all your pool at the same time? Can you set zfs_vdev_scrub_max_active=1 and run the following fio command on both idle and scrubbing pool?
fio --name=test --filename=/yourpool/test.img --rw=randread --size=32G
@ahrens excluding bad interactions with hardware queues, setting zfs_vdev_scrub_max_active=1 should let scrub to "only" eat 50% of available IOPs. Do you think a simple rule as "if any other queues has one or more active/pending IO, skip scrubbing for some msec" can be useful (similar to how 0.7.x throttled scrub)? Thanks.
NB, zpool wait time is the time I/Os are not issued to physical devices. So if you have a scrub ongoing and
zfs_vdev_scrub_max_active is small (default=2), then it is expected to see high wait time at the zpool level.
To make this info useful, you'll need to look at the wait time per queue. See zpool iostat -l (though I'm not
convinced zpool iostat -l is as advertised, but that is another discussion)
-- richard
On Aug 11, 2020, at 6:38 AM, Wildente notifications@github.com wrote:
i am also having this issue. zpool scrub seems runs without any throttle at all and thus
impacting io latency and overall system load.with zfs 0.7.* i throttled scrubs with those parameters:
zfs_scrub_delay=60
zfs_top_maxinflight=6
zfs_scan_idle=150
this slowed down scrubs without impacting the application (much).with zfs 0.8 these parameter do not exist anymore. i've been reading the zfs module parameters man page and began playing around with these parameters, but i am unable to slow down the scrub at all:
zfs_vdev_scrub_max_active
zfs_scan_strict_mem_lim
zfs_scan_mem_lim_soft_fact
zfs_scan_mem_lim_fact
zfs_scan_vdev_limit
zfs_scrub_min_time_ms
zfs_no_scrub_prefetch
i also made sure, that the system parameters for queue depth and io scheduler are set as seen above.$ for i in /sys/block/sd* ; do [[ $(cat $i/queue/rotational) == 1 ]] && cat $i/device/queue_depth ; done | sort | uniq
1$ for i in /sys/block/sd* ; do [[ $(cat $i/queue/rotational) == 1 ]] && cat $i/queue/scheduler ; done | sort | uniq
[none] mq-deadlinesystem configuration:
dell md3060e enclosure
sas hba
12 raidz1 pools of 5 nl-sas hdds (4TB) (manufacturer toshiba, seagate, hgst)
os: debian buster
zfs version: 0.8.4-1~bpo10+1
kernel version: 4.19.0-9-amd64
graphs from the prometheus node exporter. i did stop the scrub after some time:https://user-images.githubusercontent.com/29410350/89903397-e2f72f00-dbe7-11ea-9e79-312406462f24.png
https://user-images.githubusercontent.com/29410350/89903469-f86c5900-dbe7-11ea-8dd0-db06605b6759.png
i could use some help on how to go on with this. which other parameters might be helpful in decreasing the scrub speed? what else can i try?—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub https://github.com/openzfs/zfs/issues/10253#issuecomment-671952196, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAGTZTPW6H57PPF3WYJBHN3SAFCVHANCNFSM4MQWWWTA.
I'm wondering, I might be totally off, but some pools we've recently created, were created with bad ashift (=9, when the drives had 4k sectors in fact). It were SSDs in both cases, but accessing the drives with 512B sectors absolutely destroyed any hint at performance the devices might have had. Recreating the pool with -o ashift=12 fixed it.
Could you, just to be sure, check the ashift? 1.3k IOPS from a pool with NVMe drives sounds like exactly the situation I'm describing :)
@shodanshok
setting zfs_vdev_scrub_max_active=1 should let scrub to "only" eat 50% of available IOPs
Assuming that all i/os are equal, yes. But the per-byte costs can cause scrub i/o's to eat more than 50% of the available performance. I think that scrub i/o's can aggregate up to 1MB (and are likely to, now that we have "sorted scrub"), vs typical i/o's might be smaller.
Do you think a simple rule as "if any other queues has one or more active/pending IO, skip scrubbing for some msec" can be useful (similar to how 0.7.x throttled scrub)? Thanks.
I think it could be, if we do it right. For example, we might want finer granularity than whole milliseconds. And we'd want to consider both "metadata scanning" and "issuing scrub i/os" phases. Although maybe we could ignore (not limit) the metadata scanning for this purpose? A deliberate "slow scrub" feature might work by automatically adjusting this kind of knob.
@shodanshok
setting zfs_vdev_scrub_max_active=1 should let scrub to "only" eat 50% of available IOPs
Assuming that all i/os are equal, yes. But the per-byte costs can cause scrub i/o's to eat more than 50% of the available performance. I think that scrub i/o's can aggregate up to 1MB (and are likely to, now that we have "sorted scrub"), vs typical i/o's might be smaller.
True.
I think it could be, if we do it right. For example, we might want finer granularity than whole milliseconds. And we'd want to consider both "metadata scanning" and "issuing scrub i/os" phases. Although maybe we could ignore (not limit) the metadata scanning for this purpose? A deliberate "slow scrub" feature might work by automatically adjusting this kind of knob.
I suppose the metadata scan does not need special treatment. On the other side the data scrub phase, being sequential in nature, can really consume vast amout of bandwidth (and IOPs).
thank you for the overwhelming number of messages. i'll try to answer
them all
@ahrens: yes some more information would be useful. i was using
the parameters from my post to slow down the scrub and so it would
finish within one week (zfs 0.7) instead of ~15 hours (zfs 0.8).
i also agree, that the weight of each io request is relevant for this
case, since we are now scrubbing sequentially in large blocks.
@shodanshok: maybe i should have posted graphs of the read/write ops
instead of read data rate. these kind of servers mainly handle small random
reads and write-appends (similar to the mbox format).
i will set `zfs_vdev_scrub_max_active=1 and run your fio test command
tomorrow morning.
and yes, all pools did scrub at the same time.
@richardelling: thanks. i always thought, that this is the actual
wait io from the underlying disks.
@snajpa: in my case those pools were create about a year ago,
on hdds and they were created with ashift=12
@shodanshok i've set zfs_vdev_scrub_max_active=1 and ran the fio command on a one of the 12 zpools:
# zpool iostat zpool1-store1 -l 1
[snip]
zpool1-store1 11,5T 6,60T 313 0 6,94M 0 14ms - 14ms - 2us - - - 2us -
zpool1-store1 11,5T 6,60T 312 0 7,13M 0 14ms - 14ms - 2us - - - 2us -
zpool1-store1 11,5T 6,60T 301 0 6,94M 0 15ms - 15ms - 2us - - - 2us -
zpool1-store1 11,5T 6,60T 335 0 7,87M 0 14ms - 14ms - 2us - - - 2us -
zpool1-store1 11,5T 6,60T 324 0 7,24M 0 14ms - 14ms - 2us - - - 2us -
zpool1-store1 11,5T 6,60T 300 0 6,62M 0 16ms - 16ms - 2us - - - 2us -
zpool1-store1 11,5T 6,60T 330 0 7,44M 0 15ms - 15ms - 2us - - - 2us -
zpool1-store1 11,5T 6,60T 307 0 6,88M 0 16ms - 16ms - 2us - - - 2us -
zpool1-store1 11,5T 6,60T 314 0 6,92M 0 15ms - 15ms - 2us - - - 2us -
zpool1-store1 11,5T 6,60T 325 0 7,50M 0 14ms - 14ms - 2us - - - 2us -
zpool1-store1 11,5T 6,60T 315 0 6,85M 0 15ms - 15ms - 2us - - - 2us -
zpool1-store1 11,5T 6,60T 310 0 6,85M 0 14ms - 14ms - 2us - - - 2us -
zpool1-store1 11,5T 6,60T 142 275 4,29M 2,13M 20ms 143ms 20ms 26ms 2us - - 117ms 3us -
zpool1-store1 11,5T 6,60T 1,29K 21 91,1M 87,7K 5ms 68ms 3ms 38ms 2us 2us - 402ms 1ms -
zpool1-store1 11,5T 6,60T 1,65K 0 115M 0 4ms - 3ms - 2us - - - 1ms -
zpool1-store1 11,5T 6,60T 1,46K 0 102M 0 5ms - 3ms - 2us - - - 1ms -
capacity operations bandwidth total_wait disk_wait syncq_wait asyncq_wait scrub trim
pool alloc free read write read write read write read write read write read write wait wait
------------- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- -----
zpool1-store1 11,5T 6,60T 1,35K 0 97,5M 0 5ms - 4ms - 2us - - - 1ms -
zpool1-store1 11,5T 6,60T 1,47K 0 110M 0 5ms - 3ms - 3us - - - 1ms -
zpool1-store1 11,5T 6,60T 1,59K 0 113M 0 4ms - 3ms - 2us - - - 1ms -
zpool1-store1 11,5T 6,60T 1,50K 0 108M 0 4ms - 3ms - 2us - - - 1ms -
zpool1-store1 11,5T 6,60T 1,43K 0 104M 0 5ms - 3ms - 3us - - - 1ms -
zpool1-store1 11,5T 6,60T 1,51K 0 105M 0 4ms - 3ms - 2us - - - 1ms -
zpool1-store1 11,5T 6,60T 1,20K 0 80,7M 0 5ms - 4ms - 2us - - - 1ms -
zpool1-store1 11,5T 6,60T 1,25K 0 85,1M 0 5ms - 3ms - 2us - - - 1ms -
zpool1-store1 11,5T 6,60T 1,13K 0 77,7M 0 6ms - 4ms - 3us - - - 2ms -
zpool1-store1 11,5T 6,60T 1,10K 0 77,5M 0 6ms - 4ms - 2us - - - 1ms -
zpool1-store1 11,5T 6,60T 1,30K 0 90,1M 0 5ms - 3ms - 2us - - - 1ms -
zpool1-store1 11,5T 6,60T 1,08K 0 75,0M 0 6ms - 4ms - 2us - - - 2ms -
[snip]
zpool1-store1 11,5T 6,60T 1,14K 0 79,4M 0 6ms - 4ms - 2us - - - 1ms -
zpool1-store1 11,5T 6,60T 1,13K 0 79,9M 0 5ms - 3ms - 2us - - - 1ms -
zpool1-store1 11,5T 6,60T 957 0 66,6M 0 6ms - 4ms - 2us - - - 2ms -
zpool1-store1 11,5T 6,60T 1,20K 0 89,7M 0 5ms - 4ms - 3us - - - 1ms -
zpool1-store1 11,5T 6,60T 1,03K 0 70,3M 0 6ms - 4ms - 2us - - - 2ms -
zpool1-store1 11,5T 6,60T 1,19K 0 80,0M 0 6ms - 4ms - 2us - - - 1ms -
zpool1-store1 11,5T 6,60T 1,25K 0 87,8M 0 5ms - 4ms - 2us - - - 1ms -
zpool1-store1 11,5T 6,60T 1,71K 0 121M 0 4ms - 3ms - 3us - - - 1ms -
zpool1-store1 11,5T 6,60T 1,29K 13 91,5M 55,8K 5ms 66ms 3ms 57ms 2us - - 11ms 1ms -
zpool1-store1 11,5T 6,60T 155 327 4,64M 2,09M 19ms 167ms 19ms 25ms 2us 2us - 150ms 3us -
zpool1-store1 11,5T 6,60T 306 0 7,10M 0 15ms - 15ms - 2us - - - 2us -
zpool1-store1 11,5T 6,60T 317 0 7,31M 0 14ms - 14ms - 2us - - - 2us -
zpool1-store1 11,5T 6,60T 304 31 6,89M 136K 15ms 45ms 15ms 28ms 2us - - 15ms 2us -
zpool1-store1 11,5T 6,60T 479 252 30,4M 2,01M 9ms 159ms 7ms 27ms 3us 2us - 141ms 2ms -
zpool1-store1 11,5T 6,60T 1,50K 0 107M 0 4ms - 3ms - 2us - - - 1ms -
zpool1-store1 11,5T 6,60T 1,28K 0 90,7M 0 5ms - 4ms - 2us - - - 1ms -
zpool1-store1 11,5T 6,60T 1,39K 0 99,0M 0 5ms - 3ms - 2us - - - 1ms -
zpool1-store1 11,5T 6,60T 1,20K 0 79,9M 0 5ms - 4ms - 2us - - - 1ms -
zpool1-store1 11,5T 6,60T 1,11K 0 79,7M 0 6ms - 4ms - 2us - - - 2ms -
before the start of the scrub, we have ~300-330 read ops. after the start of the scrub, it jumps to 1k-1.7k read ops. i am guessing the write operation in between are checkpoints.
# zpool status zpool1-store1 | grep pool: -A4
pool: zpool1-store1
state: ONLINE
scan: scrub in progress since Thu Aug 13 09:09:05 2020
580G scanned at 642M/s, 59,5G issued at 65,9M/s, 11,5T total
0B repaired, 0,50% done, 2 days 02:41:31 to go
# fio --name=test --filename=/zpool1-store1/test.img --rw=randread --size=32G
test: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=psync, iodepth=1
fio-3.12
Starting 1 process
^Cbs: 1 (f=1): [r(1)][1.2%][r=256KiB/s][r=64 IOPS][eta 01d:06h:10m:44s]
fio: terminating on signal 2
test: (groupid=0, jobs=1): err= 0: pid=5417: Thu Aug 13 09:28:25 2020
read: IOPS=76, BW=305KiB/s (312kB/s)(401MiB/1345644msec)
clat (usec): min=2, max=514086, avg=13104.19, stdev=13722.49
lat (usec): min=2, max=514086, avg=13104.61, stdev=13722.55
clat percentiles (usec):
| 1.00th=[ 11], 5.00th=[ 26], 10.00th=[ 38], 20.00th=[ 41],
| 30.00th=[ 52], 40.00th=[ 8586], 50.00th=[ 13698], 60.00th=[ 17171],
| 70.00th=[ 20055], 80.00th=[ 24249], 90.00th=[ 28967], 95.00th=[ 32637],
| 99.00th=[ 39584], 99.50th=[ 44827], 99.90th=[ 68682], 99.95th=[ 99091],
| 99.99th=[421528]
bw ( KiB/s): min= 8, max=38312, per=100.00%, avg=305.07, stdev=930.11, samples=2691
iops : min= 2, max= 9578, avg=76.21, stdev=232.53, samples=2691
lat (usec) : 4=0.07%, 10=0.88%, 20=1.72%, 50=22.95%, 100=11.64%
lat (usec) : 250=0.06%, 500=0.01%, 750=0.01%, 1000=0.01%
lat (msec) : 2=0.01%, 4=0.03%, 10=5.91%, 20=26.39%, 50=30.03%
lat (msec) : 100=0.25%, 250=0.02%, 500=0.03%, 750=0.01%
cpu : usr=0.05%, sys=0.62%, ctx=64785, majf=0, minf=10
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=102656,0,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=1
Run status group 0 (all jobs):
READ: bw=305KiB/s (312kB/s), 305KiB/s-305KiB/s (312kB/s-312kB/s), io=401MiB (420MB), run=1345644-1345644msec
can i provide anything else to help with this issue?
@wildente so during the scrub, fio shows 76 IOPs. What about re-running fio without a background scrub? How many IOPs do you have?
Your latency numbers seems ok. Can you show, both with and without scrub running, the output of "zpool iostat -q" (to get queue stats)?
@shodanshok yes, i will do that on monday morning
@shodanshok sorry for the long delay. i can reproduce the IOPS from above, but i think that is expected of a raidz with 5 drives.
```# zpool iostat zpool1-store1 -q -l 1
capacity operations bandwidth total_wait disk_wait syncq_wait asyncq_wait scrub trim syncq_read syncq_write asyncq_read asyncq_write scrubq_read trimq_write
pool alloc free read write read write read write read write read write read write wait wait pend activ pend activ pend activ pend activ pend activ pend activ
zpool1-store1 11,4T 6,69T 61 5 2,96M 50,9K 1s 110ms 7ms 21ms 11ms 2us 7ms 97ms 1s - 0 0 0 0 0 0 0 0 0 0 0 0
zpool1-store1 11,4T 6,69T 0 0 0 0 - - - - - - - - - - 0 0 0 0 0 0 0 0 0 0 0 0
zpool1-store1 11,4T 6,69T 0 0 0 0 - - - - - - - - - - 0 0 0 0 0 0 0 0 0 0 0 0
zpool1-store1 11,4T 6,69T 0 0 0 0 - - - - - - - - - - 0 0 0 0 0 0 0 0 0 0 0 0
zpool1-store1 11,4T 6,69T 0 0 0 0 - - - - - - - - - - 0 0 0 0 0 0 0 0 0 0 0 0
[snip]
zpool1-store1 11,4T 6,69T 308 0 9,63M 0 10ms - 10ms - 2us - - - - - 0 3 0 0 0 0 0 0 0 0 0 0
zpool1-store1 11,4T 6,69T 274 0 8,57M 0 11ms - 11ms - 2us - - - - - 0 4 0 0 0 0 0 0 0 0 0 0
zpool1-store1 11,4T 6,69T 287 0 8,97M 0 10ms - 10ms - 2us - - - - - 0 4 0 0 0 0 0 0 0 0 0 0
zpool1-store1 11,4T 6,69T 318 0 9,94M 0 10ms - 10ms - 2us - - - - - 0 1 0 0 0 0 0 0 0 0 0 0
zpool1-store1 11,4T 6,69T 288 0 9,00M 0 11ms - 11ms - 2us - - - - - 0 4 0 0 0 0 0 0 0 0 0 0
zpool1-store1 11,4T 6,69T 301 0 9,41M 0 10ms - 10ms - 2us - - - - - 0 2 0 0 0 0 0 0 0 0 0 0
zpool1-store1 11,4T 6,69T 294 0 9,19M 0 10ms - 10ms - 2us - - - - - 0 3 0 0 0 0 0 0 0 0 0 0
zpool1-store1 11,4T 6,69T 322 0 10,1M 0 9ms - 9ms - 2us - - - - - 0 4 0 0 0 0 0 0 0 0 0 0
zpool1-store1 11,4T 6,69T 303 0 9,47M 0 10ms - 10ms - 2us - - - - - 0 4 0 0 0 0 0 0 0 0 0 0
zpool1-store1 11,4T 6,69T 295 0 9,22M 0 10ms - 10ms - 2us - - - - - 0 4 0 0 0 0 0 0 0 0 0 0
zpool1-store1 11,4T 6,69T 315 0 9,85M 0 10ms - 10ms - 2us - - - - - 0 4 0 0 0 0 0 0 0 0 0 0
zpool1-store1 11,4T 6,69T 303 0 9,47M 0 10ms - 10ms - 2us - - - - - 0 4 0 0 0 0 0 0 0 0 0 0
zpool1-store1 11,4T 6,69T 307 0 9,60M 0 10ms - 10ms - 2us - - - - - 0 4 0 0 0 0 0 0 0 0 0 0
zpool1-store1 11,4T 6,69T 303 0 9,47M 0 10ms - 10ms - 2us - - - - - 0 4 0 0 0 0 0 0 0 0 0 0
zpool1-store1 11,4T 6,69T 333 0 10,4M 0 9ms - 9ms - 2us - - - - - 0 1 0 0 0 0 0 0 0 0 0 0
zpool1-store1 11,4T 6,69T 304 0 9,50M 0 10ms - 10ms - 2us - - - - - 0 4 0 0 0 0 0 0 0 0 0 0
zpool1-store1 11,4T 6,69T 307 0 9,60M 0 10ms - 10ms - 2us - - - - - 0 4 0 0 0 0 0 0 0 0 0 0
zpool1-store1 11,4T 6,69T 273 0 8,54M 0 11ms - 11ms - 2us - - - - - 0 2 0 0 0 0 0 0 0 0 0 0
zpool1-store1 11,4T 6,69T 304 0 9,52M 0 10ms - 10ms - 2us - - - - - 0 1 0 0 0 0 0 0 0 0 0 0
zpool1-store1 11,4T 6,69T 321 0 10,0M 0 9ms - 9ms - 2us - - - - - 0 3 0 0 0 0 0 0 0 0 0 0
zpool1-store1 11,4T 6,69T 282 0 8,82M 0 11ms - 11ms - 2us - - - - - 0 4 0 0 0 0 0 0 0 0 0 0
zpool1-store1 11,4T 6,69T 308 0 9,63M 0 9ms - 9ms - 2us - - - - - 0 3 0 0 0 0 0 0 0 0 0 0
zpool1-store1 11,4T 6,69T 300 0 9,38M 0 10ms - 10ms - 2us - - - - - 0 2 0 0 0 0 0 0 0 0 0 0
zpool1-store1 11,4T 6,69T 277 0 8,66M 0 11ms - 11ms - 2us - - - - - 0 4 0 0 0 0 0 0 0 0 0 0
zpool1-store1 11,4T 6,69T 306 0 9,56M 0 10ms - 10ms - 2us - - - - - 0 1 0 0 0 0 0 0 0 0 0 0
zpool1-store1 11,4T 6,69T 322 0 10,1M 0 9ms - 9ms - 2us - - - - - 0 2 0 0 0 0 0 0 0 0 0 0
zpool1-store1 11,4T 6,69T 294 0 9,19M 0 10ms - 10ms - 2us - - - - - 0 3 0 0 0 0 0 0 0 0 0 0
zpool1-store1 11,4T 6,69T 279 0 8,72M 0 11ms - 11ms - 2us - - - - - 0 3 0 0 0 0 0 0 0 0 0 0
zpool1-store1 11,4T 6,69T 316 0 9,88M 0 9ms - 9ms - 2us - - - - - 0 2 0 0 0 0 0 0 0 0 0 0
zpool1-store1 11,4T 6,69T 302 0 9,44M 0 10ms - 10ms - 2us - - - - - 0 3 0 0 0 0 0 0 0 0 0 0
zpool1-store1 11,4T 6,69T 302 0 9,44M 0 10ms - 10ms - 2us - - - - - 0 4 0 0 0 0 0 0 0 0 0 0
zpool1-store1 11,4T 6,69T 305 0 9,53M 0 9ms - 9ms - 2us - - - - - 0 2 0 0 0 0 0 0 0 0 0 0
zpool1-store1 11,4T 6,69T 289 0 9,04M 0 10ms - 10ms - 2us - - - - - 0 4 0 0 0 0 0 0 0 0 0 0
zpool1-store1 11,4T 6,69T 312 0 9,75M 0 10ms - 10ms - 2us - - - - - 0 3 0 0 0 0 0 0 0 0 0 0
zpool1-store1 11,4T 6,69T 311 0 9,72M 0 10ms - 10ms - 2us - - - - - 0 3 0 0 0 0 0 0 0 0 0 0
zpool1-store1 11,4T 6,69T 302 0 9,44M 0 10ms - 10ms - 2us - - - - - 0 4 0 0 0 0 0 0 0 0 0 0
zpool1-store1 11,4T 6,69T 285 0 8,91M 0 10ms - 10ms - 2us - - - - - 0 2 0 0 0 0 0 0 0 0 0 0
zpool1-store1 11,4T 6,69T 286 0 8,94M 0 10ms - 10ms - 2us - - - - - 0 3 0 0 0 0 0 0 0 0 0 0
zpool1-store1 11,4T 6,69T 286 0 8,94M 0 10ms - 10ms - 2us - - - - - 0 4 0 0 0 0 0 0 0 0 0 0
zpool1-store1 11,4T 6,69T 303 0 9,47M 0 10ms - 10ms - 2us - - - - - 0 4 0 0 0 0 0 0 0 0 0 0
zpool1-store1 11,4T 6,69T 303 0 9,47M 0 10ms - 10ms - 2us - - - - - 0 4 0 0 0 0 0 0 0 0 0 0
zpool1-store1 11,4T 6,69T 289 0 9,04M 0 11ms - 11ms - 2us - - - - - 0 2 0 0 0 0 0 0 0 0 0 0
zpool1-store1 11,4T 6,69T 281 0 8,79M 0 10ms - 10ms - 2us - - - - - 0 4 0 0 0 0 0 0 0 0 0 0
zpool1-store1 11,4T 6,69T 303 0 9,47M 0 10ms - 10ms - 2us - - - - - 0 4 0 0 0 0 0 0 0 0 0 0
zpool1-store1 11,4T 6,69T 305 0 9,53M 0 10ms - 10ms - 2us - - - - - 0 2 0 0 0 0 0 0 0 0 0 0
zpool1-store1 11,4T 6,69T 93 0 2,93M 0 10ms - 10ms - 2us - - - - - 0 0 0 0 0 0 0 0 0 0 0 0
zpool1-store1 11,4T 6,69T 0 0 0 0 - - - - - - - - - - 0 0 0 0 0 0 0 0 0 0 0 0
zpool1-store1 11,4T 6,69T 0 0 0 0 - - - - - - - - - - 0 0 0 0 0 0 0 0 0 0 0 0
zpool1-store1 11,4T 6,69T 0 0 0 0 - - - - - - - - - - 0 0 0 0 0 0 0 0 0 0 0 0
zpool1-store1 11,4T 6,69T 0 0 0 0 - - - - - - - - - - 0 0 0 0 0 0 0 0 0 0 0 0
zpool1-store1 11,4T 6,69T 0 0 0 0 - - - - - - - - - - 0 0 0 0 0 0 0 0 0 0 0 0
test: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=psync, iodepth=1
fio-3.12
Starting 1 process
^Cbs: 1 (f=1): [r(1)][0.1%][r=300KiB/s][r=75 IOPS][eta 01d:07h:21m:08s]
fio: terminating on signal 2
test: (groupid=0, jobs=1): err= 0: pid=2313: Mon Aug 24 13:31:35 2020
read: IOPS=74, BW=297KiB/s (305kB/s)(44.8MiB/154330msec)
clat (usec): min=8, max=68477, avg=13441.23, stdev=5411.49
lat (usec): min=8, max=68478, avg=13441.77, stdev=5411.49
clat percentiles (usec):
| 1.00th=[ 19], 5.00th=[ 6783], 10.00th=[ 8029], 20.00th=[ 9241],
| 30.00th=[10159], 40.00th=[11600], 50.00th=[13042], 60.00th=[15139],
| 70.00th=[16319], 80.00th=[17171], 90.00th=[17957], 95.00th=[20317],
| 99.00th=[35390], 99.50th=[36963], 99.90th=[39584], 99.95th=[43779],
| 99.99th=[47449]
bw ( KiB/s): min= 112, max= 368, per=100.00%, avg=297.31, stdev=42.17, samples=308
iops : min= 28, max= 92, avg=74.26, stdev=10.54, samples=308
lat (usec) : 10=0.03%, 20=1.94%, 50=0.01%, 100=0.02%
lat (msec) : 2=0.01%, 4=0.14%, 10=26.45%, 20=66.32%, 50=5.08%
lat (msec) : 100=0.01%
cpu : usr=0.08%, sys=0.85%, ctx=11528, majf=0, minf=9
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=11477,0,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=1
Run status group 0 (all jobs):
READ: bw=297KiB/s (305kB/s), 297KiB/s-297KiB/s (305kB/s-305kB/s), io=44.8MiB (47.0MB), run=154330-154330msec
the layout consists of 12 zpools, each configured as raidz1 with 5 drives:
```# zpool status zpool1-store1 -P | grep config -A10
config:
NAME STATE READ WRITE CKSUM
zpool1-store1 ONLINE 0 0 0
raidz1-0 ONLINE 0 0 0
/dev/MD3060e/D01-S01-E01p1 ONLINE 0 0 0
/dev/MD3060e/D02-S01-E13p1 ONLINE 0 0 0
/dev/MD3060e/D03-S01-E25p1 ONLINE 0 0 0
/dev/MD3060e/D04-S01-E37p1 ONLINE 0 0 0
/dev/MD3060e/D05-S01-E49p1 ONLINE 0 0 0
Most helpful comment
After some more investigation, I found that the very low scrub performance was not directly related to the new zfs scan mode, but due to the interaction of
mq-schedulerIO sched (rather thannone)zfs_scrub_delayandzfs_scan_idleIf the first point was my fault (well, I set it as
noopbut it is not valid anymore on CentOS 8; rather,nonemust be used), the second one (lack of scrub throttling) is a real concern: it generally means that, even settingzfs_vdev_scrub_max_active=1, single-threaded / low queue depth application running on HDD pools will face a ~50% reduction in random IO speed.Lets see the output of
zpool iostat -q 1during a concurrentfioandzpool scrub:scrubq_readhas 1 request always active/issued, with no throttling. On rotational media this means the seek rate effectively doubles, halving application performance for random reads.While I really like the new scrub/resilver performance, I think we need an "escape hatch" to throttle scrubbing when application IO should be affected as little as possible.