Zfs: Subpar performance of RAIDZ, reads are slower than writes.

Created on 29 Sep 2019  路  13Comments  路  Source: openzfs/zfs

System information


Type | Version/Name
--- | ---
Distribution Name | Gentoo Linux
Distribution Version | amd64 (stable) 17.1/no-multilib
Linux Kernel | 4.19.72-gentoo
Architecture | x86_64
ZFS Version | 8.2 (the same behavior in 8.1)
SPL Version | N/A

Describe the problem you're observing

Performance is not satisfying. Cannot saturate full HDDs' bandwidth. Reads are less than 50% of write performance.

I have 8 ST4000VN008. 6 of them performs according to specification when reading from beginning ~180MiB/s. 2 of them under-perform with ~160MiB/s (should I RMA?). Write performance is lower with ~160MiB/s (slow disks are slower in all tests).
Benchmark provides the same results when run on all disks at the same time in parallel.

These disks are configured as RAIDZ-2 with the following command:

zpool create -n -m /mnt/storage -o ashift=12 -o autoexpand=on -o autotrim=on \
-O acltype=posixacl -O atime=off -O compression=lz4 -O dedup=off -O dnodesize=auto \
-O encryption=aes-256-gcm -O keyformat=raw -O keylocation=file:///root/storage.key \
-O logbias=latency -O xattr=sa -O casesensitivity=sensitive storage raidz2 \
ata-ST4000VN008-ZDR166_ZGY5C3W7 ata-ST4000VN008-ZDR166_ZGY5E06J \
ata-ST4000VN008-ZDR166_ZDH7EMPY ata-ST4000VN008-ZDR166_ZDH7F08Z \
ata-ST4000VN008-ZDR166_ZDH7ESM1 ata-ST4000VN008-ZDR166_ZDH7FA7S \
ata-ST4000VN008-ZDR166_ZDH7F9P5 ata-ST4000VN008-ZDR166_ZDH7F9BN \
log nvme-INTEL_SSDPED1D280GA_PHMB7443018J280CGN

(but with compression disabled for benchmarks).

When the filesystem created above is benchmarked using

dd if=/dev/zero of=zero bs=10M

then

zpool iostat -vl 10

displays the following data:

                                                 capacity     operations     bandwidth    total_wait     disk_wait    syncq_wait    asyncq_wait  scrub   trim
pool                                           alloc   free   read  write   read  write   read  write   read  write   read  write   read  write   wait   wait
---------------------------------------------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
storage                                        24.0G  29.1T      1  10.3K  4.80K   951M   15ms    5ms   15ms    4ms    3us    2us      -  614us      -      -
  raidz2                                       24.0G  29.1T      1  10.3K  4.80K   951M   15ms    5ms   15ms    4ms    3us    2us      -  614us      -      -
    ata-ST4000VN008-2DR166_ZGY5C3W7                -      -      0  1.32K    818   119M   37ms    4ms   37ms    4ms    3us    4us      -  507us      -      -
    ata-ST4000VN008-2DR166_ZGY5E06J                -      -      0  1.31K    409   119M   50ms    4ms   50ms    4ms    3us    1us      -  548us      -      -
    ata-ST4000VN008-2DR166_ZDH7EMPY                -      -      0  1.27K    409   119M  196us    5ms  196us    4ms    3us    2us      -  665us      -      -
    ata-ST4000VN008-2DR166_ZDH7F08Z                -      -      0  1.34K  1.20K   119M    4ms    4ms    4ms    4ms    3us    1us      -  487us      -      -
    ata-ST4000VN008-2DR166_ZDH7ESM1                -      -      0  1.31K    409   119M   50ms    4ms   50ms    4ms    3us    1us      -  545us      -      -
    ata-ST4000VN008-2DR166_ZDH7FA7S                -      -      0  1.27K    818   119M  196us    5ms  196us    4ms    3us    1us      -  635us      -      -
    ata-ST4000VN008-2DR166_ZDH7F9P5                -      -      0  1.24K    818   119M  393us    5ms  393us    5ms    3us    1us      -  730us      -      -
    ata-ST4000VN008-2DR166_ZDH7F9BN                -      -      0  1.22K      0   119M      -    6ms      -    5ms      -    1us      -  818us      -      -
logs                                               -      -      -      -      -      -      -      -      -      -      -      -      -      -      -      -
  nvme-INTEL_SSDPED1D280GA_PHMB7443018J280CGN      0   260G      0      0      0      0      -      -      -      -      -      -      -      -      -      -
---------------------------------------------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----

Note: Sequential write peaks at 120MiB/s per HDD.

When reading

dd if=zero of=/dev/null bs=10M

then

zpool iostat -vl 10

displays the following data:

                                                 capacity     operations     bandwidth    total_wait     disk_wait    syncq_wait    asyncq_wait  scrub   trim
pool                                           alloc   free   read  write   read  write   read  write   read  write   read  write   read  write   wait   wait
---------------------------------------------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
storage                                        33.9G  29.1T  7.28K      0   362M      0    1ms      -    1ms      -    9ms      -  368us      -      -      -
  raidz2                                       33.9G  29.1T  7.28K      0   362M      0    1ms      -    1ms      -    9ms      -  368us      -      -      -
    ata-ST4000VN008-2DR166_ZGY5C3W7                -      -    943      0  44.0M      0    1ms      -    1ms      -    6ms      -  310us      -      -      -
    ata-ST4000VN008-2DR166_ZGY5E06J                -      -    964      0  44.8M      0    1ms      -    1ms      -    4ms      -  316us      -      -      -
    ata-ST4000VN008-2DR166_ZDH7EMPY                -      -    946      0  45.0M      0    1ms      -    1ms      -    6ms      -  332us      -      -      -
    ata-ST4000VN008-2DR166_ZDH7F08Z                -      -    911      0  46.2M      0    2ms      -    1ms      -   17ms      -  423us      -      -      -
    ata-ST4000VN008-2DR166_ZDH7ESM1                -      -    928      0  46.6M      0    2ms      -    1ms      -    9ms      -  453us      -      -      -
    ata-ST4000VN008-2DR166_ZDH7FA7S                -      -    887      0  44.3M      0    2ms      -    1ms      -    8ms      -  433us      -      -      -
    ata-ST4000VN008-2DR166_ZDH7F9P5                -      -    893      0  45.3M      0    1ms      -    1ms      -    9ms      -  346us      -      -      -
    ata-ST4000VN008-2DR166_ZDH7F9BN                -      -    984      0  45.8M      0    1ms      -    1ms      -   13ms      -  339us      -      -      -
logs                                               -      -      -      -      -      -      -      -      -      -      -      -      -      -      -      -
  nvme-INTEL_SSDPED1D280GA_PHMB7443018J280CGN      0   260G      0      0      0      0      -      -      -      -      -      -      -      -      -      -
---------------------------------------------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----

Note: Sequential read peaks at ~45MiB/s per HDD.

Further experimentation revealed the following information:

  • Changing VDEV type from RAIDZ-2 to RAIDZ-1 doesn't impact performance.
  • Reducing stripe size from 8 to 6 doesn't impact performance.
  • zfs_vdev_raidz_impl is avx512bw (determined to be the fastest). Changing it to scalar doesn't impact performance.
  • Disabling encryption and checksums doesn't impact performance (but encryption does visibly impact CPU utilization).
  • Changing dnodesize to legacy doesn't impact performance.
  • Changing recordsize to 1M slightly increases performance (~130MiB/s write, 50-55MiB/s read).
  • Disabling hyper-threading decreases performance (<100MiB/s write).
  • When disks are configured in RAID0 equivalent they perform better (~140MiB/s write, ~80MiB/s read).
  • When disks are split into two sets (the fastest 4, the slowest 4) and configured as two RAID0-equivalent pools they perform slightly better than one RAID0 pool with 8 disks. There is no difference in performance between "fast" and "slow" pool though.
  • When just one disk is used to create a pool, it performs reasonably close to underlying raw disk performance.

My build is roughly:

  • Xeon 4208 (8x 2.1GHz, AES-NI, AVX2, AVX512f, AVX512dq, AVX512cd, AVX512bw, AVX512vl)
  • X11SPL-F (Chipset C621, 8x SATA 6Gb/s onboard)
  • ST4000VN008 8x (4TB, 5900 RPM, 180MiB/s)

Describe how to reproduce the problem

Copy-paste the above commands. Note that they depend on availability of HDDs with specific S/Ns, so you will need to adjust.

Include any warning/errors/backtraces from the system logs


None/Not aware of any.

Performance

Most helpful comment

I would like to chime in. I've been running benchmarks on my system after running into performance issues too.

System is HPe ML10 Intel(R) Xeon(R) CPU E3-1225 v5 @ 3.30GHz, 32 GB DDR4 ECC.
6 x 4TB spinning disks, Hitachi Deskstar and Ultrastar 7200rpm.

Before running the benchmarks, I have installed a completely fresh system with Arch Linux, 5.6.15 and zfs 0.8.4. I changed nothing on the config, simply installed with basic and minimal settings and just the packages I needed to run the benchmarks and monitor performance.

I ran memtest86+ full cycle to make sure RAM is oke (its ECC, but still).

To establish a baseline, I have benchmarked each disk individually with 2048 aligned ext4 partitions, with fio in a loop with the following parameters. Testfile gets deleted and caches dropped between each run:

  • Test Filesize: 64 GB (double the RAM)
  • Modes: read, randread, write, randwrite
  • Blocksizes: 4K, 8K, 64K, 128K, 1M
  • queuedepth: 8
  • Jobs: 8
  • end_fsync: 1
  • ioengine: libaio
  • direct: 1
  • group reporting: 1
  • ramp_time: 120 seconds
  • runtime: 500
  • Time_based: 1 (makes fio run for 500 seconds no matter if its done or not).

I can post the results if you like but believe me when I say these numbers are consistent accross the board, and completely within every reasonable expectation. They also match with online test results.

I created a zfs pool as follows:

  • ashift=12
  • relatime=on
  • canmount=off
  • compression=lz4
  • xattr=sa
  • dnodesize=auto
  • acltype=posixacl
  • normalization=formD
  • raidz2

I created a zfs dataset for each of the following recordsizes: 4K, 8K, 64K, 128K, 1M. I then ran the fio loop with each blocksize on each dataset. This amounts to 20 tests on each dataset, a total of 100 tests across the board, 15 minutes per run, 5 hours per dataset and 25 hours to complete.

The random read numbers on the pool:
Mode: RANDREAD | 聽 | RS4K | RS8K | RS64K | RS128K | RS1M
-- | -- | -- | -- | -- | -- | --
4K Rand | IOPS | 591 | 432 | 240 | 222 | 232
8K Rand | IOPS | 345 | 416 | 278 | 246 | 187
64K Rand | IOPS | 140 | 331 | 272 | 243 | 230
128K Rand | IOPS | 241 | 155 | 235 | 245 | 165
1M Rand | IOPS | 88 | 80 | 109 | 116 | 168
Averages | 聽 | 281 | 283 | 227 | 214 | 196
4K Rand | MiB/s | 2,3 | 1,7 | 0,9 | 0,9 | 0,9
8K Rand | MiB/s | 2,7 | 3,3 | 2,2 | 1,9 | 1,5
64K Rand | MiB/s | 8,8 | 20,7 | 17,1 | 15,2 | 14,4
128K Rand | MiB/s | 30,2 | 19,5 | 29,4 | 30,7 | 20,7
1M Rand | MiB/s | 89,0 | 80,4 | 110,0 | 117,0 | 169,0
Averages | 聽 | 26,6 | 25,1 | 31,9 | 33,1 | 41,3

Comparing to the single disk speeds, only the 4K was faster (about twice as fast) on my pool. From 8K and up its pretty much single disk speeds, give or take here and there.

Random writes are a different story. Look at this:
Mode: RANDWRITE | 聽 | RS4K | RS8K | RS64K | RS128K | RS1M
-- | -- | -- | -- | -- | -- | --
4K Rand | IOPS | 8905 | 5068 | 4829 | 5873 | 2200
8K Rand | IOPS | 6555 | 15500 | 2744 | 2819 | 2064
64K Rand | IOPS | 921 | 1813 | 3705 | 579 | 407
128K Rand | IOPS | 389 | 851 | 1794 | 2674 | 297
1M Rand | IOPS | 44 | 54 | 257 | 336 | 505
Averages | 聽 | 3363 | 4657 | 2666 | 2456 | 1095
4K Rand | MiB/s | 34,8 | 19,8 | 18,9 | 22,9 | 8,6
8K Rand | MiB/s | 51,2 | 121,0 | 21,4 | 22,0 | 16,1
64K Rand | MiB/s | 57,5 | 113,0 | 232,0 | 26,2 | 25,5
128K Rand | MiB/s | 48,7 | 106,0 | 224,0 | 334,0 | 37,2
1M Rand | MiB/s | 44,3 | 54,6 | 258,0 | 336,0 | 505,0
Averages | 聽 | 47,3 | 82,9 | 150,9 | 148,2 | 118,5

I don't know what to make of this. IOPS are through the roof (unreal, each disk is capable of maybe 250-300 max?). The 4K and 8K MiB/s are also unrealistically high, but the rest seems decent and consistent with triple to quad single disk speeds.

Again, I don't know what to make of this but I would really like to find out whether I can get those random read speeds "up to speed", so to speak. I'm running the above tests on a striped NVMe pool of 3 SSD's (which look to turn out abnormally slow while their single disk speeds are reasonable). After that is done I can experiment with tuning performance parameters (if I know which ones).

All 13 comments

You should look at IOPS too, please show iostat -x 1 on your disks, for example, during tests. If %util is nearly 100% - you've got all the IOPS your disks may give. ZFS is CoW filesystem, so on (nearly) each uncached read you should read it's metadata, and read will usually be random, even if you try to read logically sequential data. So the more recordsize is - the better your seq read/write is.

And one more thing - sometimes 1 thread can't give you full pool performance. You may want to tune params for it, for example prefetch read https://github.com/zfsonlinux/zfs/wiki/ZFS-on-Linux-Module-Parameters#zfetch_max_distance . It depends on your load.

So looks like not a bug.

Numbers vary greatly with 1 second intervals (workload seems to be bursty). When interval is set to 10 seconds these are:

For write:

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.00    0.00   59.59    2.08    0.00   38.33

Device            r/s     w/s     rkB/s     wkB/s   rrqm/s   wrqm/s  %rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util
nvme0n1          0.00    0.00      0.00      0.00     0.00     0.00   0.00   0.00    0.00    0.00   0.00     0.00     0.00   0.00   0.00
nvme1n1          1.20    0.00     92.40      0.00    21.90     0.00  94.81   0.00    0.83    0.00   0.00    77.00     0.00   0.75   0.09
sda              0.00    0.00      0.00      0.00     0.00     0.00   0.00   0.00    0.00    0.00   0.00     0.00     0.00   0.00   0.00
sdd              0.80 1326.40      0.80 120472.80     0.00     3.30   0.00   0.25   56.75    4.15   5.51     1.00    90.83   0.47  62.77
sde              0.80 1398.30      0.80 120432.00     0.00     2.10   0.00   0.15   67.25    3.55   4.96     1.00    86.13   0.42  58.29
sdf              0.70 1312.40      0.40 120686.40     0.00     2.20   0.00   0.17   76.71    4.32   5.64     0.57    91.96   0.49  63.87
sdg              0.90 1353.80      1.20 120470.80     0.00     2.70   0.00   0.20   68.11    4.11   5.57     1.33    88.99   0.47  63.77
sdh              0.60 1268.90      0.00 120447.20     0.00     2.60   0.00   0.20  104.33    5.01   6.36     0.00    94.92   0.55  70.17
sdi              0.90 1307.10      1.20 120485.20     0.00     2.60   0.00   0.20   53.78    4.31   5.64     1.33    92.18   0.49  64.03
sdb              0.90 1393.80      1.20 120481.20     0.00     2.90   0.00   0.21   58.33    3.64   5.08     1.33    86.44   0.42  59.01
sdc              0.70 1360.80      0.40 120433.20     0.00     2.10   0.00   0.15   55.71    3.89   5.31     0.57    88.50   0.45  61.39
dm-0            23.10    0.00     92.40      0.00     0.00     0.00   0.00   0.00    0.93    0.00   0.02     4.00     0.00   0.04   0.09
dm-1            23.10    0.00     92.40      0.00     0.00     0.00   0.00   0.00    0.93    0.00   0.02     4.00     0.00   0.04   0.09

for read:

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.00    0.00   27.94    3.30    0.00   68.76

Device            r/s     w/s     rkB/s     wkB/s   rrqm/s   wrqm/s  %rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util
nvme0n1          0.00    0.00      0.00      0.00     0.00     0.00   0.00   0.00    0.00    0.00   0.00     0.00     0.00   0.00   0.00
nvme1n1          0.00    0.00      0.00      0.00     0.00     0.00   0.00   0.00    0.00    0.00   0.00     0.00     0.00   0.00   0.00
sda              0.00    0.00      0.00      0.00     0.00     0.00   0.00   0.00    0.00    0.00   0.00     0.00     0.00   0.00   0.00
sdd            835.40    0.00  42555.60      0.00     0.00     0.00   0.00   0.00    1.33    0.00   1.11    50.94     0.00   0.58  48.41
sde            896.60    0.00  42510.80      0.00     0.00     0.00   0.00   0.00    1.22    0.00   1.10    47.41     0.00   0.52  46.53
sdf            816.60    0.00  43789.60      0.00     0.00     0.00   0.00   0.00    1.62    0.00   1.31    53.62     0.00   0.69  55.98
sdg            877.50    0.00  41629.60      0.00     0.00     0.00   0.00   0.00    1.28    0.00   1.12    47.44     0.00   0.54  47.49
sdh            873.50    0.00  41980.80      0.00     0.10     0.00   0.01   0.00    1.32    0.00   1.15    48.06     0.00   0.56  48.86
sdi            850.10    0.00  43307.60      0.00     0.10     0.00   0.01   0.00    1.42    0.00   1.18    50.94     0.00   0.59  49.90
sdb            896.30    0.00  41473.20      0.00     0.10     0.00   0.01   0.00    1.09    0.00   0.98    46.27     0.00   0.48  43.36
sdc            898.10    0.00  41845.60      0.00     0.20     0.00   0.02   0.00    1.25    0.00   1.12    46.59     0.00   0.52  46.63
dm-0             0.00    0.00      0.00      0.00     0.00     0.00   0.00   0.00    0.00    0.00   0.00     0.00     0.00   0.00   0.00
dm-1             0.00    0.00      0.00      0.00     0.00     0.00   0.00   0.00    0.00    0.00   0.00     0.00     0.00   0.00   0.00

Tunning zfetch_max_distance (default value is 8MiB):

  • 24MiB gives 55-60 MiB/s per HDD
  • 80MiB gives 70-75 peaking to 90 MiB/s per HDD
  • 800MiB gives 130-140 MiB/s per HDD
  • 2400MiB gives 150-155 MiB/s with %util 90-95%

Hi,
I'm experiencing the same behavior in my raidz1 pool, Large file read speed is only about 140MB/s, which is equal to the performance of one disk. The system is at no load with more than 10GB RAM
available.

System: Proxmox 6.0
Kernel: 5.0
CPU: Xeon E3-1285
RAM: 32GB 1600MHZ ECC
Disk: 3* WD RED 3TB 5400RPM
HBA: SAS 9305-24i
ZFS: 0.8.1

pool: Workspace
 state: ONLINE
  scan: scrub repaired 0B in 0 days 09:59:32 with 0 errors on Sun Sep  8 04:14:16 2019
config:

        NAME                        STATE     READ WRITE CKSUM
        Workspace                   ONLINE       0     0     0
          raidz1-0                  ONLINE       0     0     0
            wwn-0x50014ee65a70ec05  ONLINE       0     0     0
            wwn-0x50014ee2b6deadfc  ONLINE       0     0     0
            wwn-0x50014ee264864e38  ONLINE       0     0     0

errors: No known data errors

During large file read, as you can see the disks are not in full load:
```avg-cpu: %user %nice %system %iowait %steal %idle
1.13 0.00 3.65 43.32 0.00 51.89

Device r/s w/s rkB/s wkB/s rrqm/s wrqm/s %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util
sde 89.00 0.00 45568.00 0.00 0.00 0.00 0.00 0.00 34.58 0.00 2.89 512.00 0.00 5.44 48.40
sdf 91.00 0.00 46592.00 0.00 0.00 0.00 0.00 0.00 29.37 0.00 2.48 512.00 0.00 5.27 48.00
sdg 96.00 0.00 49152.00 0.00 0.00 0.00 0.00 0.00 40.56 0.00 3.70 512.00 0.00 5.33 51.20

And cpu usage is very low:

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
2493 mengsk 20 0 4259668 28548 9996 S 1.1 0.1 0:19.07 /usr/sbin/smbd --foreground --no-process-group
6352 root 20 0 4903636 47884 4968 S 1.0 0.1 708:21.85 /usr/bin/kvm -id 101 -name vsrvl -chardev socket,id=qmp,path=/var/run/qemu-server/101.qmp,+
1936 root 39 19 0 0 0 S 0.1 0.0 25:09.83 [kipmi0]
3072 root 0 -20 0 0 0 S 0.1 0.0 0:32.87 [z_rd_int]
3074 root 0 -20 0 0 0 S 0.1 0.0 0:32.83 [z_rd_int]
3075 root 0 -20 0 0 0 S 0.1 0.0 0:33.00 [z_rd_int]
3076 root 0 -20 0 0 0 S 0.1 0.0 0:32.86 [z_rd_int]
5825 www-data 20 0 357536 112404 9472 S 0.1 0.3 0:03.35 pveproxy worker
6467 root 20 0 0 0 0 S 0.1 0.0 23:59.06 [vhost-6352]
10 root 20 0 0 0 0 I 0.0 0.0 1:14.74 [rcu_sched]
557 root 1 -19 0 0 0 S 0.0 0.0 0:54.80 [z_wr_iss]
558 root 1 -19 0 0 0 S 0.0 0.0 0:54.79 [z_wr_iss]
563 root 0 -20 0 0 0 S 0.0 0.0 0:28.06 [z_wr_int]
3071 root 0 -20 0 0 0 S 0.0 0.0 0:32.80 [z_rd_int]
3073 root 0 -20 0 0 0 S 0.0 0.0 0:32.89 [z_rd_int]
3077 root 0 -20 0 0 0 S 0.0 0.0 0:32.76 [z_rd_int]
3078 root 0 -20 0 0 0 S 0.0 0.0 0:32.83 [z_rd_int]
6325 root 20 0 325804 68868 6456 S 0.0 0.2 0:14.49 pve-ha-crm
6464 root 20 0 0 0 0 S 0.0 0.0 35:39.83 [vhost-6352]
6466 root 20 0 0 0 0 S 0.0 0.0 27:37.46 [vhost-6352]
13163 root 20 0 11720 3492 2500 R 0.0 0.0 0:00.13 top
1 root 20 0 170592 8028 4724 S 0.0 0.0 0:15.98 /sbin/init
2 root 20 0 0 0 0 S 0.0 0.0 1:36.34 [kthreadd]
3 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 [rcu_gp]
4 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 [rcu_par_gp]
6 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 [kworker/0:0H-kblockd]
8 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 [mm_percpu_wq]
9 root 20 0 0 0 0 S 0.0 0.0 0:04.41 [ksoftirqd/0]
```

I'll tune zfetch_max_distance to see if it helps.

Returning after making some more tunings and benchmarks.

In general I found that increasing:

  • zfetch_array_rd_sz
  • zfetch_max_distance
  • zfs_pd_bytes_max

To values ~1G increases single HDD throughput to 120-140MiB/s decreasing IOPS at the same time. Going beyond that range seems to be difficult.
Before taking a look at zio_taskq_batch_pct (which is less convenient to adjust) I did some tests of sync workload:

dd if=/dev/zero of=zero bs=10M count=5000 oflag=sync

With result of roughly 90MiB/s. This is also throughput of Optane 900P SSD drive. iostat -mx 5 reveals that:

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.01    0.00    6.65    0.01    0.00   93.33

Device            r/s     w/s     rMB/s     wMB/s   rrqm/s   wrqm/s  %rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util
nvme0n1          0.00  666.80      0.00     83.35     0.00     0.00   0.00   0.00    0.00    0.07   1.00     0.00   128.00   1.50  99.98
nvme1n1          0.00    0.00      0.00      0.00     0.00     0.00   0.00   0.00    0.00    0.00   0.00     0.00     0.00   0.00   0.00
sda              0.00    0.00      0.00      0.00     0.00     0.00   0.00   0.00    0.00    0.00   0.00     0.00     0.00   0.00   0.00
sdd              0.00    0.00      0.00      0.00     0.00     0.00   0.00   0.00    0.00    0.00   0.00     0.00     0.00   0.00   0.00
sde              0.00    0.00      0.00      0.00     0.00     0.00   0.00   0.00    0.00    0.00   0.00     0.00     0.00   0.00   0.00
sdf              0.00    0.00      0.00      0.00     0.00     0.00   0.00   0.00    0.00    0.00   0.00     0.00     0.00   0.00   0.00
sdg              0.00    0.00      0.00      0.00     0.00     0.00   0.00   0.00    0.00    0.00   0.00     0.00     0.00   0.00   0.00
sdh              0.00    0.00      0.00      0.00     0.00     0.00   0.00   0.00    0.00    0.00   0.00     0.00     0.00   0.00   0.00
sdi              0.00    0.00      0.00      0.00     0.00     0.00   0.00   0.00    0.00    0.00   0.00     0.00     0.00   0.00   0.00
sdb              0.00    0.00      0.00      0.00     0.00     0.00   0.00   0.00    0.00    0.00   0.00     0.00     0.00   0.00   0.00
sdc              0.00    0.00      0.00      0.00     0.00     0.00   0.00   0.00    0.00    0.00   0.00     0.00     0.00   0.00   0.00
dm-0             0.00    0.00      0.00      0.00     0.00     0.00   0.00   0.00    0.00    0.00   0.00     0.00     0.00   0.00   0.00
dm-1             0.00    0.00      0.00      0.00     0.00     0.00   0.00   0.00    0.00    0.00   0.00     0.00     0.00   0.00   0.00
dm-2             0.00    0.00      0.00      0.00     0.00     0.00   0.00   0.00    0.00    0.00   0.00     0.00     0.00   0.00   0.00

Note this low IOPS combined with 100% utilization. zpool iostat -vyr 5 shows:

storage                                          sync_read    sync_write    async_read    async_write      scrub         trim
req_size                                         ind    agg    ind    agg    ind    agg    ind    agg    ind    agg    ind    agg
---------------------------------------------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
512                                                0      0      0      0      0      0      0      0      0      0      0      0
1K                                                 0      0      0      0      0      0      0      0      0      0      0      0
2K                                                 0      0      0      0      0      0      0      0      0      0      0      0
4K                                                 0      0      6      0      0      0     46      0      0      0      0      0
8K                                                 0      0      0      0      0      0      3     20      0      0      0      0
16K                                                0      0      0      0      0      0    497     23      0      0      0      0
32K                                                0      0      0      0      0      0      0    116      0      0      0      0
64K                                                0      0      0      0      0      0      0    218      0      0      0      0
128K                                               0      0    695      0      0      0      0    149      0      0      0      0
256K                                               0      0      0      0      0      0      0     66      0      0      0      0
512K                                               0      0      0      0      0      0      0    113      0      0      0      0
1M                                                 0      0      0      0      0      0      0      0      0      0      0      0
2M                                                 0      0      0      0      0      0      0      0      0      0      0      0
4M                                                 0      0      0      0      0      0      0      0      0      0      0      0
8M                                                 0      0      0      0      0      0      0      0      0      0      0      0
16M                                                0      0      0      0      0      0      0      0      0      0      0      0
---------------------------------------------------------------------------------------------------------------------------------


raidz2                                           sync_read    sync_write    async_read    async_write      scrub         trim
req_size                                         ind    agg    ind    agg    ind    agg    ind    agg    ind    agg    ind    agg
---------------------------------------------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
512                                                0      0      0      0      0      0      0      0      0      0      0      0
1K                                                 0      0      0      0      0      0      0      0      0      0      0      0
2K                                                 0      0      0      0      0      0      0      0      0      0      0      0
4K                                                 0      0      6      0      0      0     46      0      0      0      0      0
8K                                                 0      0      0      0      0      0      3     20      0      0      0      0
16K                                                0      0      0      0      0      0    497     23      0      0      0      0
32K                                                0      0      0      0      0      0      0    116      0      0      0      0
64K                                                0      0      0      0      0      0      0    218      0      0      0      0
128K                                               0      0      0      0      0      0      0    149      0      0      0      0
256K                                               0      0      0      0      0      0      0     66      0      0      0      0
512K                                               0      0      0      0      0      0      0    113      0      0      0      0
1M                                                 0      0      0      0      0      0      0      0      0      0      0      0
2M                                                 0      0      0      0      0      0      0      0      0      0      0      0
4M                                                 0      0      0      0      0      0      0      0      0      0      0      0
8M                                                 0      0      0      0      0      0      0      0      0      0      0      0
16M                                                0      0      0      0      0      0      0      0      0      0      0      0
---------------------------------------------------------------------------------------------------------------------------------


ata-ST4000VN008-2DR166_ZGY5C3W7                  sync_read    sync_write    async_read    async_write      scrub         trim
req_size                                         ind    agg    ind    agg    ind    agg    ind    agg    ind    agg    ind    agg
---------------------------------------------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
512                                                0      0      0      0      0      0      0      0      0      0      0      0
1K                                                 0      0      0      0      0      0      0      0      0      0      0      0
2K                                                 0      0      0      0      0      0      0      0      0      0      0      0
4K                                                 0      0      0      0      0      0      6      0      0      0      0      0
8K                                                 0      0      0      0      0      0      0      2      0      0      0      0
16K                                                0      0      0      0      0      0     64      2      0      0      0      0
32K                                                0      0      0      0      0      0      0     16      0      0      0      0
64K                                                0      0      0      0      0      0      0     24      0      0      0      0
128K                                               0      0      0      0      0      0      0     20      0      0      0      0
256K                                               0      0      0      0      0      0      0      9      0      0      0      0
512K                                               0      0      0      0      0      0      0     12      0      0      0      0
1M                                                 0      0      0      0      0      0      0      0      0      0      0      0
2M                                                 0      0      0      0      0      0      0      0      0      0      0      0
4M                                                 0      0      0      0      0      0      0      0      0      0      0      0
8M                                                 0      0      0      0      0      0      0      0      0      0      0      0
16M                                                0      0      0      0      0      0      0      0      0      0      0      0
---------------------------------------------------------------------------------------------------------------------------------


ata-ST4000VN008-2DR166_ZGY5E06J                  sync_read    sync_write    async_read    async_write      scrub         trim
req_size                                         ind    agg    ind    agg    ind    agg    ind    agg    ind    agg    ind    agg
---------------------------------------------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
512                                                0      0      0      0      0      0      0      0      0      0      0      0
1K                                                 0      0      0      0      0      0      0      0      0      0      0      0
2K                                                 0      0      0      0      0      0      0      0      0      0      0      0
4K                                                 0      0      0      0      0      0      5      0      0      0      0      0
8K                                                 0      0      0      0      0      0      0      2      0      0      0      0
16K                                                0      0      0      0      0      0     60      4      0      0      0      0
32K                                                0      0      0      0      0      0      0     13      0      0      0      0
64K                                                0      0      0      0      0      0      0     25      0      0      0      0
128K                                               0      0      0      0      0      0      0     18      0      0      0      0
256K                                               0      0      0      0      0      0      0      8      0      0      0      0
512K                                               0      0      0      0      0      0      0     13      0      0      0      0
1M                                                 0      0      0      0      0      0      0      0      0      0      0      0
2M                                                 0      0      0      0      0      0      0      0      0      0      0      0
4M                                                 0      0      0      0      0      0      0      0      0      0      0      0
8M                                                 0      0      0      0      0      0      0      0      0      0      0      0
16M                                                0      0      0      0      0      0      0      0      0      0      0      0
---------------------------------------------------------------------------------------------------------------------------------


ata-ST4000VN008-2DR166_ZDH7EMPY                  sync_read    sync_write    async_read    async_write      scrub         trim
req_size                                         ind    agg    ind    agg    ind    agg    ind    agg    ind    agg    ind    agg
---------------------------------------------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
512                                                0      0      0      0      0      0      0      0      0      0      0      0
1K                                                 0      0      0      0      0      0      0      0      0      0      0      0
2K                                                 0      0      0      0      0      0      0      0      0      0      0      0
4K                                                 0      0      0      0      0      0      5      0      0      0      0      0
8K                                                 0      0      0      0      0      0      0      2      0      0      0      0
16K                                                0      0      0      0      0      0     68      1      0      0      0      0
32K                                                0      0      0      0      0      0      0     17      0      0      0      0
64K                                                0      0      0      0      0      0      0     26      0      0      0      0
128K                                               0      0      0      0      0      0      0     20      0      0      0      0
256K                                               0      0      0      0      0      0      0      6      0      0      0      0
512K                                               0      0      0      0      0      0      0     14      0      0      0      0
1M                                                 0      0      0      0      0      0      0      0      0      0      0      0
2M                                                 0      0      0      0      0      0      0      0      0      0      0      0
4M                                                 0      0      0      0      0      0      0      0      0      0      0      0
8M                                                 0      0      0      0      0      0      0      0      0      0      0      0
16M                                                0      0      0      0      0      0      0      0      0      0      0      0
---------------------------------------------------------------------------------------------------------------------------------


ata-ST4000VN008-2DR166_ZDH7F08Z                  sync_read    sync_write    async_read    async_write      scrub         trim
req_size                                         ind    agg    ind    agg    ind    agg    ind    agg    ind    agg    ind    agg
---------------------------------------------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
512                                                0      0      0      0      0      0      0      0      0      0      0      0
1K                                                 0      0      0      0      0      0      0      0      0      0      0      0
2K                                                 0      0      0      0      0      0      0      0      0      0      0      0
4K                                                 0      0      0      0      0      0      5      0      0      0      0      0
8K                                                 0      0      0      0      0      0      0      2      0      0      0      0
16K                                                0      0      0      0      0      0     68      2      0      0      0      0
32K                                                0      0      0      0      0      0      0     13      0      0      0      0
64K                                                0      0      0      0      0      0      0     30      0      0      0      0
128K                                               0      0      0      0      0      0      0     19      0      0      0      0
256K                                               0      0      0      0      0      0      0      7      0      0      0      0
512K                                               0      0      0      0      0      0      0     13      0      0      0      0
1M                                                 0      0      0      0      0      0      0      0      0      0      0      0
2M                                                 0      0      0      0      0      0      0      0      0      0      0      0
4M                                                 0      0      0      0      0      0      0      0      0      0      0      0
8M                                                 0      0      0      0      0      0      0      0      0      0      0      0
16M                                                0      0      0      0      0      0      0      0      0      0      0      0
---------------------------------------------------------------------------------------------------------------------------------


ata-ST4000VN008-2DR166_ZDH7ESM1                  sync_read    sync_write    async_read    async_write      scrub         trim
req_size                                         ind    agg    ind    agg    ind    agg    ind    agg    ind    agg    ind    agg
---------------------------------------------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
512                                                0      0      0      0      0      0      0      0      0      0      0      0
1K                                                 0      0      0      0      0      0      0      0      0      0      0      0
2K                                                 0      0      0      0      0      0      0      0      0      0      0      0
4K                                                 0      0      0      0      0      0      5      0      0      0      0      0
8K                                                 0      0      0      0      0      0      0      3      0      0      0      0
16K                                                0      0      0      0      0      0     66      2      0      0      0      0
32K                                                0      0      0      0      0      0      0     15      0      0      0      0
64K                                                0      0      0      0      0      0      0     33      0      0      0      0
128K                                               0      0      0      0      0      0      0     13      0      0      0      0
256K                                               0      0      0      0      0      0      0      9      0      0      0      0
512K                                               0      0      0      0      0      0      0     13      0      0      0      0
1M                                                 0      0      0      0      0      0      0      0      0      0      0      0
2M                                                 0      0      0      0      0      0      0      0      0      0      0      0
4M                                                 0      0      0      0      0      0      0      0      0      0      0      0
8M                                                 0      0      0      0      0      0      0      0      0      0      0      0
16M                                                0      0      0      0      0      0      0      0      0      0      0      0
---------------------------------------------------------------------------------------------------------------------------------


ata-ST4000VN008-2DR166_ZDH7FA7S                  sync_read    sync_write    async_read    async_write      scrub         trim
req_size                                         ind    agg    ind    agg    ind    agg    ind    agg    ind    agg    ind    agg
---------------------------------------------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
512                                                0      0      0      0      0      0      0      0      0      0      0      0
1K                                                 0      0      0      0      0      0      0      0      0      0      0      0
2K                                                 0      0      0      0      0      0      0      0      0      0      0      0
4K                                                 0      0      0      0      0      0      8      0      0      0      0      0
8K                                                 0      0      0      0      0      0      0      1      0      0      0      0
16K                                                0      0      0      0      0      0     68      1      0      0      0      0
32K                                                0      0      0      0      0      0      0     11      0      0      0      0
64K                                                0      0      0      0      0      0      0     25      0      0      0      0
128K                                               0      0      0      0      0      0      0     19      0      0      0      0
256K                                               0      0      0      0      0      0      0      9      0      0      0      0
512K                                               0      0      0      0      0      0      0     14      0      0      0      0
1M                                                 0      0      0      0      0      0      0      0      0      0      0      0
2M                                                 0      0      0      0      0      0      0      0      0      0      0      0
4M                                                 0      0      0      0      0      0      0      0      0      0      0      0
8M                                                 0      0      0      0      0      0      0      0      0      0      0      0
16M                                                0      0      0      0      0      0      0      0      0      0      0      0
---------------------------------------------------------------------------------------------------------------------------------


ata-ST4000VN008-2DR166_ZDH7F9P5                  sync_read    sync_write    async_read    async_write      scrub         trim
req_size                                         ind    agg    ind    agg    ind    agg    ind    agg    ind    agg    ind    agg
---------------------------------------------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
512                                                0      0      0      0      0      0      0      0      0      0      0      0
1K                                                 0      0      0      0      0      0      0      0      0      0      0      0
2K                                                 0      0      0      0      0      0      0      0      0      0      0      0
4K                                                 0      0      0      0      0      0      4      0      0      0      0      0
8K                                                 0      0      0      0      0      0      0      2      0      0      0      0
16K                                                0      0      0      0      0      0     47      3      0      0      0      0
32K                                                0      0      0      0      0      0      0     14      0      0      0      0
64K                                                0      0      0      0      0      0      0     26      0      0      0      0
128K                                               0      0      0      0      0      0      0     20      0      0      0      0
256K                                               0      0      0      0      0      0      0      7      0      0      0      0
512K                                               0      0      0      0      0      0      0     14      0      0      0      0
1M                                                 0      0      0      0      0      0      0      0      0      0      0      0
2M                                                 0      0      0      0      0      0      0      0      0      0      0      0
4M                                                 0      0      0      0      0      0      0      0      0      0      0      0
8M                                                 0      0      0      0      0      0      0      0      0      0      0      0
16M                                                0      0      0      0      0      0      0      0      0      0      0      0
---------------------------------------------------------------------------------------------------------------------------------


ata-ST4000VN008-2DR166_ZDH7F9BN                  sync_read    sync_write    async_read    async_write      scrub         trim
req_size                                         ind    agg    ind    agg    ind    agg    ind    agg    ind    agg    ind    agg
---------------------------------------------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
512                                                0      0      0      0      0      0      0      0      0      0      0      0
1K                                                 0      0      0      0      0      0      0      0      0      0      0      0
2K                                                 0      0      0      0      0      0      0      0      0      0      0      0
4K                                                 0      0      0      0      0      0      3      0      0      0      0      0
8K                                                 0      0      0      0      0      0      0      2      0      0      0      0
16K                                                0      0      0      0      0      0     52      3      0      0      0      0
32K                                                0      0      0      0      0      0      0     13      0      0      0      0
64K                                                0      0      0      0      0      0      0     25      0      0      0      0
128K                                               0      0      0      0      0      0      0     16      0      0      0      0
256K                                               0      0      0      0      0      0      0      6      0      0      0      0
512K                                               0      0      0      0      0      0      0     15      0      0      0      0
1M                                                 0      0      0      0      0      0      0      0      0      0      0      0
2M                                                 0      0      0      0      0      0      0      0      0      0      0      0
4M                                                 0      0      0      0      0      0      0      0      0      0      0      0
8M                                                 0      0      0      0      0      0      0      0      0      0      0      0
16M                                                0      0      0      0      0      0      0      0      0      0      0      0
---------------------------------------------------------------------------------------------------------------------------------


nvme-INTEL_SSDPED1D280GA_PHMB7443018J280CGN      sync_read    sync_write    async_read    async_write      scrub         trim
req_size                                         ind    agg    ind    agg    ind    agg    ind    agg    ind    agg    ind    agg
---------------------------------------------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
512                                                0      0      0      0      0      0      0      0      0      0      0      0
1K                                                 0      0      0      0      0      0      0      0      0      0      0      0
2K                                                 0      0      0      0      0      0      0      0      0      0      0      0
4K                                                 0      0      0      0      0      0      0      0      0      0      0      0
8K                                                 0      0      0      0      0      0      0      0      0      0      0      0
16K                                                0      0      0      0      0      0      0      0      0      0      0      0
32K                                                0      0      0      0      0      0      0      0      0      0      0      0
64K                                                0      0      0      0      0      0      0      0      0      0      0      0
128K                                               0      0    696      0      0      0      0      0      0      0      0      0
256K                                               0      0      0      0      0      0      0      0      0      0      0      0
512K                                               0      0      0      0      0      0      0      0      0      0      0      0
1M                                                 0      0      0      0      0      0      0      0      0      0      0      0
2M                                                 0      0      0      0      0      0      0      0      0      0      0      0
4M                                                 0      0      0      0      0      0      0      0      0      0      0      0
8M                                                 0      0      0      0      0      0      0      0      0      0      0      0
16M                                                0      0      0      0      0      0      0      0      0      0      0      0
---------------------------------------------------------------------------------------------------------------------------------

So we write to SLOG in 128K blocks. Let's benchmark this with

dd if=/dev/zero of=/dev/disk/by-id/nvme-INTEL_SSDPED1D280GA_PHMB7443018J280CGN bs=128K oflag=sync

I get throughput of 1.1GB/s and the following output from iostat -mx 5:

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.03    0.00    2.62    3.36    0.00   93.99

Device            r/s     w/s     rMB/s     wMB/s   rrqm/s   wrqm/s  %rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util
nvme0n1          0.00 8605.40      0.00   1075.67     0.00 266769.60   0.00  96.88    0.00    0.08   1.00     0.00   128.00   0.12 100.00
nvme1n1          0.00    0.00      0.00      0.00     0.00     0.00   0.00   0.00    0.00    0.00   0.00     0.00     0.00   0.00   0.00
sda              0.00    0.00      0.00      0.00     0.00     0.00   0.00   0.00    0.00    0.00   0.00     0.00     0.00   0.00   0.00
sdd              0.00    0.00      0.00      0.00     0.00     0.00   0.00   0.00    0.00    0.00   0.00     0.00     0.00   0.00   0.00
sde              0.00    0.00      0.00      0.00     0.00     0.00   0.00   0.00    0.00    0.00   0.00     0.00     0.00   0.00   0.00
sdf              0.00    0.00      0.00      0.00     0.00     0.00   0.00   0.00    0.00    0.00   0.00     0.00     0.00   0.00   0.00
sdg              0.00    0.00      0.00      0.00     0.00     0.00   0.00   0.00    0.00    0.00   0.00     0.00     0.00   0.00   0.00
sdh              0.00    0.00      0.00      0.00     0.00     0.00   0.00   0.00    0.00    0.00   0.00     0.00     0.00   0.00   0.00
sdi              0.00    0.00      0.00      0.00     0.00     0.00   0.00   0.00    0.00    0.00   0.00     0.00     0.00   0.00   0.00
sdb              0.00    0.00      0.00      0.00     0.00     0.00   0.00   0.00    0.00    0.00   0.00     0.00     0.00   0.00   0.00
sdc              0.00    0.00      0.00      0.00     0.00     0.00   0.00   0.00    0.00    0.00   0.00     0.00     0.00   0.00   0.00
dm-0             0.00    0.00      0.00      0.00     0.00     0.00   0.00   0.00    0.00    0.00   0.00     0.00     0.00   0.00   0.00
dm-1             0.00    0.00      0.00      0.00     0.00     0.00   0.00   0.00    0.00    0.00   0.00     0.00     0.00   0.00   0.00
dm-2             0.00    0.00      0.00      0.00     0.00     0.00   0.00   0.00    0.00    0.00   0.00     0.00     0.00   0.00   0.00

As we can see, this SSD is able to handle much more IOPS sized at 128K resulting in ~1075MB/s write performance.

What causes ZFS to consume all the bandwidth of the SSD drive with just 700 IOPS, when dd can issue more than 10x at the same queue length and utilization?
I guess the answer is in wrqm column. But stalling SSD device at 666.8 writes per second seems to be wrong. I'm wondering if something is wrong with my system.

After discovering these nice histograms I run another iteration of write benchmark:

dd if=/dev/zero of=zero bs=10M count=5000

To keep it short:

storage                                          sync_read    sync_write    async_read    async_write      scrub         trim
req_size                                         ind    agg    ind    agg    ind    agg    ind    agg    ind    agg    ind    agg
---------------------------------------------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
512                                                0      0      0      0      0      0      0      0      0      0      0      0
1K                                                 0      0      0      0      0      0      0      0      0      0      0      0
2K                                                 0      0      0      0      0      0      0      0      0      0      0      0
4K                                                 0      0     12      0      0      0    156      0      0      0      0      0
8K                                                 0      0      0      0      0      0     41      6      0      0      0      0
16K                                                0      0      0      0      0      0  6.65K    155      0      0      0      0
32K                                                0      0      0      0      0      0      0    668      0      0      0      0
64K                                                0      0      0      0      0      0      0    813      0      0      0      0
128K                                               0      0      0      0      0      0      0    637      0      0      0      0
256K                                               0      0      0      0      0      0      0    522      0      0      0      0
512K                                               0      0      0      0      0      0      0    492      0      0      0      0
1M                                                 0      0      0      0      0      0      0      0      0      0      0      0
2M                                                 0      0      0      0      0      0      0      0      0      0      0      0
4M                                                 0      0      0      0      0      0      0      0      0      0      0      0
8M                                                 0      0      0      0      0      0      0      0      0      0      0      0
16M                                                0      0      0      0      0      0      0      0      0      0      0      0
---------------------------------------------------------------------------------------------------------------------------------

I interpret it as follows (please correct me if I'm wrong): column ind is a "pre-aggregation" value and agg is a "post-aggregation" (thus only blocks from agg column issue IOs). It strikes me that the above dd command with block size of 10M results in IO/s of size 16K. recordsize should be somewhat bigger... I think this can be related to inability to reach higher throughput of async write during tests.

With that I kind of have solution of the problem "reads are slower than writes". I still don't have solution of the problem "Subpar performance of RAIDZ". I think it can be related to IO size. Also issue with SLOG is worrying. I actually get much better bandwidth (>300MiB/s) when SLOG device is not present.
I used to suspect some issue with scheduling or some form of lock contention (indicated by noticeable performance difference when hyper-threading was switched off). I think the explanation of the phenomenon observed on SLOG goes beyond that.
Issues related to raidz read performance being lower than write performance are somewhat viral on the internet. While some benchmarks have mistakes like block size of 512B, there is some data suggesting that the thing might be in configuration of the software (like https://www.reddit.com/r/zfs/comments/8pm7i0/8x_seagate_12tb_in_raidz2_poor_readwrite/).

I'm facing the exact same issue. Random writes with 1M BS can sustain around 1GB/s on 8x 8TB Ultrastar 7200 HDDs. Sequential write gets me up to around 1200MB/s.
Random read with 1M BS hovers 80-100MB/s. Sequential read gets me up to around 200MB/s, still far shy of the random/sequential write speeds.

I have noticed this as well, 0.8.2 on 4.15.14

Random read with 1M BS hovers 80-100MB/s. Sequential read gets me up to around 200MB/s, still far shy of the random/sequential write speeds.

What is wrong with that? That is around what iops a spinning disk can do.

Random read with 1M BS hovers 80-100MB/s. Sequential read gets me up to around 200MB/s, still far shy of the random/sequential write speeds.

What is wrong with that? That is around what iops a spinning disk can do.

Not exactly. The IOPS on each disk are well under 100, far shy of the ~200-250 they can sustain each, and do individually. Even iostat reports each drive as 30-40% utilized. The limitation seems to be in ZFS somewhere, not the hardware.

In all other benchmarks against ZFS (and random read as well against the disk directly) the disks max out at a sustained ~250 IOPS.

I was thinking about the 200 MB/s sequential. Wouldn't that be the max a disc and thus raidz can do?

Doesn't raidz1 have a theoretical max sequential of (n-1) * STR of a single disk?

I have seen greater than 1GB /s sequential on my raidz3 arrays.

We had very similar results.

We had to drop ZFS from consideration after seeing similar issue.

HP Apollo machines with P408i-p controllers. 256GB, 40 cores.
Centos 7, ZFS ver 0.7 to 0.8/master from December 2019.
Raidz 4x6 4TB disks.

With XFS+hw controller we get 2.5-3GBs sequential speeds (fio, dd), single threaded.

With ZFS we got max 300-400MBs, after heavy tuning there were peaks around 700MBs with 8+ threads. Write speeds around 1.5GBs, acceptable in our scenario (parallel DB).

We would be happy to sacrifice some speed for the features, but this was a deal-breaker for us.

I have faced exactly the same behaviour with my 2 x 6 RaidZ2 pool (ZFS version 0.8.3-pve1 on Kernel 5.3.18-2-pve).
I described the issue here: https://forums.servethehome.com/index.php?threads/disappointing-zfs-read-performance-on-2-x-6-raidz2-and-quest-for-bottleneck-s.27716

Thank you very much @Maciej-Poleski: setting zfetch_max_distance to the maximum value of 2147483648 also got me the read speed I expected from my system.

I would like to chime in. I've been running benchmarks on my system after running into performance issues too.

System is HPe ML10 Intel(R) Xeon(R) CPU E3-1225 v5 @ 3.30GHz, 32 GB DDR4 ECC.
6 x 4TB spinning disks, Hitachi Deskstar and Ultrastar 7200rpm.

Before running the benchmarks, I have installed a completely fresh system with Arch Linux, 5.6.15 and zfs 0.8.4. I changed nothing on the config, simply installed with basic and minimal settings and just the packages I needed to run the benchmarks and monitor performance.

I ran memtest86+ full cycle to make sure RAM is oke (its ECC, but still).

To establish a baseline, I have benchmarked each disk individually with 2048 aligned ext4 partitions, with fio in a loop with the following parameters. Testfile gets deleted and caches dropped between each run:

  • Test Filesize: 64 GB (double the RAM)
  • Modes: read, randread, write, randwrite
  • Blocksizes: 4K, 8K, 64K, 128K, 1M
  • queuedepth: 8
  • Jobs: 8
  • end_fsync: 1
  • ioengine: libaio
  • direct: 1
  • group reporting: 1
  • ramp_time: 120 seconds
  • runtime: 500
  • Time_based: 1 (makes fio run for 500 seconds no matter if its done or not).

I can post the results if you like but believe me when I say these numbers are consistent accross the board, and completely within every reasonable expectation. They also match with online test results.

I created a zfs pool as follows:

  • ashift=12
  • relatime=on
  • canmount=off
  • compression=lz4
  • xattr=sa
  • dnodesize=auto
  • acltype=posixacl
  • normalization=formD
  • raidz2

I created a zfs dataset for each of the following recordsizes: 4K, 8K, 64K, 128K, 1M. I then ran the fio loop with each blocksize on each dataset. This amounts to 20 tests on each dataset, a total of 100 tests across the board, 15 minutes per run, 5 hours per dataset and 25 hours to complete.

The random read numbers on the pool:
Mode: RANDREAD | 聽 | RS4K | RS8K | RS64K | RS128K | RS1M
-- | -- | -- | -- | -- | -- | --
4K Rand | IOPS | 591 | 432 | 240 | 222 | 232
8K Rand | IOPS | 345 | 416 | 278 | 246 | 187
64K Rand | IOPS | 140 | 331 | 272 | 243 | 230
128K Rand | IOPS | 241 | 155 | 235 | 245 | 165
1M Rand | IOPS | 88 | 80 | 109 | 116 | 168
Averages | 聽 | 281 | 283 | 227 | 214 | 196
4K Rand | MiB/s | 2,3 | 1,7 | 0,9 | 0,9 | 0,9
8K Rand | MiB/s | 2,7 | 3,3 | 2,2 | 1,9 | 1,5
64K Rand | MiB/s | 8,8 | 20,7 | 17,1 | 15,2 | 14,4
128K Rand | MiB/s | 30,2 | 19,5 | 29,4 | 30,7 | 20,7
1M Rand | MiB/s | 89,0 | 80,4 | 110,0 | 117,0 | 169,0
Averages | 聽 | 26,6 | 25,1 | 31,9 | 33,1 | 41,3

Comparing to the single disk speeds, only the 4K was faster (about twice as fast) on my pool. From 8K and up its pretty much single disk speeds, give or take here and there.

Random writes are a different story. Look at this:
Mode: RANDWRITE | 聽 | RS4K | RS8K | RS64K | RS128K | RS1M
-- | -- | -- | -- | -- | -- | --
4K Rand | IOPS | 8905 | 5068 | 4829 | 5873 | 2200
8K Rand | IOPS | 6555 | 15500 | 2744 | 2819 | 2064
64K Rand | IOPS | 921 | 1813 | 3705 | 579 | 407
128K Rand | IOPS | 389 | 851 | 1794 | 2674 | 297
1M Rand | IOPS | 44 | 54 | 257 | 336 | 505
Averages | 聽 | 3363 | 4657 | 2666 | 2456 | 1095
4K Rand | MiB/s | 34,8 | 19,8 | 18,9 | 22,9 | 8,6
8K Rand | MiB/s | 51,2 | 121,0 | 21,4 | 22,0 | 16,1
64K Rand | MiB/s | 57,5 | 113,0 | 232,0 | 26,2 | 25,5
128K Rand | MiB/s | 48,7 | 106,0 | 224,0 | 334,0 | 37,2
1M Rand | MiB/s | 44,3 | 54,6 | 258,0 | 336,0 | 505,0
Averages | 聽 | 47,3 | 82,9 | 150,9 | 148,2 | 118,5

I don't know what to make of this. IOPS are through the roof (unreal, each disk is capable of maybe 250-300 max?). The 4K and 8K MiB/s are also unrealistically high, but the rest seems decent and consistent with triple to quad single disk speeds.

Again, I don't know what to make of this but I would really like to find out whether I can get those random read speeds "up to speed", so to speak. I'm running the above tests on a striped NVMe pool of 3 SSD's (which look to turn out abnormally slow while their single disk speeds are reasonable). After that is done I can experiment with tuning performance parameters (if I know which ones).

Was this page helpful?
0 / 5 - 0 ratings

Related issues

allanjude picture allanjude  路  72Comments

Tualua picture Tualua  路  54Comments

pruiz picture pruiz  路  60Comments

vbrik picture vbrik  路  108Comments

nivedita76 picture nivedita76  路  78Comments