Zfs: Huge performance drop (30%~60%) after upgrading to 0.7.9 from 0.6.5.11

Created on 27 Aug 2018  Â·  60Comments  Â·  Source: openzfs/zfs

System information


Type | Version/Name
--- | ---
Distribution Name | CentOS
Distribution Version | 7.5
Linux Kernel | 3.10.0-862.9.1.el7.x86_64
Architecture | x86_64
ZFS Version | 0.6.5.11 => 0.7.9
SPL Version | 0.6.5.11 => 0.7.9

Describe the problem you're observing

I've found a huge performance drop between zfs 0.6.5.11 and 0.7.9 which the following system/setup:

  • System Board: SuperMicro X8DTS
  • CPU: 2x Intel X5687 (8c) @3.60GHz
  • RAM: 32GB DDR3
  • Controller: LSI SAS2 2008 (IT firmware)
  • Disks: 12 x HGST HUSMR3280ASS201 (800GB, SAS/SSD)

At such system I've created the following RAID10 zpool:

  pool: DATA
 state: ONLINE
  scan: none requested
config:

        NAME                              STATE     READ WRITE CKSUM
        DATA                              ONLINE       0     0     0
          mirror-0                        ONLINE       0     0     0
            wwn-0x5000cca09f004a1c-part1  ONLINE       0     0     0
            wwn-0x5000cca09f004c14-part1  ONLINE       0     0     0
          mirror-1                        ONLINE       0     0     0
            wwn-0x5000cca09f00500c-part1  ONLINE       0     0     0
            wwn-0x5000cca09f005214-part1  ONLINE       0     0     0
          mirror-2                        ONLINE       0     0     0
            wwn-0x5000cca09f0052a8-part1  ONLINE       0     0     0
            wwn-0x5000cca09f005318-part1  ONLINE       0     0     0
          mirror-3                        ONLINE       0     0     0
            wwn-0x5000cca09f005700-part1  ONLINE       0     0     0
            wwn-0x5000cca09f005960-part1  ONLINE       0     0     0
          mirror-4                        ONLINE       0     0     0
            wwn-0x5000cca09f006178-part1  ONLINE       0     0     0
            wwn-0x5000cca09f00640c-part1  ONLINE       0     0     0
          mirror-5                        ONLINE       0     0     0
            wwn-0x5000cca09f00642c-part1  ONLINE       0     0     0
            wwn-0x5000cca09f006530-part1  ONLINE       0     0     0

And the following datasets:

NAME           USED  AVAIL  REFER  MOUNTPOINT
DATA          2.05G  4.22T    96K  none
DATA/db-data  1.01G  4.22T  1.01G  legacy

All of it created using the following commands:

zpool create -o ashift=12   DATA \
  mirror wwn-0x5000cca09f004a1c-part1 wwn-0x5000cca09f004c14-part1 \
  mirror wwn-0x5000cca09f00500c-part1 wwn-0x5000cca09f005214-part1 \
  mirror wwn-0x5000cca09f0052a8-part1 wwn-0x5000cca09f005318-part1 \
  mirror wwn-0x5000cca09f005700-part1 wwn-0x5000cca09f005960-part1 \
  mirror wwn-0x5000cca09f006178-part1 wwn-0x5000cca09f00640c-part1  \
  mirror wwn-0x5000cca09f00642c-part1 wwn-0x5000cca09f006530-part1

zfs set compression=lz4 DATA
zfs set mountpoint=none DATA
zfs create -o mountpoint=/mnt/db-data -o xattr=sa -o acltype=off -o atime=off -o relatime=off -o logbias=throughput -o recordsize=16K -o compression=lz4 DATA/db-data

While benchmarking (using fio among other tools) against DATA/db-data dataset, I've found quite a huge performance difference between version 0.6.5.11 and 0.7.9 of zfs/spl. As can be seen next.

  • Performance results using v0.6.5.11:

FILESIZE | BS | IODEPTH | THREADS | WMODE | IOP/s (R) | IOP/s (W) | BW (R) | BW (W)
--- | --- | --- | --- | --- | --- | --- | --- | ---
1G | 4K | 1 | 1 | SYNC | 6526 | 2775 | 25.5MB | 10.8MB
1G|8K|1|1|SYNC|6494|2783|50.7MB|21.7MB
1G|256K|1|1|SYNC|2267|974|567MB|244MB
1G|1M|1|1|SYNC|754|320|755MB|321MB
1G|4K|16|16|SYNC|21300|9151|83.2MB|35.7MB
1G|8K|16|16|SYNC|21500|9243|168MB|72.2MB
1G|256K|16|16|SYNC|3819|1638|955MB|410MB
1G|1M|16|16|SYNC|1303|560|1304MB|560MB
1G|16K|128|128|SYNC|126000|53900|1965MB|842MB
16G|4K|1|1|NOSYNC|17600|7566|68.9MB|29.6MB
16G|8K|1|1|NOSYNC|18300|7819|143MB|61.1MB
16G|256K|1|1|NOSYNC|3773|1623|936MB|403MB
16G|1M|1|1|NOSYNC|1178|502|1179MB|502MB
16G|4K|16|16|NOSYNC|125000|53400|487MB|209MB
16G|8K|16|16|NOSYNC|114000|48700|888MB|381MB
16G|256K|16|16|NOSYNC|8480|3637|2120MB|909MB
16G|1M|16|16|NOSYNC|2189|939|2189MB|940MB

  • Performance results using v0.7.9:

FILESIZE | BS | IODEPTH | THREADS | WMODE | IOP/s (R) | IOP/s (W) | BW (R) | BW (W)
--- | --- | --- | --- | --- | --- | --- | --- | ---
1G|4K|1|1|SYNC|4236|1821|16.5MB|7.2MB
1G|8K|1|1|SYNC|4137|1764|32.3MB|16.8MB
1G|256K|1|1|SYNC|1413|578|353MB|145MB
1G|1M|1|1|SYNC|471|179|471MB|179MB
1G|4k|16|16|SYNC|18200|7791|70.9MB|30.4MB
1G|16k|16|16|SYNC|16000|7257|265MB|113MB
1G|256k|16|16|SYNC|3476|1498|869MB|375MB
1G|1M|16|16|SYNC|1095|467|1096MB|468MB
1G|16k|128|128|SYNC|43600|18600|681MB|291MB
16G|4K|1|1|SYNC|1842|801|7.4MB|3.2MB
16G|8K|1|1|SYNC|1838|791|14.5MB|6.4MB
16G|256K|1|1|SYNC|783|337|196MB|84.4MB
16G|1M|1|1|SYNC|255|108|255MB|109MB
1G|4K|1|1|NOSYNC|34200|14200|134MB|57.5MB
1G|8K|1|1|NOSYNC|33200|14200|259MB|111MB
1G|16K|1|1|NOSYNC|34300|14700|536MB|229MB
1G|32K|1|1|NOSYNC|17700|7612|552MB|238MB
1G|64K|1|1|NOSYNC|10100|4519|629MB|282MB
1G|128K|1|1|NOSYNC|5824|2383|728MB|298MB
1G|256K|1|1|NOSYNC|2847|1221|718MB|305MB
1G|1M|1|1|NOSYNC|756|322|756MB|322MB
1G|4k|16|16|NOSYNC|89800|38500|351MB|150MB
1G|8k|16|16|NOSYNC|88100|37800|688MB|295MB
1G|16k|16|16|NOSYNC|84300|36100|1317MB|565MB
1G|32k|16|16|NOSYNC|42900|18400|1342MB|576MB
1G|64k|16|16|NOSYNC|20900|8964|1305MB|560MB
16G|4K|1|1|NOSYNC|3033|1280|12.2MB|5.1MB
16G|8K|1|1|NOSYNC|2996|1292|23.8MB|10.3MB
16G|16K|1|1|NOSYNC|3976|1707|62.1MB|26.7MB
16G|32K|1|1|NOSYNC|3128|1342|97.8MB|41MB
16G|64K|1|1|NOSYNC|2462|1068|154MB|66.8MB
16G|128K|1|1|NOSYNC|1673|719|209MB|89.9MB
16G|256K|1|1|NOSYNC|750|322|186MB|80.1MB
16G|1M|1|1|NOSYNC|308|134|309MB|135MB
16G|4k|16|16|NOSYNC|24200|10400|94.4MB|40.5MB
16G|8k|16|16|NOSYNC|24400|10400|190MB|81.5MB
16G|256k|16|16|NOSYNC|3844|1654|889MB|404MB
16G|1M|16|16|NOSYNC|1017|433|1017MB|434MB

As can be seen, performance drops at both IOPs and BW, for all use cases, examples:

  • IOPs intensive workload (~30% difference):
    * 0.6.5.11 => 4k,1,1,SYNC => 6526/2775 - 25.5MB/10.8MB
    *
    0.7.9 => 4k,1,1,SYNC => 4236/1821 - 16.5MB/7.2MB

  • BW intensive workload (~60% difference):
    * 0.6.5.11 => 256k,16,16,NOSYNC => 8480/3637 - 2120MB/909MB
    *
    0.7.9 => 256k,16,16,NOSYNC => 3844/1654 - 889MB/404MB

Tests have been performed using the following commands, using average from 3 repetitions on each case.

rm -f kk
echo 3 > /proc/sys/vm/drop_caches
sleep 30
fio --filename=kk \
  -name=test --group_reporting --fallocate=none --ioengine=libaio \
  --rw=randrw --rwmixread=70 --refill_buffers --norandommap --randrepeat=0 --runtime=60 \
  --iodepth=$IODEPTH --numjobs=$THREADS \
  --direct=0 --sync=$WMODE --size=$FILESIZE --bs=$BS --time_based

NOTEs:

  • The zpool and datasets have been re-created from scratch after swapping zfs versions.
  • When upgrading/downgrading zfs/spl kernel modules, userland utilities were upgraded/downgraded too.
  • I've tested using whole-disks instead of partitions, with same results.
  • Tuning zfs module parameters on v0.7.9 (like zfs_vdev_*, etc.) makes no observable difference.
  • I've tried 0.7.9 with kernel 4.4.152 (from elrepo), but results are even a little bit worse (~5% slower) than 0.7.9 with redhat's stock kernel.
Performance

All 60 comments

The use of scatter/gather lists for the ARC rather than chopping up vmalloc()'d blocks does incur a performance hit, but this seems a bit much...

@DeHackEd is there any module param or compile-time define I can set in order to disable s/g on 0.7.9 and redo benchmarks?

PD: I forgot to add, I've tried 0.7.9 with kernel 4.4.152 (from elrepo), but results are even a little bit worse (~5% slower) than 0.7.9 with redhat's stock kernel.

@pruiz you can set zfs_abd_scatter_enabled=0 to force ZFS to use the 0.6.5 allocation strategy and not use scatter/gather lists. You could also try setting zfs_compressed_arc_enabled=0 to disable keeping data compressed in the ARC. Both of these options will increase ZFS's memory footprint and cpu usage, but _may_ improve performance for your test workload. I'd be interested to see your results.

We've also done some work in rthe master branch to improve performance. If you're comfortable building ZFS from source it would be interesting to see how the master branch compares on your hardware.

Hi @behlendorf,

Here are some preliminary results with 0.7.9 + zfs_abd_scatter_enabled=0 (same zpool & data set settings as previously):

FILESIZE | BS | IODEPTH | THREADS | WMODE | IOP/s (R) | IOP/s (W) | BW (R) | BW (W)
--- | --- | --- | --- | --- | --- | --- | --- | ---
1G | 4K | 1 | 1 | SYNC | 5582 | 2399 | 21.8MB | 9.6MB
1G | 8K | 1 | 1 | SYNC | 5155 | 2209 | 40.3MB | 17.3MB
1G | 256K | 1 | 1 | SYNC | 2197 | 948 | 549MB | 237MB
1G | 1M | 1 | 1 | SYNC | 687 | 294 | 687MB | 295MB
1G | 4k | 16 | 16 | SYNC | 22300 | 9526 | 86.9MB | 37.2MB
1G | 8k | 16 | 16 | SYNC | 22200 | 9513 | 174MB | 74.3MB
1G | 256k | 16 | 16 | SYNC | 3842 | 1631 | 961MB | 408MB
1G | 1M | 16 | 16 | SYNC | 1248 | 528 | 1248MB | 528MB
16G | 4K | 1 | 1 | NOSYNC | 3349 | 1427 | 13.1MB | 5.7MB
16G | 8K | 1 | 1 | NOSYNC | 3360 | 1452 | 26.3MB | 11.3MB
16G | 256K | 1 | 1 | NOSYNC | 1674 | 726 | 419MB | 182MB
16G | 1M | 1 | 1 | NOSYNC | 499 | 214 | 499MB | 214MB
16G | 4K | 16 | 16 | NOSYNC | 25600 | 10900 | 99.8MB | 42.7MB
16G | 8K | 16 | 16 | NOSYNC | 25400 | 10900 | 199MB | 85.1MB
16G | 256K | 16 | 16 | NOSYNC | 4347 | 1859 | 1087MB | 465MB
16G | 1M | 16 | 16 | NOSYNC | 1183 | 505 | 1184MB | 506MB

  • Compared to other tests:
  • IOPs intensive workload:
    * 0.6.5.11 => 4k,1,1,SYNC => 6526/2775 - 25.5MB/10.8MB
    *
    0.7.9 => 4k,1,1,SYNC => 4236/1821 - 16.5MB/7.2MB
    ** 0.7.9+scatter=0 => 4k,1,1,SYNC => 5582/2399 - 21.8MB/9.6MB
    > Results: ~15% increase from plain 0.7.9, still lagging behind 0.6.5.11 (by another ~15%)
  • BW intensive workload:
    * 0.6.5.11 => 256k,16,16,NOSYNC => 8480/3637 - 2120MB/909MB
    *
    0.7.9 => 256k,16,16,NOSYNC => 3844/1654 - 889MB/404MB
    ** 0.7.9+scatter=0 => 256k,16,16,NOSYNC => 4347/1859 - 1087MB/465MB
    > Results: ~20% increase from plain 0.7.9, still lagging behind 0.6.5.11 (by ~50%)

And here are some preliminary results with 0.7.9 + zfs_compressed_arc_enabled=0 (same zpool & data set settings as in my original testing):

FILESIZE | BS | IODEPTH | THREADS | WMODE | IOP/s (R) | IOP/s (W) | BW (R) | BW (W)
--- | --- | --- | --- | --- | --- | --- | --- | ---
1G | 4K | 1 | 1 | SYNC | 4940 | 2122 | 19.3MB | 8.4MB
1G | 8K | 1 | 1 | SYNC | 4801 | 2059 | 37.5MB | 16.1MB
1G | 256K | 1 | 1 | SYNC | 1969 | 837 | 492MB | 209MB
1G | 1M | 1 | 1 | SYNC | 588 | 256 | 588MB | 257MB
1G | 4K | 16 | 16 | SYNC | 21700 | 9299 | 84.9MB | 36.3MB
1G | 8K | 16 | 16 | SYNC | 21500 | 9238 | 168MB | 72.2MB
1G | 256K | 16 | 16 | SYNC | 3592 | 1545 | 898MB | 386MB
1G | 1M | 16 | 16 | SYNC | 1086 | 465 | 1086MB | 466MB
16G | 4k | 1 | 1 | NOSYNC | 3222 | 1387 | 12.6MB | 5.5MB
16G | 8k | 1 | 1 | NOSYNC | 3233 | 1381 | 25.3MB | 10.8MB
16G | 256k | 1 | 1 | NOSYNC | 1524 | 653 | 381MB | 163MB
16G | 1M | 1 | 1 | NOSYNC | 442 | 192 | 443MB | 192MB
16G | 4k | 16 | 16 | NOSYNC | 23900 | 10200 | 93.4MB | 40MB
16G | 8k | 16 | 16 | NOSYNC | 23500 | 10100 | 184MB | 78.7MB
16G | 256k | 16 | 16 | NOSYNC | 3826 | 1637 | 957MB | 409MB
16G | 1M | 16 | 16 | NOSYNC | 1023 | 441 | 1023MB | 442MB

  • Compared to other tests:
  • IOPs intensive workload:
    * 0.6.5.11 => 4k,1,1,SYNC => 6526/2775 - 25.5MB/10.8MB
    *
    0.7.9 => 4k,1,1,SYNC => 4236/1821 - 16.5MB/7.2MB
    * 0.7.9+scatter=0 => 4k,1,1,SYNC => 5582/2399 - 21.8MB/9.6MB
    *
    0.7.9+comp_arc=0 => 4k,1,1,SYNC => 4940/2122 - 19.3MB/8.4MB
    > Results: ~10% increase from plain 0.7.9, still lagging behind 0.6.5.11
  • BW intensive workload:
    * 0.6.5.11 => 256k,16,16,NOSYNC => 8480/3637 - 2120MB/909MB
    *
    0.7.9 => 256k,16,16,NOSYNC => 3844/1654 - 889MB/404MB
    * 0.7.9+scatter=0 => 256k,16,16,NOSYNC => 4347/1859 - 1087MB/465MB
    *
    0.7.9+comp_arc=0 => 256k,16,16,NOSYNC => 3826/1637 - 957MB/409MB
    > Results: ~10% decrease from plain 0.7.9..

And results from 0.7.9 with zfs_abd_scatter_enabled=0 + zfs_compressed_arc_enabled=0:

FILESIZE | BS | IODEPTH | THREADS | WMODE | IOP/s (R) | IOP/s (W) | BW (R) | BW (W)
--- | --- | --- | --- | --- | --- | --- | --- | ---
1G | 4K | 1 | 1 | SYNC | 5277 | 2271 | 20.6MB | 9MB
1G | 8K | 1 | 1 | SYNC | 5224 | 2226 | 40.8MB | 17.4MB
1G | 256K | 1 | 1 | SYNC | 2171 | 935 | 543MB | 234MB
1G | 1M | 1 | 1 | SYNC | 672 | 289 | 673MB | 290MB
1G | 4K | 16 | 16 | SYNC | 22500 | 9659 | 88MB | 37.7MB
1G | 8K | 16 | 16 | SYNC | 22400 | 9599 | 175MB | 74MB
1G | 256K | 16 | 16 | SYNC | 3813 | 1631 | 953MB | 408MB
1G | 1M | 16 | 16 | SYNC | 1232 | 530 | 1232MB | 531MB
16G | 4k | 1 | 1 | NOSYNC | 3366 | 1442 | 13.2MB | 5.7MB
16G | 8k | 1 | 1 | NOSYNC | 3346 | 1440 | 26.1MB | 11.2MB
16G | 256k | 1 | 1 | NOSYNC | 1748 | 751 | 437MB | 188MB
16G | 1M | 1 | 1 | NOSYNC | 507 | 217 | 508MB | 218MB
16G | 4k | 16 | 16 | NOSYNC | 25700 | 11000 | 101MB | 43.1MB
16G | 8k | 16 | 16 | NOSYNC | 25800 | 11000 | 202MB | 86.3MB
16G | 256k | 16 | 16 | NOSYNC | 4514 | 1930 | 1129MB | 483MB
16G | 1M | 16 | 16 | NOSYNC | 1205 | 513 | 1206MB | 514MB

  • Compared to other tests:
  • IOPs intensive workload:
    * 0.6.5.11 => 4k,1,1,SYNC => 6526/2775 - 25.5MB/10.8MB
    *
    0.7.9 => 4k,1,1,SYNC => 4236/1821 - 16.5MB/7.2MB
    * 0.7.9+scatter=0 => 4k,1,1,SYNC => 5582/2399 - 21.8MB/9.6MB
    *
    0.7.9+comp_arc=0 => 4k,1,1,SYNC => 4940/2122 - 19.3MB/8.4MB
    ** 0.7.9+scatter=0+comp_arc=0 => 4k,1,1,SYNC => 5277/2271 - 20.6MB/9MB
    > Results: nearly same performance from plain 0.7.9...
  • BW intensive workload:
    * 0.6.5.11 => 256k,16,16,NOSYNC => 8480/3637 - 2120MB/909MB
    *
    0.7.9 => 256k,16,16,NOSYNC => 3844/1654 - 889MB/404MB
    * 0.7.9+scatter=0 => 256k,16,16,NOSYNC => 4347/1859 - 1087MB/465MB
    *
    0.7.9+comp_arc=0 => 256k,16,16,NOSYNC => 3826/1637 - 957MB/409MB
    ** 0.7.9+scatter=0+comp_arc=0 => 256k,16,16,NOSYNC => 4514/1930 - 1129MB/483MB
    > Results: ~15% increase from plain 0.7.9..

I'll try master tomorrow and report here..

Well, I've built zfs from master (v0.7.0-1533_g47ab01a), and initial testing does not look promising :(

FILESIZE | BS | IODEPTH | THREADS | WMODE | IOP/s (R) | IOP/s (W) | BW (R) | BW (W)
--- | --- | --- | --- | --- | --- | --- | --- | ---
1G | 4K | 1 | 1 | SYNC | 4426 | 1909 | 17.3MB | 7.6MB
1G | 8K | 1 | 1 | SYNC | 4348 | 1869 | 33MB | 14.6MB
1G | 256K | 1 | 1 | SYNC | 1840 | 794 | 460MB | 199MB
1G | 1M | 1 | 1 | SYNC | 597 | 259 | 597MB | 260MB
..

Possibly the slowdown with the 0.7x version is somewhere in the codepath taken because of logbias=throughput on the dataset? Asking as I'm running with logbias=latency and I vaguely remember to have benchmarked 0.7 to be faster than the 0.6 series I upgraded a certain system from a while ago...

I did some tests with logbias=latency with similar results. But I will repeat them and post here.

Sent from my iPhone

On 28 Aug 2018, at 15:23, Gregor Kopka notifications@github.com wrote:

Possibly the slowdown with the 0.7x version is somewhere in the codepath taken because of logbias=throughput on the dataset? Asking as I'm running with logbias=latency and I vaguely remember to have benchmarked 0.7 to be faster than the 0.6 series I upgraded a certain system from a while ago...

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.

One thing I didn't originally notice from your first post is that recordsize=16k set on the dataset. That's definitely a less common configuration and a potential reason which could explain why you're seeing a performance regression while others have reported an overall improvement. Regardless, we'll need to find the bottleneck. Thank you for bringing it to our attention and posting your performance results.

@behlendorf yeah, our intended use in this case is for a db server, so 8k or 16k should be the optimal recordsize.. probably not as common as bigger recordsizes, as you stated.

Anyway, I would more than happy to test other configurations/options if you guys need it.

@pruiz , great job!
Do you mind to perform your tests with recordsize=4k (as you have ashift=12 - so physical block size == 4k as well). I've also notices performance drop after upgrading (from 0.6.5 to 0.7) our NFS server (used as Proxmox shared storage)

Tests with recordsize=4k, logbias=throughput (fio, using randrw 70/30, as usual, with both 1G & 16G test files):

  • Using v0.6.5.11, with 1G test file:

FILESIZE | BS | IODEPTH | THREADS | WMODE | IOP/s (R) | IOP/s (W) | BW (R) | BW (W)
--- | --- | --- | --- | --- | --- | --- | --- | ---
1G | 4K | 1 | 1 | SYNC | 6498 | 2786 | 25.4MB | 10.9MB
1G | 8K | 1 | 1 | SYNC | 7243 | 3100 | 56.6MB | 24.2MB
1G | 256K | 1 | 1 | SYNC | 1349 | 581 | 337MB | 145MB
1G | 1M | 1 | 1 | SYNC | 372 | 160 | 372MB | 160MB
1G | 4k | 16 | 16 | SYNC | 25900 | 11100 | 101MB | 43.4MB
1G | 8k | 16 | 16 | SYNC | 22000 | 9827 | 179MB | 76.8MB
1G | 256k | 16 | 16 | SYNC | 1927 | 823 | 482MB | 206MB
1G | 1M | 16 | 16 | SYNC | 483 | 211 | 484MB | 211MB
1G | 4k | 1 | 1 | NOSYNC | 75300 | 32200 | 294MB | 126MB
1G | 8k | 1 | 1 | NOSYNC | 49200 | 21100 | 384MB | 165MB
1G | 256k | 1 | 1 | NOSYNC | 2444 | 1040 | 611MB | 260MB
1G | 1M | 1 | 1 | NOSYNC | 618 | 263 | 618MB | 264MB
1G | 4k | 16 | 16 | NOSYNC | 211000 | 90400 | 824MB | 353MB
1G | 8k | 16 | 16 | NOSYNC | 115000 | 49300 | 898MB | 385MB
1G | 256k | 16 | 16 | NOSYNC | 4193 | 1800 | 1048MB | 450MB
1G | 1M | 16 | 16 | NOSYNC | 1065 | 459 | 1066MB | 460MB

  • Using v0.6.5.11, with 16G test file:

FILESIZE | BS | IODEPTH | THREADS | WMODE | IOP/s (R) | IOP/s (W) | BW (R) | BW (W)
--- | --- | --- | --- | --- | --- | --- | --- | ---
16G | 4K | 1 | 1 | SYNC | 4678 | 2003 | 18.3MB | 8MB
16G | 1M | 1 | 1 | SYNC | 287 | 125 | 288MB | 125MB
16G | 4K | 16 | 16 | SYNC | 23100 | 9881 | 90.1MB | 38.6MB
16G | 1M | 16 | 16 | SYNC | 462 | 200 | 463MB | 200MB
16G | 4K | 1 | 1 | NOSYNC | 11800 | 5083 | 46.3MB | 19.9MB
16G | 1M | 1 | 1 | NOSYNC | 359 | 154 | 359MB | 155MB
16G | 4K | 16 | 16 | NOSYNC | 99900 | 42800 | 390MB | 167MB
16G | 1M | 16 | 16 | NOSYNC | 874 | 373 | 874MB | 374MB

  • Using master (v0.7-1533_g47ab01a), with 1G test file:

FILESIZE | BS | IODEPTH | THREADS | WMODE | IOP/s (R) | IOP/s (W) | BW (R) | BW (W)
--- | --- | --- | --- | --- | --- | --- | --- | ---
1G | 4K | 1 | 1 | SYNC | 4942 | 2125 | 19.3MB | 8.5MB
1G | 8K | 1 | 1 | SYNC | 4841 | 2072 | 37.8MB | 16.2MB
1G | 256K | 1 | 1 | SYNC | 1240 | 533 | 310MB | 133MB
1G | 1M | 1 | 1 | SYNC | 377 | 162 | 378MB | 163MB
1G | 4K | 16 | 16 | SYNC | 42500 | 18200 | 166MB | 71.1MB
1G | 8K | 16 | 16 | SYNC | 32600 | 13000 | 255MB | 109MB
1G | 256K | 16 | 16 | SYNC | 1666 | 707 | 417MB | 177MB
1G | 1M| 16 | 16 | SYNC | 438 | 188 | 438MB | 188MB
1G | 4K | 1 | 1 | NOSYNC | 57500 | 24600 | 224MB | 96.2MB
1G | 8K | 1 | 1 | NOSYNC | 36600 | 15700 | 286MB | 122MB
1G | 256K | 1 | 1 | NOSYNC | 1754 | 754 | 499MB | 189MB
1G | 1M | 1 | 1 | NOSYNC | 460 | 201 | 461MB | 201MB
1G | 4K | 16 | 16 | NOSYNC | 130000 | 55800 | 508MB | 218MB
1G | 8K | 16 | 16 | NOSYNC | 64200 | 27500 | 501MB | 215MB
1G | 256K | 16 | 16 | NOSYNC | 2251 | 970 | 563MB | 243MB
1G | 1M | 16 | 16 | NOSYNC | 583 | 248 | 584MB | 249MB

  • Using master (v0.7-1533_g47ab01a), with 16G test file:

FILESIZE | BS | IODEPTH | THREADS | WMODE | IOP/s (R) | IOP/s (W) | BW (R) | BW (W)
--- | --- | --- | --- | --- | --- | --- | --- | ---
16G | 4K | 1 | 1 | SYNC | 3612 | 1535 | 14.1MB | 6.1MB
16G | 1M | 1 | 1 | SYNC | 288 | 122 | 288MB | 123MB
16G | 4K | 16 | 16 | SYNC | 19500 | 8364 | 76.1MB | 32.7MB
16G | 1M | 16 | 16 | SYNC | 389 | 168 | 390MB | 169MB
16G | 4K | 1 | 1 | NOSYNC | 12900 | 5547 | 50.6MB | 21.7MB
16G | 1M | 1 | 1 | NOSYNC | 339 | 145 | 340MB | 146MB
16G | 4K | 16 | 16 | NOSYNC | 36500 | 15600 | 143MB | 61.1MB
16G | 1M | 16 | 16 | NOSYNC | 510 | 219 | 511MB | 220MB

  • Results summary:

    1. Baseline 4k IOPs (SYNC)
      -> v0.6.5.11 - 1G/4k/1/1 => 6498 / 2786 (25.4MB / 10.9MB)
      -> v0.7-master - 1G/4k/1/1 => 4942 / 2125 (19.3MB / 8.5MB)
      => v0.6.5 wins this case by a 20%.

    2. Baseline 4k IOPs (NOSYNC)
      -> v0.6.5.11 - 1G/4k/1/1 => 75300 / 32200 (294MB / 126MB)
      -> v0.7-master - 1G/4k/1/1 => 57500 / 24600 (224MB / 96.2MB)
      => v0.6.5 wins again..

    3. Highest IOPs (SYNC)
      -> v0.6.5.11 - 1G/4k/16/16 => 25900 / 11100 (101MB / 43.4MB)
      -> v0.7-master - 1G/4k/16/16 => 42500 / 18200 (166MB / 71.1MB)
      => Winner is v0.7-master by an impressive 50%+

    4. Highest IOPs (NOSYNC)
      -> v0.6.5.11 - 1G/4k/16/16 => 211000 / 90400 (824MB / 353MB)
      -> v0.7-master - 1G/4k/16/16 => 130000 / 55800 (508MB / 218MB)
      => In this case v0.6.5 wins by far, nearly double.

    5. Highest Throughput (SYNC)
      -> v0.6.5.11 - 1G/1M/16/16 => 483 / 211 (484MB / 211MB)
      -> v0.7-master - 1G/1M/16/16 => 438 / 188 (438MB / 188MB)
      => I would call this a tie.

    6. Highest Throughput (NOSYNC)
      -> v0.6.5.11 - 1G/1M/16/16 => 1065 / 459 (1066MB / 460MB)
      -> v0.7-master - 1G/1M/16/16 => 583 / 248 (584MB / 249MB)
      => Another long win for v0.6.5.

NOTEs:

  • I've verified with zdb that ashift is 12 as intended.
  • Testing against 0.7-master have abd_scatter_enabled=1 & compressed_arc_enabled=1
  • High concurrency/iodepth gains on 4k/sync on v0.7-master is impressive.. however it looks like we have a huge drop on the opposite use cases (no-sync tests)

I will try to add test results against v0.7.9 if I find some spare time tonite, as I would love to know wether those 4k/SYNC IOPs results of v0.7-master are reproducible with v0.7.9 too.

Using zfs-test performance regression, I'm seeing similar regression for cached reads, random reads, and random writes. I'll start bisecting commits between 0.6.5.11 and 0.7.0 tags.

I would like to add some details here. I have used ZFS on FreeBSD for almost 10 years, it has always had decent ZFS performance. But I have a newer build with only SSD's and Optane 900p as slog and the sync write performance is really bad. I've compared with different Linux distributions and other filesystems.

The tool I use to test sync write performance is pg_test_fsync

Here is the performance on my FreeBSD server with 3x raidz * 5x 5400RPM spinners (15 disks total) and with Optane 32GB.

$ pg_test_fsync -f /tank/rot/testfile
5 seconds per test
O_DIRECT supported on this platform for open_datasync and open_sync.

Compare file sync methods using one 8kB write:
(in wal_sync_method preference order, except fdatasync is Linux's default)
        open_datasync                                   n/a
        fdatasync                          7134.022 ops/sec     140 usecs/op
        fsync                              7138.345 ops/sec     140 usecs/op
        fsync_writethrough                              n/a
        open_sync                          7436.686 ops/sec     134 usecs/op

Compare file sync methods using two 8kB writes:
(in wal_sync_method preference order, except fdatasync is Linux's default)
        open_datasync                                   n/a
        fdatasync                          5139.483 ops/sec     195 usecs/op
        fsync                              4403.700 ops/sec     227 usecs/op
        fsync_writethrough                              n/a
        open_sync                          2606.494 ops/sec     384 usecs/op

Compare open_sync with different write sizes:
(This is designed to compare the cost of writing 16kB in different write
open_sync sizes.)
         1 * 16kB open_sync write          5082.113 ops/sec     197 usecs/op
         2 *  8kB open_sync writes         3707.069 ops/sec     270 usecs/op
         4 *  4kB open_sync writes         2144.459 ops/sec     466 usecs/op
         8 *  2kB open_sync writes         1271.302 ops/sec     787 usecs/op
        16 *  1kB open_sync writes          636.725 ops/sec    1571 usecs/op

Test if fsync on non-write file descriptor is honored:
(If the times are similar, fsync() can sync data written on a different
descriptor.)
        write, fsync, close                5989.971 ops/sec     167 usecs/op
        write, close, fsync                5913.696 ops/sec     169 usecs/op

Non-sync'ed 8kB writes:
        write                             72071.214 ops/sec      14 usecs/op

With 6x striped 800GB enterprise class ssds, Optane 900p as slog and ZFS on Linux 0.8.0-rc1 on Ubuntu 18.04

$ sudo /usr/lib/postgresql/10/bin/pg_test_fsync -f /tank/rot/testfile  
5 seconds per test
O_DIRECT supported on this platform for open_datasync and open_sync.

Compare file sync methods using one 8kB write:
(in wal_sync_method preference order, except fdatasync is Linux's default)
        open_datasync                      2574,871 ops/sec     388 usecs/op
        fdatasync                          2265,568 ops/sec     441 usecs/op
        fsync                              2242,302 ops/sec     446 usecs/op
        fsync_writethrough                              n/a
        open_sync                          2510,196 ops/sec     398 usecs/op

Compare file sync methods using two 8kB writes:
(in wal_sync_method preference order, except fdatasync is Linux's default)
        open_datasync                      1301,706 ops/sec     768 usecs/op
        fdatasync                          2101,979 ops/sec     476 usecs/op
        fsync                              2082,698 ops/sec     480 usecs/op
        fsync_writethrough                              n/a
        open_sync                          1441,130 ops/sec     694 usecs/op

Compare open_sync with different write sizes:
(This is designed to compare the cost of writing 16kB in different write
open_sync sizes.)
         1 * 16kB open_sync write          2421,870 ops/sec     413 usecs/op
         2 *  8kB open_sync writes         1286,643 ops/sec     777 usecs/op
         4 *  4kB open_sync writes          674,385 ops/sec    1483 usecs/op
         8 *  2kB open_sync writes          352,586 ops/sec    2836 usecs/op
        16 *  1kB open_sync writes          179,682 ops/sec    5565 usecs/op

Test if fsync on non-write file descriptor is honored:
(If the times are similar, fsync() can sync data written on a different
descriptor.)
        write, fsync, close                2469,133 ops/sec     405 usecs/op
        write, close, fsync                2522,016 ops/sec     397 usecs/op

Non-sync'ed 8kB writes:
        write                            113709,613 ops/sec       9 usecs/op

For comparison, exact same hardware, default settings ZoL 0.7.9, benchmarked with pg_test_fsync

Ubuntu 18.04 2200 iops
Debian 9 2000 iops
CentOS 7 8000 iops

FreeBSD 11.2 16000 iops

Ubuntu 18.04 + XFS 34000 iops
Ubuntu 18.04 + EXT4 32000 iops
Ubuntu 18.04 + BcacheFS 14000 iops 

If there is anything I can help with, please ask. I now know how to build from source 8-)

/tank/rot/testfile

What are the settings on the zfs filesystem (default of recordsize=128k would explain quite a lot)?

In my case it is default, which is a dynamic record size, which means that when pg_test_fsync writes, ZFS will write 8kb blocks.

ZFS will maintain files in a filesystem in $recordsize sized blocks on-disk.
You tested performance of read-modify-write cycles with 128k on-disk blocks by partially rewriting 8k chunks, which naturally isn't great.

In case you want zfs to write 8k on-disk blocks:
Set recordsize of the filesystem to that value, recreate the file, test again.

The recordsize is dynamic, it writes 8kb even if the recordsize is higher.

As this is easy to test I can confirm that I get the exact same numbers. Also iostat says the slog, which is an Optane 900p is writing around 20-30MB/s

IIRC Recordsize is dynamic for files with size < than recordsize, or then compression is enabled.

Are you sure you're not being impacted by the write throttle? It can be tuned and
the default tuning is a bit of a guess.

https://github.com/zfsonlinux/zfs/wiki/ZFS-Transaction-Delay

Are you able to bisect this at all, even just using released versions? As a starting point, is 0.7.0 good like 0.6.5.11 or bad like 0.7.9?

Tuning zfs module parameters on v0.7.9 (like zfs_vdev_*, etc.) makes no observable difference.

Are you running stock settings or what was tweaked here?

grep . /sys/module/zfs/parameters/*

would help too if you can. From what I can tell above, 0.7 may have lower bandwidth and io/s, but it has quite a bit lower latency.

One commit that comes to mind for anyone able to bisect is 1ce23dcaff. It will [EDIT] increase [/EDIT] latency for single-threaded synchronous workloads such as pg_test_fsync but should help multi-threaded workloads as can be simulated with fio. See https://goo.gl/jBUiH5 for the author's performance testing on this commit.

@dweeezil That link didn't work for me. But here is the OpenZFS issue which has the links to perf testing etc.
https://www.illumos.org/issues/8585

I must be missing something, are those stats not severely biased given they are running in a VM?? There's no mention of the infrstructure under it either. There's nothing to say those IOs even made it to disk. It's one of the reasons I wanted to see what parameters were used.

https://nbviewer.jupyter.org/gist/prakashsurya/82724c167a4d183459135ff86d3155c6/EP-144-max-rate-submit-results.ipynb#HDD---%25-change-in-IOPs-as-reported-by-fio-vs.-number-of-fio-threads---project-vs.-baseline

Oops, 1ce23dc is _not_ in any 0.7.X tag. It's only in the 0.8.0-rc1 tag. In other words, the commit doesn't apply to this issue.

The disk I/O regression in 0.7.0 seems to be introduced with Write Throttle commit, OpenZFS 7090 3dfb57a35e8cbaa7c424611235d669f3c575ada1. It's puzzling how that would impact read performance.

For testing, I'm using the ZFS performance regression test suite which creates datasets using 8K record size

Pre-Write Throttle results:

delphix@ZoL-ubuntu-4: grep iop *fio* | grep -v cached
random_reads.ksh.fio.sync.8k-ios.128-threads.1-filesystems:  read : io=56320MB, bw=480576KB/s, iops=60071, runt=120006msec
random_reads.ksh.fio.sync.8k-ios.16-threads.1-filesystems:  read : io=42372MB, bw=361571KB/s, iops=45196, runt=120001msec
random_reads.ksh.fio.sync.8k-ios.1-threads.1-filesystems:  read : io=7322.1MB, bw=62489KB/s, iops=7811, runt=120001msec
random_reads.ksh.fio.sync.8k-ios.32-threads.1-filesystems:  read : io=54999MB, bw=469322KB/s, iops=58665, runt=120001msec
random_reads.ksh.fio.sync.8k-ios.64-threads.1-filesystems:  read : io=58846MB, bw=502136KB/s, iops=62767, runt=120003msec
random_writes.ksh.fio.sync.8k-ios.128-threads.1-filesystems:  write: io=23629MB, bw=201532KB/s, iops=25191, runt=120063msec
random_writes.ksh.fio.sync.8k-ios.1-threads.1-filesystems:  write: io=4324.1MB, bw=36906KB/s, iops=4613, runt=120001msec
random_writes.ksh.fio.sync.8k-ios.32-threads.1-filesystems:  write: io=17821MB, bw=152065KB/s, iops=19008, runt=120003msec
sequential_reads.ksh.fio.sync.128k-ios.128-threads.1-filesystems:  read : io=112455MB, bw=958840KB/s, iops=7490, runt=120097msec
sequential_reads.ksh.fio.sync.128k-ios.16-threads.1-filesystems:  read : io=210926MB, bw=1757.7MB/s, iops=14061, runt=120005msec
sequential_reads.ksh.fio.sync.128k-ios.1-threads.1-filesystems:  read : io=62450MB, bw=532904KB/s, iops=4163, runt=120001msec
sequential_reads.ksh.fio.sync.128k-ios.64-threads.1-filesystems:  read : io=112632MB, bw=961070KB/s, iops=7508, runt=120007msec
sequential_reads.ksh.fio.sync.128k-ios.8-threads.1-filesystems:  read : io=265278MB, bw=2210.7MB/s, iops=17685, runt=120001msec
sequential_reads.ksh.fio.sync.1m-ios.128-threads.1-filesystems:  read : io=113139MB, bw=963085KB/s, iops=940, runt=120295msec
sequential_reads.ksh.fio.sync.1m-ios.16-threads.1-filesystems:  read : io=205522MB, bw=1712.7MB/s, iops=1712, runt=120006msec
sequential_reads.ksh.fio.sync.1m-ios.1-threads.1-filesystems:  read : io=59107MB, bw=504371KB/s, iops=492, runt=120002msec
sequential_reads.ksh.fio.sync.1m-ios.64-threads.1-filesystems:  read : io=113940MB, bw=971567KB/s, iops=948, runt=120089msec
sequential_reads.ksh.fio.sync.1m-ios.8-threads.1-filesystems:  read : io=259199MB, bw=2159.1MB/s, iops=2159, runt=120003msec
sequential_writes.ksh.fio.sync.128k-ios.128-threads.1-filesystems:  write: io=52092MB, bw=444352KB/s, iops=3471, runt=120045msec
sequential_writes.ksh.fio.sync.128k-ios.16-threads.1-filesystems:  write: io=45804MB, bw=390684KB/s, iops=3052, runt=120055msec
sequential_writes.ksh.fio.sync.128k-ios.1-threads.1-filesystems:  write: io=18429MB, bw=157253KB/s, iops=1228, runt=120003msec
sequential_writes.ksh.fio.sync.128k-ios.32-threads.1-filesystems:  write: io=48983MB, bw=417852KB/s, iops=3264, runt=120040msec
sequential_writes.ksh.fio.sync.128k-ios.64-threads.1-filesystems:  write: io=50869MB, bw=433917KB/s, iops=3389, runt=120045msec
sequential_writes.ksh.fio.sync.1m-ios.128-threads.1-filesystems:  write: io=52512MB, bw=447439KB/s, iops=436, runt=120178msec
sequential_writes.ksh.fio.sync.1m-ios.16-threads.1-filesystems:  write: io=50116MB, bw=427617KB/s, iops=417, runt=120011msec
sequential_writes.ksh.fio.sync.1m-ios.1-threads.1-filesystems:  write: io=26384MB, bw=225142KB/s, iops=219, runt=120001msec
sequential_writes.ksh.fio.sync.1m-ios.32-threads.1-filesystems:  write: io=51830MB, bw=441701KB/s, iops=431, runt=120158msec
sequential_writes.ksh.fio.sync.1m-ios.64-threads.1-filesystems:  write: io=52730MB, bw=449296KB/s, iops=438, runt=120178msec
sequential_writes.ksh.fio.sync.8k-ios.128-threads.1-filesystems:  write: io=39161MB, bw=334160KB/s, iops=41770, runt=120004msec
sequential_writes.ksh.fio.sync.8k-ios.16-threads.1-filesystems:  write: io=20610MB, bw=175850KB/s, iops=21981, runt=120013msec
sequential_writes.ksh.fio.sync.8k-ios.1-threads.1-filesystems:  write: io=5728.8MB, bw=48885KB/s, iops=6110, runt=120001msec
sequential_writes.ksh.fio.sync.8k-ios.32-threads.1-filesystems:  write: io=27228MB, bw=232246KB/s, iops=29030, runt=120049msec
sequential_writes.ksh.fio.sync.8k-ios.64-threads.1-filesystems:  write: io=34768MB, bw=296685KB/s, iops=37085, runt=120002msec
delphix@ZoL-ubuntu-4:

Post Write Throttle results:

delphix@ZoL-ubuntu-4: grep iop *fio* | grep -v cached
random_reads.ksh.fio.sync.8k-ios.128-threads.1-filesystems:  read : io=45066MB, bw=384549KB/s, iops=48068, runt=120005msec
random_reads.ksh.fio.sync.8k-ios.16-threads.1-filesystems:  read : io=33579MB, bw=286536KB/s, iops=35817, runt=120002msec
random_reads.ksh.fio.sync.8k-ios.1-threads.1-filesystems:  read : io=3347.1MB, bw=28569KB/s, iops=3571, runt=120001msec
random_reads.ksh.fio.sync.8k-ios.32-threads.1-filesystems:  read : io=38711MB, bw=330326KB/s, iops=41290, runt=120002msec
random_reads.ksh.fio.sync.8k-ios.64-threads.1-filesystems:  read : io=42354MB, bw=361406KB/s, iops=45175, runt=120005msec
random_writes.ksh.fio.sync.8k-ios.128-threads.1-filesystems:  write: io=15518MB, bw=132409KB/s, iops=16551, runt=120010msec
random_writes.ksh.fio.sync.8k-ios.1-threads.1-filesystems:  write: io=4041.0MB, bw=34483KB/s, iops=4310, runt=120000msec
random_writes.ksh.fio.sync.8k-ios.32-threads.1-filesystems:  write: io=12471MB, bw=106414KB/s, iops=13301, runt=120006msec
sequential_reads.ksh.fio.sync.128k-ios.128-threads.1-filesystems:  read : io=85841MB, bw=731726KB/s, iops=5716, runt=120129msec
sequential_reads.ksh.fio.sync.128k-ios.16-threads.1-filesystems:  read : io=109147MB, bw=931371KB/s, iops=7276, runt=120002msec
sequential_reads.ksh.fio.sync.128k-ios.1-threads.1-filesystems:  read : io=61193MB, bw=522176KB/s, iops=4079, runt=120001msec
sequential_reads.ksh.fio.sync.128k-ios.64-threads.1-filesystems:  read : io=81868MB, bw=698512KB/s, iops=5457, runt=120016msec
sequential_reads.ksh.fio.sync.128k-ios.8-threads.1-filesystems:  read : io=203022MB, bw=1691.9MB/s, iops=13534, runt=120001msec
sequential_reads.ksh.fio.sync.1m-ios.128-threads.1-filesystems:  read : io=86464MB, bw=736715KB/s, iops=719, runt=120181msec
sequential_reads.ksh.fio.sync.1m-ios.16-threads.1-filesystems:  read : io=121781MB, bw=1014.8MB/s, iops=1014, runt=120010msec
sequential_reads.ksh.fio.sync.1m-ios.1-threads.1-filesystems:  read : io=58418MB, bw=498492KB/s, iops=486, runt=120002msec
sequential_reads.ksh.fio.sync.1m-ios.64-threads.1-filesystems:  read : io=83923MB, bw=715565KB/s, iops=698, runt=120097msec
sequential_reads.ksh.fio.sync.1m-ios.8-threads.1-filesystems:  read : io=202765MB, bw=1689.7MB/s, iops=1689, runt=120004msec
sequential_writes.ksh.fio.sync.128k-ios.128-threads.1-filesystems:  write: io=42299MB, bw=360887KB/s, iops=2819, runt=120020msec
sequential_writes.ksh.fio.sync.128k-ios.16-threads.1-filesystems:  write: io=37867MB, bw=323130KB/s, iops=2524, runt=120002msec
sequential_writes.ksh.fio.sync.128k-ios.1-threads.1-filesystems:  write: io=17015MB, bw=145189KB/s, iops=1134, runt=120001msec
sequential_writes.ksh.fio.sync.128k-ios.32-threads.1-filesystems:  write: io=40401MB, bw=344686KB/s, iops=2692, runt=120024msec
sequential_writes.ksh.fio.sync.128k-ios.64-threads.1-filesystems:  write: io=42360MB, bw=361369KB/s, iops=2823, runt=120033msec
sequential_writes.ksh.fio.sync.1m-ios.128-threads.1-filesystems:  write: io=44971MB, bw=382735KB/s, iops=373, runt=120319msec
sequential_writes.ksh.fio.sync.1m-ios.16-threads.1-filesystems:  write: io=42142MB, bw=359420KB/s, iops=350, runt=120064msec
sequential_writes.ksh.fio.sync.1m-ios.1-threads.1-filesystems:  write: io=23790MB, bw=203006KB/s, iops=198, runt=120001msec
sequential_writes.ksh.fio.sync.1m-ios.32-threads.1-filesystems:  write: io=44169MB, bw=376351KB/s, iops=367, runt=120178msec
sequential_writes.ksh.fio.sync.1m-ios.64-threads.1-filesystems:  write: io=44628MB, bw=380664KB/s, iops=371, runt=120051msec
sequential_writes.ksh.fio.sync.8k-ios.128-threads.1-filesystems:  write: io=33858MB, bw=288915KB/s, iops=36114, runt=120004msec
sequential_writes.ksh.fio.sync.8k-ios.16-threads.1-filesystems:  write: io=18571MB, bw=158472KB/s, iops=19809, runt=120001msec
sequential_writes.ksh.fio.sync.8k-ios.1-threads.1-filesystems:  write: io=5397.4MB, bw=46057KB/s, iops=5757, runt=120001msec
sequential_writes.ksh.fio.sync.8k-ios.32-threads.1-filesystems:  write: io=24498MB, bw=209044KB/s, iops=26130, runt=120002msec
sequential_writes.ksh.fio.sync.8k-ios.64-threads.1-filesystems:  write: io=30910MB, bw=263736KB/s, iops=32966, runt=120012msec
delphix@ZoL-ubuntu-4:

The same question, is there any progress?

@woquflux:

is there any progress?

Within < 18 h? :man_shrugging:

@tonynguien thank you for isolating this commit. This should make it easier to identify the exact bottleneck. Could you try disabling the DVA throttle in the new code by setting zio_dva_throttle_enabled=0. This will disable the new stage in the pipeline, which should help us determine if it's the throttle itself or an implementation detail in this change.

IANAD (I Am Not A [ZFS] Developer): however, I’ve seen massive performance issues throughout 0.7.x having to do with ZED running.

I know that there are clearly other issues here that are being addressed by the professionals, however...

Something to try is stopping zfs-zed and seeing if your performance issues disappear. There have been bugs reported due to issues with the udev system { https://github.com/zfsonlinux/zfs/issues/6667 | https://github.com/zfsonlinux/zfs/issues/7366 } and based on the fact that ZED running, while a little better, still has constant drive access and kills our performance (On multiple machines in every version from 0.7.4 - 0.7.9 ), I'm assuming the issues aren't totally addressed by 0.7.9

@ericdaltman Just for the record, I did not have ZED running during my tests at all.

@tonynguien thank you for isolating this commit. This should make it easier to identify the exact bottleneck. Could you try disabling the DVA throttle in the new code by setting zio_dva_throttle_enabled=0. This will disable the new stage in the pipeline, which should help us determine if it's the throttle itself or an implementation detail in this change.

@behlendorf - I forgot to mention I have results with throttle disabled, i.e. zio_dva_throttle_disabled=0. Numbers are slightly better but still far from pre-throttle code

delphix@ZoL-ubuntu-4:/var/tmp/test_results.7090.throttle_disabled/20180919T181507/perf_data$ cat /sys/module/zfs/parameters/zio_dva_throttle_enabled
0
delphix@ZoL-ubuntu-4:/var/tmp/test_results.7090.throttle_disabled/20180919T181507/perf_data$ grep iop *fio*
random_reads.ksh.fio.sync.8k-ios.128-threads.1-filesystems:  read : io=45754MB, bw=390416KB/s, iops=48802, runt=120006msec
random_reads.ksh.fio.sync.8k-ios.16-threads.1-filesystems:  read : io=33682MB, bw=287413KB/s, iops=35926, runt=120001msec
random_reads.ksh.fio.sync.8k-ios.1-threads.1-filesystems:  read : io=3272.9MB, bw=27922KB/s, iops=3490, runt=120001msec
random_reads.ksh.fio.sync.8k-ios.32-threads.1-filesystems:  read : io=38607MB, bw=329442KB/s, iops=41180, runt=120002msec
random_reads.ksh.fio.sync.8k-ios.64-threads.1-filesystems:  read : io=42316MB, bw=361087KB/s, iops=45135, runt=120002msec
random_writes.ksh.fio.sync.8k-ios.128-threads.1-filesystems:  write: io=18109MB, bw=154515KB/s, iops=19314, runt=120009msec
random_writes.ksh.fio.sync.8k-ios.1-threads.1-filesystems:  write: io=4116.9MB, bw=35130KB/s, iops=4391, runt=120001msec
random_writes.ksh.fio.sync.8k-ios.32-threads.1-filesystems:  write: io=14056MB, bw=119943KB/s, iops=14992, runt=120001msec
sequential_reads_arc_cached.ksh.fio.sync.128k-ios.128-threads.1-filesystems:  read : io=319479MB, bw=2660.9MB/s, iops=21280, runt=120101msec
sequential_reads_arc_cached.ksh.fio.sync.128k-ios.1-threads.1-filesystems:  read : io=88224MB, bw=752834KB/s, iops=5881, runt=120001msec
sequential_reads_arc_cached.ksh.fio.sync.128k-ios.64-threads.1-filesystems:  read : io=321499MB, bw=2678.2MB/s, iops=21431, runt=120008msec
sequential_reads_arc_cached.ksh.fio.sync.1m-ios.128-threads.1-filesystems:  read : io=301952MB, bw=2512.6MB/s, iops=2512, runt=120177msec
sequential_reads_arc_cached.ksh.fio.sync.1m-ios.1-threads.1-filesystems:  read : io=79046MB, bw=674520KB/s, iops=658, runt=120001msec
sequential_reads_arc_cached.ksh.fio.sync.1m-ios.64-threads.1-filesystems:  read : io=303414MB, bw=2528.7MB/s, iops=2528, runt=120021msec
sequential_reads.ksh.fio.sync.128k-ios.128-threads.1-filesystems:  read : io=86109MB, bw=733882KB/s, iops=5733, runt=120149msec
sequential_reads.ksh.fio.sync.128k-ios.16-threads.1-filesystems:  read : io=111820MB, bw=954187KB/s, iops=7454, runt=120001msec
sequential_reads.ksh.fio.sync.128k-ios.1-threads.1-filesystems:  read : io=61625MB, bw=525858KB/s, iops=4108, runt=120001msec
sequential_reads.ksh.fio.sync.128k-ios.64-threads.1-filesystems:  read : io=81968MB, bw=699381KB/s, iops=5463, runt=120013msec
sequential_reads.ksh.fio.sync.128k-ios.8-threads.1-filesystems:  read : io=202999MB, bw=1691.7MB/s, iops=13533, runt=120001msec
sequential_reads.ksh.fio.sync.1m-ios.128-threads.1-filesystems:  read : io=86863MB, bw=738946KB/s, iops=721, runt=120371msec
sequential_reads.ksh.fio.sync.1m-ios.16-threads.1-filesystems:  read : io=121171MB, bw=1009.8MB/s, iops=1009, runt=120006msec
sequential_reads.ksh.fio.sync.1m-ios.1-threads.1-filesystems:  read : io=58526MB, bw=499418KB/s, iops=487, runt=120001msec
sequential_reads.ksh.fio.sync.1m-ios.64-threads.1-filesystems:  read : io=85291MB, bw=727471KB/s, iops=710, runt=120057msec
sequential_reads.ksh.fio.sync.1m-ios.8-threads.1-filesystems:  read : io=203818MB, bw=1698.5MB/s, iops=1698, runt=120005msec
sequential_writes.ksh.fio.sync.128k-ios.128-threads.1-filesystems:  write: io=43697MB, bw=372732KB/s, iops=2911, runt=120047msec
sequential_writes.ksh.fio.sync.128k-ios.16-threads.1-filesystems:  write: io=38417MB, bw=327792KB/s, iops=2560, runt=120011msec
sequential_writes.ksh.fio.sync.128k-ios.1-threads.1-filesystems:  write: io=16908MB, bw=144279KB/s, iops=1127, runt=120001msec
sequential_writes.ksh.fio.sync.128k-ios.32-threads.1-filesystems:  write: io=41039MB, bw=349946KB/s, iops=2733, runt=120088msec
sequential_writes.ksh.fio.sync.128k-ios.64-threads.1-filesystems:  write: io=42654MB, bw=363959KB/s, iops=2843, runt=120006msec
sequential_writes.ksh.fio.sync.1m-ios.128-threads.1-filesystems:  write: io=46163MB, bw=393017KB/s, iops=383, runt=120277msec
sequential_writes.ksh.fio.sync.1m-ios.16-threads.1-filesystems:  write: io=42595MB, bw=363453KB/s, iops=354, runt=120008msec
sequential_writes.ksh.fio.sync.1m-ios.1-threads.1-filesystems:  write: io=23767MB, bw=202771KB/s, iops=198, runt=120024msec
sequential_writes.ksh.fio.sync.1m-ios.32-threads.1-filesystems:  write: io=44624MB, bw=380557KB/s, iops=371, runt=120074msec
sequential_writes.ksh.fio.sync.1m-ios.64-threads.1-filesystems:  write: io=45666MB, bw=389378KB/s, iops=380, runt=120094msec
sequential_writes.ksh.fio.sync.8k-ios.128-threads.1-filesystems:  write: io=34413MB, bw=293638KB/s, iops=36704, runt=120008msec
sequential_writes.ksh.fio.sync.8k-ios.16-threads.1-filesystems:  write: io=18190MB, bw=155217KB/s, iops=19402, runt=120001msec
sequential_writes.ksh.fio.sync.8k-ios.1-threads.1-filesystems:  write: io=5315.7MB, bw=45360KB/s, iops=5669, runt=120001msec
sequential_writes.ksh.fio.sync.8k-ios.32-threads.1-filesystems:  write: io=24537MB, bw=209381KB/s, iops=26172, runt=120001msec
sequential_writes.ksh.fio.sync.8k-ios.64-threads.1-filesystems:  write: io=30967MB, bw=264250KB/s, iops=33031, runt=120002msec

Hi,

I've made some new preliminary tests with v0.7.9 & master (g47ab01a) using zio_dva_throttle_enabled=0, compared to previous results:

  1. Baseline 4k IOPs (SYNC)
    -> v0.6.5.11 - 1G/4k/1/1 => 6498 / 2786 (25.4MB / 10.9MB)
    -> v0.7-master - 1G/4k/1/1 => 4942 / 2125 (19.3MB / 8.5MB)
    -> v0.7.9+dva_throttle=0 - 1G/4k/1/1 => 6072 / 2613 (23.7MB / 10.2MB)
    -> v0.7-master+dva_throttle=0 - 1G/4k/1/1 => 5251 / 2246 (20.5MB / 8.9MB)
    => v0.6.5 still wins, but v0.7.9+dva_throttle=0 has quite some improvement, which is again gone with master+dva_throttle=0.. :?

  2. Baseline 4k IOPs (NOSYNC)
    -> v0.6.5.11 - 1G/4k/1/1 => 75300 / 32200 (294MB / 126MB)
    -> v0.7-master - 1G/4k/1/1 => 57500 / 24600 (224MB / 96.2MB)
    -> v0.7.9+dva_throttle=0 - 1G/4k/1/1 => 54900 / 47905 (215MB / 92MB)
    -> v0.7-master+dva_throttle=0 - 1G/4k/1/1 => 58400 / 25000 (228MB / 97.8MB)
    => v0.6.5 wins again.. and dva_throttle=0 cases have little improvement if any over dva_throttle=1.

  3. Highest IOPs (SYNC)
    -> v0.6.5.11 - 1G/4k/16/16 => 25900 / 11100 (101MB / 43.4MB)
    -> v0.7-master - 1G/4k/16/16 => 42500 / 18200 (166MB / 71.1MB)
    -> v0.7.9+dva_throttle=0 - 1G/4k/16/16 => 27600 / 11800 (108MB / 46.1MB)
    -> v0.7-master+dva_throttle=0 - 1G/4k/16/16 => 42800 / 18300 (167MB / 71.6MB)
    => Winner is still v0.7-master, while v0.7-master+dva_throttle makes little difference.

  4. Highest IOPs (NOSYNC)
    -> v0.6.5.11 - 1G/4k/16/16 => 211000 / 90400 (824MB / 353MB)
    -> v0.7-master - 1G/4k/16/16 => 130000 / 55800 (508MB / 218MB)
    -> v0.7.9+dva_throttle=0 - 1G/4k/16/16 => 114000 / 48800 (445MB / 191MB)
    -> v0.7-master+dva_throttle=0 - 1G/4k/16/16 => 129000 / 55200 (503MB / 216MB)
    => In this case v0.6.5 still wins by far, while v0.7.9+dva_throttle=0 is worse than standard.

  5. Highest Throughput (SYNC)
    -> v0.6.5.11 - 1G/1M/16/16 => 483 / 211 (484MB / 211MB)
    -> v0.7-master - 1G/1M/16/16 => 438 / 188 (438MB / 188MB)
    -> v0.7.9+dva_throttle=0 - 1G/1M/16/16 => 430 / 182 (431MB / 182MB)
    -> v0.7-master+dva_throttle=0 - 1G/1M/16/16 => 443 / 188 (443MB / 188MB)
    => Still little difference between each test case..

  6. Highest Throughput (NOSYNC)
    -> v0.6.5.11 - 1G/1M/16/16 => 1065 / 459 (1066MB / 460MB)
    -> v0.7-master - 1G/1M/16/16 => 583 / 248 (584MB / 249MB)
    -> v0.7.9+dva_throttle=0 - 1G/1M/16/16 => 575 / 247 (576MB / 247MB)
    -> v0.7-master+dva_throttle=0 - 1G/1M/16/16 => 582 / 250 (582MB / 250MB)
    => Still another long win for v0.6.5.

So the change in zio_notify_parent() which replaced the zio_execute() with zio_taskq_dispatch() introduced the performance regression. I reverted that change and got similar performance to pre-throttle code.

In master, https://github.com/zfsonlinux/zfs/pull/7736/commits reduced taskq context switching thus solved the above issue.

@pruiz Would you be able to test with master or 0.8 code to verify?

Additionally, I noticed two things:
1) random writes and sequential reads numbers from pre write throttle code are still higher than numbers with 7736 change.

Pre write throttle numbers:
```delphix@ZoL-ubuntu-4: cd test_results.pre_throttle.5252/20181001T150923/perf_data/
delphix@ZoL-ubuntu-4: grep iop random_writes.ksh.fio*
random_writes.ksh.fio.sync.8k-ios.128-threads.1-filesystems: write: io=21403MB, bw=182629KB/s, iops=22828, runt=120005msec
random_writes.ksh.fio.sync.8k-ios.1-threads.1-filesystems: write: io=4094.5MB, bw=34939KB/s, iops=4367, runt=120001msec
random_writes.ksh.fio.sync.8k-ios.32-threads.1-filesystems: write: io=16115MB, bw=137498KB/s, iops=17187, runt=120011msec
delphix@ZoL-ubuntu-4:
delphix@ZoL-ubuntu-4: grep iop sequential_reads.ksh.fio

sequential_reads.ksh.fio.sync.128k-ios.128-threads.1-filesystems: read : io=114173MB, bw=973311KB/s, iops=7603, runt=120119msec
sequential_reads.ksh.fio.sync.128k-ios.16-threads.1-filesystems: read : io=219190MB, bw=1826.6MB/s, iops=14612, runt=120001msec
sequential_reads.ksh.fio.sync.128k-ios.1-threads.1-filesystems: read : io=64290MB, bw=548606KB/s, iops=4285, runt=120001msec
sequential_reads.ksh.fio.sync.128k-ios.64-threads.1-filesystems: read : io=114187MB, bw=974304KB/s, iops=7611, runt=120011msec
sequential_reads.ksh.fio.sync.128k-ios.8-threads.1-filesystems: read : io=270624MB, bw=2255.2MB/s, iops=18041, runt=120001msec
sequential_reads.ksh.fio.sync.1m-ios.128-threads.1-filesystems: read : io=114034MB, bw=970752KB/s, iops=948, runt=120289msec
sequential_reads.ksh.fio.sync.1m-ios.16-threads.1-filesystems: read : io=213920MB, bw=1782.6MB/s, iops=1782, runt=120006msec
sequential_reads.ksh.fio.sync.1m-ios.1-threads.1-filesystems: read : io=61403MB, bw=523968KB/s, iops=511, runt=120001msec
sequential_reads.ksh.fio.sync.1m-ios.64-threads.1-filesystems: read : io=115057MB, bw=981288KB/s, iops=958, runt=120065msec
sequential_reads.ksh.fio.sync.1m-ios.8-threads.1-filesystems: read : io=263363MB, bw=2194.7MB/s, iops=2194, runt=120003msec
delphix@ZoL-ubuntu-4:

7736 numbers
```delphix@ZoL-ubuntu-4: cd /var/tmp/test_results.7736/20180928T121010/perf_data
delphix@ZoL-ubuntu-4: grep iop random_writes.ksh.fio**
random_writes.ksh.fio.sync.8k-ios.128-threads.1-filesystems:  write: io=17093MB, bw=145773KB/s, iops=18221, runt=120069msec
random_writes.ksh.fio.sync.8k-ios.1-threads.1-filesystems:  write: io=2680.9MB, bw=22876KB/s, iops=2859, runt=120001msec
random_writes.ksh.fio.sync.8k-ios.32-threads.1-filesystems:  write: io=12982MB, bw=110772KB/s, iops=13846, runt=120006msec
delphix@ZoL-ubuntu-4:
delphix@ZoL-ubuntu-4:
delphix@ZoL-ubuntu-4: grep iop sequential_reads.ksh.fio*
sequential_reads.ksh.fio.sync.128k-ios.128-threads.1-filesystems:  read : io=97437MB, bw=831078KB/s, iops=6492, runt=120055msec
sequential_reads.ksh.fio.sync.128k-ios.16-threads.1-filesystems:  read : io=158444MB, bw=1320.4MB/s, iops=10562, runt=120003msec
sequential_reads.ksh.fio.sync.128k-ios.1-threads.1-filesystems:  read : io=55948MB, bw=477418KB/s, iops=3729, runt=120001msec
sequential_reads.ksh.fio.sync.128k-ios.64-threads.1-filesystems:  read : io=92176MB, bw=786437KB/s, iops=6144, runt=120020msec
sequential_reads.ksh.fio.sync.128k-ios.8-threads.1-filesystems:  read : io=228701MB, bw=1905.9MB/s, iops=15246, runt=120001msec
sequential_reads.ksh.fio.sync.1m-ios.128-threads.1-filesystems:  read : io=98438MB, bw=838174KB/s, iops=818, runt=120262msec
sequential_reads.ksh.fio.sync.1m-ios.16-threads.1-filesystems:  read : io=155480MB, bw=1295.7MB/s, iops=1295, runt=120005msec
sequential_reads.ksh.fio.sync.1m-ios.1-threads.1-filesystems:  read : io=53686MB, bw=458117KB/s, iops=447, runt=120001msec
sequential_reads.ksh.fio.sync.1m-ios.64-threads.1-filesystems:  read : io=93378MB, bw=796341KB/s, iops=777, runt=120073msec
sequential_reads.ksh.fio.sync.1m-ios.8-threads.1-filesystems:  read : io=219143MB, bw=1826.2MB/s, iops=1826, runt=120003msec
delphix@ZoL-ubuntu-4:

2) cached reads performance also dropped somewhere between 0.6.5 and before write throttle commit.

So we may still have some regressions. I'm looking at #2 now. Does it make sense to open new issue(s)?

I'd also like to mention that disabling dynamic taskqs (spl_taskq_thread_dynamic=0) can decrease latency and improve performance, especially in single-threaded benchmark scenarios.

I'd also like to mention that disabling dynamic taskqs (spl_taskq_thread_dynamic=0) can decrease latency and improve performance, especially in single-threaded benchmark scenarios.

Thanks!

I use fio to test the performance of zfs-0.7.11 zvol, the write amplification more than 6x, this seriously affects the performance of zvol.

Type | Version/Name
-- | --
Distribution Name | redhat-7.4
Distribution Version | 7.4
Linux Kernel | 3.10.0-693.el7.x86_64
Architecture | x86_64
ZFS Version | 0.7.11
SPL Version | 0.7.11
Hardware | 3 x SSD(370G)

  • 8K zvol randwrite
    image

@kpande I don't quite understand what you mean. Can you elaborate more?

You are handling all ZIL writes via indirect sync (logbias=throughout). This will trash your ability to aggregate read I/o over time due to data/metadata fragmentation, and will even greatly reduce your ability to agg between one data block and another. Any outstanding async write in the same sync domain may suffer as well.

I understand the desire for throughput but here it is coming at the expense of the pool data at large. In the real world, you would seldom set up a dataset like this unless read performance was totally unimportant. If you will read from a block at least once, it's worth doing direct sync.

If you test with logbias=latency, you need to either add a SLOG or increase zfs_immediate_write_sz.

I'd recommend doing a ZFS send while you watch zpool iostat -r. With 16k indirect writes you should have some absolutely amazing unaggregatable fragmentation.

Another note - it looks like you are suffering reads even on full block writes. This should help greatly with that:

https://github.com/zfsonlinux/zfs/issues/8590

We have encountered a very similar issue, in the form of a significant performance drop between zfs 0.6.5.9 and 0.7.11. We are able to overcome the issue by setting zfs_abd_scatter_enabled=0 & zfs_compressed_arc_enabled=0.

We are using Debian Stretch (version 9.8) and linux kernel 4.9.0-8-amd64.

Our recordsize is 128K and I don't think we would be able to decrease it.

I too have seen cases were ABD scatter/gather isn't as performant. So I can
believe it makes a difference for some workloads, but don't have a generic
guideline for when to use it and when to not use it. Experiment results appreciated.

I don't believe disabling compressed ARC will make much difference. Perhaps on
small memory machines? Can you toggle that and report results. This will be more
important soon as there is a proposal to force compressed ARC on. https://github.com/zfsonlinux/zfs/issues/7896

I tried enabling zfs_compressed_arc_enabled and zfs_abd_scatter_enabled in separate tests.
Enabling compressed ARC makes the biggest performance drop of them both. By enabling ABD scatter/gather I can see a decrease in performance but not as noticeable as when I enable compressed ARC.
Our machine uses more than 250G of memory. Let me know if I can help by providing other information.

@pauful What compression algorithm are you using on your datasets? lz4 is very fast to decompress but gzip would certainly cause issues.

@jwittlincohen lz4 is the compression option used in our pools.

@pruiz Have you tried current master? Some performance oriented commits have been applied so far

@matveevandrey not yet, but I have it on TODO.

On Fri, Jun 21, 2019 at 4:23 AM matveevandrey notifications@github.com
wrote:

@pruiz https://github.com/pruiz Have you tried current master? Some
performance oriented commits have been applied so far

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/zfsonlinux/zfs/issues/7834?email_source=notifications&email_token=AABOV6LH3XXJ2JBDCMUNI73P3Q3SVA5CNFSM4FRUHVPKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODYHHNPI#issuecomment-504264381,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AABOV6PO2OQLGGV32ELMUB3P3Q3SVANCNFSM4FRUHVPA
.

Did somebody do some performance tests on 0.8.2 and would like to share?
I didn't find anything interesting in google.

@pruiz could You do the same test on 0.8.2 with the hardware You mentioned earlier?

Right now I am have quite limited time, maybe in a week or two..

On Wed, Oct 9, 2019 at 12:03 PM Interduo notifications@github.com wrote:

@pruiz https://github.com/pruiz could You do the same test on 0.8.2
with the hardware You mentioned earlier?

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/zfsonlinux/zfs/issues/7834?email_source=notifications&email_token=AABOV6I7UWCHJT3OBOHYWCTQNWT7HA5CNFSM4FRUHVPKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEAXLO7Y#issuecomment-539932543,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AABOV6MXPQ5AB7Q2RM6SIG3QNWT7HANCNFSM4FRUHVPA
.

@pruiz could You do the same test on 0.8.3 with the hardware You mentioned earlier?
You've done good work with that earlier.

just a lurker on this bug here as I saw a similar drop on systems here back in 2018 doing the same transition from 0.6.5 to 0.7.x, decreasing performance to about 1/5 - 1/6th when running v0.7 so had to fail back. Been testing randomly over the last two years with newer versions against the same array but still in the 0.7 line but no luck. This last week just tried 0.8.3 and performance is back / comparable with 0.6.5. This is on one of my larger dev/qa systems and will watch it closely for the next month before upgrading the other systems. So 0.8.3 looks promising. Just wanted to bump this and see if pruiz could validate if this also alleviates his original problem.

@stevecs try 0.8.5, this version is having more ios.

@interduo I'll see if I can get another window but it will be probably a couple weeks. I did a quick look a the commit deltas between 0.8.3 and 0.8.5 but didn't see much to catch my eye for I/O improvements (though did spot a couple other commits that were interesting). Can you give me a hint as to what commits you think may be relevant?

I just jumped from 0.8.2 to 0.8.5 with nice surprise on io graphs. I didn't look at commits.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

kernelOfTruth picture kernelOfTruth  Â·  4Comments

mailinglists35 picture mailinglists35  Â·  4Comments

marker5a picture marker5a  Â·  4Comments

jakeogh picture jakeogh  Â·  3Comments

nwf picture nwf  Â·  4Comments