Zfs: Slow write performance with zfs 0.8

Created on 30 May 2019 · 53Comments · Source: openzfs/zfs

System information

Type | Version/Name
--- | ---
Distribution Name | Manjaro
Distribution Version | Testing
Linux Kernel | 4.19.46-1-MANJARO
Architecture | x86_64
ZFS Version | 0.8.0-1
SPL Version | 0.8.0-1

Describe the problem you're observing

I do frequent fio benchmarks with my pool "zstore" and just realized that write performance is dropping with zfs version 0.8.

With zfs version 0.7.13 I typically got around 230-250 write IOPS:

fio-output-zstore-32G-2019-05-15@06:52:  read: IOPS=240, BW=240MiB/s (252MB/s)(32.0GiB/136347msec)
fio-output-zstore-32G-2019-05-15@06:52:  write: IOPS=233, BW=234MiB/s (245MB/s)(32.0GiB/140079msec); 0 zone resets
fio-output-zstore-32G-2019-04-06@19:53:  read: IOPS=280, BW=281MiB/s (294MB/s)(32.0GiB/116694msec)
fio-output-zstore-32G-2019-04-06@19:53:  write: IOPS=254, BW=254MiB/s (267MB/s)(32.0GiB/128766msec); 0 zone resets
fio-output-zstore-32G-2019-03-13@15:12:  read: IOPS=286, BW=286MiB/s (300MB/s)(32.0GiB/114442msec)
fio-output-zstore-32G-2019-03-13@15:12:  write: IOPS=269, BW=270MiB/s (283MB/s)(32.0GiB/121379msec); 0 zone resets
fio-output-zstore-32G-2019-03-09@11:02:  read: IOPS=296, BW=296MiB/s (311MB/s)(32.0GiB/110551msec)
fio-output-zstore-32G-2019-03-09@11:02:  write: IOPS=249, BW=249MiB/s (262MB/s)(32.0GiB/131339msec); 0 zone resets
fio-output-zstore-32G-2019-03-08@14:28:  read: IOPS=305, BW=305MiB/s (320MB/s)(32.0GiB/107366msec)
fio-output-zstore-32G-2019-03-08@14:28:  write: IOPS=243, BW=243MiB/s (255MB/s)(32.0GiB/134811msec); 0 zone resets

with zfs version 0.8 I only get 160-190 write IOPS:

fio-output-zstore-0.8-32G-2019-05-30@11:01:  read: IOPS=265, BW=265MiB/s (278MB/s)(32.0GiB/123489msec)
fio-output-zstore-0.8-32G-2019-05-30@11:01:  write: IOPS=191, BW=192MiB/s (201MB/s)(32.0GiB/170900msec); 0 zone resets
fio-output-zstore-0.8-32G-2019-05-30@10:45:  read: IOPS=278, BW=278MiB/s (292MB/s)(32.0GiB/117837msec)
fio-output-zstore-0.8-32G-2019-05-30@10:45:  write: IOPS=160, BW=161MiB/s (168MB/s)(32.0GiB/204095msec); 0 zone resets
fio-output-zstore-0.8-32G-2019-05-29@08:12:  read: IOPS=270, BW=270MiB/s (283MB/s)(32.0GiB/121249msec)
fio-output-zstore-0.8-32G-2019-05-29@08:12:  write: IOPS=181, BW=181MiB/s (190MB/s)(32.0GiB/180892msec); 0 zone resets

The read IOPS seem to be the same in the range of 260-280. Where is this write performance difference coming from?

Here are the pool details:

zfs recordsize is 1M. No compression. No dedup

30# zpool status
  pool: zstore
 state: ONLINE
  scan: scrub repaired 0B in 0 days 16:57:34 with 0 errors on Mon Apr  1 23:52:01 2019
config:

    NAME                     STATE     READ WRITE CKSUM
    zstore                   ONLINE       0     0     0
      mirror-0               ONLINE       0     0     0
        sdb-WD-WCC4E5HF3P4S  ONLINE       0     0     0
        sdc-WD-WCC4E1SSP28F  ONLINE       0     0     0
      mirror-1               ONLINE       0     0     0
        sdd-WD-WCC4E1SSP6NC  ONLINE       0     0     0
        sda-WD-WCC7K7EK9VC4  ONLINE       0     0     0

errors: No known data errors

43# zfs get all zstore 
NAME    PROPERTY              VALUE                 SOURCE
zstore  type                  filesystem            -
zstore  creation              Di Jan 23 14:39 2018  -
zstore  used                  6,76T                 -
zstore  available             268G                  -
zstore  referenced            96K                   -
zstore  compressratio         1.03x                 -
zstore  mounted               yes                   -
zstore  quota                 none                  default
zstore  reservation           none                  default
zstore  recordsize            1M                    local
zstore  mountpoint            /mnt/zstore           local
zstore  sharenfs              off                   default
zstore  checksum              on                    default
zstore  compression           lz4                   local
zstore  atime                 on                    local
zstore  devices               on                    default
zstore  exec                  on                    default
zstore  setuid                on                    default
zstore  readonly              off                   default
zstore  zoned                 off                   default
zstore  snapdir               hidden                default
zstore  aclinherit            restricted            default
zstore  createtxg             1                     -
zstore  canmount              on                    default
zstore  xattr                 sa                    local
zstore  copies                1                     default
zstore  version               5                     -
zstore  utf8only              off                   -
zstore  normalization         none                  -
zstore  casesensitivity       sensitive             -
zstore  vscan                 off                   default
zstore  nbmand                off                   default
zstore  sharesmb              off                   default
zstore  refquota              none                  default
zstore  refreservation        none                  default
zstore  guid                  10936391047855543944  -
zstore  primarycache          all                   default
zstore  secondarycache        all                   default
zstore  usedbysnapshots       0B                    -
zstore  usedbydataset         96K                   -
zstore  usedbychildren        6,76T                 -
zstore  usedbyrefreservation  0B                    -
zstore  logbias               latency               default
zstore  objsetid              51                    -
zstore  dedup                 off                   default
zstore  mlslabel              none                  default
zstore  sync                  standard              default
zstore  dnodesize             legacy                default
zstore  refcompressratio      1.00x                 -
zstore  written               96K                   -
zstore  logicalused           6,99T                 -
zstore  logicalreferenced     42K                   -
zstore  volmode               default               default
zstore  filesystem_limit      none                  default
zstore  snapshot_limit        none                  default
zstore  filesystem_count      none                  default
zstore  snapshot_count        none                  default
zstore  snapdev               hidden                default
zstore  acltype               posixacl              local
zstore  context               none                  default
zstore  fscontext             none                  default
zstore  defcontext            none                  default
zstore  rootcontext           none                  default
zstore  relatime              on                    local
zstore  redundant_metadata    all                   default
zstore  overlay               off                   default
zstore  encryption            off                   default
zstore  keylocation           none                  default
zstore  keyformat             none                  default
zstore  pbkdf2iters           0                     default
zstore  special_small_blocks  0                     default



md5-9efa729f11e840b031d4374a67ecabc2



41# cat fio-bench-generic-seq-read.options 
[global]
bs=1m
ioengine=libaio
invalidate=1
refill_buffers
numjobs=1
fallocate=none
size=${SIZE}

[seq-read]
rw=read
stonewall



md5-cfd8efdfc80ded54dfda8696de2f7036




45# cat fio-bench-generic-seq-write.options 
[global]
bs=1m
ioengine=libaio
invalidate=1
refill_buffers
numjobs=1
fallocate=none
size=${SIZE}

[seq-write]
rw=write
stonewall

Performance

Source

mabod

Most helpful comment

@behlendorf this issue was created at 30 May, the solution for this issue come in master branch at 12 Jul. This is very important case for us (users). When do plan to do next release of ZFS with this commit?

What is the project politics for releases?
I didn't find any information about it on github or zol website.

interduo on 25 Aug 2019

👍7

All 53 comments

Did you test on same kernel version? Looks like https://github.com/zfsonlinux/zfs/issues/8793

gmelikov on 30 May 2019

The values I am showing here are all kernel 4.19. I have a few numbers for kernel 5.0 which basically confirm the numbers from kernel 4.19. No significant difference by kernel version.

But the zfs version makes a big difference. write IOPS is down to 70 % with zfs 0.8. Average write IOPS 249 with zfs version 0.7.13 and 175 with version 0.8.

mabod on 30 May 2019

Since you're using 4.19.46, this is probably #8793 as mentioned above. The symbol export that allowed SIMD-accelerated checksums was removed from the 4.19 branch with 4.19.38. Maybe set checksum=off for the duration of the benchmark and see if that changes things?

sjuxax on 30 May 2019

👍1

If this is caused by the lack of SIMD support them you should be able to see the same drop in performance using 0.7.13 and the 4.19.46 kernel. It would be good to know either way.

behlendorf on 30 May 2019

I did two runs with checksum=off and it does NOT make a difference. Write performance is still down to ca. 70 %.

My benchmark numbers for version 0.7.13 are from kernels 4.19.42, 4.19.34, 4.19.28 and 4.19.26 (following the Manjaro Testing upgrades). The benchmark numbers for version 0.8 are only for kernel 4.19.46.

Are you suggesting that this is a kernel regression?

mabod on 30 May 2019

Since you achieved the expected performance using 0.7.13 and the 4.19.42 kernel that should rule out the kernel's SIMD changes as a cause. Further investigation is going to be needed to determine exactly why you're seeing a drop in write performance.

behlendorf on 30 May 2019

In the zfs manpages it states to consider changing the dnodesize zfs property to auto.
Seen below is from the dnodesize section in the manpages:
Consider setting dnodesize to auto if the dataset uses the xattr=sa property setting and the workload makes heavy use of extended attributes. This may be applicable to SELinux-enabled systems, Lustre servers, and Samba servers, for example. Literal values are supported for cases where the optimal size is known in advance and for performance testing.
Also the recordsize of the dataset is 1M, this I think can cause issue depending on what your storing in that dataset due to zfs being a copy on write file system. Changing the recordsize value of the dataset will require removing and placing back all files back on the dataset to ensure that all files on the dataset uses the new recordsize value if changed.
ZFS support up to 16 MiB recordsize to get this change the zfs_max_recordsize file.
To view the value of this file do cat /sys/module/zfs/parameters/zfs_max_recordsize.
To change [I DO NOT RECOMMAND IT] it do echo (your value) > /sys/module/zfs/parameters/zfs_max_recordsize.
To get 16MiB the value should be echo $((16 * 1024 * 1024)) which is 16777216.
Changing the default value of 1048576 orecho $((1 * 1024 * 1024))to a bigger value gives issues in deleting the file.
Note if changing the 16MiB to another value it would be echo $((<your value> * 1024 * 1024)).

johnnyjacq16 on 30 May 2019

The system is always idle when I do the tests. I am doing this already since a while. Unfortunately I have only kept the logs since March of this year. But the results have always been comparable as long as I remember. Even with recordsize 128k. Of course there is always some variance in the values. But a performance decrease of 30 % is a significant change.

mabod on 30 May 2019

Look at the history of the pool zpool history <your pool name> | less and look for a time before the performance decrease, may help

johnnyjacq16 on 30 May 2019

There is nothing in the history other than the regular import or snapshot commands.

mabod on 31 May 2019

I did some more tests. Also with another pool. The other pool is a raidz2 with 6 drives in an external USB case. The interesting finding for me is, that this pool (zf1) is NOT showing performance differences. But I certainly see write performance issues with the internal pool (zstore).

I compared the out of "zfs get all" for both zstore and zf1 and there is no important difference other than mountpoint and such. Basic parameters are all the same.

I also doublechecked that checksum=on/off does not make a difference.

Once again some results for zstore:

old (good) values with zfs 0.7.13:

1  write: IOPS=255, BW=256MiB/s (268MB/s)(32.0GiB/128135msec); 0 zone resets
2  write: IOPS=238, BW=239MiB/s (250MB/s)(32.0GiB/137293msec); 0 zone resets
3  write: IOPS=245, BW=245MiB/s (257MB/s)(32.0GiB/133739msec); 0 zone resets
4  write: IOPS=243, BW=243MiB/s (255MB/s)(32.0GiB/134811msec); 0 zone resets
5  write: IOPS=249, BW=249MiB/s (262MB/s)(32.0GiB/131339msec); 0 zone resets
6  write: IOPS=269, BW=270MiB/s (283MB/s)(32.0GiB/121379msec); 0 zone resets
7  write: IOPS=254, BW=254MiB/s (267MB/s)(32.0GiB/128766msec); 0 zone resets
8  write: IOPS=233, BW=234MiB/s (245MB/s)(32.0GiB/140079msec); 0 zone resets

new (bad) values with zfs 0.8.0:

 1 write: IOPS=174, BW=175MiB/s (183MB/s)(32.0GiB/187521msec); 0 zone resets
 2 write: IOPS=188, BW=188MiB/s (197MB/s)(32.0GiB/174175msec); 0 zone resets
 3 write: IOPS=203, BW=204MiB/s (213MB/s)(32.0GiB/160953msec); 0 zone resets
 4 write: IOPS=205, BW=206MiB/s (216MB/s)(32.0GiB/159290msec); 0 zone resets
 5 write: IOPS=191, BW=192MiB/s (201MB/s)(32.0GiB/170795msec); 0 zone resets
 6 write: IOPS=159, BW=160MiB/s (168MB/s)(32.0GiB/204952msec); 0 zone resets
 7 write: IOPS=180, BW=181MiB/s (190MB/s)(32.0GiB/181212msec); 0 zone resets
 8 write: IOPS=194, BW=194MiB/s (204MB/s)(32.0GiB/168825msec); 0 zone resets
 9 write: IOPS=215, BW=216MiB/s (226MB/s)(32.0GiB/151945msec); 0 zone resets
10 write: IOPS=194, BW=195MiB/s (204MB/s)(32.0GiB/168349msec); 0 zone resets
11 write: IOPS=203, BW=204MiB/s (214MB/s)(32.0GiB/160770msec); 0 zone resets
12 write: IOPS=205, BW=206MiB/s (216MB/s)(32.0GiB/159360msec); 0 zone resets

mabod on 31 May 2019

Let zfs report what is happening on the pool and on each vdev with this command zpool iostat -vl <your pool> .1. This command will auto-refresh every .1 second, which can be changed to whatever value.
This command will show you all I/O info that is happening on each vdev with the latency info.

Also use the command zpool iostat -vq <your pool> .1 which will show 'disk queued' info. Info that is waiting to be written on the disk.

zpool iostat -vr <your pool> .1 the -r option shows the size histograms for the leaf vdev's IO. This includes histograms of individual IOs (ind) and aggregate IOs (agg). These stats can be useful for observing how well IO aggregation is working.

Thezpool iostat -c list a number of checks that can be done.
You can check for SMART, ATA and NVMe stuff example zpool iostat -c nvme_err.
If you see the following error, Can't run -c with root privileges unless ZPOOL_SCRIPTS_AS_ROOT is set.
Run then use ZPOOL_SCRIPTS_AS_ROOT=1 zpool iostat -c nvme_err.

Monitor zfs process while working, such of cache info, memory status etc.
cat /proc/spl/kstat/zfs/arcstats
To make it auto-refresh
watch -n .1 cat /proc/spl/kstat/zfs/arcstats

Also ensure that you have your ashift value are accurate with following blockdev --getpbsz /dev/sdXY.
the blockdev command shows the physical block (sector) size. (for all your disks).

Ashift info below:
At pool creation, ashift=12 should always be used, except with SSDs that have 8k sectors where ashift=13 is correct. A vdev of 512 byte disks using 4k sectors will not experience performance issues, but a 4k disk using 512 byte sectors will. Since ashift cannot be changed after pool creation, even a pool with only 512 byte disks should use 4k because those disks may need to be replaced with 4k disks or the pool may be expanded by adding a vdev composed of 4k disks. Because correct detection of 4k disks is not reliable, -o ashift=12 should always be specified during pool creation. See the ZFS on Linux FAQ for more details.

johnnyjacq16 on 31 May 2019

NB, running zpool iostat with short interval (eg < zfs_txg_timeout) is almost always a waste of effort.
Also, running a bunch of CLI collectors is difficult to grok.

A better solution is to use one of the telemetry collectors, telegraf or node_exporter, to collect the data and forward it to a TSDB, like influxdb or prometheus, and analyzed with tools like grafana.

richardelling on 31 May 2019

👍2

@richardelling Could a telemetry collector, TSDB and an analyzed tool be implemented into zfs itself since working with iostat is a waste of effort and difficult to grok. I would like to know that all tools and features in zfs are useful, which I can use to gain meaningful information from zfs.

I have installed telegraf which is just pulling information from /proc/spl/kstat/zfs, I believe that a tool in zfs can do that and display a graph-like representation of the information including what is happening on all vdevs. It is also useful in troubleshooting performances also without the full bloat of influxdb or prometheus and grafana.

johnnyjacq16 on 1 Jun 2019

👍1

no, it is a really bad idea and goes counter to the UNIX philosophy. Today ZFS makes stats available, but reading them is not a free operation. So designing a monitoring system needs to meet very different business requirements. For this reason it is best to have integration to the best-in-class monitoring systems. I only mentioned a few of the open source tools that are popular. There are many more tools in the market.

richardelling on 1 Jun 2019

👍2

For what its worth, I've also seen huge Performance decreases on my pool. Write speed has throttled down to 30MB/s from 600MB/s+
0.8rc3 and kernel 4.9.16-gentoo

If you've got a reasonable method for me to collect performance data I will also assist in this.

    NAME                                                  STATE     READ WRITE CKSUM
    zebras                                                ONLINE       0     0     0
      mirror-0                                            ONLINE       0     0     0
        ata-WDC_WD60EDAZ-11BMZB0_WD-WX61D88AZET6          ONLINE       0     0     0
        ata-WDC_WD60EFRX-68L0BN1_WD-WX51D88NL080          ONLINE       0     0     0
      mirror-1                                            ONLINE       0     0     0
        ata-WDC_WD60EDAZ-11BMZB0_WD-WX61DB72TP5S          ONLINE       0     0     0
        ata-WDC_WD60EFRX-68L0BN1_WD-WXB1HB4JKAM6          ONLINE       0     0     0
      mirror-2                                            ONLINE       0     0     0
        ata-WDC_WD60EFRX-68L0BN1_WD-WX71DB8KYUPY          ONLINE       0     0     0
        ata-WDC_WD60EFRX-68MYMN1_WD-WX21D9421XU3          ONLINE       0     0     0
    special
      mirror-3                                            ONLINE       0     0     0
        ata-KINGSTON_SA400S37240G_50026B76824BC4D8-part2  ONLINE       0     0     0
        ata-KINGSTON_SA400S37240G_50026B76824BC5F3-part2  ONLINE       0     0     0
    logs
      mirror-4                                            ONLINE       0     0     0
        ata-KINGSTON_SA400S37240G_50026B76824BC5F3-part4  ONLINE       0     0     0
        ata-KINGSTON_SA400S37240G_50026B76824BC4D8-part4  ONLINE       0     0     0

All direct attached from a Dell Perc h310 controller in IT mode.

Setsuna-Xero on 6 Jun 2019

👍2

@Setsuna-Xero do I understand correctly that you see this performance drop for both 0.8.0-rc3 and the 0.8.0 tag?

behlendorf on 7 Jun 2019

@behlendorf
Sorry I forgot to include the previous kernel:
4.12.13 on 0.8rc3

I will be moving this array to another server with a 4.19.41 kernel as soon as the drive cages arrive however

Setsuna-Xero on 7 Jun 2019

It is striking to me that @Setsuna-Xero is seeing the performance drop also with a RAID10 system. Can it be that the RAID level makes the difference? I have another pool as RAIDZ2 which is not showing a performance drop.

mabod on 7 Jun 2019

I have write performance problem after upgrading from 0.7.19 to 0.8.0 also. Tried with older kernel to exclude missing SIMD problem and my system is completely idle. Rsyncing same VM image from dedicated disk to ZFS pool:
0.7.19, performance as expected:
0 7 19
0.8.0, performance bad:
0 8 0

amissus on 7 Jun 2019

I'm getting 2-3MB/s with cp/cq and tar. rsync gets an order of magnitude higher at right around 30MB/s.
Previously on what ever 7.x pool I had from two of these disks, I would get over 100MB/s on a single mirror write speeds. Then once I moved to this pool, I had approximately 600MB/s, which then fell off to 20-30MB/s sometime after a kernel bump and moving to 8.0rc3

Setsuna-Xero on 7 Jun 2019

I benchmarked sequential writes on a 6-disk RAIDz2 (all HDD) using Proxmox 6 with ZFS 0.8.1 and Kernel 5.0. The array struggled to maintain the single-disk sequential speed, around 200MB/sec.

An older ZoL build (0.7.13 with older kernel) shows more than double the speed with the same configuration, around 450MB/sec.

herf on 17 Jul 2019

😕2

The branch 0.6.x was spining like a tornado.
The branch 0.7.x drops its performance about 30%.
And now there is next performance drop.

Does somebody compile&test master branch with commit https://github.com/zfsonlinux/zfs/commit/e5db31349484e5e859c7a942eb15b98d68ce5b4d ?

interduo on 19 Jul 2019

Another "me too" over here. After upgrading to the newest Proxmox (with Zol 0.8.1) I can't sustain write speeds for more than a few seconds before they tank and I get lockups.

lukegalea on 29 Jul 2019

Documenting this in case it helps. Seems clear that this is related to lack of SIMD - higher RAID-Z levels use a lot of CPU and scalar perf isn't enough.

cat /proc/spl/kstat/zfs/vdev_raidz_bench ("scalar" row) on a Xeon 4108:

gen_p (RAID-Z) is 1.13GB/sec
gen_pq (RAID-Z2) is 290MB/sec
and gen_pqr (RAID-Z3) is 132MB/sec.

SIMD makes everything 5-7x faster, so restoring SIMD should help this problem.

herf on 29 Jul 2019

👍2

@amissus What version did You tested and showed results? 0.7.19 does not exist.

interduo on 29 Jul 2019

I'm sorry, version 0.7.13 has expected performance for me and >= 0.8 has degraded and unstable performance.

amissus on 29 Jul 2019

What exactly am I reading here?

[root@hostname~]# cat /proc/spl/kstat/zfs/vdev_raidz_bench
18 0 0x01 -1 0 5551518943 1459035087503366
implementation   gen_p           gen_pq          gen_pqr  
original         383443168       135674622       67712690 
scalar           1682391699      530611710       228126033
fastest          scalar          scalar          scalar

msLinuxNinja on 29 Jul 2019

@msLinuxNinja I am reading RAID-Z at 1.6GB/sec, RAID-Z2 at 530MB/sec, and RAID-Z3 at 228MB/sec. So this is a fast CPU - some slower ones will require SIMD to get these numbers.

herf on 29 Jul 2019

👍1

For compare:

$ cat /proc/spl/kstat/zfs/vdev_raidz_bench | grep scalar | awk '{ print $1, $2, $3, $4 }'
scalar 487264127 177733274 7658480

$ cat /proc/cpuinfo | grep "model name" | head -n1
model name  : Intel(R) Xeon(R) CPU E5-2650L v2 @ 1.70GHz

interduo on 29 Jul 2019

What is the project politics for releases?
I didn't find any information about it on github or zol website.

interduo on 25 Aug 2019

👍7

It seems zfs 0.8.2 was released, but without the fix in e5db31349484e5e859c7a942eb15b98d68ce5b4d 😢
I don't know the reason for it not being included. But it seems there will be a few more months with crawling performance.

faern on 29 Sep 2019

@faern https://github.com/zfsonlinux/zfs/issues/9346

gmelikov on 29 Sep 2019

👍4

Does this issue concern the kernel 3.10.0-1062.1.1.el7.x86_64 as well?

DannCos on 3 Oct 2019

@DannCos the 3.10.0-1062.1.1.el7.x86_64 is not effected by this issue.

behlendorf on 3 Oct 2019

👍1

I decided to conduct some tests under CentOS 7 (with 3.10.0-1062.1.1.el7.x86_64) - the reason I decided to do this, was simply due to replacing a storage server, one running 0.7 and one running 0.8 - I experienced issues with slow read performance under the new system.

Old server:

Intel(R) Xeon(R) CPU E5-2420
64GB DDR3 ECC RAM
8x8TB spinning enterprise disks (In raidz2)
4x 240GB SSD (2 for OS, and 2 split up for ZIL and l2arc)

New server:

Intel(R) Xeon(R) CPU E5-1650 v3
128GB DDR4 ECC RAM
10x10TB spinning enterprise disks (In raidz2)
2x960GB Enterprise NVMe SSD (partitioned for OS, ZIL and l2arc)
10g networking

Both servers use about 20TB of storage and store 280 million files.

The old server would restore a 1GB backup with 100k files in about 1.5 minutes where the new one would do the same folder in 17 minutes.

Note: Writes seems to be decent on both systems, reads being the main affected.

Both tests were performed on an idle system right after rebooting the system (to ensure that no cache or anything got hit).

atime turned off, lz4 compression turned on, dedup off.

It made me search and I found this thread regarding performance issues, so I wanted to test out various versions of ZoL as well as ZFS on FreeBSD 12.

For this I set up another machine:

Intel(R) Core(TM) i7-2600
16GB DDR3 RAM
4x4TB spinning enterprise disks

All tests below use the same zpool create parameters: atime=off, dedup=off, compression=lz4, ashift=12, and a reboot being performed between every test.

Test directory structure being 11294 megabyte and 311153 inodes.

It's also worth noting that the only data being stored on the pool is the test directory structure, nothing else - whether performance becomes worse as the dataset grows, I don't know (Hopefully it doesn't).

Backup/restore is performed using rsync on a local network (1gigabit/s) with no other communication happening:

ZFS 0.6 (Installed via Ubuntu 16.04):

zfs striped mirror backup: 2 min 1 sec
zfs striped mirror restore: 3 min 39 sec
zfs raidz2 backup: 2 min 22 sec
zfs raidz2 restore: 3 min 36 sec
zfs striped backup: 2 min 15 sec
zfs striped restore: 3 min 26 sec

ZFS 0.7 (Installed via CentOS 7.7 using zfs-release.el7_6):

zfs striped mirror backup: 2 min 8 sec
zfs striped mirror restore: 3 min 16 sec
zfs raidz2 backup: 2 min 10 sec
zfs raidz2 restore: 3 min 18 sec
zfs striped backup: 2 min 8 sec
zfs striped restore: 3 min 23 sec

ZFS 0.8 (Installed via CentOS 7.7 using zfs-release.el7_7):

zfs striped mirror backup: 2 min 9 sec
zfs striped mirror restore: 4 min 45 sec
zfs raidz2 backup: 2 min 9 sec
zfs raidz2 restore: 5 min 54 sec
zfs striped backup: 2 min 8 sec
zfs striped restore: 5 min 24 sec

Backup times (writing to ZFS) seems to stay pretty consistent in my case - likely also being limited by the 1g link between machines, but average is about the 2 minute and 10 seconds, or about 700mbps.

What surprises me is the drop between 0.7 and 0.8 is the read performance experienced, especially for raidz2. From 3 min and 18 seconds to 5 minutes and 54 seconds. That's 78% increase in restoration times.

Just for fun, I tried to give FreeBSD 12 a try:

zfs striped mirror: 4 min 22 sec
zfs raidz2: 5 min 20 sec
zfs striped: 5 min 4 sec

Whether it performs better under FreeBSD 11.x I haven't had the time to test yet.

Now, I'd expect performance to be roughly the same on the same hardware.

My tests conducted does still not explain the massive slowdown I experience between the two real systems with more powerful hardware - hopefully adding more memory to a system (64 vs 128GB) shouldn't make performance worse.

I know this issue is mainly related to write performance, however, I do find it important that read performance gets mentioned as well, especially under the 3.10.0-1062.1.1.el7.x86_64 which should not be affected by the SIMD.

It makes me believe that there may be some other regression between 0.7 and 0.8 that may affect the overall performance as well, other than the SIMD.

If people want me to test with some other settings, I'm more than happy to do so. Ideally, I want my backup server to remain snappy so in case of restores being needed that they can actually be performed rather quickly.

lucasRolff on 27 Oct 2019

👍4 👀1

@lucasRolff could You do a benchmark for 0.8.3 version whitch was released few days ago?
Is this issue fully resolved?

interduo on 27 Jan 2020

I just did a test with 0.8.3 and kernel 5.4.14. I do see better IOPS.

Average of 7 runs:

read: 301 IOPS (lowest out of seven: 273)
write: 209 IOPS (lowest out of seven: 192)

This is certainly better than what I had before (https://github.com/zfsonlinux/zfs/issues/8836#issuecomment-497673636;

The read speed is very good. Same level or better than 0.7.13. But the write speed is still behind 0.7.13.

mabod on 27 Jan 2020

@mabod out of curiosity, how did you run the benchmark? just want to compare results.

msLinuxNinja on 27 Jan 2020

I explained it in this thread. It is a fio benchmark. The fio option files are in this thread too.

mabod on 27 Jan 2020

@interduo - I moved my backup servers to 100% SSD storage and (sadly) using a hardware raid 6 :)

Eventually, I'll give ZFS a try again on spinning disks and see how it performs.

lucasRolff on 13 Feb 2020

does not sound like SIMD is the only problem with this

zfs striped mirror restore: 3 min 39 sec
zfs striped mirror restore: 4 min 45 sec

FlorianHeigl on 20 Feb 2020

@FlorianHeigl
@mabod

did You do Your tests on 0.8.4 release? Could You post results?

interduo on 20 May 2020

I can not compare my test results anymore because I have replaced all 4 HD in that RAID10 in the meantime. Sorry.

mabod on 20 May 2020

@interduo I was thinking that the SIMD issue would only really affect RaidZ/compression/encryption but not a mirror, and so it might be something else.
Re-reading this now, I don't think that is actually the case.

I'm not sure if I can quickly run a few tests, if yes, I'll update.

FlorianHeigl on 20 May 2020

I'm a bit late to this party...

For those of us building our own kernel for private use, is it possible to avoid "the SIMD issue" by reintroducing the symbols that are no longer exported and, if so, how would you do that?

I've been running 0.8.4 on 4.14.23 for about a week now (with the impression that reads seem a bit faster compared to 0.7.12, writing probably slower, judging from compiler job durations). I'm building kernel 4.19.133 as we speak so now would be a good time to restore those SIMD exports...

RJVB on 20 Jul 2020

😄1

@RJVB https://github.com/NixOS/nixpkgs/blob/master/pkgs/os-specific/linux/kernel/export_kernel_fpu_functions_4_14.patch

mskarbek on 20 Jul 2020

Thanks!

The commit message surprises me a bit: I would have expected that checksumming uses the crc32 intrinsic from SSE4 (4.2 IIRC) and that's not being mentioned. Good thing even my slow beater (N3150) has AES and AVX!

Edit: the name surprises me too ... suggesting the patch was already needed in the 4.14 kernel.

RJVB on 20 Jul 2020

@RJVB it is but not in 4.14.0, change was made as a backport to some later version, I can't remember which one right now.
There is also a second patch for the newer kernels: https://github.com/NixOS/nixpkgs/blob/master/pkgs/os-specific/linux/kernel/export_kernel_fpu_functions_5_3.patch

mskarbek on 20 Jul 2020

Saw that. I have been wondering if there's a compelling reason to migrate to a 5.x kernel, beyond "latest is always greatest" or features I didn't know I couldn't do without... Either way it seemed smart to live with the latest 4.x kernel for a while first.

Now just to be certain: can I assume that the re-exported functions will be picked up automagically during the ZFS 0.8.4 (dkms) kernel module build (I don't see any NixOS patches to ZFS)?

RJVB on 20 Jul 2020

@RJVB yes, OpenZFS checks individually each kernel functionality during the build process regardless of the kernel version.

mskarbek on 20 Jul 2020

👍1

And then the kernel module (one of the) simply fails to build: https://github.com/openzfs/zfs/issues/10601 :-/

RJVB on 21 Jul 2020

I take it this patch has been tested with ZFS?

After working around the build failure I could finally boot a VM into my new 4.19 kernel, with the ZFS 0.8.4 kmods ready to roll. The VM runs under VirtualBox, using "raw disk" access to actual external drives connected via USB3. When I imported a pool (created recently by splitting off a dedicated mirror vdev from my main Linux rig's root pool) I discovered it had a number of corrupted items.

I don't know if the corruption occurred during the previous time I'd used that pool, or during import. The identified items were all directories, curiously, (in a dataset that has copies=1 because it has its own registry that doubles as an online backup) and the errors could be clear by making an identical copy (cp -prd /path/to/foo{,.bak}) and then replacing the original with that clone. I don't have the impression I lost anything... The remaining items don't seem to correspond to existing files, some are of the type "metadata:".

Can I suppose that every single directory on (at least) every single dataset with copies=1 would have been affected if this were due to an issue with my kernel patches *) or the workarounds I applied to get the ZFS kmods to build?

*): I also use the ConKolivas patches (which I had to refactor for 4.19.133) and a patch to make zswap use B-Trees.

RJVB on 22 Jul 2020

Was this page helpful?

0 / 5 - 0 ratings