Type | Version/Name
--- | ---
Distribution Name | Manjaro
Distribution Version | Testing
Linux Kernel | 4.19.46-1-MANJARO
Architecture | x86_64
ZFS Version | 0.8.0-1
SPL Version | 0.8.0-1
I do frequent fio benchmarks with my pool "zstore" and just realized that write performance is dropping with zfs version 0.8.
With zfs version 0.7.13 I typically got around 230-250 write IOPS:
fio-output-zstore-32G-2019-05-15@06:52: read: IOPS=240, BW=240MiB/s (252MB/s)(32.0GiB/136347msec)
fio-output-zstore-32G-2019-05-15@06:52: write: IOPS=233, BW=234MiB/s (245MB/s)(32.0GiB/140079msec); 0 zone resets
fio-output-zstore-32G-2019-04-06@19:53: read: IOPS=280, BW=281MiB/s (294MB/s)(32.0GiB/116694msec)
fio-output-zstore-32G-2019-04-06@19:53: write: IOPS=254, BW=254MiB/s (267MB/s)(32.0GiB/128766msec); 0 zone resets
fio-output-zstore-32G-2019-03-13@15:12: read: IOPS=286, BW=286MiB/s (300MB/s)(32.0GiB/114442msec)
fio-output-zstore-32G-2019-03-13@15:12: write: IOPS=269, BW=270MiB/s (283MB/s)(32.0GiB/121379msec); 0 zone resets
fio-output-zstore-32G-2019-03-09@11:02: read: IOPS=296, BW=296MiB/s (311MB/s)(32.0GiB/110551msec)
fio-output-zstore-32G-2019-03-09@11:02: write: IOPS=249, BW=249MiB/s (262MB/s)(32.0GiB/131339msec); 0 zone resets
fio-output-zstore-32G-2019-03-08@14:28: read: IOPS=305, BW=305MiB/s (320MB/s)(32.0GiB/107366msec)
fio-output-zstore-32G-2019-03-08@14:28: write: IOPS=243, BW=243MiB/s (255MB/s)(32.0GiB/134811msec); 0 zone resets
with zfs version 0.8 I only get 160-190 write IOPS:
fio-output-zstore-0.8-32G-2019-05-30@11:01: read: IOPS=265, BW=265MiB/s (278MB/s)(32.0GiB/123489msec)
fio-output-zstore-0.8-32G-2019-05-30@11:01: write: IOPS=191, BW=192MiB/s (201MB/s)(32.0GiB/170900msec); 0 zone resets
fio-output-zstore-0.8-32G-2019-05-30@10:45: read: IOPS=278, BW=278MiB/s (292MB/s)(32.0GiB/117837msec)
fio-output-zstore-0.8-32G-2019-05-30@10:45: write: IOPS=160, BW=161MiB/s (168MB/s)(32.0GiB/204095msec); 0 zone resets
fio-output-zstore-0.8-32G-2019-05-29@08:12: read: IOPS=270, BW=270MiB/s (283MB/s)(32.0GiB/121249msec)
fio-output-zstore-0.8-32G-2019-05-29@08:12: write: IOPS=181, BW=181MiB/s (190MB/s)(32.0GiB/180892msec); 0 zone resets
The read IOPS seem to be the same in the range of 260-280. Where is this write performance difference coming from?
Here are the pool details:
zfs recordsize is 1M. No compression. No dedup
30# zpool status
pool: zstore
state: ONLINE
scan: scrub repaired 0B in 0 days 16:57:34 with 0 errors on Mon Apr 1 23:52:01 2019
config:
NAME STATE READ WRITE CKSUM
zstore ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
sdb-WD-WCC4E5HF3P4S ONLINE 0 0 0
sdc-WD-WCC4E1SSP28F ONLINE 0 0 0
mirror-1 ONLINE 0 0 0
sdd-WD-WCC4E1SSP6NC ONLINE 0 0 0
sda-WD-WCC7K7EK9VC4 ONLINE 0 0 0
errors: No known data errors
43# zfs get all zstore
NAME PROPERTY VALUE SOURCE
zstore type filesystem -
zstore creation Di Jan 23 14:39 2018 -
zstore used 6,76T -
zstore available 268G -
zstore referenced 96K -
zstore compressratio 1.03x -
zstore mounted yes -
zstore quota none default
zstore reservation none default
zstore recordsize 1M local
zstore mountpoint /mnt/zstore local
zstore sharenfs off default
zstore checksum on default
zstore compression lz4 local
zstore atime on local
zstore devices on default
zstore exec on default
zstore setuid on default
zstore readonly off default
zstore zoned off default
zstore snapdir hidden default
zstore aclinherit restricted default
zstore createtxg 1 -
zstore canmount on default
zstore xattr sa local
zstore copies 1 default
zstore version 5 -
zstore utf8only off -
zstore normalization none -
zstore casesensitivity sensitive -
zstore vscan off default
zstore nbmand off default
zstore sharesmb off default
zstore refquota none default
zstore refreservation none default
zstore guid 10936391047855543944 -
zstore primarycache all default
zstore secondarycache all default
zstore usedbysnapshots 0B -
zstore usedbydataset 96K -
zstore usedbychildren 6,76T -
zstore usedbyrefreservation 0B -
zstore logbias latency default
zstore objsetid 51 -
zstore dedup off default
zstore mlslabel none default
zstore sync standard default
zstore dnodesize legacy default
zstore refcompressratio 1.00x -
zstore written 96K -
zstore logicalused 6,99T -
zstore logicalreferenced 42K -
zstore volmode default default
zstore filesystem_limit none default
zstore snapshot_limit none default
zstore filesystem_count none default
zstore snapshot_count none default
zstore snapdev hidden default
zstore acltype posixacl local
zstore context none default
zstore fscontext none default
zstore defcontext none default
zstore rootcontext none default
zstore relatime on local
zstore redundant_metadata all default
zstore overlay off default
zstore encryption off default
zstore keylocation none default
zstore keyformat none default
zstore pbkdf2iters 0 default
zstore special_small_blocks 0 default
md5-9efa729f11e840b031d4374a67ecabc2
41# cat fio-bench-generic-seq-read.options
[global]
bs=1m
ioengine=libaio
invalidate=1
refill_buffers
numjobs=1
fallocate=none
size=${SIZE}
[seq-read]
rw=read
stonewall
md5-cfd8efdfc80ded54dfda8696de2f7036
45# cat fio-bench-generic-seq-write.options
[global]
bs=1m
ioengine=libaio
invalidate=1
refill_buffers
numjobs=1
fallocate=none
size=${SIZE}
[seq-write]
rw=write
stonewall
Did you test on same kernel version? Looks like https://github.com/zfsonlinux/zfs/issues/8793
The values I am showing here are all kernel 4.19. I have a few numbers for kernel 5.0 which basically confirm the numbers from kernel 4.19. No significant difference by kernel version.
But the zfs version makes a big difference. write IOPS is down to 70 % with zfs 0.8. Average write IOPS 249 with zfs version 0.7.13 and 175 with version 0.8.
Since you're using 4.19.46, this is probably #8793 as mentioned above. The symbol export that allowed SIMD-accelerated checksums was removed from the 4.19 branch with 4.19.38. Maybe set checksum=off for the duration of the benchmark and see if that changes things?
If this is caused by the lack of SIMD support them you should be able to see the same drop in performance using 0.7.13 and the 4.19.46 kernel. It would be good to know either way.
I did two runs with checksum=off and it does NOT make a difference. Write performance is still down to ca. 70 %.
My benchmark numbers for version 0.7.13 are from kernels 4.19.42, 4.19.34, 4.19.28 and 4.19.26 (following the Manjaro Testing upgrades). The benchmark numbers for version 0.8 are only for kernel 4.19.46.
Are you suggesting that this is a kernel regression?
Since you achieved the expected performance using 0.7.13 and the 4.19.42 kernel that should rule out the kernel's SIMD changes as a cause. Further investigation is going to be needed to determine exactly why you're seeing a drop in write performance.
In the zfs manpages it states to consider changing the dnodesize zfs property to auto.
Seen below is from the dnodesize section in the manpages:
Consider setting dnodesize to auto if the dataset uses the xattr=sa property setting and the workload makes heavy use of extended attributes. This may be applicable to SELinux-enabled systems, Lustre servers, and Samba servers, for example. Literal values are supported for cases where the optimal size is known in advance and for performance testing.
Also the recordsize of the dataset is 1M, this I think can cause issue depending on what your storing in that dataset due to zfs being a copy on write file system. Changing the recordsize value of the dataset will require removing and placing back all files back on the dataset to ensure that all files on the dataset uses the new recordsize value if changed.
ZFS support up to 16 MiB recordsize to get this change the zfs_max_recordsize file.
To view the value of this file do cat /sys/module/zfs/parameters/zfs_max_recordsize.
To change [I DO NOT RECOMMAND IT] it do echo (your value) > /sys/module/zfs/parameters/zfs_max_recordsize.
To get 16MiB the value should be echo $((16 * 1024 * 1024)) which is 16777216.
Changing the default value of 1048576 orecho $((1 * 1024 * 1024))to a bigger value gives issues in deleting the file.
Note if changing the 16MiB to another value it would be echo $((<your value> * 1024 * 1024)).
The system is always idle when I do the tests. I am doing this already since a while. Unfortunately I have only kept the logs since March of this year. But the results have always been comparable as long as I remember. Even with recordsize 128k. Of course there is always some variance in the values. But a performance decrease of 30 % is a significant change.
Look at the history of the pool zpool history <your pool name> | less and look for a time before the performance decrease, may help
There is nothing in the history other than the regular import or snapshot commands.
I did some more tests. Also with another pool. The other pool is a raidz2 with 6 drives in an external USB case. The interesting finding for me is, that this pool (zf1) is NOT showing performance differences. But I certainly see write performance issues with the internal pool (zstore).
I compared the out of "zfs get all" for both zstore and zf1 and there is no important difference other than mountpoint and such. Basic parameters are all the same.
I also doublechecked that checksum=on/off does not make a difference.
Once again some results for zstore:
old (good) values with zfs 0.7.13:
1 write: IOPS=255, BW=256MiB/s (268MB/s)(32.0GiB/128135msec); 0 zone resets
2 write: IOPS=238, BW=239MiB/s (250MB/s)(32.0GiB/137293msec); 0 zone resets
3 write: IOPS=245, BW=245MiB/s (257MB/s)(32.0GiB/133739msec); 0 zone resets
4 write: IOPS=243, BW=243MiB/s (255MB/s)(32.0GiB/134811msec); 0 zone resets
5 write: IOPS=249, BW=249MiB/s (262MB/s)(32.0GiB/131339msec); 0 zone resets
6 write: IOPS=269, BW=270MiB/s (283MB/s)(32.0GiB/121379msec); 0 zone resets
7 write: IOPS=254, BW=254MiB/s (267MB/s)(32.0GiB/128766msec); 0 zone resets
8 write: IOPS=233, BW=234MiB/s (245MB/s)(32.0GiB/140079msec); 0 zone resets
new (bad) values with zfs 0.8.0:
1 write: IOPS=174, BW=175MiB/s (183MB/s)(32.0GiB/187521msec); 0 zone resets
2 write: IOPS=188, BW=188MiB/s (197MB/s)(32.0GiB/174175msec); 0 zone resets
3 write: IOPS=203, BW=204MiB/s (213MB/s)(32.0GiB/160953msec); 0 zone resets
4 write: IOPS=205, BW=206MiB/s (216MB/s)(32.0GiB/159290msec); 0 zone resets
5 write: IOPS=191, BW=192MiB/s (201MB/s)(32.0GiB/170795msec); 0 zone resets
6 write: IOPS=159, BW=160MiB/s (168MB/s)(32.0GiB/204952msec); 0 zone resets
7 write: IOPS=180, BW=181MiB/s (190MB/s)(32.0GiB/181212msec); 0 zone resets
8 write: IOPS=194, BW=194MiB/s (204MB/s)(32.0GiB/168825msec); 0 zone resets
9 write: IOPS=215, BW=216MiB/s (226MB/s)(32.0GiB/151945msec); 0 zone resets
10 write: IOPS=194, BW=195MiB/s (204MB/s)(32.0GiB/168349msec); 0 zone resets
11 write: IOPS=203, BW=204MiB/s (214MB/s)(32.0GiB/160770msec); 0 zone resets
12 write: IOPS=205, BW=206MiB/s (216MB/s)(32.0GiB/159360msec); 0 zone resets
Let zfs report what is happening on the pool and on each vdev with this command zpool iostat -vl <your pool> .1. This command will auto-refresh every .1 second, which can be changed to whatever value.
This command will show you all I/O info that is happening on each vdev with the latency info.
Also use the command zpool iostat -vq <your pool> .1 which will show 'disk queued' info. Info that is waiting to be written on the disk.
zpool iostat -vr <your pool> .1 the -r option shows the size histograms for the leaf vdev's IO. This includes histograms of individual IOs (ind) and aggregate IOs (agg). These stats can be useful for observing how well IO aggregation is working.
Thezpool iostat -c list a number of checks that can be done.
You can check for SMART, ATA and NVMe stuff example zpool iostat -c nvme_err.
If you see the following error, Can't run -c with root privileges unless ZPOOL_SCRIPTS_AS_ROOT is set.
Run then use ZPOOL_SCRIPTS_AS_ROOT=1 zpool iostat -c nvme_err.
Monitor zfs process while working, such of cache info, memory status etc.
cat /proc/spl/kstat/zfs/arcstats
To make it auto-refresh
watch -n .1 cat /proc/spl/kstat/zfs/arcstats
Also ensure that you have your ashift value are accurate with following blockdev --getpbsz /dev/sdXY.
the blockdev command shows the physical block (sector) size. (for all your disks).
Ashift info below:
At pool creation, ashift=12 should always be used, except with SSDs that have 8k sectors where ashift=13 is correct. A vdev of 512 byte disks using 4k sectors will not experience performance issues, but a 4k disk using 512 byte sectors will. Since ashift cannot be changed after pool creation, even a pool with only 512 byte disks should use 4k because those disks may need to be replaced with 4k disks or the pool may be expanded by adding a vdev composed of 4k disks. Because correct detection of 4k disks is not reliable, -o ashift=12 should always be specified during pool creation. See the ZFS on Linux FAQ for more details.
NB, running zpool iostat with short interval (eg < zfs_txg_timeout) is almost always a waste of effort.
Also, running a bunch of CLI collectors is difficult to grok.
A better solution is to use one of the telemetry collectors, telegraf or node_exporter, to collect the data and forward it to a TSDB, like influxdb or prometheus, and analyzed with tools like grafana.
@richardelling Could a telemetry collector, TSDB and an analyzed tool be implemented into zfs itself since working with iostat is a waste of effort and difficult to grok. I would like to know that all tools and features in zfs are useful, which I can use to gain meaningful information from zfs.
I have installed telegraf which is just pulling information from /proc/spl/kstat/zfs, I believe that a tool in zfs can do that and display a graph-like representation of the information including what is happening on all vdevs. It is also useful in troubleshooting performances also without the full bloat of influxdb or prometheus and grafana.
no, it is a really bad idea and goes counter to the UNIX philosophy. Today ZFS makes stats available, but reading them is not a free operation. So designing a monitoring system needs to meet very different business requirements. For this reason it is best to have integration to the best-in-class monitoring systems. I only mentioned a few of the open source tools that are popular. There are many more tools in the market.
For what its worth, I've also seen huge Performance decreases on my pool. Write speed has throttled down to 30MB/s from 600MB/s+
0.8rc3 and kernel 4.9.16-gentoo
If you've got a reasonable method for me to collect performance data I will also assist in this.
NAME STATE READ WRITE CKSUM
zebras ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
ata-WDC_WD60EDAZ-11BMZB0_WD-WX61D88AZET6 ONLINE 0 0 0
ata-WDC_WD60EFRX-68L0BN1_WD-WX51D88NL080 ONLINE 0 0 0
mirror-1 ONLINE 0 0 0
ata-WDC_WD60EDAZ-11BMZB0_WD-WX61DB72TP5S ONLINE 0 0 0
ata-WDC_WD60EFRX-68L0BN1_WD-WXB1HB4JKAM6 ONLINE 0 0 0
mirror-2 ONLINE 0 0 0
ata-WDC_WD60EFRX-68L0BN1_WD-WX71DB8KYUPY ONLINE 0 0 0
ata-WDC_WD60EFRX-68MYMN1_WD-WX21D9421XU3 ONLINE 0 0 0
special
mirror-3 ONLINE 0 0 0
ata-KINGSTON_SA400S37240G_50026B76824BC4D8-part2 ONLINE 0 0 0
ata-KINGSTON_SA400S37240G_50026B76824BC5F3-part2 ONLINE 0 0 0
logs
mirror-4 ONLINE 0 0 0
ata-KINGSTON_SA400S37240G_50026B76824BC5F3-part4 ONLINE 0 0 0
ata-KINGSTON_SA400S37240G_50026B76824BC4D8-part4 ONLINE 0 0 0
All direct attached from a Dell Perc h310 controller in IT mode.
@Setsuna-Xero do I understand correctly that you see this performance drop for both 0.8.0-rc3 and the 0.8.0 tag?
@behlendorf
Sorry I forgot to include the previous kernel:
4.12.13 on 0.8rc3
I will be moving this array to another server with a 4.19.41 kernel as soon as the drive cages arrive however
It is striking to me that @Setsuna-Xero is seeing the performance drop also with a RAID10 system. Can it be that the RAID level makes the difference? I have another pool as RAIDZ2 which is not showing a performance drop.
I have write performance problem after upgrading from 0.7.19 to 0.8.0 also. Tried with older kernel to exclude missing SIMD problem and my system is completely idle. Rsyncing same VM image from dedicated disk to ZFS pool:
0.7.19, performance as expected:

0.8.0, performance bad:

I'm getting 2-3MB/s with cp/cq and tar. rsync gets an order of magnitude higher at right around 30MB/s.
Previously on what ever 7.x pool I had from two of these disks, I would get over 100MB/s on a single mirror write speeds. Then once I moved to this pool, I had approximately 600MB/s, which then fell off to 20-30MB/s sometime after a kernel bump and moving to 8.0rc3
I benchmarked sequential writes on a 6-disk RAIDz2 (all HDD) using Proxmox 6 with ZFS 0.8.1 and Kernel 5.0. The array struggled to maintain the single-disk sequential speed, around 200MB/sec.
An older ZoL build (0.7.13 with older kernel) shows more than double the speed with the same configuration, around 450MB/sec.
The branch 0.6.x was spining like a tornado.
The branch 0.7.x drops its performance about 30%.
And now there is next performance drop.
Does somebody compile&test master branch with commit https://github.com/zfsonlinux/zfs/commit/e5db31349484e5e859c7a942eb15b98d68ce5b4d ?
Another "me too" over here. After upgrading to the newest Proxmox (with Zol 0.8.1) I can't sustain write speeds for more than a few seconds before they tank and I get lockups.

Documenting this in case it helps. Seems clear that this is related to lack of SIMD - higher RAID-Z levels use a lot of CPU and scalar perf isn't enough.
cat /proc/spl/kstat/zfs/vdev_raidz_bench ("scalar" row) on a Xeon 4108:
gen_p (RAID-Z) is 1.13GB/sec
gen_pq (RAID-Z2) is 290MB/sec
and gen_pqr (RAID-Z3) is 132MB/sec.
SIMD makes everything 5-7x faster, so restoring SIMD should help this problem.
@amissus What version did You tested and showed results? 0.7.19 does not exist.
I'm sorry, version 0.7.13 has expected performance for me and >= 0.8 has degraded and unstable performance.
What exactly am I reading here?
[root@hostname~]# cat /proc/spl/kstat/zfs/vdev_raidz_bench
18 0 0x01 -1 0 5551518943 1459035087503366
implementation gen_p gen_pq gen_pqr
original 383443168 135674622 67712690
scalar 1682391699 530611710 228126033
fastest scalar scalar scalar
@msLinuxNinja I am reading RAID-Z at 1.6GB/sec, RAID-Z2 at 530MB/sec, and RAID-Z3 at 228MB/sec. So this is a fast CPU - some slower ones will require SIMD to get these numbers.
For compare:
$ cat /proc/spl/kstat/zfs/vdev_raidz_bench | grep scalar | awk '{ print $1, $2, $3, $4 }'
scalar 487264127 177733274 7658480
on
$ cat /proc/cpuinfo | grep "model name" | head -n1
model name : Intel(R) Xeon(R) CPU E5-2650L v2 @ 1.70GHz
@behlendorf this issue was created at 30 May, the solution for this issue come in master branch at 12 Jul. This is very important case for us (users). When do plan to do next release of ZFS with this commit?
What is the project politics for releases?
I didn't find any information about it on github or zol website.
It seems zfs 0.8.2 was released, but without the fix in e5db31349484e5e859c7a942eb15b98d68ce5b4d 馃槩
I don't know the reason for it not being included. But it seems there will be a few more months with crawling performance.
Does this issue concern the kernel 3.10.0-1062.1.1.el7.x86_64 as well?
@DannCos the 3.10.0-1062.1.1.el7.x86_64 is not effected by this issue.
I decided to conduct some tests under CentOS 7 (with 3.10.0-1062.1.1.el7.x86_64) - the reason I decided to do this, was simply due to replacing a storage server, one running 0.7 and one running 0.8 - I experienced issues with slow read performance under the new system.
Old server:
New server:
Both servers use about 20TB of storage and store 280 million files.
The old server would restore a 1GB backup with 100k files in about 1.5 minutes where the new one would do the same folder in 17 minutes.
Note: Writes seems to be decent on both systems, reads being the main affected.
Both tests were performed on an idle system right after rebooting the system (to ensure that no cache or anything got hit).
atime turned off, lz4 compression turned on, dedup off.
It made me search and I found this thread regarding performance issues, so I wanted to test out various versions of ZoL as well as ZFS on FreeBSD 12.
For this I set up another machine:
All tests below use the same zpool create parameters: atime=off, dedup=off, compression=lz4, ashift=12, and a reboot being performed between every test.
Test directory structure being 11294 megabyte and 311153 inodes.
It's also worth noting that the only data being stored on the pool is the test directory structure, nothing else - whether performance becomes worse as the dataset grows, I don't know (Hopefully it doesn't).
Backup/restore is performed using rsync on a local network (1gigabit/s) with no other communication happening:
ZFS 0.6 (Installed via Ubuntu 16.04):
zfs striped mirror backup: 2 min 1 sec
zfs striped mirror restore: 3 min 39 sec
zfs raidz2 backup: 2 min 22 sec
zfs raidz2 restore: 3 min 36 sec
zfs striped backup: 2 min 15 sec
zfs striped restore: 3 min 26 sec
ZFS 0.7 (Installed via CentOS 7.7 using zfs-release.el7_6):
zfs striped mirror backup: 2 min 8 sec
zfs striped mirror restore: 3 min 16 sec
zfs raidz2 backup: 2 min 10 sec
zfs raidz2 restore: 3 min 18 sec
zfs striped backup: 2 min 8 sec
zfs striped restore: 3 min 23 sec
ZFS 0.8 (Installed via CentOS 7.7 using zfs-release.el7_7):
zfs striped mirror backup: 2 min 9 sec
zfs striped mirror restore: 4 min 45 sec
zfs raidz2 backup: 2 min 9 sec
zfs raidz2 restore: 5 min 54 sec
zfs striped backup: 2 min 8 sec
zfs striped restore: 5 min 24 sec
Backup times (writing to ZFS) seems to stay pretty consistent in my case - likely also being limited by the 1g link between machines, but average is about the 2 minute and 10 seconds, or about 700mbps.
What surprises me is the drop between 0.7 and 0.8 is the read performance experienced, especially for raidz2. From 3 min and 18 seconds to 5 minutes and 54 seconds. That's 78% increase in restoration times.
Just for fun, I tried to give FreeBSD 12 a try:
zfs striped mirror: 4 min 22 sec
zfs raidz2: 5 min 20 sec
zfs striped: 5 min 4 sec
Whether it performs better under FreeBSD 11.x I haven't had the time to test yet.
Now, I'd expect performance to be roughly the same on the same hardware.
My tests conducted does still not explain the massive slowdown I experience between the two real systems with more powerful hardware - hopefully adding more memory to a system (64 vs 128GB) shouldn't make performance worse.
I know this issue is mainly related to write performance, however, I do find it important that read performance gets mentioned as well, especially under the 3.10.0-1062.1.1.el7.x86_64 which should not be affected by the SIMD.
It makes me believe that there may be some other regression between 0.7 and 0.8 that may affect the overall performance as well, other than the SIMD.
If people want me to test with some other settings, I'm more than happy to do so. Ideally, I want my backup server to remain snappy so in case of restores being needed that they can actually be performed rather quickly.
@lucasRolff could You do a benchmark for 0.8.3 version whitch was released few days ago?
Is this issue fully resolved?
I just did a test with 0.8.3 and kernel 5.4.14. I do see better IOPS.
Average of 7 runs:
read: 301 IOPS (lowest out of seven: 273)
write: 209 IOPS (lowest out of seven: 192)
This is certainly better than what I had before (https://github.com/zfsonlinux/zfs/issues/8836#issuecomment-497673636;
The read speed is very good. Same level or better than 0.7.13. But the write speed is still behind 0.7.13.
@mabod out of curiosity, how did you run the benchmark? just want to compare results.
I explained it in this thread. It is a fio benchmark. The fio option files are in this thread too.
@interduo - I moved my backup servers to 100% SSD storage and (sadly) using a hardware raid 6 :)
Eventually, I'll give ZFS a try again on spinning disks and see how it performs.
does not sound like SIMD is the only problem with this
zfs striped mirror restore: 3 min 39 sec
zfs striped mirror restore: 4 min 45 sec
@FlorianHeigl
@mabod
did You do Your tests on 0.8.4 release? Could You post results?
I can not compare my test results anymore because I have replaced all 4 HD in that RAID10 in the meantime. Sorry.
@interduo I was thinking that the SIMD issue would only really affect RaidZ/compression/encryption but not a mirror, and so it might be something else.
Re-reading this now, I don't think that is actually the case.
I'm not sure if I can quickly run a few tests, if yes, I'll update.
I'm a bit late to this party...
For those of us building our own kernel for private use, is it possible to avoid "the SIMD issue" by reintroducing the symbols that are no longer exported and, if so, how would you do that?
I've been running 0.8.4 on 4.14.23 for about a week now (with the impression that reads seem a bit faster compared to 0.7.12, writing probably slower, judging from compiler job durations). I'm building kernel 4.19.133 as we speak so now would be a good time to restore those SIMD exports...
Thanks!
The commit message surprises me a bit: I would have expected that checksumming uses the crc32 intrinsic from SSE4 (4.2 IIRC) and that's not being mentioned. Good thing even my slow beater (N3150) has AES and AVX!
Edit: the name surprises me too ... suggesting the patch was already needed in the 4.14 kernel.
@RJVB it is but not in 4.14.0, change was made as a backport to some later version, I can't remember which one right now.
There is also a second patch for the newer kernels: https://github.com/NixOS/nixpkgs/blob/master/pkgs/os-specific/linux/kernel/export_kernel_fpu_functions_5_3.patch
Saw that. I have been wondering if there's a compelling reason to migrate to a 5.x kernel, beyond "latest is always greatest" or features I didn't know I couldn't do without... Either way it seemed smart to live with the latest 4.x kernel for a while first.
Now just to be certain: can I assume that the re-exported functions will be picked up automagically during the ZFS 0.8.4 (dkms) kernel module build (I don't see any NixOS patches to ZFS)?
@RJVB yes, OpenZFS checks individually each kernel functionality during the build process regardless of the kernel version.
And then the kernel module (one of the) simply fails to build: https://github.com/openzfs/zfs/issues/10601 :-/
I take it this patch has been tested with ZFS?
After working around the build failure I could finally boot a VM into my new 4.19 kernel, with the ZFS 0.8.4 kmods ready to roll. The VM runs under VirtualBox, using "raw disk" access to actual external drives connected via USB3. When I imported a pool (created recently by splitting off a dedicated mirror vdev from my main Linux rig's root pool) I discovered it had a number of corrupted items.
I don't know if the corruption occurred during the previous time I'd used that pool, or during import. The identified items were all directories, curiously, (in a dataset that has copies=1 because it has its own registry that doubles as an online backup) and the errors could be clear by making an identical copy (cp -prd /path/to/foo{,.bak}) and then replacing the original with that clone. I don't have the impression I lost anything... The remaining items don't seem to correspond to existing files, some are of the type "metadata:".
Can I suppose that every single directory on (at least) every single dataset with copies=1 would have been affected if this were due to an issue with my kernel patches *) or the workarounds I applied to get the ZFS kmods to build?
*): I also use the ConKolivas patches (which I had to refactor for 4.19.133) and a patch to make zswap use B-Trees.
Most helpful comment
@behlendorf this issue was created at 30 May, the solution for this issue come in master branch at 12 Jul. This is very important case for us (users). When do plan to do next release of ZFS with this commit?
What is the project politics for releases?
I didn't find any information about it on github or zol website.