Type | Version/Name
--- | ---
Distribution Name | Ubuntu
Distribution Version | 17.04
Linux Kernel | 4.10.0-33-generic
Architecture | x86_64
ZFS Version | 0.7.0
SPL Version | 0.7.0
Pool created:
zpool create -o ashift=12 -O recordsize=4k -O compression=off -O checksum=off ev3 /dev/ev3mema
Then I run fio:
fio --filename=/ev3/fiotest --ioengine=libaio --direct=0 --rw=randwrite --bs=4k --iodepth=128 --size=15G --group_reporting --name=test
Performance decresed over time, now (after a few minutes):
Jobs: 1 (f=1): [w(1)] [48.6% done] [0KB/3576KB/0KB /s] [0/894/0 iops] [eta 11m:18s]
zpool iostat -v ev3 1
capacity operations bandwidth
pool alloc free read write read write
---------- ----- ----- ----- ----- ----- -----
ev3 15.2G 739M 2.62K 69.2K 15.9M 420M
ev3mema 15.2G 739M 2.62K 69.2K 15.9M 420M
---------- ----- ----- ----- ----- ----- -----
If I restart fio, it does between 100-2000 IOPS (though it seems to now mostly stay on the 100-200 mark), while zfs writes >1GB/s to the device
This makes no sense to me.
With 128k iosize, IOPS stay around ~1300 and zfs writes ~300MB/s to the device.
With 1B iosize, IOPS drop to 100 and zfs writes 1.2GB to the drive. W00T.
With 4K iosize and iodepth=1, fio starts at ~10K IOPS but drops to 700 while zfs writes 1.6GB/s.
This is without removing the /ev3/fiotest file, it is left dirty.
Also z_wr_iss threads are eating ~40% of all CPUs
If I create a ZVOL and run the same fio benchmark on that, it behaves as it should - 150K IOPS and 600MB/s, on subsequent runs (without recreating the ZVOL) it isn't nearly as stable and performs about 30% worse, but since fio is doing buffered writes I see it writing to the device long after fio itself has finished and I'm guessing there's some amplification as well (200%?).
Switching fio to direct=1 and bs=4k shows write amplification of 1000%:
write: io=913540KB, bw=64970KB/s, iops=16242, runt= 14061msec
capacity operations bandwidth
pool alloc free read write read write
---------- ----- ----- ----- ----- ----- -----
ev3 14.1G 1.81G 868 67.3K 434K 677M
ev3mema 14.1G 1.81G 868 67.3K 434K 677M
---------- ----- ----- ----- ----- ----- -----
On subsequent runs it rises even more (15MB/s vs 650MB/s).
With larger matched volblocksize=32k and iosize=32k it looks stable at 200% write amplification (even with logbias=throughput, so even that doesn't make sense to me)
FYI this is fio on the block device itself:
fio --filename=/dev/ev3mema --ioengine=libaio --direct=1 --rw=randwrite --bs=4k --iodepth=128 --size=14G --group_reporting --name=file1
write: io=14336MB, bw=1840.4MB/s, iops=471118, runt= 7790msec
This is fio on ext4 using this block device:
fio --filename=/ev3ext4/fiotest --ioengine=libaio --direct=1 --rw=randwrite --bs=4k --iodepth=128 --size=14G --group_reporting --name=file1
write: io=14336MB, bw=558071KB/s, iops=139517, runt= 26305msec
(block device shows ~600MB/s)
Am I missing something or is it seriously broken?
It might be related to the file or ZVOL taking up almost all space in the pool (the device is only 16G), but it still isn't pretty.
I'm testing the same scenarios with an ordinary 60GB SSD now and will test smaller data sizez on the EV3 as well.
So...
Yes, it is related to free space, at least partly:
fio --filename=/zfstest/fiotest --ioengine=libaio --direct=0 --rw=randwrite --bs=4k --iodepth=128 --size=54700M --group_reporting --name=file1
700-1500 IOPS (~3-6MB/s), but:
capacity operations bandwidth
pool alloc free read write read write
---------- ----- ----- ----- ----- ----- -----
zfstest 53.6G 1.86G 0 18.6K 0 177M
sdh 53.6G 1.86G 0 18.6K 0 177M
---------- ----- ----- ----- ----- ----- -----
(the above is first run and buffered, rewriting it would take a looong time and I hope it's preallocated already...)
so let's call it a write amplification of 50x
fio --filename=/zfstest/fiotest --ioengine=libaio --direct=0 --rw=randwrite --bs=4k --iodepth=128 --size=10000M --group_reporting --name=file1
write: io=10000MB, bw=114634KB/s, iops=28658, runt= 89328msec
drive was writing between 100-200MB
this is second run
write: io=10000MB, bw=137612KB/s, iops=34403, runt= 74412msec
interestingly, the drive wasn't writing that much more (maybe 20MB more on average)
I'll continue with ZVOLs from now on:
zfs create -V 51700M -o volblocksize=4k zfstest/zvitest
NAME USED AVAIL REFER MOUNTPOINT
zfstest 53.7G 98.2M 96K /zfstest
I'd call it full
fio --filename=/dev/zvol/zfstest/zvitest --ioengine=libaio --direct=1 --rw=randwrite --bs=4k --iodepth=128 --group_reporting --name=file1
^ dropped to 0 IOPS pretty much immediately because it still buffers
capacity operations bandwidth
pool alloc free read write read write
---------- ----- ----- ----- ----- ----- -----
zfstest 617M 54.9G 0 193 0 6.35M
sdh 617M 54.9G 0 193 0 6.35M
---------- ----- ----- ----- ----- ----- -----
iostat reports the same with 100% utilization, system load has risen to ~30 while CPUs are doing pretty much nothing
capacity operations bandwidth
pool alloc free read write read write
---------- ----- ----- ----- ----- ----- -----
zfstest 1.02G 54.5G 0 0 0 0
sdh 1.02G 54.5G 0 0 0 0
---------- ----- ----- ----- ----- ----- -----
zfs destroy zfstest/zvitest
^ has been running for a good minute now, doing nothing
... [ 8901.588042] blk_update_request: I/O error, dev sdh, sector 25367848
I just killed the SSD :-D
I'll grab another and re-do then test.
If anybody minds a running commentary in there, let me know... :-)
Now testing with a larger drive (480G)
fio --filename=/dev/zvol/zfstest/zvitest --ioengine=libaio --direct=1 --rw=randwrite --bs=4k --iodepth=128 --group_reporting --name=file1
file1: (g=0): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=128
fio-2.16
Starting 1 process
Jobs: 1 (f=1): [w(1)] [47.1% done] [0KB/33676KB/0KB /s] [0/8419/0 iops] [eta 01h:26m:15s]
(oscillates between 4-10K)
capacity operations bandwidth
pool alloc free read write read write
---------- ----- ----- ----- ----- ----- -----
zfstest 283G 161G 3.92K 6.46K 32.1M 220M
sdh 283G 161G 3.92K 6.46K 32.1M 220M
---------- ----- ----- ----- ----- ----- -----
ZVOL has volblocksize=8k (sorry, forgot to change that). Running fio with bs=8k just increases fio throughput, the rest stays the same (IOPS, actual drive writes...)
The pool is not full.
Still the write amplification is pretty terrible, >5 instead of the expected 2 because of volblocksize mismatch.
Is ZFS trying to saturate the drives even if it doesn't have to? (iow, is it just coalescing because it has to with slower drives). That would explain the terrible number I see when using the EV3 - it's essentialy a ramdisk, tops at ~800K IOPS and 6GB/s, it wasn't even saturated before while the SSD now shows 95% utilization.
But still - 100 IOPS of 1 Byte creating 1.5GB/s writes doesn't sound right.
I don't see a bug here. Suggest moving this discussion to the mail list, a bug tracker is not the right tool for discussions.
I don't think this behaviour is neither expected nor right.
There's no scenario in which a 100B/s application write workload should trigger a 1.5GB/s of real sustained write workload on the drive.
If this is something that's documented somewhere, or can be explained then I'm open to suggestions, until then, with all due respect, I consider it a bug.
Btw this is all happening on my desktop/lab machine, so whatever data is needed I can provide, whatever debug I can do I'll be happy to do and whatever I should try I can try. Just point me in the right direction...
I am facing the same problem.
# Create a 1 GB zpool on /dev/sdc1 and small 64M zvol
[root at blackhole ~]# zpool create tank /dev/sdc1
[root at blackhole ~]# zfs create tank/vol1 -V 64M -b 64k
# First pass, to fill the small volume
[root at blackhole ~]# dd if=/dev/zero of=/dev/zvol/tank/vol1 bs=4k
dd: error writing โ/dev/zvol/tank/vol1โ: No space left on device
16385+0 records in
16384+0 records out
67108864 bytes (67 MB) copied, 0.711457 s, 94.3 MB/s
# Second pass: being an overwrite, it is slower than the first pass.
However, performance remain quite good
[root at blackhole ~]# dd if=/dev/zero of=/dev/zvol/tank/vol1 bs=4k
dd: error writing โ/dev/zvol/tank/vol1โ: No space left on device
16385+0 records in
16384+0 records out
67108864 bytes (67 MB) copied, 0.975994 s, 68.8 MB/s
# zpool and zfs status
[root at blackhole ~]# zpool list
NAME SIZE ALLOC FREE EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT
tank 1008M 64.7M 943M - 2% 6% 1.00x ONLINE -
[root at blackhole ~]# zfs list
NAME USED AVAIL REFER MOUNTPOINT
tank 66.4M 814M 96K /tank
tank/vol1 66M 816M 64.1M -
# Create a big but completely unused volume
[root at blackhole ~]# zfs create tank/vol2 -V 750M -b 64k
# zpool show no changes at all (nothing was written, so nothing was
allocated)
[root at blackhole ~]# zpool list
NAME SIZE ALLOC FREE EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT
tank 1008M 65.6M 942M - 2% 6% 1.00x ONLINE -
# zfs list reports the lower total available space, but that's all (no
real space was allocated, as we found with zpool list)
[root at blackhole ~]# zfs list
NAME USED AVAIL REFER MOUNTPOINT
tank 821M 58.5M 96K /tank
tank/vol1 66M 60.5M 64.1M -
tank/vol2 755M 813M 56K -
# WRITE SPEED ON THE FIRST, SMALL VOLUME IS NOW VERY LOW...
[root at blackhole ~]# dd if=/dev/zero of=/dev/zvol/tank/vol1 bs=4k
dd: error writing โ/dev/zvol/tank/vol1โ: No space left on device
16385+0 records in
16384+0 records out
67108864 bytes (67 MB) copied, 70.5518 s, 951 kB/s
# ...but zpool list shows only 6% utilization! (note: fragmentation is
way higher than on the previous run, but it alone does not explain the
extremely dip in write speed)
[root at blackhole ~]# zpool list
NAME SIZE ALLOC FREE EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT
tank 1008M 66.7M 941M - 21% 6% 1.00x ONLINE -
# O_DIRECT writes, albeit slower than normal, are not so bad (as shown
by zpool iostat -r, the zfs aggregator is working much better in this case)
[root at blackhole ~]# dd if=/dev/zero of=/dev/zvol/tank/vol1 bs=4k
oflag=direct
dd: error writing โ/dev/zvol/tank/vol1โ: No space left on device
16385+0 records in
16384+0 records out
67108864 bytes (67 MB) copied, 2.93807 s, 22.8 MB/s
# Let try to convert the second, big volume to a sparse device. As shown
by zpool status, nothing really changed on disk (as expected); on the
other size, zfs list now show much higher free space
[root at blackhole ~]# zfs set refreservation=none tank/vol2
[root at blackhole ~]# zpool list
NAME SIZE ALLOC FREE EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT
tank 1008M 66.9M 941M - 21% 6% 1.00x ONLINE -
[root at blackhole ~]# zfs list
NAME USED AVAIL REFER MOUNTPOINT
tank 68.4M 812M 96K /tank
tank/vol1 66M 814M 64.1M -
tank/vol2 56K 812M 56K -
# Write speed are now back to normal level!
[root at blackhole ~]# dd if=/dev/zero of=/dev/zvol/tank/vol1 bs=4k
dd: error writing โ/dev/zvol/tank/vol1โ: No space left on device
16385+0 records in
16384+0 records out
67108864 bytes (67 MB) copied, 1.09085 s, 61.5 MB/s
To recap:
as checked with zdm -mmm, all metaslab have plenty of free space. This
means that, as far I know, we are not facing a first-fit vs best-fit
allocation policy change;
when over ~70% of total space (as shown by zfs list) is allocated (but
not really used/consumed!), write speed start to decrease. When crossing
~80%, each additional allocation greatly lowers write speed
O_DIRECT writes seems much less affected (albeit not completel) from
the slowdown, probably due to how much more effective is the zfs IO
aggregator. They even lower total pool fragmentation...
Some questions:
The entire thread, with some other data (and a calculated write amplification of 50X) can be found here: http://list.zfsonlinux.org/pipermail/zfs-discuss/2017-September/029292.html
Well, another interesting test run...
# This is what seems a empty, healty pool (0% FRAG and CAP)
[root@localhost ~]# zpool list
NAME SIZE ALLOC FREE EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT
tank 7.94G 64.4M 7.87G - 0% 0% 1.00x ONLINE -
# But write speed is real slow... first pass:
[root@localhost ~]# dd if=/dev/zero of=/dev/zvol/tank/vol1 bs=4k
dd: error writing โ/dev/zvol/tank/vol1โ: No space left on device
16385+0 records in
16384+0 records out
67108864 bytes (67 MB) copied, 14.15 s, 4.7 MB/s
# Second pass
[root@localhost ~]# dd if=/dev/zero of=/dev/zvol/tank/vol1 bs=4k
dd: error writing โ/dev/zvol/tank/vol1โ: No space left on device
16385+0 records in
16384+0 records out
67108864 bytes (67 MB) copied, 30.976 s, 2.2 MB/s
Why such low write speed? Well, give a look at "zfs list":
[root@localhost ~]# zfs list
NAME USED AVAIL REFER MOUNTPOINT
tank 7.65G 45.5M 24K /tank
tank/vol1 66M 47.5M 64.0M -
tank/vol2 7.58G 7.62G 12K -
As you can see, I used a big "vol2" volume taking so much space. In short, it seems that when available/allocatable space is under the 256/512 MB threshold (experimentally determined) performance begin to suffer.
What surprise me if that, as far I know, non-sparse volumes do not really allocate anything on-disk. In other words, there is not a true "pre-allocated" file/volume on ZFS.
Anyone has a plausible expanation on what going here?
Thanks.
@zviratko You are seeing high write amplification when doing random writes, as well as slower performance than expected. The write amplification is likely due to indirect blocks. E.g. with recordsize=4k, every random 4k write will need to write 4KB of user data, plus at least 2x 128K (before compression) level-1 indirect blocks. Depending on your file size, level-2 indirect block writes may not "amortize out" over the txg either. So 32x inflation would be expected. In addition to this, if the indirect blocks are not all cached, then you will need to read the indirect block before making the modification, which has a very adverse effect on write latency.
@shodanshok You are seeing poor write performance when there is not much space available to the dataset you're writing to. This problem is unrelated to the one originally reported here; in the future please file a separate issue to discuss it.
The behavior you're seeing is because ZFS is trying hard to not run you out of space. When the dataset you're writing to is low on space, we need to periodically flush a txg to check how much space is really used, to ensure that we don't go over the prescribed amount of space. Note that the space I'm talking about is reported by "zfs list" (which may be constrained by quotas and reservations). It has nothing to do with "zpool list" (which may have tons of free space).
You can reduce the impact of this by reducing spa_asize_inflation.
@ahrens a lot of the write amplification could be explained, but I think I demonstrated a lot of it can not.
Explain this: How does 100 IOPS of 1B size turn into 1GB/s sustained writes to the pool? (And I think that was with 4K volblocksize). That cannot ever be right.
@zviratko I'm not sure what's going on there. Do you have more detailed information about that run? The closest output I see above shows 894 IOPS and 420MB/s. This could be explained if we have to write 2 levels of indirect blocks for each data block: 894 IOPS(4KB+4128KB) = 460MB/s.
@ahrens unfortunately it's been some time since I ran these tests, but I think the scenario was this:
/dev/ev3mema is a NVRAM device (presented as an ordinary block device) of 16G size, essentialy a RAMdisk and very very fast.
zpool create -o ashift=12 -O recordsize=4k -O compression=off -O checksum=off ev3 /dev/ev3mema
fio --filename=/ev3/fiotest --ioengine=libaio --direct=0 --rw=randwrite --bs=4k --iodepth=128 --size=15G --group_reporting --name=test
^ notice it fills most of the pool size (15GB out of 16GB), which is what triggers the problem
The pool/fie might have been "dirty" before (other tests might have run)l.
I also tested some of these scenarios (can't vouch for this one) with other drives - a 4TB SATA drive, and several SSDs - and the faster the drive, the worse it gets. Could it be some sort of buffer thrashing where incomplete data gets continously written?
For the record, we're running full-flash ZFS pool at work and we are seeing some unexplained amplification in there, that's what triggered my investigation, and it performs significantly differently in LAB and when tested concurrently with production workload (which eats whatever it gets) - the production pool is also severely overprovisioned (maybe five-fold) and quite fast.
Unfortunately I don't have the NVRAM drive available anymore, but I'll see if it's reproducible with an ordinary ramdisk.
Ah, I didn't realize you had so little free space. Yes, I suspect that the small, 93% full pool is seriously exacerbating the issue. ZFS will only be able to write a very small amount of user data per txg, so you could be hitting 2 or even 3 levels of indirect blocks per user write. There are probably a bunch of other pieces of metadata that don't amortize well when the pool has <1GB free space (and you have <1GB "available" from the DMU level, so you have @shodanshok 's problem as well.
Could be,
but going from 100B/s to 1000000000B/s means the amplification is 10000000x. Not a sane number in any case.
To be fair, it's obviously 400KB/s at least with a 4k blocksize, but that's still 250000x... :)
I am facing the same problem. 8~10 write amplification. zol(8K)+fio(randwrite,8K,direct=1)
Most helpful comment
I am facing the same problem.
To recap:
eating into the metaslab free space (I checked them with zdb -mmm),
seems to badly interact with space allocation. It seems that they
somewhat reserve real space on the metaslab allocation, greatly
influencing the vdev allocation policy;
as checked with zdm -mmm, all metaslab have plenty of free space. This
means that, as far I know, we are not facing a first-fit vs best-fit
allocation policy change;
when over ~70% of total space (as shown by zfs list) is allocated (but
not really used/consumed!), write speed start to decrease. When crossing
~80%, each additional allocation greatly lowers write speed
O_DIRECT writes seems much less affected (albeit not completel) from
the slowdown, probably due to how much more effective is the zfs IO
aggregator. They even lower total pool fragmentation...
Some questions:
refreservation property, but they are absolutely the same on-disk (as
shown by zdb). Is this true? If so, why do I see dramatically different
behavior?
The entire thread, with some other data (and a calculated write amplification of 50X) can be found here: http://list.zfsonlinux.org/pipermail/zfs-discuss/2017-September/029292.html