Zfs: random writes cause high write amplification

Created on 30 Aug 2017 · 16Comments · Source: openzfs/zfs

System information

Type | Version/Name
--- | ---
Distribution Name | Ubuntu
Distribution Version | 17.04
Linux Kernel | 4.10.0-33-generic
Architecture | x86_64
ZFS Version | 0.7.0
SPL Version | 0.7.0

Describe the problem you're observing

Pool created:
zpool create -o ashift=12 -O recordsize=4k -O compression=off -O checksum=off ev3 /dev/ev3mema

Then I run fio:

fio --filename=/ev3/fiotest --ioengine=libaio --direct=0 --rw=randwrite --bs=4k --iodepth=128 --size=15G --group_reporting --name=test

Performance decresed over time, now (after a few minutes):

Jobs: 1 (f=1): [w(1)] [48.6% done] [0KB/3576KB/0KB /s] [0/894/0 iops] [eta 11m:18s]

zpool iostat -v ev3 1
              capacity     operations     bandwidth
pool        alloc   free   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
ev3         15.2G   739M  2.62K  69.2K  15.9M   420M
  ev3mema   15.2G   739M  2.62K  69.2K  15.9M   420M
----------  -----  -----  -----  -----  -----  -----

If I restart fio, it does between 100-2000 IOPS (though it seems to now mostly stay on the 100-200 mark), while zfs writes >1GB/s to the device

This makes no sense to me.

With 128k iosize, IOPS stay around ~1300 and zfs writes ~300MB/s to the device.
With 1B iosize, IOPS drop to 100 and zfs writes 1.2GB to the drive. W00T.
With 4K iosize and iodepth=1, fio starts at ~10K IOPS but drops to 700 while zfs writes 1.6GB/s.

This is without removing the /ev3/fiotest file, it is left dirty.

Also z_wr_iss threads are eating ~40% of all CPUs

If I create a ZVOL and run the same fio benchmark on that, it behaves as it should - 150K IOPS and 600MB/s, on subsequent runs (without recreating the ZVOL) it isn't nearly as stable and performs about 30% worse, but since fio is doing buffered writes I see it writing to the device long after fio itself has finished and I'm guessing there's some amplification as well (200%?).

Switching fio to direct=1 and bs=4k shows write amplification of 1000%:

  write: io=913540KB, bw=64970KB/s, iops=16242, runt= 14061msec

              capacity     operations     bandwidth
pool        alloc   free   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
ev3         14.1G  1.81G    868  67.3K   434K   677M
  ev3mema   14.1G  1.81G    868  67.3K   434K   677M
----------  -----  -----  -----  -----  -----  -----

On subsequent runs it rises even more (15MB/s vs 650MB/s).

With larger matched volblocksize=32k and iosize=32k it looks stable at 200% write amplification (even with logbias=throughput, so even that doesn't make sense to me)

FYI this is fio on the block device itself:

fio --filename=/dev/ev3mema --ioengine=libaio --direct=1 --rw=randwrite --bs=4k --iodepth=128 --size=14G --group_reporting --name=file1

  write: io=14336MB, bw=1840.4MB/s, iops=471118, runt=  7790msec

This is fio on ext4 using this block device:

fio --filename=/ev3ext4/fiotest --ioengine=libaio --direct=1 --rw=randwrite --bs=4k --iodepth=128 --size=14G --group_reporting --name=file1

  write: io=14336MB, bw=558071KB/s, iops=139517, runt= 26305msec

(block device shows ~600MB/s)

Am I missing something or is it seriously broken?

Performance

Source

zviratko

👍1

Most helpful comment

I am facing the same problem.

# Create a 1 GB zpool on /dev/sdc1 and small 64M zvol
[root at blackhole ~]# zpool create tank /dev/sdc1
[root at blackhole ~]# zfs create tank/vol1 -V 64M -b 64k

# First pass, to fill the small volume
[root at blackhole ~]# dd if=/dev/zero of=/dev/zvol/tank/vol1 bs=4k
dd: error writing ‘/dev/zvol/tank/vol1’: No space left on device
16385+0 records in
16384+0 records out
67108864 bytes (67 MB) copied, 0.711457 s, 94.3 MB/s

# Second pass: being an overwrite, it is slower than the first pass. 
However, performance remain quite good
[root at blackhole ~]# dd if=/dev/zero of=/dev/zvol/tank/vol1 bs=4k
dd: error writing ‘/dev/zvol/tank/vol1’: No space left on device
16385+0 records in
16384+0 records out
67108864 bytes (67 MB) copied, 0.975994 s, 68.8 MB/s

# zpool and zfs status
[root at blackhole ~]# zpool list
NAME   SIZE  ALLOC   FREE  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
tank  1008M  64.7M   943M         -     2%     6%  1.00x  ONLINE  -
[root at blackhole ~]# zfs list
NAME        USED  AVAIL  REFER  MOUNTPOINT
tank       66.4M   814M    96K  /tank
tank/vol1    66M   816M  64.1M  -

# Create a big but completely unused volume
[root at blackhole ~]# zfs create tank/vol2 -V 750M -b 64k

# zpool show no changes at all (nothing was written, so nothing was 
allocated)
[root at blackhole ~]# zpool list
NAME   SIZE  ALLOC   FREE  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
tank  1008M  65.6M   942M         -     2%     6%  1.00x  ONLINE  -

# zfs list reports the lower total available space, but that's all (no 
real space was allocated, as we found with zpool list)
[root at blackhole ~]# zfs list
NAME        USED  AVAIL  REFER  MOUNTPOINT
tank        821M  58.5M    96K  /tank
tank/vol1    66M  60.5M  64.1M  -
tank/vol2   755M   813M    56K  -

# WRITE SPEED ON THE FIRST, SMALL VOLUME IS NOW VERY LOW...
[root at blackhole ~]# dd if=/dev/zero of=/dev/zvol/tank/vol1 bs=4k
dd: error writing ‘/dev/zvol/tank/vol1’: No space left on device
16385+0 records in
16384+0 records out
67108864 bytes (67 MB) copied, 70.5518 s, 951 kB/s

# ...but zpool list shows only 6% utilization! (note: fragmentation is 
way higher than on the previous run, but it alone does not explain the 
extremely dip in write speed)
[root at blackhole ~]# zpool list
NAME   SIZE  ALLOC   FREE  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
tank  1008M  66.7M   941M         -    21%     6%  1.00x  ONLINE  -

# O_DIRECT writes, albeit slower than normal, are not so bad (as shown 
by zpool iostat -r, the zfs aggregator is working much better in this case)
[root at blackhole ~]# dd if=/dev/zero of=/dev/zvol/tank/vol1 bs=4k 
oflag=direct
dd: error writing ‘/dev/zvol/tank/vol1’: No space left on device
16385+0 records in
16384+0 records out
67108864 bytes (67 MB) copied, 2.93807 s, 22.8 MB/s

# Let try to convert the second, big volume to a sparse device. As shown 
by zpool status, nothing really changed on disk (as expected); on the 
other size, zfs list now show much higher free space
[root at blackhole ~]# zfs set refreservation=none tank/vol2
[root at blackhole ~]# zpool list
NAME   SIZE  ALLOC   FREE  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
tank  1008M  66.9M   941M         -    21%     6%  1.00x  ONLINE  -
[root at blackhole ~]# zfs list
NAME        USED  AVAIL  REFER  MOUNTPOINT
tank       68.4M   812M    96K  /tank
tank/vol1    66M   814M  64.1M  -
tank/vol2    56K   812M    56K  -

# Write speed are now back to normal level!
[root at blackhole ~]# dd if=/dev/zero of=/dev/zvol/tank/vol1 bs=4k
dd: error writing ‘/dev/zvol/tank/vol1’: No space left on device
16385+0 records in
16384+0 records out
67108864 bytes (67 MB) copied, 1.09085 s, 61.5 MB/s

To recap:

non-sparse volume, albeit being not really allocated on disk, nor
eating into the metaslab free space (I checked them with zdb -mmm),
seems to badly interact with space allocation. It seems that they
somewhat reserve real space on the metaslab allocation, greatly
influencing the vdev allocation policy;

as checked with zdm -mmm, all metaslab have plenty of free space. This
means that, as far I know, we are not facing a first-fit vs best-fit
allocation policy change;
when over ~70% of total space (as shown by zfs list) is allocated (but
not really used/consumed!), write speed start to decrease. When crossing
~80%, each additional allocation greatly lowers write speed
O_DIRECT writes seems much less affected (albeit not completel) from
the slowdown, probably due to how much more effective is the zfs IO
aggregator. They even lower total pool fragmentation...

Some questions:

is that the expected behavior, or should I open a bug?
from what I know, non-sparse vs sparse volumes only differ by the
refreservation property, but they are absolutely the same on-disk (as
shown by zdb). Is this true? If so, why do I see dramatically different
behavior?
I am missing something? Are my tests flawed?

The entire thread, with some other data (and a calculated write amplification of 50X) can be found here: http://list.zfsonlinux.org/pipermail/zfs-discuss/2017-September/029292.html

shodanshok on 15 Sep 2017

👍2

All 16 comments

It might be related to the file or ZVOL taking up almost all space in the pool (the device is only 16G), but it still isn't pretty.
I'm testing the same scenarios with an ordinary 60GB SSD now and will test smaller data sizez on the EV3 as well.

zviratko on 30 Aug 2017

So...
Yes, it is related to free space, at least partly:

fio --filename=/zfstest/fiotest --ioengine=libaio --direct=0 --rw=randwrite --bs=4k --iodepth=128 --size=54700M --group_reporting --name=file1

700-1500 IOPS (~3-6MB/s), but:

              capacity     operations     bandwidth
pool        alloc   free   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
zfstest     53.6G  1.86G      0  18.6K      0   177M
  sdh       53.6G  1.86G      0  18.6K      0   177M
----------  -----  -----  -----  -----  -----  -----

(the above is first run and buffered, rewriting it would take a looong time and I hope it's preallocated already...)

so let's call it a write amplification of 50x

fio --filename=/zfstest/fiotest --ioengine=libaio --direct=0 --rw=randwrite --bs=4k --iodepth=128 --size=10000M --group_reporting --name=file1

  write: io=10000MB, bw=114634KB/s, iops=28658, runt= 89328msec

drive was writing between 100-200MB

this is second run

write: io=10000MB, bw=137612KB/s, iops=34403, runt= 74412msec
interestingly, the drive wasn't writing that much more (maybe 20MB more on average)

I'll continue with ZVOLs from now on:

zfs create -V 51700M -o volblocksize=4k zfstest/zvitest

NAME      USED  AVAIL  REFER  MOUNTPOINT
zfstest  53.7G  98.2M    96K  /zfstest

I'd call it full

fio --filename=/dev/zvol/zfstest/zvitest --ioengine=libaio --direct=1 --rw=randwrite --bs=4k --iodepth=128 --group_reporting --name=file1^ dropped to 0 IOPS pretty much immediately because it still buffers

              capacity     operations     bandwidth
pool        alloc   free   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
zfstest      617M  54.9G      0    193      0  6.35M
  sdh        617M  54.9G      0    193      0  6.35M
----------  -----  -----  -----  -----  -----  -----

iostat reports the same with 100% utilization, system load has risen to ~30 while CPUs are doing pretty much nothing

              capacity     operations     bandwidth
pool        alloc   free   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
zfstest     1.02G  54.5G      0      0      0      0
  sdh       1.02G  54.5G      0      0      0      0
----------  -----  -----  -----  -----  -----  -----

zfs destroy zfstest/zvitest
^ has been running for a good minute now, doing nothing

... [ 8901.588042] blk_update_request: I/O error, dev sdh, sector 25367848

I just killed the SSD :-D

I'll grab another and re-do then test.

If anybody minds a running commentary in there, let me know... :-)

zviratko on 30 Aug 2017

Now testing with a larger drive (480G)

fio --filename=/dev/zvol/zfstest/zvitest --ioengine=libaio --direct=1 --rw=randwrite --bs=4k --iodepth=128 --group_reporting --name=file1
file1: (g=0): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=128
fio-2.16
Starting 1 process
Jobs: 1 (f=1): [w(1)] [47.1% done] [0KB/33676KB/0KB /s] [0/8419/0 iops] [eta 01h:26m:15s]

(oscillates between 4-10K)

              capacity     operations     bandwidth
pool        alloc   free   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
zfstest      283G   161G  3.92K  6.46K  32.1M   220M
  sdh        283G   161G  3.92K  6.46K  32.1M   220M
----------  -----  -----  -----  -----  -----  -----

ZVOL has volblocksize=8k (sorry, forgot to change that). Running fio with bs=8k just increases fio throughput, the rest stays the same (IOPS, actual drive writes...)
The pool is not full.
Still the write amplification is pretty terrible, >5 instead of the expected 2 because of volblocksize mismatch.

Is ZFS trying to saturate the drives even if it doesn't have to? (iow, is it just coalescing because it has to with slower drives). That would explain the terrible number I see when using the EV3 - it's essentialy a ramdisk, tops at ~800K IOPS and 6GB/s, it wasn't even saturated before while the SSD now shows 95% utilization.

But still - 100 IOPS of 1 Byte creating 1.5GB/s writes doesn't sound right.

zviratko on 30 Aug 2017

I don't see a bug here. Suggest moving this discussion to the mail list, a bug tracker is not the right tool for discussions.

richardelling on 30 Aug 2017

I don't think this behaviour is neither expected nor right.
There's no scenario in which a 100B/s application write workload should trigger a 1.5GB/s of real sustained write workload on the drive.

If this is something that's documented somewhere, or can be explained then I'm open to suggestions, until then, with all due respect, I consider it a bug.

zviratko on 30 Aug 2017

👍2

Btw this is all happening on my desktop/lab machine, so whatever data is needed I can provide, whatever debug I can do I'll be happy to do and whatever I should try I can try. Just point me in the right direction...

zviratko on 30 Aug 2017

I am facing the same problem.

# Create a 1 GB zpool on /dev/sdc1 and small 64M zvol
[root at blackhole ~]# zpool create tank /dev/sdc1
[root at blackhole ~]# zfs create tank/vol1 -V 64M -b 64k

# First pass, to fill the small volume
[root at blackhole ~]# dd if=/dev/zero of=/dev/zvol/tank/vol1 bs=4k
dd: error writing ‘/dev/zvol/tank/vol1’: No space left on device
16385+0 records in
16384+0 records out
67108864 bytes (67 MB) copied, 0.711457 s, 94.3 MB/s

# Second pass: being an overwrite, it is slower than the first pass. 
However, performance remain quite good
[root at blackhole ~]# dd if=/dev/zero of=/dev/zvol/tank/vol1 bs=4k
dd: error writing ‘/dev/zvol/tank/vol1’: No space left on device
16385+0 records in
16384+0 records out
67108864 bytes (67 MB) copied, 0.975994 s, 68.8 MB/s

# zpool and zfs status
[root at blackhole ~]# zpool list
NAME   SIZE  ALLOC   FREE  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
tank  1008M  64.7M   943M         -     2%     6%  1.00x  ONLINE  -
[root at blackhole ~]# zfs list
NAME        USED  AVAIL  REFER  MOUNTPOINT
tank       66.4M   814M    96K  /tank
tank/vol1    66M   816M  64.1M  -

# Create a big but completely unused volume
[root at blackhole ~]# zfs create tank/vol2 -V 750M -b 64k

# zpool show no changes at all (nothing was written, so nothing was 
allocated)
[root at blackhole ~]# zpool list
NAME   SIZE  ALLOC   FREE  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
tank  1008M  65.6M   942M         -     2%     6%  1.00x  ONLINE  -

# zfs list reports the lower total available space, but that's all (no 
real space was allocated, as we found with zpool list)
[root at blackhole ~]# zfs list
NAME        USED  AVAIL  REFER  MOUNTPOINT
tank        821M  58.5M    96K  /tank
tank/vol1    66M  60.5M  64.1M  -
tank/vol2   755M   813M    56K  -

# WRITE SPEED ON THE FIRST, SMALL VOLUME IS NOW VERY LOW...
[root at blackhole ~]# dd if=/dev/zero of=/dev/zvol/tank/vol1 bs=4k
dd: error writing ‘/dev/zvol/tank/vol1’: No space left on device
16385+0 records in
16384+0 records out
67108864 bytes (67 MB) copied, 70.5518 s, 951 kB/s

# ...but zpool list shows only 6% utilization! (note: fragmentation is 
way higher than on the previous run, but it alone does not explain the 
extremely dip in write speed)
[root at blackhole ~]# zpool list
NAME   SIZE  ALLOC   FREE  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
tank  1008M  66.7M   941M         -    21%     6%  1.00x  ONLINE  -

# O_DIRECT writes, albeit slower than normal, are not so bad (as shown 
by zpool iostat -r, the zfs aggregator is working much better in this case)
[root at blackhole ~]# dd if=/dev/zero of=/dev/zvol/tank/vol1 bs=4k 
oflag=direct
dd: error writing ‘/dev/zvol/tank/vol1’: No space left on device
16385+0 records in
16384+0 records out
67108864 bytes (67 MB) copied, 2.93807 s, 22.8 MB/s

# Let try to convert the second, big volume to a sparse device. As shown 
by zpool status, nothing really changed on disk (as expected); on the 
other size, zfs list now show much higher free space
[root at blackhole ~]# zfs set refreservation=none tank/vol2
[root at blackhole ~]# zpool list
NAME   SIZE  ALLOC   FREE  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
tank  1008M  66.9M   941M         -    21%     6%  1.00x  ONLINE  -
[root at blackhole ~]# zfs list
NAME        USED  AVAIL  REFER  MOUNTPOINT
tank       68.4M   812M    96K  /tank
tank/vol1    66M   814M  64.1M  -
tank/vol2    56K   812M    56K  -

# Write speed are now back to normal level!
[root at blackhole ~]# dd if=/dev/zero of=/dev/zvol/tank/vol1 bs=4k
dd: error writing ‘/dev/zvol/tank/vol1’: No space left on device
16385+0 records in
16384+0 records out
67108864 bytes (67 MB) copied, 1.09085 s, 61.5 MB/s

To recap:

non-sparse volume, albeit being not really allocated on disk, nor
eating into the metaslab free space (I checked them with zdb -mmm),
seems to badly interact with space allocation. It seems that they
somewhat reserve real space on the metaslab allocation, greatly
influencing the vdev allocation policy;

as checked with zdm -mmm, all metaslab have plenty of free space. This
means that, as far I know, we are not facing a first-fit vs best-fit
allocation policy change;
when over ~70% of total space (as shown by zfs list) is allocated (but
not really used/consumed!), write speed start to decrease. When crossing
~80%, each additional allocation greatly lowers write speed
O_DIRECT writes seems much less affected (albeit not completel) from
the slowdown, probably due to how much more effective is the zfs IO
aggregator. They even lower total pool fragmentation...

Some questions:

is that the expected behavior, or should I open a bug?
from what I know, non-sparse vs sparse volumes only differ by the
refreservation property, but they are absolutely the same on-disk (as
shown by zdb). Is this true? If so, why do I see dramatically different
behavior?
I am missing something? Are my tests flawed?

The entire thread, with some other data (and a calculated write amplification of 50X) can be found here: http://list.zfsonlinux.org/pipermail/zfs-discuss/2017-September/029292.html

shodanshok on 15 Sep 2017

👍2

Well, another interesting test run...

# This is what seems a empty, healty pool (0% FRAG and CAP)
[root@localhost ~]# zpool list
NAME   SIZE  ALLOC   FREE  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
tank  7.94G  64.4M  7.87G         -     0%     0%  1.00x  ONLINE  -

# But write speed is real slow... first pass:
[root@localhost ~]# dd if=/dev/zero of=/dev/zvol/tank/vol1 bs=4k
dd: error writing ‘/dev/zvol/tank/vol1’: No space left on device
16385+0 records in
16384+0 records out
67108864 bytes (67 MB) copied, 14.15 s, 4.7 MB/s

# Second pass
[root@localhost ~]# dd if=/dev/zero of=/dev/zvol/tank/vol1 bs=4k
dd: error writing ‘/dev/zvol/tank/vol1’: No space left on device
16385+0 records in
16384+0 records out
67108864 bytes (67 MB) copied, 30.976 s, 2.2 MB/s

Why such low write speed? Well, give a look at "zfs list":

[root@localhost ~]# zfs list
NAME        USED  AVAIL  REFER  MOUNTPOINT
tank       7.65G  45.5M    24K  /tank
tank/vol1    66M  47.5M  64.0M  -
tank/vol2  7.58G  7.62G    12K  -

As you can see, I used a big "vol2" volume taking so much space. In short, it seems that when available/allocatable space is under the 256/512 MB threshold (experimentally determined) performance begin to suffer.

What surprise me if that, as far I know, non-sparse volumes do not really allocate anything on-disk. In other words, there is not a true "pre-allocated" file/volume on ZFS.

Anyone has a plausible expanation on what going here?
Thanks.

shodanshok on 16 Sep 2017

👍1

@zviratko You are seeing high write amplification when doing random writes, as well as slower performance than expected. The write amplification is likely due to indirect blocks. E.g. with recordsize=4k, every random 4k write will need to write 4KB of user data, plus at least 2x 128K (before compression) level-1 indirect blocks. Depending on your file size, level-2 indirect block writes may not "amortize out" over the txg either. So 32x inflation would be expected. In addition to this, if the indirect blocks are not all cached, then you will need to read the indirect block before making the modification, which has a very adverse effect on write latency.

ahrens on 27 Apr 2018

@shodanshok You are seeing poor write performance when there is not much space available to the dataset you're writing to. This problem is unrelated to the one originally reported here; in the future please file a separate issue to discuss it.

The behavior you're seeing is because ZFS is trying hard to not run you out of space. When the dataset you're writing to is low on space, we need to periodically flush a txg to check how much space is really used, to ensure that we don't go over the prescribed amount of space. Note that the space I'm talking about is reported by "zfs list" (which may be constrained by quotas and reservations). It has nothing to do with "zpool list" (which may have tons of free space).

You can reduce the impact of this by reducing spa_asize_inflation.

ahrens on 27 Apr 2018

@ahrens a lot of the write amplification could be explained, but I think I demonstrated a lot of it can not.

Explain this: How does 100 IOPS of 1B size turn into 1GB/s sustained writes to the pool? (And I think that was with 4K volblocksize). That cannot ever be right.

zviratko on 27 Apr 2018

@zviratko I'm not sure what's going on there. Do you have more detailed information about that run? The closest output I see above shows 894 IOPS and 420MB/s. This could be explained if we have to write 2 levels of indirect blocks for each data block: 894 IOPS(4KB+4128KB) = 460MB/s.

ahrens on 27 Apr 2018

@ahrens unfortunately it's been some time since I ran these tests, but I think the scenario was this:

/dev/ev3mema is a NVRAM device (presented as an ordinary block device) of 16G size, essentialy a RAMdisk and very very fast.

zpool create -o ashift=12 -O recordsize=4k -O compression=off -O checksum=off ev3 /dev/ev3mema
fio --filename=/ev3/fiotest --ioengine=libaio --direct=0 --rw=randwrite --bs=4k --iodepth=128 --size=15G --group_reporting --name=test
^ notice it fills most of the pool size (15GB out of 16GB), which is what triggers the problem

The pool/fie might have been "dirty" before (other tests might have run)l.

I also tested some of these scenarios (can't vouch for this one) with other drives - a 4TB SATA drive, and several SSDs - and the faster the drive, the worse it gets. Could it be some sort of buffer thrashing where incomplete data gets continously written?
For the record, we're running full-flash ZFS pool at work and we are seeing some unexplained amplification in there, that's what triggered my investigation, and it performs significantly differently in LAB and when tested concurrently with production workload (which eats whatever it gets) - the production pool is also severely overprovisioned (maybe five-fold) and quite fast.

Unfortunately I don't have the NVRAM drive available anymore, but I'll see if it's reproducible with an ordinary ramdisk.

zviratko on 27 Apr 2018

Ah, I didn't realize you had so little free space. Yes, I suspect that the small, 93% full pool is seriously exacerbating the issue. ZFS will only be able to write a very small amount of user data per txg, so you could be hitting 2 or even 3 levels of indirect blocks per user write. There are probably a bunch of other pieces of metadata that don't amortize well when the pool has <1GB free space (and you have <1GB "available" from the DMU level, so you have @shodanshok 's problem as well.

ahrens on 27 Apr 2018

Could be,
but going from 100B/s to 1000000000B/s means the amplification is 10000000x. Not a sane number in any case.
To be fair, it's obviously 400KB/s at least with a 4k blocksize, but that's still 250000x... :)

zviratko on 27 Apr 2018

I am facing the same problem. 8~10 write amplification. zol(8K)+fio(randwrite,8K,direct=1)