zfs filesystem iops is slower than vol, vol got 8k randomwrite 1.8w iops,but zfs filesystem only got 8k randomwrite 1.2K iops

Created on 19 Apr 2017  Â·  8Comments  Â·  Source: openzfs/zfs

System information


Type | Version/Name
--- | ---
Distribution Name | Red Hat Enterprise Linux Server release
Distribution Version | 7.2 (Maipo)
Linux Kernel | 3.10.0-327.el7.x86_64
Architecture |
ZFS Version | 0.6.5.9-1.el7
SPL Version | spl-0.6.5.9-1.el7

Describe the problem you're observing

zfs filesystem random iops is slower than vol.

Describe how to reproduce the problem

intel 3500 300G ssd*2 to make zpool
zpool create test -o ashift=13 /dev/sda /dev/sdb -f
zpool set listsnapshots=on test
zfs set primarycache=metadata test
zfs set atime=off test

$cat /etc/modprobe.d/zfs.conf
options zfs zfs_arc_shrink_shift=14
options zfs zfs_prefetch_disable=1
options zfs zfs_nocacheflush=1
options zfs zfs_dirty_data_max=2147483648
options zfs zfs_dirty_data_max_max=4294967296
options zfs zfs_vdev_async_write_max_active=15
options zfs zfs_vdev_async_write_min_active=5
options zfs zfs_vdev_scheduler=deadline
options zfs zfs_vdev_async_write_active_min_dirty_percent=20

zfs filesystem
zfs create test/test2
[root@localhost test/test2]
$fio -ioengine=libaio -bs=8k -thread -rw=randwrite -filename=test -name="test" -size=10G -iodepth=1 -numjobs=10 -runtime=60 -group_reporting
test: (g=0): rw=randwrite, bs=8K-8K/8K-8K/8K-8K, ioengine=libaio, iodepth=1
...
fio-2.2.8
Starting 10 threads
Jobs: 10 (f=10): [w(10)] [100.0% done] [0KB/10600KB/0KB /s] [0/1325/0 iops] [eta 00m:00s]
test: (groupid=0, jobs=10): err= 0: pid=35140: Tue Apr 11 16:44:20 2017
write: io=618696KB, bw=10311KB/s, iops=1288, runt= 60006msec
slat (usec): min=13, max=35634, avg=7753.37, stdev=4485.33
clat (usec): min=0, max=40, avg= 1.49, stdev= 0.87
lat (usec): min=14, max=35636, avg=7755.51, stdev=4485.43
clat percentiles (usec):
| 1.00th=[ 0], 5.00th=[ 1], 10.00th=[ 1], 20.00th=[ 1],
| 30.00th=[ 1], 40.00th=[ 1], 50.00th=[ 1], 60.00th=[ 2],
| 70.00th=[ 2], 80.00th=[ 2], 90.00th=[ 2], 95.00th=[ 2],
| 99.00th=[ 3], 99.50th=[ 4], 99.90th=[ 13], 99.95th=[ 15],

zfs vol
zfs create -b 4K -V 100G test/test
[root@backup94 /root/]
$fio -ioengine=libaio -bs=8k -direct=1 -thread -rw=randwrite -filename=/dev/test/test -name="test" -size=100G -iodepth=1 -numjobs=10 -runtime=60 -group_reporting
test: (g=0): rw=randwrite, bs=8K-8K/8K-8K/8K-8K, ioengine=libaio, iodepth=1
...
fio-2.2.8
Starting 10 threads
Jobs: 10 (f=10): [w(10)] [100.0% done] [0KB/167.7MB/0KB /s] [0/21.5K/0 iops] [eta 00m:00s]
test: (groupid=0, jobs=10): err= 0: pid=42542: Fri Apr 14 11:11:26 2017
write: io=8857.9MB, bw=151171KB/s, iops=18896, runt= 60001msec
slat (usec): min=4, max=97, avg=10.33, stdev= 4.12
clat (usec): min=32, max=2673, avg=516.47, stdev=185.98
lat (usec): min=39, max=2690, avg=526.99, stdev=186.23
clat percentiles (usec):
| 1.00th=[ 118], 5.00th=[ 197], 10.00th=[ 398], 20.00th=[ 430],
| 30.00th=[ 450], 40.00th=[ 462], 50.00th=[ 478], 60.00th=[ 494],
| 70.00th=[ 516], 80.00th=[ 588], 90.00th=[ 772], 95.00th=[ 916],
| 99.00th=[ 1144], 99.50th=[ 1224], 99.90th=[ 1368], 99.95th=[ 1400],
| 99.99th=[ 1448]

Include any warning/errors/backtraces from the system logs

Documentation

All 8 comments

It looks like the difference is actually more than a factor of 10: 1288 IOPS vs 18,896.

Did you leave recordsize at the default 128k?

If you want to make the test comparable, create a new filesystem and set the recordsize to 8k (zfs set recordsize=8k test/test3). Also, it doesn't make much sense to set ashift=13 (8K) and then force a zvol blocksize of 4K.

zpool create test /dev/sda /dev/sdb -f
zpool set listsnapshots=on test
zfs set primarycache=metadata test
zfs set atime=off test

zpool ashift值
test ashift 0 default

zvol
zfs create -V 100G test/test
test/test volblocksize 8K -
$fio -ioengine=libaio -bs=8k -direct=1 -thread -rw=randwrite -filename=/dev/test/test -name=test -size=10G -iodepth=1 -numjobs=10 -runtime=60 -group_reporting
test: (g=0): rw=randwrite, bs=8K-8K/8K-8K/8K-8K, ioengine=libaio, iodepth=1
...
fio-2.2.8
Starting 10 threads
Jobs: 1 (f=1): [_(9),w(1)] [25.2% done] [0KB/0KB/0KB /s] [0/0/0 iops] [eta 03m:13s]
test: (groupid=0, jobs=10): err= 0: pid=47596: Wed Apr 19 13:20:15 2017
write: io=25476MB, bw=434776KB/s, iops=54346, runt= 60001msec
slat (usec): min=13, max=9034, avg=181.59, stdev=91.16
clat (usec): min=0, max=2148, avg= 0.91, stdev= 1.40
lat (usec): min=14, max=9039, avg=182.80, stdev=91.38
clat percentiles (usec):
| 1.00th=[ 0], 5.00th=[ 0], 10.00th=[ 0], 20.00th=[ 1],
| 30.00th=[ 1], 40.00th=[ 1], 50.00th=[ 1], 60.00th=[ 1],
| 70.00th=[ 1], 80.00th=[ 1], 90.00th=[ 1], 95.00th=[ 2],
| 99.00th=[ 2], 99.50th=[ 2], 99.90th=[ 4], 99.95th=[ 6],
| 99.99th=[ 9]
bw (KB /s): min=16016, max=84384, per=9.99%, avg=43419.55, stdev=15272.75
lat (usec) : 2=91.97%, 4=7.91%, 10=0.11%, 20=0.01%, 50=0.01%
lat (usec) : 100=0.01%, 250=0.01%, 1000=0.01%
lat (msec) : 4=0.01%
cpu : usr=1.31%, sys=23.45%, ctx=2918039, majf=0, minf=22
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued : total=r=0/w=3260872/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
latency : target=0, window=0, percentile=100.00%, depth=1

zfs filesystem
zfs create test/test2
test/test2 recordsize 128K default
$fio -ioengine=libaio -bs=8k -thread -rw=randwrite -filename=test -name="test" -size=10G -iodepth=1 -numjobs=10 -runtime=60 -group_reporting
test: (g=0): rw=randwrite, bs=8K-8K/8K-8K/8K-8K, ioengine=libaio, iodepth=1
...
fio-2.2.8
Starting 10 threads
Jobs: 10 (f=10): [w(10)] [100.0% done] [0KB/11264KB/0KB /s] [0/1408/0 iops] [eta 00m:00s]
test: (groupid=0, jobs=10): err= 0: pid=47299: Wed Apr 19 13:18:57 2017
write: io=687592KB, bw=11458KB/s, iops=1432, runt= 60008msec
slat (usec): min=13, max=31759, avg=6976.94, stdev=3807.26
clat (usec): min=0, max=27, avg= 1.39, stdev= 0.67
lat (usec): min=13, max=31761, avg=6978.79, stdev=3807.37
clat percentiles (usec):
| 1.00th=[ 1], 5.00th=[ 1], 10.00th=[ 1], 20.00th=[ 1],
| 30.00th=[ 1], 40.00th=[ 1], 50.00th=[ 1], 60.00th=[ 1],
| 70.00th=[ 2], 80.00th=[ 2], 90.00th=[ 2], 95.00th=[ 2],
| 99.00th=[ 3], 99.50th=[ 3], 99.90th=[ 9], 99.95th=[ 12],
| 99.99th=[ 16]
bw (KB /s): min= 800, max= 1632, per=10.02%, avg=1148.42, stdev=132.45
lat (usec) : 2=63.19%, 4=36.57%, 10=0.14%, 20=0.10%, 50=0.01%
cpu : usr=0.07%, sys=1.52%, ctx=172437, majf=0, minf=18
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued : total=r=0/w=85949/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
latency : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
WRITE: io=687592KB, aggrb=11458KB/s, minb=11458KB/s, maxb=11458KB/s, mint=60008msec, maxt=60008msec
(everest-dev-env)

What are you doing? You still have 128k recordsize set...

zfs create test/test2
test/test2 recordsize 128K default

You need to run:

zfs set recordsize=8k test/test2

Why did you set ashift=13? On modern hardware, you want ashift=12. Making it larger often has no benefit. As for the zpool ashift, that just means you did not override it. The actual value is per vdev and can be obtained from zdb -l /path/to/zfs/vdev/partition.

Anyway, if you care about 8K random IO, you want to do:

zfs create -V 100G test/test
zfs create -o recordsize=8k test/test2

Do not do zfs set primarycache=metadata test. I realize that you likely want to test uncached performance, but in reality, there is always some amount of cache. You are better off making sure that you are testing random IO on an amount of space than is double your RAM. That will give a partially cached result that is more relevant to the real world.

There are situations where zfs set primarycache=metadata test might improve performance when the metadata usage of the indirect block tree exceeds 3/8 of system RAM (assuming ARC defaults) and you only care about random IO. There is a theoretical performance cliff where the indirect block tree does not fit into RAM and performance on random IO does suffer when close to it. In one case at work where we were close (and contention for ARC meant the indirect block tree was being evicted), I was able to improve IOPS from 23,000 to 26,000 solely by doing primarycache=metadata. That being said, I doubt that you are anywhere near it.

Since I am on the topic, the size of the indirect block tree is primarily that of the lowest level of indirect blocks. That can be determined by dividing the size of the zvol (or file) by the volblocksize (or recordsize) to get the number of direct blocks, dividing by 128 to get the number of L1 indirect blocks and multiplying by 16K. The space usage of the upper levels of the indirect block tree can be obtained by dividing by 128 each time. Compression also plays a role when dealing with on-disk storage, but for in-memory cache, this should be accurate. I wrote this on the off chance that you have a situation where primarycache=metadata actually helps.

That being said, I suggest reading this:

http://open-zfs.org/wiki/Performance_tuning

I update it as I learn more ways of effectively improving ZFS performance. This is an area of active research for me as I have just started a job this week where my first goal is to raise ZFS performance on zvols. We probably should link to it from the wiki and possibly also from the man pages to help people trying to get the most out of ZFS on their hardware.

My remarks about ditto blocks appear to have been incorrect. I have removed them from my previous comment. This is an area that I am just started researching for work. The remarks on the indirect block tree are correct though. I will update the OpenZFS wiki after I have fully studied this effect.

That being said, the wiki has been updated with links to the OpenZFS performance tuning and hardware pages:

https://github.com/zfsonlinux/zfs/wiki

We probably should move the links to more prominent locations, but this will suffice for the short term.

ths very much。 zfs filesystem default recordsize is 128k,is not well in random io,i change the recordsize to 8k,the random io get much betrer。iops is very close to the vol.

$zfs create -o recordsize=4k test/test2
$fio -ioengine=libaio -bs=8k -thread -rw=randwrite -filename=test -name="test" -size=10G -iodepth=1 -numjobs=10 -runtime=60 -group_reporting
test: (g=0): rw=randwrite, bs=8K-8K/8K-8K/8K-8K, ioengine=libaio, iodepth=1
...
fio-2.2.8
Starting 10 threads
test: Laying out IO file(s) (1 file(s) / 10240MB)
Jobs: 10 (f=10): [w(10)] [100.0% done] [0KB/311.2MB/0KB /s] [0/39.9K/0 iops] [eta 00m:00s]
test: (groupid=0, jobs=10): err= 0: pid=56594: Thu Apr 20 09:26:20 2017
write: io=12484MB, bw=213061KB/s, iops=26632, runt= 60001msec
slat (usec): min=26, max=6563, avg=372.91, stdev=169.64
clat (usec): min=0, max=77, avg= 0.99, stdev= 0.60
lat (usec): min=26, max=6566, avg=374.19, stdev=169.88
clat percentiles (usec):
| 1.00th=[ 0], 5.00th=[ 0], 10.00th=[ 0], 20.00th=[ 1],
| 30.00th=[ 1], 40.00th=[ 1], 50.00th=[ 1], 60.00th=[ 1],
| 70.00th=[ 1], 80.00th=[ 1], 90.00th=[ 2], 95.00th=[ 2],
| 99.00th=[ 2], 99.50th=[ 3], 99.90th=[ 5], 99.95th=[ 7],
| 99.99th=[ 10]
bw (KB /s): min=10560, max=59792, per=10.02%, avg=21340.53, stdev=9195.90
lat (usec) : 2=86.85%, 4=13.01%, 10=0.12%, 20=0.02%, 50=0.01%
lat (usec) : 100=0.01%
cpu : usr=0.63%, sys=36.62%, ctx=1895453, majf=0, minf=14
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued : total=r=0/w=1597984/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
latency : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
WRITE: io=12484MB, aggrb=213060KB/s, minb=213060KB/s, maxb=213060KB/s, mint=60001msec, maxt=60001msec

zfs vol iops test
$fio -ioengine=libaio -bs=8k -direct=1 -thread -rw=randwrite -filename=/dev/test/test -name=test -size=10G -iodepth=1 -numjobs=10 -runtime=60 -group_reporting
test: (g=0): rw=randwrite, bs=8K-8K/8K-8K/8K-8K, ioengine=libaio, iodepth=1
...
fio-2.2.8
Starting 10 threads
Jobs: 1 (f=1): [_(3),w(1),_(6)] [13.8% done] [0KB/0KB/0KB /s] [0/0/0 iops] [eta 07m:23s]
test: (groupid=0, jobs=10): err= 0: pid=6404: Thu Apr 20 09:53:52 2017
write: io=13929MB, bw=237711KB/s, iops=29713, runt= 60001msec
slat (usec): min=14, max=3588, avg=333.24, stdev=165.66
clat (usec): min=0, max=91, avg= 1.22, stdev= 0.76
lat (usec): min=14, max=3590, avg=334.87, stdev=166.20
clat percentiles (usec):
| 1.00th=[ 0], 5.00th=[ 0], 10.00th=[ 0], 20.00th=[ 1],
| 30.00th=[ 1], 40.00th=[ 1], 50.00th=[ 1], 60.00th=[ 1],
| 70.00th=[ 1], 80.00th=[ 2], 90.00th=[ 2], 95.00th=[ 2],
| 99.00th=[ 3], 99.50th=[ 3], 99.90th=[ 8], 99.95th=[ 10],
| 99.99th=[ 14]
bw (KB /s): min= 9344, max=99328, per=10.03%, avg=23845.53, stdev=14192.41
lat (usec) : 2=72.53%, 4=27.10%, 10=0.31%, 20=0.06%, 50=0.01%
lat (usec) : 100=0.01%
cpu : usr=0.99%, sys=14.00%, ctx=1579274, majf=0, minf=15
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued : total=r=0/w=1782861/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
latency : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
WRITE: io=13929MB, aggrb=237710KB/s, minb=237710KB/s, maxb=237710KB/s, mint=60001msec, maxt=60001msec
(everest-dev-env)

Was this page helpful?
0 / 5 - 0 ratings

Related issues

dmaziuk picture dmaziuk  Â·  52Comments

tycho picture tycho  Â·  67Comments

cytrinox picture cytrinox  Â·  66Comments

mabod picture mabod  Â·  53Comments

pruiz picture pruiz  Â·  60Comments