Zfs: Various read amplification in dd small block size read. it's occupied bandwidth and decrease the IOPS

Created on 27 Aug 2020  路  12Comments  路  Source: openzfs/zfs

System information


Type | Version/Name
--- | ---
Distribution Name | CentOS
Distribution Version | 7.7
Linux Kernel | 3.10.0-1062.18.1.el7.x86_64
Architecture | x86_64
ZFS Version | 0.8.4 and 0.7.13
SPL Version | 0.8.4 and 0.7.13

Describe the problem you're observing

read amplification by different block size (less than 1MiB)

I use dd command and different bs variable to read the same files(primarycache=none), I just want to see the impact when the cache not hit. too surprised.
There are many 4KiB/8KiB read in my environment. so I did the test by dd.

If you want to read 1MiB file by 4KiB block size, the "zpool iostat -lv 1" read overhead will reach 256MiB/s, 256x overhead !!!
The test is in the zfs local posix layer. no network.

|zfs recordzie 1M | read 1MiB file
|-------------------------|--------------------
|dd bs=4KiB | 256 MiB/s
|dd bs=8KiB | 128 MiB/s
|dd bs=16KiB | 64 MiB/s
|dd bs=32KiB | 32 MiB/s
|dd bs=64KiB | 16 MiB/s
|dd bs=128KiB | 8 MiB/s
|dd bs=1024KiB | 1 MiB/s

The different ashfit and different recordsize and the different read block sizes have various results.
Here just performance impact and it 's the capacity impact.

zpool iostat count inaccuracy

EG:

recordsize=4K 
ashfit=9 
dd bs=4KiB
test file size 4KiB

tank        1.27G  1.81T      7      0  3.98K      0  958us      -  958us      -    1us      -      -      -      -      -
  raidz2    1.27G  1.81T      7      0  3.98K      0  958us      -  958us      -    1us      -      -      -      -      -
    sdb1        -      -      0      0    509      0  196us      -  196us      -    1us      -      -      -      -      -
    sdc1        -      -      0      0    509      0  196us      -  196us      -    1us      -      -      -      -      -
    sdd1        -      -      0      0    509      0  196us      -  196us      -    1us      -      -      -      -      -
    sde1        -      -      0      0    509      0  196us      -  196us      -    1us      -      -      -      -      -
    sdf1        -      -      0      0    509      0  196us      -  196us      -    1us      -      -      -      -      -
    sdg1        -      -      0      0    509      0  196us      -  196us      -    1us      -      -      -      -      -
    sdh1        -      -      0      0    509      0    6ms      -    6ms      -    1us      -      -      -      -      -
    sdi1        -      -      0      0    509      0  196us      -  196us      -    3us      -      -      -      -      -
    sdj1        -      -      0      0      0      0      -      -      -      -      -      -      -      -      -      -
    sdk1        -      -      0      0      0      0      -      -      -      -      -      -      -      -      -      -

I can't understand why the read block only 509 Bytes?
I use dd read 4096 Bytes file, the result just 3.98K(KiB or KB) ?
It's less rigorous.

Many thanks.

Describe how to reproduce the problem

Create zpool

   for i in {b..k}; do parted -s /dev/sd$i mklabel gpt; parted -s /dev/sd$i mkpart p1 2048s 200G; done
   zpool create tank -o ashift=12 raidz2 /dev/sd{b..k}1
   modprobe zfs
   zfs create tank/1024
   zfs set recordsize=1M tank/1024
   zfs set primarycache=none tank
   echo 60 > zfs_txg_timeout

Create file

mkdir test_10_1M
for i in {0..9}
do
   dd if=/dev/zero of=test_10_1M/$(openssl rand -hex 8)_$i bs=1048576 count=1 oflag=sync &
   sleep 1
done

Read file one by one

cd /tank/1024/test_10_1M
 for i in ./*; do dd if=$i of=/dev/null bs=1M ; sleep 2 ; done

Include any warning/errors/backtraces from the system logs

Performance Question

Most helpful comment

In bioinformatics, many software like read write file by line that why too many small block access.

Main question if those apps generate random or sequential IO, so if you request 4k but sequentially - ARC can absorb any amplification. If it's truly random and works only with small blocks - you may have to set smaller recordsize. I've had an interesting time with LMDB, which works with PAGESIZE-sized blocks by default, and is mainly random by it's architecture. 4k recordsize on HDD pool makes it faster than 128k-1M recordsize, unless you have ARC size to cache, ideally, 90%+ of workload in RAM.

All 12 comments

Oh sorry, looks like I add the wrong label, I can't remove it....

Oh sorry, looks like I add the wrong label, I can't remove it....

Mistakes are only human! :)
I'm working on a patch to make the issue options a little more clear for people that just (basically) want some support. ^^

The problem is simplified as follows:
You save a 1M file in a 1M zfs-block on the harddisk. So far so good.
But: 1M recordsize and a 1M file, also means that every part of that file read, will fetch 1M.
So if you ask to fetch a file in 4K portions, those are seperate requests. It will basically fetch 1M for every 4K part-of-file requested.

However: This should be covered by ARC or at the veryleast you should be able to setup ARC to cover this issue.
But in short:
If you want to do many small reads, it's not the best design the use a 1M zfs recordsize.

Hi Ornias1993, Thank you.
In bioinformatics, many software like read write file by line that why too many small block access.

@homerl In those cases I would advice not to exceed 128K record size and you can further optimize by creating seperate datasets with even smaller recordsizes on a case-by-case basis :)

Do remember:
Higher recordsizes do increase compression. So it would be weighing compression vs transfer speed in this case. I think 128K or 64K as a general record size for those files is pretty decent :)

@Ornias1993
I'm testing the recordsize from 4K to 1M. if the recordsize set slow, ZFS will consume more usable space.
For the moment, set recordsize to 128K/64K is a balanced choice.
In 128K recordsize, the read amplification still in existence. it 's just 1/8 compare with 1M recordsize.
I will suggest they(not IT engineers, they are biologists) increase the read block, but it's not all software could be modified.
Thank you very much.

For small reads, I often use 16k record size, bu never smaller.

You could also look into optimizing ARC. These reads should be mostly covered by (L2)ARC imho.

In bioinformatics, many software like read write file by line that why too many small block access.

Main question if those apps generate random or sequential IO, so if you request 4k but sequentially - ARC can absorb any amplification. If it's truly random and works only with small blocks - you may have to set smaller recordsize. I've had an interesting time with LMDB, which works with PAGESIZE-sized blocks by default, and is mainly random by it's architecture. 4k recordsize on HDD pool makes it faster than 128k-1M recordsize, unless you have ARC size to cache, ideally, 90%+ of workload in RAM.

Hi Ornias1993,Gmelikov,
Aha, sequence IO is fine. There are two headache cases

  1. One or more Big files (zfs ARC too slow compare with these files), too many compute node read/write hot files.
  2. Eg: Sort the big file(or some algorithms) by line, there are too many random IO and split too many tiny files( get more metadata stress), and all of the IO will pour into the zpool. 3x or 4x jobs could knock out one 8+2 zpool.

I remove all L2ARC from the others zpool(In single zfs server, we have multiple zpools), Add them to the hot zpool, I got a good result. I try to make it automatic : )

If each zpool has heavy loading, I have to add more L2ARC next time.

Hi
I lose the intention of this question.
Huge ARC and L2ARC could avoid the issue. but I can't add them to no limit.
What causes the read amplification?
Any better idea to debug?
The 128K Recordsize has the same amplification, About 1/8 read amplification compare with 1MB Recordsize.
Thank you.

zfs set recordsize=1M tank/1024
zfs set primarycache=none tank
dd bs=

Each read from disk has to retrieve the whole 1MB block from disk. You have explicitly requested that the data not be cached in memory, so every read system call will retrieve 1MB from disk, even reads of size less than 1MB.

Hopefully this answers your question of why ZFS behaves this way. If your real-world workload looks like this, you should get a big performance improvement by not disabling the ARC (i.e. use the default primarycache=all). If you can't dedicate a lot of memory to ZFS, you can limit the ARC size with the zfs_arc_max module parameter.

Hi Ahrens
Thank you.
I meet a bad case.
There are too many sequential and random 4K/8K read requests.
If I set the recordsize=4K/8K, When I save 9M size files, it will consume 20+ MB space, waste too many capacities.
Oh, Sorry, I make the wrong idea, In the raid system, once read operations need to read the whole stripe.

Was this page helpful?
0 / 5 - 0 ratings