Type | Version/Name
--- | ---
Distribution Name | Arch Linux
Distribution Version | rolling
Linux Kernel | Linux 5.2.7-arch1-1-ARCH #3 SMP, PREEMPT_VOLUNTARY
Architecture | x86-64
ZFS Version | 0.8.2
SPL Version |
I am observing unfair sync write scheduling and severe userspace process io starvation in certain situation. It appears to be that fsync call on a file with a lot of unwritten dirty data will stall the system and cause FIFO-like sync write order, where no other processes get their share untill the dirty data is flushed.
On my home system this causes severe stalls when the guest VM with cache=writeback virtio-scsi disk decides to sync the scsi barrier while having a lot of dirty data in the hypervisor's RAM. All other hypervisor writers block completely and userspace starts chimping out with various timeouts and locks. It effectively acts as a DoS.
1). Prepare a reasonably-default dataset.
zfs get all rootpool1/arch1/varcache
NAME PROPERTY VALUE SOURCE
rootpool1/arch1/varcache type filesystem -
rootpool1/arch1/varcache creation Thu Aug 15 23:37 2019 -
rootpool1/arch1/varcache used 13.4G -
rootpool1/arch1/varcache available 122G -
rootpool1/arch1/varcache referenced 13.4G -
rootpool1/arch1/varcache compressratio 1.08x -
rootpool1/arch1/varcache mounted yes -
rootpool1/arch1/varcache quota none default
rootpool1/arch1/varcache reservation none default
rootpool1/arch1/varcache recordsize 128K default
rootpool1/arch1/varcache mountpoint /var/cache local
rootpool1/arch1/varcache sharenfs off default
rootpool1/arch1/varcache checksum on default
rootpool1/arch1/varcache compression lz4 inherited from rootpool1
rootpool1/arch1/varcache atime off inherited from rootpool1
rootpool1/arch1/varcache devices on default
rootpool1/arch1/varcache exec on default
rootpool1/arch1/varcache setuid on default
rootpool1/arch1/varcache readonly off default
rootpool1/arch1/varcache zoned off default
rootpool1/arch1/varcache snapdir hidden default
rootpool1/arch1/varcache aclinherit restricted default
rootpool1/arch1/varcache createtxg 920 -
rootpool1/arch1/varcache canmount on default
rootpool1/arch1/varcache xattr sa inherited from rootpool1
rootpool1/arch1/varcache copies 1 default
rootpool1/arch1/varcache version 5 -
rootpool1/arch1/varcache utf8only off -
rootpool1/arch1/varcache normalization none -
rootpool1/arch1/varcache casesensitivity sensitive -
rootpool1/arch1/varcache vscan off default
rootpool1/arch1/varcache nbmand off default
rootpool1/arch1/varcache sharesmb off default
rootpool1/arch1/varcache refquota none default
rootpool1/arch1/varcache refreservation none default
rootpool1/arch1/varcache guid 394889745699357232 -
rootpool1/arch1/varcache primarycache all default
rootpool1/arch1/varcache secondarycache all default
rootpool1/arch1/varcache usedbysnapshots 0B -
rootpool1/arch1/varcache usedbydataset 13.4G -
rootpool1/arch1/varcache usedbychildren 0B -
rootpool1/arch1/varcache usedbyrefreservation 0B -
rootpool1/arch1/varcache logbias latency default
rootpool1/arch1/varcache objsetid 660 -
rootpool1/arch1/varcache dedup off default
rootpool1/arch1/varcache mlslabel none default
rootpool1/arch1/varcache sync standard default
rootpool1/arch1/varcache dnodesize legacy default
rootpool1/arch1/varcache refcompressratio 1.08x -
rootpool1/arch1/varcache written 13.4G -
rootpool1/arch1/varcache logicalused 14.5G -
rootpool1/arch1/varcache logicalreferenced 14.5G -
rootpool1/arch1/varcache volmode default default
rootpool1/arch1/varcache filesystem_limit none default
rootpool1/arch1/varcache snapshot_limit none default
rootpool1/arch1/varcache filesystem_count none default
rootpool1/arch1/varcache snapshot_count none default
rootpool1/arch1/varcache snapdev hidden default
rootpool1/arch1/varcache acltype off default
rootpool1/arch1/varcache context none default
rootpool1/arch1/varcache fscontext none default
rootpool1/arch1/varcache defcontext none default
rootpool1/arch1/varcache rootcontext none default
rootpool1/arch1/varcache relatime off default
rootpool1/arch1/varcache redundant_metadata all default
rootpool1/arch1/varcache overlay on local
rootpool1/arch1/varcache encryption off default
rootpool1/arch1/varcache keylocation none default
rootpool1/arch1/varcache keyformat none default
rootpool1/arch1/varcache pbkdf2iters 0 default
rootpool1/arch1/varcache special_small_blocks 0 default
2). Prepare 2 terminal tabs, cd to this dataset mount point. In them prepare the following fio commands:
"big-write"
fio --name=big-write --ioengine=sync --rw=write --bs=32k --direct=1 --size=2G --numjobs=1 --end_fsync=1
and "small-write"
fio --name=small-write --ioengine=sync --rw=write --bs=128k --direct=1 --size=128k --numjobs=1 --end_fsync=1
3). Let them run once to prepare the necessary benchmark files. In the meantime observe the iostat on the pool:
zpool iostat -qv rootpool1 0.1
capacity operations bandwidth syncq_read syncq_write asyncq_read asyncq_write scrubq_read trimq_write
pool alloc free read write read write pend activ pend activ pend activ pend activ pend activ pend activ
------------------------------------------------------ ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- -----
rootpool1 273G 171G 0 7.15K 0 598M 0 0 0 0 0 0 213 8 0 0 0 0
mirror 273G 171G 0 7.15K 0 597M 0 0 0 0 0 0 213 8 0 0 0 0
ata-PNY_CS900_480GB_SSD_PNY111900057103045C0-part2 - - 0 3.57K 0 300M 0 0 0 0 0 0 105 4 0 0 0 0
ata-Patriot_Burst_9128079B175300025792-part2 - - 0 3.56K 0 297M 0 0 0 0 0 0 108 4 0 0 0 0
------------------------------------------------------ ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- -----
...
capacity operations bandwidth syncq_read syncq_write asyncq_read asyncq_write scrubq_read trimq_write
pool alloc free read write read write pend activ pend activ pend activ pend activ pend activ pend activ
------------------------------------------------------ ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- -----
rootpool1 274G 170G 0 7.19K 0 565M 0 0 0 7 0 0 0 0 0 0 0 0
mirror 274G 170G 0 7.18K 0 565M 0 0 0 7 0 0 0 0 0 0 0 0
ata-PNY_CS900_480GB_SSD_PNY111900057103045C0-part2 - - 0 3.59K 0 282M 0 0 0 3 0 0 0 0 0 0 0 0
ata-Patriot_Burst_9128079B175300025792-part2 - - 0 3.59K 0 282M 0 0 0 4 0 0 0 0 0 0 0 0
------------------------------------------------------ ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- -----
Note that when fio issues 2G of async writes it calls fsync at the very end, wich moves them from async to sync class.
4). When fios are finished, do the following: start "big-write" and then after 2-3 seconds (when "Jobs: 1" appears) start the "small-write". Note that the small 128K write will never finish before the 2G one, and the second fio remains blocked until the first one finishes.
Lowering zfs_dirty_data_max significantly (to 100-200M values from default 3G) mitigates the problem for me, but with 50% performance drop.
After some code investigation the problem appears to be too deeply ingrained in the write path.
There are multiple problems that cause this:
I am afraid my workaround is currently the only viable option for acceptable latency under overwhelming fsync load. ZFS is nor designed nor built to be bandwidth-fair to consumer entities.
Most helpful comment
After some code investigation the problem appears to be too deeply ingrained in the write path.
There are multiple problems that cause this:
I am afraid my workaround is currently the only viable option for acceptable latency under overwhelming fsync load. ZFS is nor designed nor built to be bandwidth-fair to consumer entities.