Zfs: Lack of fairness of sync writes

Created on 8 Mar 2020 · 2Comments · Source: openzfs/zfs

System information

Type | Version/Name
--- | ---
Distribution Name | Arch Linux
Distribution Version | rolling
Linux Kernel | Linux 5.2.7-arch1-1-ARCH #3 SMP, PREEMPT_VOLUNTARY
Architecture | x86-64
ZFS Version | 0.8.2
SPL Version |

Describe the problem you're observing

I am observing unfair sync write scheduling and severe userspace process io starvation in certain situation. It appears to be that fsync call on a file with a lot of unwritten dirty data will stall the system and cause FIFO-like sync write order, where no other processes get their share untill the dirty data is flushed.
On my home system this causes severe stalls when the guest VM with cache=writeback virtio-scsi disk decides to sync the scsi barrier while having a lot of dirty data in the hypervisor's RAM. All other hypervisor writers block completely and userspace starts chimping out with various timeouts and locks. It effectively acts as a DoS.

Describe how to reproduce the problem

1). Prepare a reasonably-default dataset.

zfs get all rootpool1/arch1/varcache
NAME                      PROPERTY              VALUE                  SOURCE
rootpool1/arch1/varcache  type                  filesystem             -
rootpool1/arch1/varcache  creation              Thu Aug 15 23:37 2019  -
rootpool1/arch1/varcache  used                  13.4G                  -
rootpool1/arch1/varcache  available             122G                   -
rootpool1/arch1/varcache  referenced            13.4G                  -
rootpool1/arch1/varcache  compressratio         1.08x                  -
rootpool1/arch1/varcache  mounted               yes                    -
rootpool1/arch1/varcache  quota                 none                   default
rootpool1/arch1/varcache  reservation           none                   default
rootpool1/arch1/varcache  recordsize            128K                   default
rootpool1/arch1/varcache  mountpoint            /var/cache             local
rootpool1/arch1/varcache  sharenfs              off                    default
rootpool1/arch1/varcache  checksum              on                     default
rootpool1/arch1/varcache  compression           lz4                    inherited from rootpool1
rootpool1/arch1/varcache  atime                 off                    inherited from rootpool1
rootpool1/arch1/varcache  devices               on                     default
rootpool1/arch1/varcache  exec                  on                     default
rootpool1/arch1/varcache  setuid                on                     default
rootpool1/arch1/varcache  readonly              off                    default
rootpool1/arch1/varcache  zoned                 off                    default
rootpool1/arch1/varcache  snapdir               hidden                 default
rootpool1/arch1/varcache  aclinherit            restricted             default
rootpool1/arch1/varcache  createtxg             920                    -
rootpool1/arch1/varcache  canmount              on                     default
rootpool1/arch1/varcache  xattr                 sa                     inherited from rootpool1
rootpool1/arch1/varcache  copies                1                      default
rootpool1/arch1/varcache  version               5                      -
rootpool1/arch1/varcache  utf8only              off                    -
rootpool1/arch1/varcache  normalization         none                   -
rootpool1/arch1/varcache  casesensitivity       sensitive              -
rootpool1/arch1/varcache  vscan                 off                    default
rootpool1/arch1/varcache  nbmand                off                    default
rootpool1/arch1/varcache  sharesmb              off                    default
rootpool1/arch1/varcache  refquota              none                   default
rootpool1/arch1/varcache  refreservation        none                   default
rootpool1/arch1/varcache  guid                  394889745699357232     -
rootpool1/arch1/varcache  primarycache          all                    default
rootpool1/arch1/varcache  secondarycache        all                    default
rootpool1/arch1/varcache  usedbysnapshots       0B                     -
rootpool1/arch1/varcache  usedbydataset         13.4G                  -
rootpool1/arch1/varcache  usedbychildren        0B                     -
rootpool1/arch1/varcache  usedbyrefreservation  0B                     -
rootpool1/arch1/varcache  logbias               latency                default
rootpool1/arch1/varcache  objsetid              660                    -
rootpool1/arch1/varcache  dedup                 off                    default
rootpool1/arch1/varcache  mlslabel              none                   default
rootpool1/arch1/varcache  sync                  standard               default
rootpool1/arch1/varcache  dnodesize             legacy                 default
rootpool1/arch1/varcache  refcompressratio      1.08x                  -
rootpool1/arch1/varcache  written               13.4G                  -
rootpool1/arch1/varcache  logicalused           14.5G                  -
rootpool1/arch1/varcache  logicalreferenced     14.5G                  -
rootpool1/arch1/varcache  volmode               default                default
rootpool1/arch1/varcache  filesystem_limit      none                   default
rootpool1/arch1/varcache  snapshot_limit        none                   default
rootpool1/arch1/varcache  filesystem_count      none                   default
rootpool1/arch1/varcache  snapshot_count        none                   default
rootpool1/arch1/varcache  snapdev               hidden                 default
rootpool1/arch1/varcache  acltype               off                    default
rootpool1/arch1/varcache  context               none                   default
rootpool1/arch1/varcache  fscontext             none                   default
rootpool1/arch1/varcache  defcontext            none                   default
rootpool1/arch1/varcache  rootcontext           none                   default
rootpool1/arch1/varcache  relatime              off                    default
rootpool1/arch1/varcache  redundant_metadata    all                    default
rootpool1/arch1/varcache  overlay               on                     local
rootpool1/arch1/varcache  encryption            off                    default
rootpool1/arch1/varcache  keylocation           none                   default
rootpool1/arch1/varcache  keyformat             none                   default
rootpool1/arch1/varcache  pbkdf2iters           0                      default
rootpool1/arch1/varcache  special_small_blocks  0                      default

2). Prepare 2 terminal tabs, cd to this dataset mount point. In them prepare the following fio commands:
"big-write"

fio --name=big-write --ioengine=sync --rw=write --bs=32k --direct=1 --size=2G --numjobs=1 --end_fsync=1

and "small-write"

fio --name=small-write --ioengine=sync --rw=write --bs=128k --direct=1 --size=128k --numjobs=1 --end_fsync=1

3). Let them run once to prepare the necessary benchmark files. In the meantime observe the iostat on the pool:

zpool iostat -qv rootpool1 0.1
                                                         capacity     operations     bandwidth    syncq_read    syncq_write   asyncq_read  asyncq_write   scrubq_read   trimq_write
pool                                                    alloc   free   read  write   read  write   pend  activ   pend  activ   pend  activ   pend  activ   pend  activ   pend  activ
------------------------------------------------------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
rootpool1                                                273G   171G      0  7.15K      0   598M      0      0      0      0      0      0    213      8      0      0      0      0
  mirror                                                 273G   171G      0  7.15K      0   597M      0      0      0      0      0      0    213      8      0      0      0      0
    ata-PNY_CS900_480GB_SSD_PNY111900057103045C0-part2      -      -      0  3.57K      0   300M      0      0      0      0      0      0    105      4      0      0      0      0
    ata-Patriot_Burst_9128079B175300025792-part2            -      -      0  3.56K      0   297M      0      0      0      0      0      0    108      4      0      0      0      0
------------------------------------------------------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
...
                                                         capacity     operations     bandwidth    syncq_read    syncq_write   asyncq_read  asyncq_write   scrubq_read   trimq_write
pool                                                    alloc   free   read  write   read  write   pend  activ   pend  activ   pend  activ   pend  activ   pend  activ   pend  activ
------------------------------------------------------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
rootpool1                                                274G   170G      0  7.19K      0   565M      0      0      0      7      0      0      0      0      0      0      0      0
  mirror                                                 274G   170G      0  7.18K      0   565M      0      0      0      7      0      0      0      0      0      0      0      0
    ata-PNY_CS900_480GB_SSD_PNY111900057103045C0-part2      -      -      0  3.59K      0   282M      0      0      0      3      0      0      0      0      0      0      0      0
    ata-Patriot_Burst_9128079B175300025792-part2            -      -      0  3.59K      0   282M      0      0      0      4      0      0      0      0      0      0      0      0
------------------------------------------------------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----

Note that when fio issues 2G of async writes it calls fsync at the very end, wich moves them from async to sync class.

4). When fios are finished, do the following: start "big-write" and then after 2-3 seconds (when "Jobs: 1" appears) start the "small-write". Note that the small 128K write will never finish before the 2G one, and the second fio remains blocked until the first one finishes.

Include any warning/errors/backtraces from the system logs

Source

Boris-Barboris

👍3

Most helpful comment

After some code investigation the problem appears to be too deeply ingrained in the write path.
There are multiple problems that cause this:

Writer task_id is effectively lost on VFS layer (vnops). Apart from zio priority, there is no QoS-related data tracking for dmu itxses and zios. While current()'s task_id can be added to these structures, there is no data to track any type of QoS at this moment.
DMU throttle that implements the zfs_dirty_data_max logic and bounds the total size of dirty dbufs of a transaction group is unfair. Assuming very busy system with one full txg already syncing, single process can eat up all ram and will be as throttled as the interactive processes that arrive later, after single ddoser's write flow. Transaction allocation is effectively as fair as cpu scheduler, relying on thread priorities. Even worse, unless their files are open with O_SYNC flag, latency-sentive applications that rely on multiple writes + fsync still go through DMU throttle and receive a severe penalty to their write speed.
ZIL is serialized per-dataset. I repeated the experiment from the OP on two separate datasets and the second small fio indeed managed to complete before the first big one in some cases. The speed of the second fio was still abyssimal, as nothing in the SPA/ZIO guarantees fairness to zils. Such FIFO ordering is incompatible with any form of bandwidth sharing, and this is the original cause of my problems: ZIL processes 2G of first fio and only then 128K of small fio.
SPA allocation throttle and vdev schedulers are oblivious to the source of zios they handle. They are not fair to processes nor to any other entities, and only look at graph ordering of zios and their priority class.

I am afraid my workaround is currently the only viable option for acceptable latency under overwhelming fsync load. ZFS is nor designed nor built to be bandwidth-fair to consumer entities.

Boris-Barboris on 9 May 2020

😕2 👍2

All 2 comments

Lowering zfs_dirty_data_max significantly (to 100-200M values from default 3G) mitigates the problem for me, but with 50% performance drop.

Boris-Barboris on 8 Mar 2020

After some code investigation the problem appears to be too deeply ingrained in the write path.
There are multiple problems that cause this:

Writer task_id is effectively lost on VFS layer (vnops). Apart from zio priority, there is no QoS-related data tracking for dmu itxses and zios. While current()'s task_id can be added to these structures, there is no data to track any type of QoS at this moment.
DMU throttle that implements the zfs_dirty_data_max logic and bounds the total size of dirty dbufs of a transaction group is unfair. Assuming very busy system with one full txg already syncing, single process can eat up all ram and will be as throttled as the interactive processes that arrive later, after single ddoser's write flow. Transaction allocation is effectively as fair as cpu scheduler, relying on thread priorities. Even worse, unless their files are open with O_SYNC flag, latency-sentive applications that rely on multiple writes + fsync still go through DMU throttle and receive a severe penalty to their write speed.
ZIL is serialized per-dataset. I repeated the experiment from the OP on two separate datasets and the second small fio indeed managed to complete before the first big one in some cases. The speed of the second fio was still abyssimal, as nothing in the SPA/ZIO guarantees fairness to zils. Such FIFO ordering is incompatible with any form of bandwidth sharing, and this is the original cause of my problems: ZIL processes 2G of first fio and only then 128K of small fio.
SPA allocation throttle and vdev schedulers are oblivious to the source of zios they handle. They are not fair to processes nor to any other entities, and only look at graph ordering of zios and their priority class.

I am afraid my workaround is currently the only viable option for acceptable latency under overwhelming fsync load. ZFS is nor designed nor built to be bandwidth-fair to consumer entities.

Boris-Barboris on 9 May 2020

😕2 👍2

Was this page helpful?

0 / 5 - 0 ratings