Zfs: Staggered-write file fragmentation on multi-core systems

Created on 1 Feb 2018 · 4Comments · Source: openzfs/zfs

System information

Type | Version/Name
--- | ---
Distribution Name | NixOS
Distribution Version | 18.03.git.d492cdc789c (Impala)
Linux Kernel | 4.9.77 #1-NixOS SMP
Architecture | Intel(R) Atom(TM) CPU C2750
ZFS Version | 0.7.5-1
SPL Version | 0.7.5-1

Describe the problem you're observing

After writing a tool to display file fragmentation as seen by zdb, I noticed that even a file newly created by cp, on a pool with plentiful free space, would have 2-3 fragments per MB; for a 300MB file, that's 700-900 fragments.

After disabling 7 of the system's 8 cores through /sys/devices/cpu/cpu[1-7]/online and redoing the write, I saw only 8 to 10 fragments. Re-enabling the cores caused new writes to be as fragmented as ever.

The fragments manifest as staggered writes; individual sections are reordered, but never written very far from where they 'should' be, i.e. instead of an 'ABCDEFGHI' ordering, I might instead see 'BACEDFIGH'.

This has also been confirmed on a newly-created, single-disk pool.

Describe how to reproduce the problem

Replicate my procedure from above. To examine the resulting files, you could use the printzfrag tool:

printzblock() {
  sudo zdb -ddddd $(df --output=source --type=zfs "$1" | tail -n +2) $(stat -c %i "$1")
}

printzfrag() {
  printzblock "$@" \
    | grep ' L0' \
    | awk '{print $3}' \
    | gawk -F: \
      'BEGIN {
        pos=0
        segments=0
      } {
         segstart = strtonum("0x" $2)
         size = strtonum("0x" $3)
         if (pos != segstart) {
           segments += 1
           print old, " -- ", $0, " -- ", pos - segstart
         }
         pos = segstart + size
         old = $0
      } END {
         print segments
      }'
}

Which will output something similar to http://ix.io/F27.

The last column shows each fragment's offset from its 'expected' location, exhibiting an easily recognisable pattern of staggered writes.

The tool isn't intended for and won't work well on multi-vdev pools.

Stale Performance

Source

Baughn

Most helpful comment

@Baughn nice observation. As you probably surmised this behavior is an artifact of the way the I/O pipeline concurrently allocates new blocks. In practice, most blocks are large and will get prefetched sequentially so the impact of this is usually minimal. But this is an area which could potentially be optimized, the tricky bit is sequentially laying out an unknown number of blocks while maintaining maximum concurrency in the pipeline.

behlendorf on 1 Feb 2018

👍3

All 4 comments

behlendorf on 1 Feb 2018

👍3

In practice, most blocks are large and will get prefetched sequentially so the impact of this is usually minimal.

I can confirm that this usually happens, or at least that I'm getting full-speed streaming reads from these files. Regarding its being 'minimal', though... I believe it's pretty common to use a higher recordsize for filesystems mainly consisting of larger files, e.g. video, and it wouldn't take much of an increase to blow out the default prefetch size.

Baughn on 4 Feb 2018

👍1

Reducing zfs_sync_taskq_batch_pct seems to make the TxG commit much more "linear" and promotes more write aggregation. I suspect this is why. The default of 75% is very oversized for most systems with a lot of processors. I end up using somewhere between 2-5 threads for it in most cases.

As part of tuning, I find what is helpful is to open the ZIO throttle, have adequate async write threads, and find the point at which zfs_sync_taskq_batch_pct can just generate data fast enough to make full use of the disks. This point provides the most potential for write merges and rate limits TxG sync RMW reads. After that, dial in the threads, and then set the ZIO throttle just a little looser than it needs to be to maintain the previous results. This gives us sequential enough writes to easily stream to block devices over a WAN, I've never seen a TxG commit so clean.

janetcampbell on 6 Apr 2019

This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions.