Zfs: Large recordsize (> 1 MB) == inflated disk writes? (0.6.5.4-1)

Created on 25 Jul 2016 · 16Comments · Source: openzfs/zfs

Forgive me if I leave out anything that would be helpful, I can reproduce at will, although getting logs out of the system in question may take a few days.

I've been experimenting with larger recordsizes (4 MB, 16 MB) for a very specific workload that would benefit from them when combined with SMR HDDs. All seems well from initial tests when looking at dstat, zpool iostat, etc in terms of bandwidth, however, the logical writes are being inflated when committing to disk by up to 4X! IE: I see ~440 MB/s of disk IO through the kernel, from what ZFS reports, etc, but the logical throughput is barely over 100 MB/s. This is with 16 MB recordsizes.

With 4 MB max recordsize the delta appears to be approximately 2X inflation from logical -> disk throughput from quick testing.

The odd thing is that the allocated disk space rises with the logical throughput, in that that ~440 MB/s doesn't match up over time with pool utilization (i.e., the logical throughput drives pool allocation, not the writes being committed to the disks). This tracks with both 4 MB and 16 MB recordsizes (I did not test other sizes). 1 MB records are perfectly normal in behavior, but non-optimal due to the quirkiness of SMR drives (I'd love 100+ MB record sizes if they were an option).

Versions:
Stock ZFS in TOSS 2.4-9 (ZoL 0.6.5.4-1)

Pool setup:
RAIDZ3 over 20 8 TB SMR drives
ashift=12
relatime=on
reservation=20G
logbias=throughput
Everything else defaults (compression is not enabled)

Module parameters:
zfs_txg_timeout = 60000 (60 seconds)
zfs_vdev_async_max_active = 128
zfs_top_maxinflight = 128

This occurs during the following workload:
10 processes (with seperate output directories), each writing 100 MB (in 1 MB blocks) files, with a single fsync for each file.
e.g.:
for i in 1..100000
do dd if=/dev/zero of=/test_pool/dir_1/$i.file bs=1M count=100 conv=fdatasync
done

No other pool IO. No reads during writes.

Results from ~14 hours of writes at 16 MB recordsize:
dstat/zpool iostat report an average of 440 MB/s over the entire period
Logical data written = ~5.5 TB (somewhere around 110 MB/s)

Given the dstat/zpool iostat figures I would have expected closer to 22 TB written overnight.

Let me know what further information I can provide to track this down, as I'm sure this isn't enough to really delve into the behavior.

Performance

Source

thewacokid

Most helpful comment

We ended up not running > 1 MB recordsizes in production. While we were able to get the larger recordsizes to commit consistently well, read performance and memory management overhead proved too great of a compromise. There is a current issue #6518 referencing the read performance with larger than 1 MB recordsizes.

However - the config that we settled on for fairly consistent write performance (with our very sequential and large file oriented workload with explicit syncing) was:
zfs_dirty_data = 4 GB (this was the biggest step forward)
zfs_arc_min = 4 GB
zfs_arc_max = 8 GB (these two arc configs were for memory management, not SMR specific)
zfs_txg_timeout = 120 seconds (since we explicitly sync)
zfs_vdev_aggregation_limit = 16 MB

I'm sure there are more tunables, but this got us the performance and relative consistency that we required. These settings held fairly consistent in testing with the newer .7 codebase as well.

thewacokid on 26 Sep 2017

👍4

All 16 comments

@thewacokid one possibility is that you're triggering multiple writes of the same 16M block due to the 1M block size you've specified in the dd command line. As a quick test I'd suggest increasing that to 16M and see if you see a change in behavior.

behlendorf on 26 Jul 2016

I'll try that when I get a chance, however, I don't see how sequential writes (with a few syncs every 100 MB or so) would cause a 4x overwrite penalty? Am I missing something about how ZFS allocates blocks for new writes?

Is there anything you'd like me to measure if I get a chance to retest? I could dump the transaction groups, if that would shed light on this.

thewacokid on 26 Jul 2016

The thing to keep in mind is ZFS can only write full blocks to disk, in your case those blocks are going to be 16M. If you're writing to a file in 1M chunks it's not unlikely the txg sync will happen while you're partially through dirtying one of these 16M blocks. That partial block must be written to disk since it's dirty, and it must be written again in the next txg when the rest of the block is dirtied. Minimizing the number of system calls required to fully dirty the block should make this less likely (but not impossible).

This is going to hurt even worse for SMR drives since it means you'll be writing the block twice and then freeing up the first copy and making holes. One way you could check this is to use zdb to dump the block layout of the files you've just written. Use stat <file> to get the inode number of the file, this maps directly to the ZFS object number. You can then dump the object with zdb <pool/dataset> <object nr>.

behlendorf on 26 Jul 2016

I must be totally confused about something. With standard ZFS files (i.e., not zvols), I was under the impression the recordsize was a _maximum_ value, and anything between [ashift minimum * stripe width (in this case 4 KB * 20 = 80 KB)] and [recordsize] was a valid block size?

Is that an untrue statement?

thewacokid on 26 Jul 2016

@thewacokid you're not crazy, that's largely true. The caveat is that once the file size exceeds the maximum block size, meaning multiple-blocks are required, all of those blocks must be of the same size. The block sizes also must be a power of two in the range of ashift to recordsize inclusive.

So you're absolutely right, if you write a 900 byte file it will use a 1K block size. If you write a 17M file it needs to use two 16M blocks (assuming recordsize=16M).

behlendorf on 27 Jul 2016

Okay, that makes more sense!

Moving on to the issue I'm seeing, I'm unsure how I'd be hitting txg syncs, since they're set well above the IO time required for a single file (60 seconds).

Each file is synced at the tail end of the file, does that action sync the entire transaction group? If so, then this behavior makes sense, but I'm unsure how to fix/alter our target workload to utilize large blocks with multiple writers (that sync at the tail end of each file).

It's starting to feel like syncing any file at all is equivalent to syncing the entire filesystem, and I hope that's me misunderstanding this behavior.

thewacokid on 27 Jul 2016

A txg sync gets triggered based on a few different criteria, not just time.

Setting a zfs_txg_timeout=60 means that dirty data in the ARC will be written at least every 60 seconds. The zfs_dirty_data_sync=64M module option will also trigger a sync when there's more than this amount of dirty data in the ARC.

For most workloads it's not a good idea to let large amounts of dirty data accumulate. In your case you may want to try increasing this as well as zfs_dirty_data_max. These modules options are described in the zfs-module-parameters(5) man page. There's also many other tuning which you might find useful.

A good way to get visibility in to how the txgs are being formed is to enable the txg history in proc. That'll give you an idea how often they're happening and how large they are.

echo 1 >/sys/module/zfs/parameters/zfs_txg_history
cat /proc/spl/kstat/zfs/pip/txgs

Additionally if you're willing to custom build the latest ZFS source from Github you'll have access to some really nice extended zpool iostat commands. Personally I've found the new request size histograms (-r) and latency histrograms (-w) really insightful.

It's starting to feel like syncing any file at all is equivalent to syncing the entire filesystem, and I hope that's me misunderstanding this behavior.

That's definitely not the case. When you call fsync on a file only that file will be synced to disk. But just like any filesystem that data may be written out sooner depending on how the system is tuned.

In your case with SMR drives and huge blocks you're probably on the right track trying to tune things to cache as much as possible in the ARC before writing anything out.

behlendorf on 27 Jul 2016

Aha, I knew I missed something. The zfs_dirty_data_sync parameter will allow us to keep a lot more in memory prior to syncing the file.

We have a very peculiar workload here, but we're trying to minimize the pain of working with SMR drives. What we're shooting for is essentially a combination of settings that will allow writes in a file to queue up until we explicitly sync the file, and I simply forgot about the above parameter.

Thanks a bunch for your insight, I was really starting to worry I'd lost my mind! I'll work on tuning this and update so others can gain some insight as well. Sorry for thinking something was wrong instead of simply blaming my lack of knowledge!

thewacokid on 27 Jul 2016

Another thing you might try is to leave the block size set at 1M but increase the zfs_vdev_aggregation_limit to 16M. This way as long as your doing 1M aligned IO you should never write partial blocks and leave holes. ZFS will aggregate these 1M blocks in to larger 16M IOs to the disk.

behlendorf on 27 Jul 2016

👍1

I'll try that as well. I was mainly trying to tune for the best workload to the disks, as well as minimizing resilver time in the event of disk failure (since resilvers on SMR drives can be extremely painful if the original write pattern was poor).

thewacokid on 27 Jul 2016

Just to follow up and get this closed - increasing the aggregation limit along with the dirty_data_sync limit dramatically improves real throughput. No overwrites are happening, and the SMR drives resilver (with this workload) at almost 40 MB/s (to the single drive). That's far in excess of what we've been able to achieve in the past.

Thanks for clearing up my confusion @behlendorf!

thewacokid on 12 Aug 2016

👍1

@thewacokid, can you please share what your final parameters for this SMR pool were? I'm about to try something very similar here, and would benefit from your results.

DurvalMenezes on 22 Sep 2017

I'm sure there are more tunables, but this got us the performance and relative consistency that we required. These settings held fairly consistent in testing with the newer .7 codebase as well.

thewacokid on 26 Sep 2017

👍4

the config that we settled on for fairly consistent write performance (with our very sequential and large file oriented workload with explicit syncing) was:
zfs_dirty_data = 4 GB (this was the biggest step forward)
zfs_arc_min = 4 GB
zfs_arc_max = 8 GB (these two arc configs were for memory management, not SMR specific)
zfs_txg_timeout = 120 seconds (since we explicitly sync)
zfs_vdev_aggregation_limit = 16 MB

Thanks for the info, @thewacokid. So (given the read performance/memory management issues) did you end up using the above with zfs_record_size = 1m? Or something else?

DurvalMenezes on 30 Sep 2017

That'c correct, 1 MB recordsize on all SMR pools. I briefly experimented with 4 MB as a happy medium, but the read performance under non-perfect loads left quite a bit to be desired.

thewacokid on 2 Oct 2017

👍1

I wanted to note that thewacokid's solution did work like a charm for me too. I have a 6-SMR-disk raid-z array for a backup server (basically a NAS + a linux desktop to run CrashPlan for cloud disaster recovery) and the bottleneck is now the gigabit Ethernet jack on the back. The solution easily made the disks 10x faster for very large continuous writes than they were before.

Note: it is also much better to run these drives as individual SATA drives (preferably internally to a tower) than either via USB or even with an E-SATA enclosure with a 4-port multiplier.