Zfs: Data throughput causing apparent (directory) metadata eviction with metadata_size << arc_meta_min

Created on 28 Jun 2020 · 4Comments · Source: openzfs/zfs

System information

Type | Version/Name
--- | ---
Distribution Name | Proxmox VE (some configs below were repeated on Ubuntu 20.04 with the same result)
Distribution Version | 6.2
Linux Kernel | (tested across): 5.4.34-1-pve, 5.4.41-1-pve, and 5.4.44-1-pve)
Architecture | 2x Xeon Gold 6154, 192GB, boot from ext4 SATA SSD or 4x SSD DC P4510 (ZFS as root)
ZFS Version | 0.8.4-pve1 (also experienced with 0.8.3)
SPL Version | 0.8.4-pve1 (")
zfs parameters | spa_slop_shift=7 (also tried default), zfs_arc_meta_min set to 16GB or 32GB
zpool parameters | atime=off, xattr=sa, recordsize=1M

After a clean boot, system memory climbs to a steady 118-120GB of 188GB as ARC populates (as reported by HTOP). No other memory heavy operations taking place on the system. It is basically idle save for this testing.

Describe the problem you're observing

In testing usage scenarios sensitive to metadata eviction (repeated indexing/walking of a large directory structure, hdd's remaining spun down while refreshing directory contents, etc), I've found that beyond a certain file transfer throughput to (and possibly from) a zpool, the directory metadata appears to have a hair-trigger to be purged. If the throughput remains relatively low (low hundreds of MB/s), the transfer can continue for days with all directory contents remaining cached, but with higher throughputs (600+ MB/s), I've found that it takes just a few dozen GB of transfer for the directory traversal to revert to its uncached / cold-boot speed. For a raidz zpool with thousands of directories, this means an operation that took <10 seconds to traverse all directories will now take tens of minutes to complete (as long as it took when the cache was cold after boot).

I've tested this with zpool configurations of varying numbers of raidz/z2/z3 vdevs. In all cases, the vdevs were wide enough as to support 2GB/s bursts (observable via iostat 0.1), and the 'high' throughputs that trigger the cache misses are still sufficiently low that they don't appear to be running up against the write throttle (observed bursts of 2GB/s for a fraction of the zfs_txg_timeout, with 1+ seconds of 0 writes between timeouts).

For my setup:

arc_meta_limit = 70GB (default)

After a cold boot and a find run against the zpool (~1.2 million files / ~100k dirs):

metadata_size = ~5GB
arc_meta_used = ~10GB
arc_meta_min = 32GB (set via zfs_arc_meta_min)

The metadata_size and arc_meta_used will fluctuate depending on the type of activity, but even a few seconds of sufficiently high data throughput can cause the cache to drop directory data.

The specific vdev/zpool configuration(s) in use do not appear to have any impact. I've observed the cache misses triggered with the following widely varying scenarios:

Copying to multiple single-hdd zpools via samba 10Gbps LAN (at 1GB/s), multiple threads.
Copying from multiple single-drive zpools to a single raidz2 (single, dual, and triple dev) via rsync and/or cp.
Copying from one raidz2 to another raidz2 with multiple rsync and/or cp threads.
Copying from one raidz2 to another with a single cp thread (appears thread limited to ~600 MB/s).
However, if I repeat that last operation with a single rsync instead of a single cp (which runs closer to ~300 MB/s), this transfer can continue for literally days with zero impact on cached directories. Add a few more rsyncs or shift back to a single cp and it takes just a few seconds for the directories to no longer be cached.

Some other observations:

After a clean boot + find to cache directory metadata, mru/mfu_evictable_metadata remains at 0 until some time after data_size hits its limit (~83GB for this config), then evictable_metadata begins to climb as metadata_size and meta_used start to fall off, but if the transfer is stopped and the find operation is repeated, the numbers revert and the directories remain cached. This shifting to evictable occurs even though meta_used is well below meta_min (and meta_min had been set since boot). If the transfer continues or throughput increases to a high enough value, even looping the find operation (every 3-4 seconds if cache hits) will not be sufficient to keep dirs cached. If the transfer is paused, the (now cache missing) find operation will be observable in zpool iostat, and will take a very long time to complete. Once complete, a repeat again shows the dirs are cached, and then resuming the file transfer for a few more seconds will once again lead to cache misses on the next find.

Describe how to reproduce the problem

Monitor arcstats, etc.
Set zfs_arc_meta_min to a value higher than your setup's arc_meta_used climbs to after a find operation caches all directory structure data. Ideally have this set at boot.
Have a couple of zpools or one zpool and other data sources capable of producing sufficient throughput.
Have enough dirs/files present on a zpool such that a find operation can demonstrate a performance difference between cached and uncached performance.
Perform a few find operations on the volume as to traverse the tree and cache metadata.
Begin a file copy. This can be any type of relatively high throughput operation. I've found 600 MB/s to be the approximate threshold necessary to trigger the issue.
After a few seconds of the copy operation, repeat the find operation and note performance.
If the copy throughput was low (e.g. copying from a single HDD to the zpool, or just a single thread via SMB, etc), note that the find operation continues to complete quickly, even after several hours / dozens of TB transferred.
If the copy throughput was high (>~600 MB/s), note that the find operation reverts to cold-boot performance.

Include any warning/errors/backtraces from the system logs

Nothing abnormal to report.

Performance

Source

malventano

Most helpful comment

Possibly related:

10331 - fix dnode eviction typo in arc_evict_state()

10563 - dbuf cache size is 1/32nd what was intended

10600 - Revise ARC shrinker algorithm

10610 - Limit dbuf cache sizes based only on ARC target size by default

10618 - Restore ARC MFU/MRU pressure

chrisrd on 29 Jul 2020

❤3

All 4 comments

Have you tried increasing zfs_arc_dnode_limit_percent (or zfs_arc_dnode_limit) to avoid flushing the dnodes too aggressively?

(See also man zfs-module-parameters)

chrisrd on 1 Jul 2020

Have you tried increasing zfs_arc_dnode_limit_percent (or zfs_arc_dnode_limit) to avoid flushing the dnodes too aggressively?

zfs_arc_dnode_limit automatically rises to match zfs_arc_meta_min, which is set to 32GB (and overrides zfs_arc_dnode_limit_percent).
dnode_size remains stable at 5.04GB during all operations above (as well as when directory queries become cache misses once the data throughput/copy operation resumes).

As an additional data point, in performing repeated find operations on the zpool in order to confirm dnode_size remained constant for this reply, and with the copy operation stopped, I noted that with just a single background task reading data from the zpool (in this case a media scanner) several back-to-back find operations in a row appeared to run at uncached speed. It wasn't until the 3rd or 4th repeat that the directory metadata appeared to 'stick' in the cache. Then once the directory metadata was cached, with the background sequential read task continuing, I can watch mfu_evictable_metadata slowly climb again. Repeating the find knocks it down again. This system has seen a lot of sequential throughput over the past two days, without me repeating the find until just now.

It's as if dnode/metadata is fighting with data mru/mfu somehow and is failing to follow the respective parameters. The only way I can get the directory metadata to remain cached is to repeat a find across the zpool at a rate sufficient to prevent eviction. The higher the data throughput, the more frequently I need to repeat the directory traversal to keep it in the arc. If I repeat the find a bunch of times in a row, that appears to 'buy me some time', but if data transfer throughput is increased, then I must increase the frequency of the find operation to compensate or else it will revert to uncached performance. With sufficiently high data throughput, even constantly traversing all directories may not be sufficient to keep them cached.

This should not be occurring with arc_meta_min set higher than the peak observed metadata_size / dnode_size. A possible workaround is to cron find . the zpool every minute, but that shouldn't be necessary given that the current parameters should be sufficient to keep this dnode metadata in the arc.

edit in the time it took for me to write those last two paragraphs, with the copy operation resumed (1GB/s from one zpool to another), the find operation once again returned to uncached performance. dnode_size remains at 5.04GB and metadata_size at 7.16GB.

malventano on 1 Jul 2020

it's even worse - i have found that metadata even evicts if there is no memory pressure or other data troughput at all, i.e. just a simple

rsync -av --dry-run /dir/subdirs/with/1mio+/files/altogether /tmp (which is effectively lstat()'ing all files recursively)

on a freshly booted system will make the arc go crazy.

On my VM with 8gb ram, i see arc collapsing during the initial and subsequent run - and altough all metadata fits in ram, a second rsync run will never perform sufficiently and being completely served from ram (which i would expect). we can see slab and SUnreclaim grow in /proc/meminfo

i discussed that on irc and was recommended to drop dnode cache via echo 2 >/proc/sys/vm/drop_caches but this does not really work for me.

there is definetely something stupid going on here and zfs caching apparently puts a spoke in it's own wheel...