Type | Version/Name
--- | ---
Distribution Name | Proxmox VE (some configs below were repeated on Ubuntu 20.04 with the same result)
Distribution Version | 6.2
Linux Kernel | (tested across): 5.4.34-1-pve, 5.4.41-1-pve, and 5.4.44-1-pve)
Architecture | 2x Xeon Gold 6154, 192GB, boot from ext4 SATA SSD or 4x SSD DC P4510 (ZFS as root)
ZFS Version | 0.8.4-pve1 (also experienced with 0.8.3)
SPL Version | 0.8.4-pve1 (")
zfs parameters | spa_slop_shift=7 (also tried default), zfs_arc_meta_min set to 16GB or 32GB
zpool parameters | atime=off, xattr=sa, recordsize=1M
After a clean boot, system memory climbs to a steady 118-120GB of 188GB as ARC populates (as reported by HTOP). No other memory heavy operations taking place on the system. It is basically idle save for this testing.
In testing usage scenarios sensitive to metadata eviction (repeated indexing/walking of a large directory structure, hdd's remaining spun down while refreshing directory contents, etc), I've found that beyond a certain file transfer throughput to (and possibly from) a zpool, the directory metadata appears to have a hair-trigger to be purged. If the throughput remains relatively low (low hundreds of MB/s), the transfer can continue for days with all directory contents remaining cached, but with higher throughputs (600+ MB/s), I've found that it takes just a few dozen GB of transfer for the directory traversal to revert to its uncached / cold-boot speed. For a raidz zpool with thousands of directories, this means an operation that took <10 seconds to traverse all directories will now take tens of minutes to complete (as long as it took when the cache was cold after boot).
I've tested this with zpool configurations of varying numbers of raidz/z2/z3 vdevs. In all cases, the vdevs were wide enough as to support 2GB/s bursts (observable via iostat 0.1), and the 'high' throughputs that trigger the cache misses are still sufficiently low that they don't appear to be running up against the write throttle (observed bursts of 2GB/s for a fraction of the zfs_txg_timeout, with 1+ seconds of 0 writes between timeouts).
For my setup:
After a cold boot and a find run against the zpool (~1.2 million files / ~100k dirs):
The metadata_size and arc_meta_used will fluctuate depending on the type of activity, but even a few seconds of sufficiently high data throughput can cause the cache to drop directory data.
The specific vdev/zpool configuration(s) in use do not appear to have any impact. I've observed the cache misses triggered with the following widely varying scenarios:
Some other observations:
Nothing abnormal to report.
Have you tried increasing zfs_arc_dnode_limit_percent (or zfs_arc_dnode_limit) to avoid flushing the dnodes too aggressively?
(See also man zfs-module-parameters)
Have you tried increasing
zfs_arc_dnode_limit_percent(orzfs_arc_dnode_limit) to avoid flushing the dnodes too aggressively?
zfs_arc_dnode_limit automatically rises to match zfs_arc_meta_min, which is set to 32GB (and overrides zfs_arc_dnode_limit_percent).
dnode_size remains stable at 5.04GB during all operations above (as well as when directory queries become cache misses once the data throughput/copy operation resumes).
As an additional data point, in performing repeated find operations on the zpool in order to confirm dnode_size remained constant for this reply, and with the copy operation stopped, I noted that with just a single background task reading data from the zpool (in this case a media scanner) several back-to-back find operations in a row appeared to run at uncached speed. It wasn't until the 3rd or 4th repeat that the directory metadata appeared to 'stick' in the cache. Then once the directory metadata was cached, with the background sequential read task continuing, I can watch mfu_evictable_metadata slowly climb again. Repeating the find knocks it down again. This system has seen a lot of sequential throughput over the past two days, without me repeating the find until just now.
It's as if dnode/metadata is fighting with data mru/mfu somehow and is failing to follow the respective parameters. The only way I can get the directory metadata to remain cached is to repeat a find across the zpool at a rate sufficient to prevent eviction. The higher the data throughput, the more frequently I need to repeat the directory traversal to keep it in the arc. If I repeat the find a bunch of times in a row, that appears to 'buy me some time', but if data transfer throughput is increased, then I must increase the frequency of the find operation to compensate or else it will revert to uncached performance. With sufficiently high data throughput, even constantly traversing all directories may not be sufficient to keep them cached.
This should not be occurring with arc_meta_min set higher than the peak observed metadata_size / dnode_size. A possible workaround is to cron find . the zpool every minute, but that shouldn't be necessary given that the current parameters should be sufficient to keep this dnode metadata in the arc.
edit in the time it took for me to write those last two paragraphs, with the copy operation resumed (1GB/s from one zpool to another), the find operation once again returned to uncached performance. dnode_size remains at 5.04GB and metadata_size at 7.16GB.
it's even worse - i have found that metadata even evicts if there is no memory pressure or other data troughput at all, i.e. just a simple
rsync -av --dry-run /dir/subdirs/with/1mio+/files/altogether /tmp (which is effectively lstat()'ing all files recursively)
on a freshly booted system will make the arc go crazy.
On my VM with 8gb ram, i see arc collapsing during the initial and subsequent run - and altough all metadata fits in ram, a second rsync run will never perform sufficiently and being completely served from ram (which i would expect). we can see slab and SUnreclaim grow in /proc/meminfo
i discussed that on irc and was recommended to drop dnode cache via echo 2 >/proc/sys/vm/drop_caches but this does not really work for me.
there is definetely something stupid going on here and zfs caching apparently puts a spoke in it's own wheel...
Possibly related:
Most helpful comment
Possibly related:
10331 - fix dnode eviction typo in arc_evict_state()
10563 - dbuf cache size is 1/32nd what was intended
10600 - Revise ARC shrinker algorithm
10610 - Limit dbuf cache sizes based only on ARC target size by default
10618 - Restore ARC MFU/MRU pressure