Zfs: L2ARC shall not lose valid pool metadata

Created on 20 Sep 2020 · 6Comments · Source: openzfs/zfs

Describe the feature would like to see added to OpenZFS

Requirements:

ZFS shall keep all (cached & still valid) pool metadata in L2ARC if tunables and size of pool metadata, L2ARC and ARC allow
The code change shall be minimally invasive, without requiring a redesign of L2ARC or on-disk format changes
The feature shall be relevant for pools with L2ARC and secondarycache set to _all_ or _metadata_, and have no impact on pools without L2ARC or with secondarycache set to _none_ or e.g. _data_
The feature shall be compatible with pools with one or multiple top level L2ARC vdevs
The feature shall be en-/disabled via zfs tunable
The amount of metadata shall be limited to a percentage of the L2ARC size via a zfs tunable to avoid having a negative impact on pools with very small blocksize and/or small L2ARC size relative to pool size
The impact of the feature shall be visible via zfs observables
The feature shall not cause ARC buffers to be moved from MRU to MFU
The feature shall not falsify zfs statistics / observables

Idea:

Before the L2ARC feed-thread deletes (overwrite or trim) the L2ARC area containing the oldest data, ARC_STATE_L2C_ONLY metadata in this area is read back into the ARC
This is only performed if conditions allow, else behaviour will be as before this feature was added.
To ensure the metadata read from the end of the L2ARC is written back to the L2ARC before it can be evicted from ARC, it might make sense to internally set a metadata L2ARC_HEADROOM to 0 in case the feature is enabled and condition allows

How will this feature improve OpenZFS?

The L2ARC feed-thread will not any-more delete still valid metadata which is only cached in L2ARC.
With this it could now make sense to warm up a L2ARC by reading the complete pool metadata (if zdb or zpool scrub would have an option to use the ARC for metadata)
There should be no downside as the feature only activates when circumstances allow and is user-configurable
See first section

Additional context

Condition:

The pool L2ARC has enough free space to store the cached metadata completely. The calculation could be _similar_ to the following pseudo-code (assuming use of pool instead of vdev parameters):
(l2arc_dev->"meta_buf_to_be_evicted_asize") < (pool->l2arc_available_asize + MIN(pool->l2arc_data_buf_asize, (pool->l2arc_data_buf_asize + pool->l2arc_meta_buf_asize) * l2arc_meta_limit_percent / 100%) - pool->l2arc_dev_count * 2 * L2ARC_WRITE_MAX * L2ARC_TRIM_AHEAD/100%)

Remarks:

Reading back metadata in the ARC without impacting MRU/MFU assignment might require adding or updating functions in arc.c
Depending on the implementation of the condition, asize of data and metadata buffers stored in the pool L2ARC needs to be available in the code.
Ensuring the metadata read from the end of the L2ARC is not immediately (in the same feed-thread) written back to (the same) L2ARC should allow for some load-balacing in case of multiple L2ARC toplevel vdevs (in case another L2ARC vdev was just addeed).

Tunables:

vfs.zfs.l2arc.keep_meta: (or a better name), 0=old behaviour, 1=new behaviour, default=0 (at the moment)
vfs.zfs.l2arc.meta_limit_percent: 0-100, default=100 (keeps original behaviour if keep_meta=0)

Observables:

kstat.zfs.misc.arcstats.l2_keep_meta_skip: (or a better name), a counter with the number of feeds where the condition paused the feature. Alternatively the number of feeds where metadata was evicted (there might already be a kstat for this)

Feature

Source

zfsuser

Most helpful comment

To me this sounds like additional complication with no obvious benefits. ZFS already has small non-evictable metadata cache in RAM for the most important pool metadata. On top of that, normal ARC and L2ARC operation should ensure that (meta-)data accessed at least sometimes should be cached. If for some reason you need all of your metadata to reside on SSDs, just add special metadata vdev to your pool, that will be much more efficient from all perspectives than use of L2ARC. L2ARC should be used for cases where you can not predict active data set in advance, and in that context making some (meta-)data more special than others even if accessed only rarely is a step in wrong direction.

From purely mechanical since, I think there will be a problem with checksum verification. Since L2ARC header in RAM does not store it, unless there is actual read request with full block pointer, the code reloading blocks from L2ARC into ARC won't be able to verify the checksum.

amotin on 21 Sep 2020

👍4

All 6 comments

amotin on 21 Sep 2020

👍4

The motivation is the wish to have a L2ARC which stores data and metadata, but prioritizes metadata. Basically behaving as with secondarycache=metadata, but in addition also storing data on opportunity bases. Have your cake and eat it too.

Without requiring a complete redesign of the L2ARC. Without requiring separate partitions for data and metadata, and a secondarycache property which can be configured per L2ARC top level vdev instead of once per pool, and in the end would most likely result in ineffective use of the physical L2ARC vdev.

In the end the idea is to keep the L2ARC as it is, and just prevent losing perfectly fine pool metadata when its storage area in the persistent L2ARC is being overwritten. The idea is not to store the complete pool metadata in the L2ARC, but yes, it could happen based on L2ARC size, tunables and access patterns

The special vdevs are very interesting but require interface-ports and drive-slots. And as the redundancy should be no less than that of the data disk of the pool, a raidz2 pool would require the ability to house and connect ~3 additional drives. While this is no issue for big irons, for SOHO it is quite often not possible.

Keeping rarely accessed metadata in the L2ARC should not be an issue. The L2ARC just have to be bigger than 0.1% (128kiB blocksize) to ~3% (4kiB blocksize) of the pool size, and/or a tunable like vfs.zfs.l2arc.meta_limit_percent has to be set to a value <100%. The tunable would ensure that enough of the L2ARC is available for random access (non-meta)data.

Regarding your point about zfs mechanics, do i understand your explanation correctly?:

Normally a block is read from the L2ARC by following a pointer stored in its parent block/buffer, which also contains the checksum of the L2ARC block? So if we would try to just read back L2ARC blocks, we would have no parent block and so would be missing the checksum to verify that the block was not corrupted?

Is this not a problem applying also to reading back the persistent L2ARC? Was this solved with the log-blocks? If yes, couldn't we use those logblocks to check the data is uncorrupted?

zfsuser on 22 Sep 2020

FYI, in Solaris 11, the metadata/data separation has been removed entirely. Can we be sure keeping the complexity of separate metadata/data caching is worth the trouble?

richardelling on 22 Sep 2020

👍1

Normally a block is read from the L2ARC by following a pointer stored in its parent block/buffer, which also contains the checksum of the L2ARC block? So if we would try to just read back L2ARC blocks, we would have no parent block and so would be missing the checksum to verify that the block was not corrupted?

Right. L2ARC block checksum is identical to normal block checksum, since it uses the same compression/encryption, just stored in different place. It does not require separate storage.

Is this not a problem applying also to reading back the persistent L2ARC? Was this solved with the log-blocks? If yes, couldn't we use those logblocks to check the data is uncorrupted?

Persistent L2ARC does not reload the data into ARC, it only reconstructs previous L2ARC headers on pool import. The log blocks have their own checksums, which don't cover the actual data block. Any possible corruptions are detected later when the read is attempted by application, in which case read is just silently redirected to main storage.

amotin on 22 Sep 2020

Due to the smaller size of metadata, the same amount of L2ARC space will contain more metadata than data, and by this have a higher hit-probability. Also (if i have not misunderstood the discussion) having data in the (L2)ARC is not really helpful, if the corresponding metadata is not also cached and would need to be read from spinning rust. Getting rid of the separation would result in a simpler code, but metadata would lose its VIP handling, and the users would lose mechanisms to adapt their pool to their needs. In my opinion until somebody performs an in-depth analysis on this topic which undisputable shows the pros of getting rid of the separation outweigh the cons including rewrite of the zfs code with the possibility to introduce errors, the implemented separation of metadata/data caching is clearly worth it.

Interesting, so the persistent L2ARC is only reading back and checking the ARC L2ARC headers, and the L2ARC block are only checked when accessed due to a cache hit.

As we shall verify all data read from a persistent media against their checksum, an implementation of this feature seems to require:

A mechanism to find the parent metadata of an L2ARC_only metadata block to be able to verify the checksum of the block being read back.
Only metadata with parent metadata cached in ARC/L2ARC shall to be "rescued" from L2ARC overwrite.

zfsuser on 23 Sep 2020

FYI, in Solaris 11, the metadata/data separation has been removed entirely. Can we be sure keeping the complexity of separate metadata/data caching is worth the trouble?

I think so: a correct using of the metadata property can make a very big difference when traversing dataset with millions of files. For example, I have a rsnapshot machine were ARC caches both data and metadata, while L2ARC caches metadata only. The performance improvements when iterating over these files (ie: by rsync) over a similarly configured XFS really is massive. Using secondarycache=metadata was a significant improvement over the default secondarycache=all setting.

So I would really like to maintain the data/metadata separation we have now.