Requirements:
Idea:
Condition:
(l2arc_dev->"meta_buf_to_be_evicted_asize")
<
(pool->l2arc_available_asize + MIN(pool->l2arc_data_buf_asize, (pool->l2arc_data_buf_asize + pool->l2arc_meta_buf_asize) * l2arc_meta_limit_percent / 100%) - pool->l2arc_dev_count * 2 * L2ARC_WRITE_MAX * L2ARC_TRIM_AHEAD/100%)
Remarks:
Tunables:
Observables:
To me this sounds like additional complication with no obvious benefits. ZFS already has small non-evictable metadata cache in RAM for the most important pool metadata. On top of that, normal ARC and L2ARC operation should ensure that (meta-)data accessed at least sometimes should be cached. If for some reason you need all of your metadata to reside on SSDs, just add special metadata vdev to your pool, that will be much more efficient from all perspectives than use of L2ARC. L2ARC should be used for cases where you can not predict active data set in advance, and in that context making some (meta-)data more special than others even if accessed only rarely is a step in wrong direction.
From purely mechanical since, I think there will be a problem with checksum verification. Since L2ARC header in RAM does not store it, unless there is actual read request with full block pointer, the code reloading blocks from L2ARC into ARC won't be able to verify the checksum.
The motivation is the wish to have a L2ARC which stores data and metadata, but prioritizes metadata. Basically behaving as with secondarycache=metadata, but in addition also storing data on opportunity bases. Have your cake and eat it too.
Without requiring a complete redesign of the L2ARC. Without requiring separate partitions for data and metadata, and a secondarycache property which can be configured per L2ARC top level vdev instead of once per pool, and in the end would most likely result in ineffective use of the physical L2ARC vdev.
In the end the idea is to keep the L2ARC as it is, and just prevent losing perfectly fine pool metadata when its storage area in the persistent L2ARC is being overwritten. The idea is not to store the complete pool metadata in the L2ARC, but yes, it could happen based on L2ARC size, tunables and access patterns
The special vdevs are very interesting but require interface-ports and drive-slots. And as the redundancy should be no less than that of the data disk of the pool, a raidz2 pool would require the ability to house and connect ~3 additional drives. While this is no issue for big irons, for SOHO it is quite often not possible.
Keeping rarely accessed metadata in the L2ARC should not be an issue. The L2ARC just have to be bigger than 0.1% (128kiB blocksize) to ~3% (4kiB blocksize) of the pool size, and/or a tunable like vfs.zfs.l2arc.meta_limit_percent has to be set to a value <100%. The tunable would ensure that enough of the L2ARC is available for random access (non-meta)data.
Regarding your point about zfs mechanics, do i understand your explanation correctly?:
Normally a block is read from the L2ARC by following a pointer stored in its parent block/buffer, which also contains the checksum of the L2ARC block? So if we would try to just read back L2ARC blocks, we would have no parent block and so would be missing the checksum to verify that the block was not corrupted?
Is this not a problem applying also to reading back the persistent L2ARC? Was this solved with the log-blocks? If yes, couldn't we use those logblocks to check the data is uncorrupted?
FYI, in Solaris 11, the metadata/data separation has been removed entirely. Can we be sure keeping the complexity of separate metadata/data caching is worth the trouble?
Normally a block is read from the L2ARC by following a pointer stored in its parent block/buffer, which also contains the checksum of the L2ARC block? So if we would try to just read back L2ARC blocks, we would have no parent block and so would be missing the checksum to verify that the block was not corrupted?
Right. L2ARC block checksum is identical to normal block checksum, since it uses the same compression/encryption, just stored in different place. It does not require separate storage.
Is this not a problem applying also to reading back the persistent L2ARC? Was this solved with the log-blocks? If yes, couldn't we use those logblocks to check the data is uncorrupted?
Persistent L2ARC does not reload the data into ARC, it only reconstructs previous L2ARC headers on pool import. The log blocks have their own checksums, which don't cover the actual data block. Any possible corruptions are detected later when the read is attempted by application, in which case read is just silently redirected to main storage.
Due to the smaller size of metadata, the same amount of L2ARC space will contain more metadata than data, and by this have a higher hit-probability. Also (if i have not misunderstood the discussion) having data in the (L2)ARC is not really helpful, if the corresponding metadata is not also cached and would need to be read from spinning rust. Getting rid of the separation would result in a simpler code, but metadata would lose its VIP handling, and the users would lose mechanisms to adapt their pool to their needs. In my opinion until somebody performs an in-depth analysis on this topic which undisputable shows the pros of getting rid of the separation outweigh the cons including rewrite of the zfs code with the possibility to introduce errors, the implemented separation of metadata/data caching is clearly worth it.
Interesting, so the persistent L2ARC is only reading back and checking the ARC L2ARC headers, and the L2ARC block are only checked when accessed due to a cache hit.
As we shall verify all data read from a persistent media against their checksum, an implementation of this feature seems to require:
FYI, in Solaris 11, the metadata/data separation has been removed entirely. Can we be sure keeping the complexity of separate metadata/data caching is worth the trouble?
I think so: a correct using of the metadata property can make a very big difference when traversing dataset with millions of files. For example, I have a rsnapshot machine were ARC caches both data and metadata, while L2ARC caches metadata only. The performance improvements when iterating over these files (ie: by rsync) over a similarly configured XFS really is massive. Using secondarycache=metadata was a significant improvement over the default secondarycache=all setting.
So I would really like to maintain the data/metadata separation we have now.
Most helpful comment
To me this sounds like additional complication with no obvious benefits. ZFS already has small non-evictable metadata cache in RAM for the most important pool metadata. On top of that, normal ARC and L2ARC operation should ensure that (meta-)data accessed at least sometimes should be cached. If for some reason you need all of your metadata to reside on SSDs, just add special metadata vdev to your pool, that will be much more efficient from all perspectives than use of L2ARC. L2ARC should be used for cases where you can not predict active data set in advance, and in that context making some (meta-)data more special than others even if accessed only rarely is a step in wrong direction.
From purely mechanical since, I think there will be a problem with checksum verification. Since L2ARC header in RAM does not store it, unless there is actual read request with full block pointer, the code reloading blocks from L2ARC into ARC won't be able to verify the checksum.