Zfs: Metadata Allocation Class

Created on 15 Sep 2015 · 17Comments · Source: openzfs/zfs

Intel is working on ways to isolate large block file data from metadata for ZFS on Linux. In addition to the size discrepancy with file data, metadata often has a more transient lifecycle and additional redundancy requirements (ditto blocks). Metadata is often a poor match for a RAIDZ tier since it cannot be dispersed and the relative parity overhead is high. A mirrored redundancy is a better choice for metadata.

A metadata-only allocation tier is being added to the existing storage pool allocation class mechanism and is used as the primary source for metadata allocations. File data remains in the _normal class_. Each top-level metadata VDEV is tagged as belonging to the _metadata allocation class_ and at runtime becomes associated with a pool’s metadata allocation class. The remaining (i.e. non-designated) top-level VDEVs default to the _normal allocation class_ distinction. In addition to generic metadata, the performance sensitive deduplication table (DDT) data can also benefit from having its own separate allocation class.

| Allocation Class | Purpose |
| --- | --- |
| Normal | Default source for all allocations |
| Log | ZIL records only |
| Metadata (new) | All metadata blocks (non level-0 file blocks) |
| Dedup Table (new) | Deduplication Table Data (DDT) |

More details to follow.

Feature

Source

don-brady

All 17 comments

I swear someone else was working on a metadata-specific vdev for much the same purpose.

DeHackEd on 15 Sep 2015

@DeHackEd I have seen writings about the _concept_ but FWIW, no recollection of anyone else _working_ on it in an open source area.

Found today: https://github.com/zfsonlinux/zfs/issues/1071#issuecomment-10249646 mentioning tegile, and Google finds e.g.:

http://www.tegile.com/blog/hybrid-unified-optimized-storage/ "… To the ZFS core they add a sophisticated in-line deduplication and meta-data acceleration code that takes advantage of the variety of storage types …"
http://www.tegile.com/blog/george_tintri/ "… we uniquely leverage flash as a store for all metadata …"
http://www.tegile.com/products/intelliflash

grahamperrin on 16 Sep 2015

It might be nice to allow (optionally!) one (or two?) of the metadata ditto blocks to reside in the ordinary pool, as well, making the metadata vdevs a kind of dedicated, write-through cache. (Different from L2ARC with secondarycache=metadata because they really would be holding authoritative copies of the metadata, but in dire straits could still be removed from the pool or used at lower fault tolerance -- one SSD instead of two in a mirror, etc.)

nwf on 25 Sep 2015

I just watched the Livestream presentation on this. This is definitely a feature needed in ZFS. I've struggled to keep metadata cached. The only way I've been able to get decent amount of metadata cached is to set L2ARC to metadata only, but it takes many passes, which still probably misses a significant amount. I would love to build pools with metadata on SSD tiers. Metadata typically accounts for 1/2 my disk I/O load.

There was a presentation from Nexenta a couple years back about tiered storage pools that had similar goals. http://www.open-zfs.org/w/images/7/71/ZFS_tiering.pdf I haven't heard anything of this effort since.

lschweiss-wustl on 20 Oct 2015

Hi @don-brady
May I ask what's the current status of this feature? Is there anything I can help work on with?
Thanks.

tuxoko on 2 Jun 2016

@tuxoko I'm hoping to post a public WIP branch soon. The creation/addition of VDEVs dedicated to specific metadata classes is functional. Currently working out accounting issues in metaslab layer. We just started running ztest with metadata only classes to help shake out any edge cases (found a few). Let me come up with a to-do list so others can help.

don-brady on 7 Jun 2016

Sounds great!!

tuxoko on 7 Jun 2016

Hi @don-brady
Did some more work on and measurements for the #4365 implementation. Would be interesting to test also this version.

inkdot7 on 2 Sep 2016

It seems that the WIP pull request #5182 was never referenced here, or vice versa...

adilger on 15 Feb 2017

👍1

@don-brady,

It's been a year since the last update and I was wondering how this is progressing.

Specifically, I'm interested in DDT devices. Moving the DDT onto dedicated high-speed devices should allow dedupe to function nearly as fast as the current memory-only implementation, but require much less memory.

For storage of VMs, dedupe could easily save much more space than compression, but the current memory requirements usually make it too costly.

pashford on 23 Aug 2017

See #5182 for the WIP. It's gone through a few iterations but I'm running an (old) version here. Very satisfied thus far. (No dedup, just regular metadata)

DeHackEd on 23 Aug 2017

@pashford So far as I know, DDT metadata can reside on L2ARC devices. The only thing this would change is to permit writebacks to go to faster media, rather than the primary (spinning rust) storage. That seems like it's unlikely to be a huge improvement vs. just having the DDT hang around in L2ARC.

nwf on 23 Aug 2017

@DeHackEd,

Thanks for the information.

@nwf,

The only thing this would change is to permit writebacks to go to faster media

If the writeback goes to a faster media, then a future DDT miss in the ARC/L2ARC would also go to that faster media, which would generate a performance bump. As an example, if you have a 2PB pool of 7200RPM storage, and have a few fast SSDs (SATA, SAS or NVME) as DDT devices, the DDT performance WILL be better, especially if only a portion of the DDT is kept in memory.

pashford on 23 Aug 2017

👍1

Hi @don-brady, may I ask a silly question -- what are the redundancy requirements for DDT storage? I mean -- could it be possible to reconstruct the deduplication table in the case if the present DDT data is lost?

gf-mse on 24 Sep 2017

👍1

( @don-brady ) To add, we are currently looking into enabling deduplication on our 0.5 Pb research storage cluster, and I'd be very much interested to test this feature. We are running zfsonlinux 0.6.5 ( Ubuntu LTS 16.04 ), but if you could point me in the direction of the most recent update ( https://github.com/zfsonlinux/zfs/pull/5182#issuecomment-315280160 ? ), I can start with build tests etc.

gf-mse on 24 Sep 2017

Is there anything left to do in this ticket, or should it be closed now that PR #5182 landed?

adilger on 12 Oct 2018

Yup, we can close this. Thanks.