Zfs: Metadata Allocation Class

Created on 15 Sep 2015  Â·  17Comments  Â·  Source: openzfs/zfs

Intel is working on ways to isolate large block file data from metadata for ZFS on Linux. In addition to the size discrepancy with file data, metadata often has a more transient lifecycle and additional redundancy requirements (ditto blocks). Metadata is often a poor match for a RAIDZ tier since it cannot be dispersed and the relative parity overhead is high. A mirrored redundancy is a better choice for metadata.

A metadata-only allocation tier is being added to the existing storage pool allocation class mechanism and is used as the primary source for metadata allocations. File data remains in the _normal class_. Each top-level metadata VDEV is tagged as belonging to the _metadata allocation class_ and at runtime becomes associated with a pool’s metadata allocation class. The remaining (i.e. non-designated) top-level VDEVs default to the _normal allocation class_ distinction. In addition to generic metadata, the performance sensitive deduplication table (DDT) data can also benefit from having its own separate allocation class.

| Allocation Class | Purpose |
| --- | --- |
| Normal | Default source for all allocations |
| Log | ZIL records only |
| Metadata (new) | All metadata blocks (non level-0 file blocks) |
| Dedup Table (new) | Deduplication Table Data (DDT) |

More details to follow.

Feature

All 17 comments

I swear someone else was working on a metadata-specific vdev for much the same purpose.

@DeHackEd I have seen writings about the _concept_ but FWIW, no recollection of anyone else _working_ on it in an open source area.

Found today: https://github.com/zfsonlinux/zfs/issues/1071#issuecomment-10249646 mentioning tegile, and Google finds e.g.:

It might be nice to allow (optionally!) one (or two?) of the metadata ditto blocks to reside in the ordinary pool, as well, making the metadata vdevs a kind of dedicated, write-through cache. (Different from L2ARC with secondarycache=metadata because they really would be holding authoritative copies of the metadata, but in dire straits could still be removed from the pool or used at lower fault tolerance -- one SSD instead of two in a mirror, etc.)

I just watched the Livestream presentation on this. This is definitely a feature needed in ZFS. I've struggled to keep metadata cached. The only way I've been able to get decent amount of metadata cached is to set L2ARC to metadata only, but it takes many passes, which still probably misses a significant amount. I would love to build pools with metadata on SSD tiers. Metadata typically accounts for 1/2 my disk I/O load.

There was a presentation from Nexenta a couple years back about tiered storage pools that had similar goals. http://www.open-zfs.org/w/images/7/71/ZFS_tiering.pdf I haven't heard anything of this effort since.

Hi @don-brady
May I ask what's the current status of this feature? Is there anything I can help work on with?
Thanks.

@tuxoko I'm hoping to post a public WIP branch soon. The creation/addition of VDEVs dedicated to specific metadata classes is functional. Currently working out accounting issues in metaslab layer. We just started running ztest with metadata only classes to help shake out any edge cases (found a few). Let me come up with a to-do list so others can help.

Sounds great!!

Hi @don-brady
Did some more work on and measurements for the #4365 implementation. Would be interesting to test also this version.

It seems that the WIP pull request #5182 was never referenced here, or vice versa...

@don-brady,

It's been a year since the last update and I was wondering how this is progressing.

Specifically, I'm interested in DDT devices. Moving the DDT onto dedicated high-speed devices should allow dedupe to function nearly as fast as the current memory-only implementation, but require much less memory.

For storage of VMs, dedupe could easily save much more space than compression, but the current memory requirements usually make it too costly.

See #5182 for the WIP. It's gone through a few iterations but I'm running an (old) version here. Very satisfied thus far. (No dedup, just regular metadata)

@pashford So far as I know, DDT metadata can reside on L2ARC devices. The only thing this would change is to permit writebacks to go to faster media, rather than the primary (spinning rust) storage. That seems like it's unlikely to be a huge improvement vs. just having the DDT hang around in L2ARC.

@DeHackEd,

Thanks for the information.

@nwf,

The only thing this would change is to permit writebacks to go to faster media

If the writeback goes to a faster media, then a future DDT miss in the ARC/L2ARC would also go to that faster media, which would generate a performance bump. As an example, if you have a 2PB pool of 7200RPM storage, and have a few fast SSDs (SATA, SAS or NVME) as DDT devices, the DDT performance WILL be better, especially if only a portion of the DDT is kept in memory.

Hi @don-brady, may I ask a silly question -- what are the redundancy requirements for DDT storage? I mean -- could it be possible to reconstruct the deduplication table in the case if the present DDT data is lost?

( @don-brady ) To add, we are currently looking into enabling deduplication on our 0.5 Pb research storage cluster, and I'd be very much interested to test this feature. We are running zfsonlinux 0.6.5 ( Ubuntu LTS 16.04 ), but if you could point me in the direction of the most recent update ( https://github.com/zfsonlinux/zfs/pull/5182#issuecomment-315280160 ? ), I can start with build tests etc.

Is there anything left to do in this ticket, or should it be closed now that PR #5182 landed?

Yup, we can close this. Thanks.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

chrisrd picture chrisrd  Â·  65Comments

crollorc picture crollorc  Â·  143Comments

pruiz picture pruiz  Â·  60Comments

ltz3317 picture ltz3317  Â·  82Comments

vbrik picture vbrik  Â·  108Comments