Zfs: [Feature] Compression Migration Tool

Created on 21 Dec 2019 · 8Comments · Source: openzfs/zfs

Describe the problem you're observing

Currently one can change the compression setting on a dataset and this will compress new blocks using the new algorithm. This works perfectly fine for many people during normal use.

However, there are 3 scenario's where we would want an easy way to recompress a complete dataset:

If one wants to change decompression speed for currently stored write-once-read-many data
If one wants to increase the compression ratio of currently compressed data
If we remove (or depricate) a compression algorithm

While it's perfectly possible to send data to a new dataset and thus trigger a recompression, this has a few downsides:

It's not very accessible for the simplest of users, for example (future) freenas home users
It might mean downtime

A prefered way to handle this would be a features which recompresses current data on the drive, "in the background", just like a scrub or resilver. this also has the added benefid of making us able to force it if we depricate/replace/remove an algorithm.

This feature would enable us to go byond the requested deprication in #9761.

Feature

Source

Ornias1993

Most helpful comment

this requires block pointer rewrite

I personally would be fine if this feature initially behaved like/leveraged an auto-resumed local send/receive and some clone/upgrade-like switcheroo (and obeyed the same constraints, if unavoidable even temporarily using twice the required storage of the dataset being 'transformed') in the background with the user interface of a scrub (i.e. trigger it through a zfs subcommand, appears in zfs?/zpool status, gets resumed after reboots, can me paused, stopped, etc.).

The applications for this go beyond just applying a different compression algorithm:

AFAIK this also applies to checksum algorithms.
Shouldn't this also convert xattrs to the sa format?
If there's sufficient free space on the pool, this can also be a form of defragmentation, right?

One could hack something like this together using zfs send/recv, it'd probably involve a clone receive and some upgrade shenanigans, but it would definitely not be the same as having a canonical zfs subcommand with the above-mentioned UX; especially since it would somewhat cleanly resolve some "please unshoot my foot" situations that inexperienced and/or sleep deprived users might get themselves into, for example choosing the wrong compression algorithm/level a year before realizing it, without the need to figure out and possibly script (recursive) zfs send and receive.
Also, zfs is probably in a better position to do a much cleaner in-place swap of the two versions of the dataset when the 'rewrite' is done, probably like a snapshot rollback, and will most likely not forget to delete the old version afterwards, unlike my hacky scripts, which break all the time. 😉

Future future work ideas:

Defrag mode: Only rewrite fragmented datasets for some definition of fragmented. (Without knowing implementation details, sounds like it could be a two phase process like scrubs?)
-o encryption=on from off might be a useful thing to support, now that we allow unencrypted children
-> A future³ pr might add a way to migrate between crypto ciphers.

InsanePrawn on 22 Dec 2019

👍2 ❤1

All 8 comments

How do you plan to implement this? What is to happen to snapshots?
Recv already does this.

scineram on 21 Dec 2019

@scineram Snapshots would be a problem indeed.
I don't have a "plan" to implement this, otherwise I wouldn't file an issue ;)

How to you suggest to do future removal of compression algorithms and zero-downtime change of on-disk compression otherwise, I don't think recv covers this usecase, or does it?

If so: Where is the documentation about using recv in this way?
it would have a very low downtime ofcourse...

Ornias1993 on 21 Dec 2019

this requires block pointer rewrite

richardelling on 22 Dec 2019

👍1

@richardelling Precisely, I didn't say it was going to be easy ;)

Ornias1993 on 22 Dec 2019

this requires block pointer rewrite

The applications for this go beyond just applying a different compression algorithm:

AFAIK this also applies to checksum algorithms.
Shouldn't this also convert xattrs to the sa format?
If there's sufficient free space on the pool, this can also be a form of defragmentation, right?

Future future work ideas:

Defrag mode: Only rewrite fragmented datasets for some definition of fragmented. (Without knowing implementation details, sounds like it could be a two phase process like scrubs?)
-o encryption=on from off might be a useful thing to support, now that we allow unencrypted children
-> A future³ pr might add a way to migrate between crypto ciphers.

InsanePrawn on 22 Dec 2019

👍2 ❤1

@InsanePrawn

Good point, it should
Could be interesting
Considering All data should gets read and re-writen sequentially, it would defrag the drive, yes.

One could hack something like this together using zfs send/recv, it'd probably involve a clone receive and some upgrade shenanigans, but it would definitely not be the same as having a canonical zfs subcommand with the above-mentioned UX

Yes, thats mostly the point... I think more advanced users can do things that get pretty close (and pretty hacky), but creating it to be "as easy as possible" for the median user was the goal of my feature request...

Ornias1993 on 25 Dec 2019

@InsanePrawn, given enough space, yes, a transparent ZFS send/receive would be a way to go. All new writes go to the new dataset, and any read not yet available in the new dataset would fall back to the old dataset. Whence the entire dataset is received, the old dataset is destroyed.

Theoretically, we could almost do it without enough space for the whole dataset. Whence one file is entirely copied to the new dataset, the file could be deleted from the source dataset.

If something like this were implemented, a resume after Zpool export would also have to be part of the work. Otherwise, the pool would remain in a partially migragted state.

This does have the advantage of re-striping the data. Simple example, you have 1 vDev and when it get fullish, you add a second vDev. The data from the first, (if not changed), remains only on the first vDev. Even newly written data may have to favor the second vDev as it has the most free space. Something like suggested above can help balance data, even if we don't need to change checksum, compress or encryption algorythms.

Back to reality, snapshots & possibly even bookmarks would be a problem. Even clones of snapshots that reference the old dataset would still reference the old data & metadata, (be it compression, checksum or encryption changes).