Zstd: `zstd` default behavior : to erase, or not to erase, that is the question

Created on 11 Aug 2016  路  6Comments  路  Source: facebook/zstd

When invoking gzip file, file gets silently erased, replaced by a compressed version file.gz .
The same happens on decompression, with file.gz silently erased and replaced by original file.

This behavior mimics the old compress utility, from early 1980's.
In turn, gzip behavior was later mimicked by bzip2 and xz.

The only major novelty was brought by xz several years later, introducing opt-in command -k, which preserves source files. It was later back-ported into recent versions of gzip. Before -k, the only solution to avoid erasing source file was to use pipes gzip < in > out . Unfortunately, pipe method is incompatible with compression of multiple files, resulting in significant slowdowns.

zstd currently supports both modes, using either --rm to automatically erase source files, or -k to preserve them.
The main difference is that, by default, it preserves source files.
Note that it's already possible to enforce one or the other policy locally by using alias, such as alias zstd='zstd --rm' of alias zstd='zstd -k'.

As we move towards v1.0, a remaining question is, what should be the _default_ behavior of zstd. Keep in mind it's always possible to be explicit, so the question is only about _default_ preference.

One reason to prefer "preserve by default" is that it feels safer. With "erase by default" policy, you'd better be damn sure that data will be faithfully regenerated. Therefore, during development version (v0.x), "preserve by default" looked a more reasonable option. However, this argument will be weakened with the release of v1.0, which translates into high confidence in the regeneration process.

A new argument in favor of "preserve by default" is dictionary compression : any data compressed with a dictionary is now also dependent on this dictionary. If dictionary gets lost / modified for whatever reason, any data depending on it will be undecodable.
That being said, setting different policies, depending on compression with or without dictionary, seems confusing, hence unreasonable. So the policy should be the same for all compression / decompression.

Obviously, the main argument in favor of "erase by default" is that this is the way gzip works today.

I would like to create a poll on this issue, to gather feedbacks, but I don't see any option to do that on github, so maybe, just give your thoughts in the issue board,

or maybe use reactions icons, short of better solution,
for example :+1: for (save) "preserve by default", and :-1: for (kill) "erase by default"

question

Most helpful comment

:+1: Prefer non-destructive over destructive.

All 6 comments

:+1: Prefer non-destructive over destructive.

:+1:

:-1: If you're going to mimic gzip and xz then please do it as closely as possible. People are already familiar with the common params (-0..-9,-k,-f,-c,-d) and making it as simple as possible to replace gzip with zstd will go a long way to having it adopted in lots of places. A have to read the mans every time I use rar, lrzip or (pk)zip, it's a real pain.

As shaneday said, in order to be super useful, mimicking gzip and xz makes it really easy to replace these tools with a "better" compress and be done with it.

But the dictionary mode is super-scary - I'm used to my compressed files being "stand-alone decompressible", as that's the behavior for all the other tools.

I think these 2 scenarios should be treated with different behavior. Dictionary compression is scary because if you lose the dictionary, its over and your files are lost. So don't delete the originals. But I'm never going to easily replace gzip with a dictionary-based tools, so when I'm using non-dictionary mode just delete the original.

I'd even go so far as to suggest that the default _file extension_ of dictionary vs non-dictionary compressed files should be different, to more clearly indicate to users which mode has been used to generate the files. (something like .zst and .dzst)

Dedicated file extension is an interesting suggestion @chadnickbok .
We'll certainly have a look into it.

We had 2 separate polls, and they both ended up with "preserve source file by default" leading, by a very large margin.

So we'll start with this default behavior in zstd.

Keep in mind that --rm is present as opt-in command,
it can also be used as an alias (for command line) or as part of script's variable,
be combined with -k (last one wins),
so there are ways to select this behavior when it's preferred.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

robert3005 picture robert3005  路  4Comments

animalize picture animalize  路  3Comments

vade picture vade  路  3Comments

g666gle picture g666gle  路  3Comments

escalade picture escalade  路  3Comments