Zstd: Clarify zstd compressor output compatibility guarantees across versions

Created on 22 Jan 2018  路  5Comments  路  Source: facebook/zstd

Hi,

We recently upgraded zstd to 1.3.3 after reading about the performance improvements for high compression levels, and we were happy to see that the performance increase was very significant (around 40% for level 19). However, we also noticed that the output of zstd 1.3.3 is not binary-identical to zstd 1.3.2, and unfortunately that limits its usefulness for our particular use case because we rely on our compressed data not changing as we upgrade the libzstd library, which we'd like to do in order to get access to bugfixes, new features and performance improvements. We were previously using zlib which I guess hasn't had a bitstream-impacting change in many years.

Is bit-identical output across versions a goal of the zstd project, or do you expect these changes to happen for the foreseeable future?

question

Most helpful comment

Situation has evolved since this issue was opened, and is generally more friendly to reproducible builds.

With recent versions of zstd (v1.3.4+), the number of cores and the number of threads do not matter. -T0, -T1 (default), and generally -Tn all generate the same output.
The only important parameters are the zstd version number, and the compression level.

Things that can break this reproducibility pattern :

  • altering compression level by adding advanced parameters (--long=, --zstd=, etc.), though identical advanced parameters will give the same result.
  • add --single-thread command. It's not the same as -T1, and will generate a slightly different output. However, --single-thread is stable with itself.

All 5 comments

Hi @jblazquez,

zstd only guarantees :

  • Format compliance : any compressed data produced by version v1+ respect the specification, and can therefore be decompressed correctly by any decoder from version v1+.
  • Reproducibility : For a given compression level, binary version, source data, and nb of threads, compressed data will always be the same.

However, zstd makes no guarantee of producing exactly the same compressed output when comparing 2 different versions. Such restriction would greatly limit its capability to improve.

Bottom line : never expect 2 different versions of zstd to produce the same output. If it does, it's purely by chance.

Thanks for clarifying the compatibility guarantees @Cyan4973. I think those two guarantees - especially the first one - should be enough for us to unlock our ability to upgrade.

@Cyan4973
We are currently considering the possibility to use zstd as default compression for all our distro packages, but we would appreciate if you could clarify the reproducible guarantees that you mentioned above with varying threads in different scenarios.

Do the following restrictions and variables all always produce the same output?

  • all compression operations always use the very same zstd version
  • fixed compression level f.e. all use -18
  • varying hardware cpu cores f.e. some single core machines, some 4 and some 8 core machines
  • fixed value of -T0 which uses number of physical CPU cores (please note that above requirement varies single core, 4 and 8 core machines as there are other compression algos that break the guarantee on a single core machine)

technically the -T0 varies the nb of threads that you mentioned above, but we would like to have -T0 and still a guarantee to have reproducible output across different number of cures (single core + multi core)

Some tests show this may be the case, but we seek to have some official clarification before we assume that we can rely on it.
thanks in advance

Situation has evolved since this issue was opened, and is generally more friendly to reproducible builds.

With recent versions of zstd (v1.3.4+), the number of cores and the number of threads do not matter. -T0, -T1 (default), and generally -Tn all generate the same output.
The only important parameters are the zstd version number, and the compression level.

Things that can break this reproducibility pattern :

  • altering compression level by adding advanced parameters (--long=, --zstd=, etc.), though identical advanced parameters will give the same result.
  • add --single-thread command. It's not the same as -T1, and will generate a slightly different output. However, --single-thread is stable with itself.

We will fix all bugs causing non-deterministic builds as long as they follow the constraints that @Cyan4973 laid out above. However, I'd definitely recommend adding zstd determinism tests that invoke zstd the same way you do in your builds. We test zstd for non-determinism, but you may invoke it in a different way that we've missed in our test coverage.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

indygreg picture indygreg  路  3Comments

AbdulrahmanAltabba picture AbdulrahmanAltabba  路  3Comments

TheSil picture TheSil  路  3Comments

rgdoliveira picture rgdoliveira  路  3Comments

planet36 picture planet36  路  3Comments