Hi,
We recently upgraded zstd to 1.3.3 after reading about the performance improvements for high compression levels, and we were happy to see that the performance increase was very significant (around 40% for level 19). However, we also noticed that the output of zstd 1.3.3 is not binary-identical to zstd 1.3.2, and unfortunately that limits its usefulness for our particular use case because we rely on our compressed data not changing as we upgrade the libzstd library, which we'd like to do in order to get access to bugfixes, new features and performance improvements. We were previously using zlib which I guess hasn't had a bitstream-impacting change in many years.
Is bit-identical output across versions a goal of the zstd project, or do you expect these changes to happen for the foreseeable future?
Hi @jblazquez,
zstd only guarantees :
v1+ respect the specification, and can therefore be decompressed correctly by any decoder from version v1+.However, zstd makes no guarantee of producing exactly the same compressed output when comparing 2 different versions. Such restriction would greatly limit its capability to improve.
Bottom line : never expect 2 different versions of zstd to produce the same output. If it does, it's purely by chance.
Thanks for clarifying the compatibility guarantees @Cyan4973. I think those two guarantees - especially the first one - should be enough for us to unlock our ability to upgrade.
@Cyan4973
We are currently considering the possibility to use zstd as default compression for all our distro packages, but we would appreciate if you could clarify the reproducible guarantees that you mentioned above with varying threads in different scenarios.
Do the following restrictions and variables all always produce the same output?
f.e. all use -18f.e. some single core machines, some 4 and some 8 core machines-T0 which uses number of physical CPU cores (please note that above requirement varies single core, 4 and 8 core machines as there are other compression algos that break the guarantee on a single core machine)technically the -T0 varies the nb of threads that you mentioned above, but we would like to have -T0 and still a guarantee to have reproducible output across different number of cures (single core + multi core)
Some tests show this may be the case, but we seek to have some official clarification before we assume that we can rely on it.
thanks in advance
Situation has evolved since this issue was opened, and is generally more friendly to reproducible builds.
With recent versions of zstd (v1.3.4+), the number of cores and the number of threads do not matter. -T0, -T1 (default), and generally -Tn all generate the same output.
The only important parameters are the zstd version number, and the compression level.
Things that can break this reproducibility pattern :
--long=, --zstd=, etc.), though identical advanced parameters will give the same result.--single-thread command. It's not the same as -T1, and will generate a slightly different output. However, --single-thread is stable with itself.We will fix all bugs causing non-deterministic builds as long as they follow the constraints that @Cyan4973 laid out above. However, I'd definitely recommend adding zstd determinism tests that invoke zstd the same way you do in your builds. We test zstd for non-determinism, but you may invoke it in a different way that we've missed in our test coverage.
Most helpful comment
Situation has evolved since this issue was opened, and is generally more friendly to reproducible builds.
With recent versions of
zstd(v1.3.4+), the number of cores and the number of threads do not matter.-T0,-T1(default), and generally-Tnall generate the same output.The only important parameters are the
zstdversion number, and the compression level.Things that can break this reproducibility pattern :
--long=,--zstd=, etc.), though identical advanced parameters will give the same result.--single-threadcommand. It's not the same as-T1, and will generate a slightly different output. However,--single-threadis stable with itself.