We have found z-standard to be most useful in a variety of situations. In an interesting performance enhancement test to our product, we found that for very large files (particularly when they are sparse), using z-standard as a preprocessor of data, before file hashing can produce some incredible performance gains above any optimization we could easily do on the hash algorithm itself.
This benefit is greatly enhanced by using multithreaded compression; however, for it to be valid the results would have to be deterministic, in that, given the same input data, number of threads, and parameters it would always produce the same byte-for-byte compressed results.
I have a feeling this is not the case; I鈥檓 not even 100% sure if this is the case for single threaded compression.
If this is not, would you consider something like this? For us, it would be okay if a new deterministic multi-threaded mode impacts performance and compression sizes, as long as it is typically faster than the single threaded equivalence (assuming that single-thread compression is already deterministic).
I would be happy to contribute some of my time to help such an endeavor.
Thanks much,
James
for it to be valid the results would have to be deterministic, in that, given the same input data, number of threads, and parameters it would always produce the same byte-for-byte compressed results.
I have a feeling this is not the case
Actually, zstd is fully deterministic.
It's even more deterministic than that : when using zstd as the CLI program,
for a given file, and a given version of zstd, compressed outcome is always the same for the same set of compression parameters (typically a compression level), though the nb of compression threads can be anything, from 1 to N, and the outcome will nonetheless remain exactly identical.
The objective is to make the compression step reproducible when executed on different hardware with different thread capabilities. It is one of the core properties that helped make zstd ArchLinux's default compression algorithm for their package distribution (pacman).
To be more complete, by default, the zstd CLI behaves as if it was always multi-threaded, even when only 1 thread is requested.
Conversely, when specifying the advanced command --single-thread, the compressed result will now be different (from regular -T#).
The situation is also different when using libzstd's API, as in this case, it is single-threaded by default, so the outcome will be similar to --single-thread.
The single-thread mode is also completely deterministic.
Single-thread and multi-thread modes produce different outputs, but both are fully deterministic, and the multi-thread mode doesn't care about the nb of threads specified.
This is such great news to hear! My admiration for the quality and usefulness of z-standard continues to increase.
Thanks much for the information; it is most helpful.
Most helpful comment
Actually,
zstdis fully deterministic.It's even more deterministic than that : when using
zstdas the CLI program,for a given file, and a given version of
zstd, compressed outcome is always the same for the same set of compression parameters (typically a compression level), though the nb of compression threads can be anything, from 1 to N, and the outcome will nonetheless remain exactly identical.The objective is to make the compression step reproducible when executed on different hardware with different thread capabilities. It is one of the core properties that helped make
zstdArchLinux's default compression algorithm for their package distribution (pacman).To be more complete, by default, the
zstdCLI behaves as if it was always multi-threaded, even when only 1 thread is requested.Conversely, when specifying the advanced command
--single-thread, the compressed result will now be different (from regular-T#).The situation is also different when using
libzstd's API, as in this case, it is single-threaded by default, so the outcome will be similar to--single-thread.The single-thread mode is also completely deterministic.
Single-thread and multi-thread modes produce different outputs, but both are fully deterministic, and the multi-thread mode doesn't care about the nb of threads specified.