Zstd: Deterministic multi-threaded compression

Created on 10 Apr 2020  路  2Comments  路  Source: facebook/zstd

We have found z-standard to be most useful in a variety of situations. In an interesting performance enhancement test to our product, we found that for very large files (particularly when they are sparse), using z-standard as a preprocessor of data, before file hashing can produce some incredible performance gains above any optimization we could easily do on the hash algorithm itself.

This benefit is greatly enhanced by using multithreaded compression; however, for it to be valid the results would have to be deterministic, in that, given the same input data, number of threads, and parameters it would always produce the same byte-for-byte compressed results.

I have a feeling this is not the case; I鈥檓 not even 100% sure if this is the case for single threaded compression.

If this is not, would you consider something like this? For us, it would be okay if a new deterministic multi-threaded mode impacts performance and compression sizes, as long as it is typically faster than the single threaded equivalence (assuming that single-thread compression is already deterministic).

I would be happy to contribute some of my time to help such an endeavor.

Thanks much,
James

question

Most helpful comment

for it to be valid the results would have to be deterministic, in that, given the same input data, number of threads, and parameters it would always produce the same byte-for-byte compressed results.
I have a feeling this is not the case

Actually, zstd is fully deterministic.

It's even more deterministic than that : when using zstd as the CLI program,
for a given file, and a given version of zstd, compressed outcome is always the same for the same set of compression parameters (typically a compression level), though the nb of compression threads can be anything, from 1 to N, and the outcome will nonetheless remain exactly identical.

The objective is to make the compression step reproducible when executed on different hardware with different thread capabilities. It is one of the core properties that helped make zstd ArchLinux's default compression algorithm for their package distribution (pacman).

To be more complete, by default, the zstd CLI behaves as if it was always multi-threaded, even when only 1 thread is requested.

Conversely, when specifying the advanced command --single-thread, the compressed result will now be different (from regular -T#).
The situation is also different when using libzstd's API, as in this case, it is single-threaded by default, so the outcome will be similar to --single-thread.

The single-thread mode is also completely deterministic.
Single-thread and multi-thread modes produce different outputs, but both are fully deterministic, and the multi-thread mode doesn't care about the nb of threads specified.

All 2 comments

for it to be valid the results would have to be deterministic, in that, given the same input data, number of threads, and parameters it would always produce the same byte-for-byte compressed results.
I have a feeling this is not the case

Actually, zstd is fully deterministic.

It's even more deterministic than that : when using zstd as the CLI program,
for a given file, and a given version of zstd, compressed outcome is always the same for the same set of compression parameters (typically a compression level), though the nb of compression threads can be anything, from 1 to N, and the outcome will nonetheless remain exactly identical.

The objective is to make the compression step reproducible when executed on different hardware with different thread capabilities. It is one of the core properties that helped make zstd ArchLinux's default compression algorithm for their package distribution (pacman).

To be more complete, by default, the zstd CLI behaves as if it was always multi-threaded, even when only 1 thread is requested.

Conversely, when specifying the advanced command --single-thread, the compressed result will now be different (from regular -T#).
The situation is also different when using libzstd's API, as in this case, it is single-threaded by default, so the outcome will be similar to --single-thread.

The single-thread mode is also completely deterministic.
Single-thread and multi-thread modes produce different outputs, but both are fully deterministic, and the multi-thread mode doesn't care about the nb of threads specified.

This is such great news to hear! My admiration for the quality and usefulness of z-standard continues to increase.

Thanks much for the information; it is most helpful.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

sergeevabc picture sergeevabc  路  3Comments

itsnotvalid picture itsnotvalid  路  3Comments

AbdulrahmanAltabba picture AbdulrahmanAltabba  路  3Comments

terrelln picture terrelln  路  3Comments

robert3005 picture robert3005  路  4Comments