Zstd: Some questions about ZSTD_compress

Created on 22 Oct 2020  路  6Comments  路  Source: facebook/zstd

Assuming that L(x) is the compressed size obtained after using ZSTD_compress on buffer x, is it possible to ensure that L(x)+L(y) <= L(x+y)?
That is, the compression size of the concatenation of buffer x and y is smaller than the compression of buffer x and y respectively.

question

Most helpful comment

@LucienXian most of the time L(x+y) <= L(x)+L(y) will hold. But there can certainly be cases where it doesn't.

It sounds like in this case you are compressing each chunk twice. Instead, you could use streaming compression. For the first 4KB compression use ZSTD_compressStream2(in, out, ZSTD_e_flush). Then you can continue the compression with the next chunk in the same manner. When you're ready to complete the compression call, use ZSTD_compressStream2(in, out, ZSTD_e_end) to complete the frame.

Alternatively, you can assume that L(x+y) <= L(x)+L(y) and pass L(x)+L(y) as the output buffer size. But then if compression fails because it isn't smaller, recompress x and y independently, and concatenate the frames. This should be unlikely. ZSTD_decompress() handles multiple concatenated frames, so decompression will still work as expected.

All 6 comments

Hi @LucienXian,

No, it's not possible to make that guarantee in general. In fact, it will tend towards the opposite.

Why do you ask?

Oh, sorry, I wrote the formula in reverse. What I want to say is, can it guarantee that L(x+y) <= L(x)+L(y)?

I have a requirement to compress streaming data, and every time I get a compressed block (<= 4k), I write it to the storage.
Assuming that the data streamed in for the first time reaches 4000 bytes, I do a compression, and the compressed size may become 3000 bytes. Later, I will continue to stream the raw data of 1000 bytes. I will compress the raw data of 1000 bytes for the second time. Assuming that the compressed data size is 800 bytes (1000 -> 800), I will compress the raw data (4000+1000 bytes ), then write to storage.
So I want to ensure that L(x+y) <= L(x)+L(y)

@LucienXian most of the time L(x+y) <= L(x)+L(y) will hold. But there can certainly be cases where it doesn't.

It sounds like in this case you are compressing each chunk twice. Instead, you could use streaming compression. For the first 4KB compression use ZSTD_compressStream2(in, out, ZSTD_e_flush). Then you can continue the compression with the next chunk in the same manner. When you're ready to complete the compression call, use ZSTD_compressStream2(in, out, ZSTD_e_end) to complete the frame.

Alternatively, you can assume that L(x+y) <= L(x)+L(y) and pass L(x)+L(y) as the output buffer size. But then if compression fails because it isn't smaller, recompress x and y independently, and concatenate the frames. This should be unlikely. ZSTD_decompress() handles multiple concatenated frames, so decompression will still work as expected.

@LucienXian most of the time L(x+y) <= L(x)+L(y) will hold. But there can certainly be cases where it doesn't.

It sounds like in this case you are compressing each chunk twice. Instead, you could use streaming compression. For the first 4KB compression use ZSTD_compressStream2(in, out, ZSTD_e_flush). Then you can continue the compression with the next chunk in the same manner. When you're ready to complete the compression call, use ZSTD_compressStream2(in, out, ZSTD_e_end) to complete the frame.

Alternatively, you can assume that L(x+y) <= L(x)+L(y) and pass L(x)+L(y) as the output buffer size. But then if compression fails because it isn't smaller, recompress x and y independently, and concatenate the frames. This should be unlikely. ZSTD_decompress() handles multiple concatenated frames, so decompression will still work as expected.

Thank you so much for your suggestion. But I have a question: When I use it like this, the uncompressed result is a failure.

// Some pseudo code
// Compress
compressed_buf = ZSTD_compress(buf1) + ZSTD_compress(buf2);
// Decompress
rSize = ZSTD_getFrameContentSize(compressed_buf, compressed_buf.size()); // just return the first frame content size
decompressed_buf = ZSTD_decompress(compressed_buf, rSize); // This will fail

In a scenario where any number of frames are spliced into a buffer, how can I get the sum of all frame content sizes based on this buffer?
How should I use the APIs correctly?

ZSTD_getFrameContentSize() will return only the decompressed size of the first frame. If you need the total decompressed size, you need to use ZSTD_findDecompressedSize().

ZSTD_getFrameContentSize() will return only the decompressed size of the first frame. If you need the total decompressed size, you need to use ZSTD_findDecompressedSize().

Thx! I got it.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

TheSil picture TheSil  路  3Comments

sergeevabc picture sergeevabc  路  3Comments

vade picture vade  路  3Comments

indygreg picture indygreg  路  3Comments

itsnotvalid picture itsnotvalid  路  3Comments