Zstd: Provide ZSTD_boundDecompressedSize()

Created on 21 Feb 2019  路  11Comments  路  Source: facebook/zstd

When the decompressed size isn't written into the frame, and you want to use the one pass function, you are forced to guess the decompressed size. We could estimate the decompressed size using the number of blocks. It will be no larger than 128 KB larger than the decompressed size if flush isn't used. If flush is used, or some other block level API, then we could estimate way too large.

Example usage (without error checking):

std::string decompress(void* src, size_t size, size_t maxAlloc) {
    size_t const bound = std::min(maxAlloc, ZSTD_boundDecompressedSize(src, size));
    std::string out;
    out.resize(bound);
    out.resize(ZSTD_decompress(&out[0], out.size(), src, size));
    return out;
}

Edit: Implemented by @shakeelrao in PR #1543. This issue is left open to add support for legacy mode before the release.

feature request good first issue

Most helpful comment

There isn't explicit documentations, but they are all predecessors to zstd, so they are very similar.

Check out ZSTD_findFrameCompressedSizeLegacy, and each implementation like ZSTDv01_findFrameCompressedSize.

Lets start by using the macro BLOCKSIZE, instead of inspecting the frame header, since it is a bit simpler, and I'm not 100% confident that blockSize <= windowSize for every legacy version. @Cyan4973 may be able to help here. The function ZSTD_getFrameParams() is available for versions >= 0.4. However, we only need legacy support so we can say that the bound always works, it isn't super valuable to be as tight as possible for legacy formats, so I don't think the complexity is worth the gains.

All 11 comments

Hi @terrelln,

I'm new to the project, but would be interested in implementing this feature. Do you have any reference points for understanding how to compute the upper bound? Should I read the RFC?

  1. Read the block header section of the format spec. Each block has at most Block_Maximum_Decompressed_Size bytes.
  2. We will loop over each frame in the input and either use the content size if available, or count the number of blocks and multiple by Block_Maximum_Decompressed_Size.
  3. We can reuse the code in ZSTD_findFrameCompressedSize(), which loops over every block in a frame to find the compressed size. We can write a helper function which finds both the compressed size, and bounds the decompress size. Then ZSTD_findFrameCompressedSize() can return only the compressed size.
  4. We should support multiple appended frames like ZSTD_findDecompressedSize().

Thanks, @terrelln! Is this the rough idea?

bound = 0
for each frame in src:
  if Frame_Content_Size of frame exists:
    bound += Frame_Content_Size of frame
  else
    for each block in frame:
       bound += Block_Maximum_Decompressed_Size
return bound

Also, is Block_Maximum_Decompressed_Size a predefined constant?

Actually, now that I have read more code, is it accurate to say this function is an adaption of ZSTD_findDecompressedSize, but instead of returning an error if ZSTD_getFrameContentSize returns ZSTD_CONTENTSIZE_UNKNOWN, we calculate the block-based upper-bound?

  • Block_Maximum_Decompressed_Size is defined in the spec (you can search for the phrase).
  • That pseudocode looks right. Note that we'll want to handle legacy frames. We can start by returning an error, and then add in support later.
  • Yeah, it is exactly that. We can probably roll the block-based upper-bound into a helper also used by ZSTD_findFrameCompressedSize() so we avoid some code duplication.

Should we leave this open until legacy mode is implemented or does it make sense to create a new issue?

Lets leave it open, I'll edit the message.

@terrelln Is there documentation on the legacy format? I'd be interested continuing my work on this feature by implementing legacy support.

There isn't explicit documentations, but they are all predecessors to zstd, so they are very similar.

Check out ZSTD_findFrameCompressedSizeLegacy, and each implementation like ZSTDv01_findFrameCompressedSize.

Lets start by using the macro BLOCKSIZE, instead of inspecting the frame header, since it is a bit simpler, and I'm not 100% confident that blockSize <= windowSize for every legacy version. @Cyan4973 may be able to help here. The function ZSTD_getFrameParams() is available for versions >= 0.4. However, we only need legacy support so we can say that the bound always works, it isn't super valuable to be as tight as possible for legacy formats, so I don't think the complexity is worth the gains.

Is this issue considered done now? Thanks again for all the help!

Thanks @shakeelrao for implementing the decompress bound function!

Was this page helpful?
0 / 5 - 0 ratings