Zstd: Complete flush of input into compressed stream without losing current dictionary

Created on 28 Sep 2017  路  7Comments  路  Source: facebook/zstd

While writing a compressed data stream, at certain times it is necessary for me to ensure that all input data is flushed to storage.

1) After ZSTD_flushStream the resulting compressed data isn't valid according to zstdcat and zstdcat doesn't output the last input data.
2) After a combo of ZSTD_endStream+ZSTD_initCStream the result is valid and contains all input data but the overall compression ratio degraded.

What I need is a way to flush the complete present data into the compressed stream, accepting a little size penalty for the last batch but retain the current assembled dictionary (and whatever else supports the compression) for further compression into the same stream.

If there is an API to make this happen, it eludes me.

question

All 7 comments

ZSTD_flushStream() is the right API for your need.
It sends whatever remains in its internal buffers, keeping the current context alive, hence preserving compression ratio.

The problem is the association with zstdcat.
zstdcat expects "full frames", aka finished streams.
The way to finish a frame is to call ZSTD_endStream().
If operations stop before reaching such an end point, zstdcat considers it must be an error (truncated input data).

Should you create your own receiver, using ZSTD_decompressStream() API, you'll be in charge and will be able to consider an "unfinished frame" as nonetheless valid.

Thanks for the quick answer!

Tomorrow I code my own simple decompress program. If this does the job it would save about 50% space for my workload. On success maybe I look into zstdcat to add an option to do this.

The issue is that after a flush, the compressed data depends on what was before the flush.
This means to decode a chunk, you still need to read all the previous chunks, so you can't just use zstdcat on a not-well-finished chunk: you'd need to give it the previous chunks as well.
It sounds maybe a bit too specific a use-case to fit zstdcat itself.

On the other hand, you can already zstdcat the concatenation of chunks (if the final chunk is properly finished):
zstdcat <(cat chunks-*.zst) | head

@gyscos: after flushing the following compressed stream will be written into the same file. The flush is to ensure that the complete data can be read by another person at this point of time.

The comment of Cyan4973 about ZSTD_flushStream() was to the point. With a self written decoder I can read all data, a task zstdcat isn't able to do.

I don't understand why zstdcat doesn't output all available data. Even if there is the error of premature end of file, to withhold parts of available data from the user is simply not acceptable. In case of an accident (e.g. simply a full filesystem) recovery of as much data as possible is very important.

I don't understand why zstdcat doesn't output all available data.

It comes from here :
https://github.com/facebook/zstd/blob/dev/programs/fileio.c#L1260

The decoder expects to load a "certain amount" of data, provided by the streaming interface. And since the last block has only been flushed, it's not supposed to be the last one. So more data should follow, at least another block header.

When reading the expected amount of data from file, the function discovers that there is not enough data to reach objective, triggers an error and stops there. As a consequence, the last block is not even decoded, since error is discovered and triggered before reaching that point.

Maybe this behaviour could be changed. Though there are side-effects to pay attention to, such as infinite loops, missing truncated data warning, etc.

In latest dev branch update, the decoder has been modified to decode and flush the last full block before triggering a "truncated data" error. This will impact zstdcat as well.
Note that, although it displays all generated data up to last full block, it's still considered as an error from a decoder perspective, meaning the cli will leave with an error code, and output an error message on stderr (which can be disabled using -qq).
This modification will be part of next release.

Thank you very much indeed!

This is exactly the property I need/expect from a decoder.

It was a pleasure to interact with the zstd community. I hadn't expected a so quick refinement after my comment. I'm delightedly surprised :-)

Was this page helpful?
0 / 5 - 0 ratings

Related issues

ga92yup picture ga92yup  路  3Comments

sergeevabc picture sergeevabc  路  3Comments

itsnotvalid picture itsnotvalid  路  3Comments

g666gle picture g666gle  路  3Comments

escalade picture escalade  路  3Comments