Zstd: Q: Dictionary format maximum offsets

Created on 21 May 2020 · 6Comments · Source: facebook/zstd

As long as the amount of data decoded from this frame is less than or equal to Window_Size, sequence commands may specify offsets longer than the total length of decoded output so far to reference back to the dictionary, even parts of the dictionary with offsets larger than Window_Size.

After the total output has surpassed Window_Size however, this is no longer allowed and the dictionary is no longer accessible.

Is that really true? Offsets can be greater than WS until WS bytes has been decoded? Very much hoping I am reading that wrong.

documentation question

Source

klauspost

All 6 comments

It's correct: if any part of the dictionary is in the window, we allow references to any part of the dictionary. I agree that it's a somewhat awkward quirk of the spec, analyzed at that level.

However, it is motivated by (our expectations of) the realities of an implementation's memory management. The window generally lends itself to being backed by a circular buffer. As content is processed, it necessarily overwrites content window_size back in the stream, thus motivating the rolling limit on back-reference distance. But I would expect an implementation to use a dictionary in-place, as a monolithic buffer, against which no evictions need to be performed over the course of a compression. So why not make it all available?

I suppose this makes it unworkable to first copy the dictionary into the circular buffer and then just go from there. Is that what you wanted to do?

Our implementation doesn't take that strategy because it requires copying the dictionary, and using a circular buffer at all already means we need to have an implementation that can dispatch a contiguous offset space into two discrete memory buffers, so there isn't much simplification to be found from doing the copy.

felixhandte on 21 May 2020

I use a flat buffer for history - memcopying a bit outweighs the complexity of a circular buffer, but the problem is pretty much the same. Also, I just added a stricter check for offset lengths.

This complicates both a whole bunch. I guess there is nothing I can do now but be frustrated about it. I guess with fixed formats bad decisions are forever.

Guess my dictionary support just went from pretty simple to complex.

Thanks for the explanation.

klauspost on 21 May 2020

The scenario leading to this choice is along these lines :

Dictionary is "relatively large", e.g. ~100 KB
Data size to compress is "relatively small", e.g. ~1 KB
The window size has a direct relation with input data size, it directly commands memory allocation on the decoding size. For this small data, window size will be 1 or 2 KB, or even data size directly.
We still want compression and decompression processes to fully benefit from the whole dictionary, for efficiency, hence not limited to just the last 1 or 2 KB. Therefore, offsets are allowed to be larger than window size.

So that's the primary scenario leading to this design decision, made early 2016.
Since then, we have improved our understanding of dictionary compression scenarios, and we are aware that there are _different_ scenarios for which this design is less optimal, and a different spec would make implementation easier. But indeed, it required experience to get to that point, and by the time dictionary compression was created, we were, by definition, lacking such experience.

Cyan4973 on 21 May 2020

👍1

@Cyan4973 Thanks for the background. I do have these checks in place, so at least I don't have to add extra branches to the main loop.

I guess there is no restrictions on matches crossing from dictionary into the start of the decoded data. Since it isn't mentioned as "not allowed", I assume it is.

Your docs also doesn't mention if 0 offsets are explicitly disallowed. Since they don't make sense I assume it would be fine/expected to reject a dictionary where any initial offset is 0.

klauspost on 21 May 2020

I guess there is no restrictions on matches crossing from dictionary into the start of the decoded data. Since it isn't mentioned as "not allowed", I assume it is.

Yes, right. It's not exactly "greatly useful", but it's indeed allowed.

Your docs also doesn't mention if 0 offsets are explicitly disallowed

In a "normal" stream of offset values (encoded using fse), 0 cannot even be represented, so it's not a valid offset value.
You are right to note that, in the context of dictionary compression, it's possible to provide arbitrary repeat offset values, and since the format is more liberal (direct values, not fse encoded), it would be possible to provide 0 there. But an offset of 0 is indeed invalid.
(note : in the reference implementation, a repeat offset of 0 is automatically disabled. But it's indeed an implementation detail.)

Cyan4973 on 21 May 2020

Yes, right. It's not exactly "greatly useful", but it's indeed allowed.

Good. Shouldn't make much difference since I will keep the dictionary content separate anyway since I can't just slap it in front as regular history anyway. At least it is consistent, though it does imply a specific implementation.