Zstd: Encoding errors when using a dictionary using zstdmt

Created on 8 Dec 2017 · 10Comments · Source: facebook/zstd

Dear zstd team,
There seem to be a problem with zstd decoding as described in this bug:
#883816
I see the same problem on my system.

When using a pre-made dictionary, zstdmt will generate corrupted files:

[guus@haplo]/dev/shm/corpus>zstd --rm -D ../dictionary *
[guus@haplo]/dev/shm/corpus>zstd --rm -D ../dictionary -d *
[guus@haplo]/dev/shm/corpus>zstdmt --rm -D ../dictionary *
[guus@haplo]/dev/shm/corpus>zstdmt --rm -D ../dictionary -d *
ar,S=18008914:2,.zst : Decoding error (36) : Corrupted block detected
ar,S=6386609:2,S.zst : Decoding error (36) : Corrupted block detected
ar,S=6382007:2,S.zst : Decoding error (36) : Corrupted block detected
[...]
[guus@haplo]/dev/shm/corpus>zstd --rm -D ../dictionary -d *.zst
ar,S=18008914:2,.zst : Decoding error (36) : Corrupted block detected
ar,S=6386609:2,S.zst : Decoding error (36) : Corrupted block detected
ar,S=6382007:2,S.zst : Decoding error (36) : Corrupted block detected
[...]

Not all files are corrupt, only 1% has problems.

thank you!

bug

Source

mestia

All 10 comments

We would need a reproduction case to investigate.
I tried this test setup using multiple test files, but none of them failed so far.
It likely requires specific files to show up.

Cyan4973 on 8 Dec 2017

I'd like to, but the files I tried it on is my personal mailbox, so you can imagine that I do not want to share that. I could not reproduce it on maildirs from public mailing lists such as debian-devel. However, I did notice that it is always the largest files that are corrupted. Out of ~1500 files, 1000 of them are between 1 and 10 kilobyte in size, then 400 of them are between 10 and 100 kilobyte in size, and then there's the rest which goes up to 25 megabytes in size. The debian-devel mailing list doesn't have such a distribution, the largest file there is only 90 kilobytes.

gsliepen on 9 Dec 2017

I've been testing again this setup with an extended test set containing large files (> 10 MB),
but none of them failed so far.
I'm afraid I'll need a reproduction case to be able to investigate.

_Correction_ : I was testing under dev branch.
Switching to v1.3.2, like the debian test, I can observe the bug, specifically when using zstdmt compression with dictionary on large files.
It seems the bug is already fixed in dev branch, as I cannot reproduce it using same files and dictionary. It's fairly hard to determine which change exactly helps, but I would typically look in this direction.

I would recommend testing the dev branch on the test set where you observed the issue.

In the meantime, I will also put a warning notice on v1.3.2.

Cyan4973 on 11 Dec 2017

I've managed to create a testcase I feel comfortable in sharing, that reproduces the issue 100% of the time on my machine. SInce it's too large to attach to a GitHub issue, I've temporarily made it available here:

http://tinc-vpn.org/temp/zstd-testcase.tar.xz

Note, I get this issue on a CPU with 6 cores, 12 threads.

gsliepen on 11 Dec 2017

👍1

Thanks for the reproduction case @gsliepen.
I could confirm the diagnosis : v1.3.2 fails on a few files (about a dozen, all > 4 MB),
while dev branch version works just fine for all files.
Therefore, a spurious fix seems already integrated, and will be part of next release.

In the meantime, if you urgently need a quick fix, you could select one of these options :

if using cli : compress files one by one, since the issue only shows up when multiple files are compressed in the same command.
switch to the dev branch, since the bug is no longer triggered there
disable multi-threading, since it's only useful for large files (> several MB)
disable dictionary, since it's only effective for small files (< several KB), while its impact is negligible for large files
route messages to use either dictionary (when they are small) or multi-threading (when they are large), but not both at the same time.

Cyan4973 on 11 Dec 2017

git bisect with this script says that it is fixed by commit fc8d293460d6767558843ac8231c249dcc704382 from PR #891.

terrelln on 11 Dec 2017

👍1

Thank you for looking on it and for the test case. As far as I see the fix from commit fc8d293 doesn't solve the issue. I still get "Corrupted block detected" messages with the test case provided by @gsliepen.

mestia on 12 Dec 2017

👍1

Thanks for testing @mestia! We believe we figured out the issue, but hadn't posted here yet. Commit fc8d293 merely hides the issue when compressing a single file, but it remains when compressing multiple files.

When compressing with multiple threads and with a dictionary, we can confuse the window size, and use a larger window size for the second chunk than the first. Since the window size of the first chunk is what is written in the frame, the second chunk can have too-large offsets.

The dev branch has hidden this issue, but the root cause isn't solved yet. @cyan4973 is looking into it.

If you have corrupted data that you need to recover you can interpret the frame header using the zstd format, change the window size to be larger, then you should be able to decompress the data successfully. You can verify you got the header right with zstd -lv file.zst. The compressed frame contains a checksum by default so you can be confident it is correct (you can check with zstd -l file.zst).

terrelln on 12 Dec 2017

👍1

We are considering creating a new release of zstd later in the week, once the deeper root cause is fully investigated and fixed. This bug requires several conditions to be assembled in order to show up, and indeed, the current dev branch masks the issue by making it impossible to generate _all_ these conditions together from the cli. But it doesn't mean the root cause is properly tackled.
More to come after investigation.

Cyan4973 on 12 Dec 2017

👍1

On Thu, Dec 14, 2017 at 06:02:02PM +0000, Yann Collet wrote: