Zstd: Encoding errors when using a dictionary using zstdmt

Created on 8 Dec 2017  路  10Comments  路  Source: facebook/zstd

Dear zstd team,
There seem to be a problem with zstd decoding as described in this bug:
#883816
I see the same problem on my system.

When using a pre-made dictionary, zstdmt will generate corrupted files:

[guus@haplo]/dev/shm/corpus>zstd --rm -D ../dictionary *
[guus@haplo]/dev/shm/corpus>zstd --rm -D ../dictionary -d *
[guus@haplo]/dev/shm/corpus>zstdmt --rm -D ../dictionary *
[guus@haplo]/dev/shm/corpus>zstdmt --rm -D ../dictionary -d *
ar,S=18008914:2,.zst : Decoding error (36) : Corrupted block detected
ar,S=6386609:2,S.zst : Decoding error (36) : Corrupted block detected
ar,S=6382007:2,S.zst : Decoding error (36) : Corrupted block detected
[...]
[guus@haplo]/dev/shm/corpus>zstd --rm -D ../dictionary -d *.zst
ar,S=18008914:2,.zst : Decoding error (36) : Corrupted block detected
ar,S=6386609:2,S.zst : Decoding error (36) : Corrupted block detected
ar,S=6382007:2,S.zst : Decoding error (36) : Corrupted block detected
[...]

Not all files are corrupt, only 1% has problems.

thank you!

bug

All 10 comments

We would need a reproduction case to investigate.
I tried this test setup using multiple test files, but none of them failed so far.
It likely requires specific files to show up.

I'd like to, but the files I tried it on is my personal mailbox, so you can imagine that I do not want to share that. I could not reproduce it on maildirs from public mailing lists such as debian-devel. However, I did notice that it is always the largest files that are corrupted. Out of ~1500 files, 1000 of them are between 1 and 10 kilobyte in size, then 400 of them are between 10 and 100 kilobyte in size, and then there's the rest which goes up to 25 megabytes in size. The debian-devel mailing list doesn't have such a distribution, the largest file there is only 90 kilobytes.

I've been testing again this setup with an extended test set containing large files (> 10 MB),
but none of them failed so far.
I'm afraid I'll need a reproduction case to be able to investigate.

_Correction_ : I was testing under dev branch.
Switching to v1.3.2, like the debian test, I can observe the bug, specifically when using zstdmt compression with dictionary on large files.
It seems the bug is already fixed in dev branch, as I cannot reproduce it using same files and dictionary. It's fairly hard to determine which change exactly helps, but I would typically look in this direction.

I would recommend testing the dev branch on the test set where you observed the issue.

In the meantime, I will also put a warning notice on v1.3.2.

I've managed to create a testcase I feel comfortable in sharing, that reproduces the issue 100% of the time on my machine. SInce it's too large to attach to a GitHub issue, I've temporarily made it available here:

http://tinc-vpn.org/temp/zstd-testcase.tar.xz

Note, I get this issue on a CPU with 6 cores, 12 threads.

Thanks for the reproduction case @gsliepen.
I could confirm the diagnosis : v1.3.2 fails on a few files (about a dozen, all > 4 MB),
while dev branch version works just fine for all files.
Therefore, a spurious fix seems already integrated, and will be part of next release.

In the meantime, if you urgently need a quick fix, you could select one of these options :

  • if using cli : compress files one by one, since the issue only shows up when multiple files are compressed in the same command.
  • switch to the dev branch, since the bug is no longer triggered there
  • disable multi-threading, since it's only useful for large files (> several MB)
  • disable dictionary, since it's only effective for small files (< several KB), while its impact is negligible for large files
  • route messages to use either dictionary (when they are small) or multi-threading (when they are large), but not both at the same time.

git bisect with this script says that it is fixed by commit fc8d293460d6767558843ac8231c249dcc704382 from PR #891.

Thank you for looking on it and for the test case. As far as I see the fix from commit fc8d293 doesn't solve the issue. I still get "Corrupted block detected" messages with the test case provided by @gsliepen.

Thanks for testing @mestia! We believe we figured out the issue, but hadn't posted here yet. Commit fc8d293 merely hides the issue when compressing a single file, but it remains when compressing multiple files.

When compressing with multiple threads and with a dictionary, we can confuse the window size, and use a larger window size for the second chunk than the first. Since the window size of the first chunk is what is written in the frame, the second chunk can have too-large offsets.

The dev branch has hidden this issue, but the root cause isn't solved yet. @cyan4973 is looking into it.

If you have corrupted data that you need to recover you can interpret the frame header using the zstd format, change the window size to be larger, then you should be able to decompress the data successfully. You can verify you got the header right with zstd -lv file.zst. The compressed frame contains a checksum by default so you can be confident it is correct (you can check with zstd -l file.zst).

We are considering creating a new release of zstd later in the week, once the deeper root cause is fully investigated and fixed. This bug requires several conditions to be assembled in order to show up, and indeed, the current dev branch masks the issue by making it impossible to generate _all_ these conditions together from the cli. But it doesn't mean the root cause is properly tackled.
More to come after investigation.

On Thu, Dec 14, 2017 at 06:02:02PM +0000, Yann Collet wrote:

Closed #944 via e28305fcca441564a57758d84da8ef73afe7d85d.

I can confirm that that fixed the issue. Thanks!

--
Met vriendelijke groet / with kind regards,
Guus Sliepen guus@tinc-vpn.org

Was this page helpful?
0 / 5 - 0 ratings