Zstd: Decoding error on spruce genome

Created on 15 Jun 2019 · 9Comments · Source: facebook/zstd

Using command line zstd, experimenting with maximum compression options. All is fine until getting to a 13 GB file. Command lines:

zstd --ultra -22 --zstd=wlog=31,clog=30,hlog=30,slog=26 -c zstd --zstd=wlog=31,clog=30,hlog=30,slog=26 -dc pa.zstd >pa.fna2

Decompression prints this error message:

"pa.zstd : Decoding error (36) : Restored data doesn't match checksum"

The decompressed file has correct size, but corrupted contents.

zstd: Version 1.4.0, built from source using bundled Makefile
OS: Ubuntu 18.04.1 LTS (64-bit)
gcc: 7.4.0
CPU: dual Xeon E5-2643v3 (hyperthreading is off)
RAM: 128 GB DDR4-2133 ECC Registered

Test data size: 13,409,043,938 bytes
Test data: ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/900/067/695/GCA_900067695.1_Pabies01/GCA_900067695.1_Pabies01_genomic.fna.gz
(Un-gzipped and renamed into "pa.fna")

This experiment was for the Sequence Compression Benchmark ( http://kirr.dyndns.org/sequence-compression-benchmark/ ). zstd is working fine with default levels. With the above options it worked fine for smaller files, producing a bit stronger compression than just "--ultra -22".

I'd like to include a stronger compression mode in the benchmark, but only if it works on inputs of all sizes.

Let me know if you need any additional information or help with testing the fix.

bug

Source

KirillKryukov

👍1

Most helpful comment

@KirillKryukov I've been able to reproduce the bug.

terrelln on 19 Jun 2019

👍2

All 9 comments

Thanks for the report @KirillKryukov,

We recently received a similar report associated with usage of wlog=31.
This is a pretty rare setting, which is probably badly tested.

We mostly need a reproducible test case to investigate.
The links proposed are unfortunately down currently (both http and ftp).

If we can get access to the source file, this will be very useful to reproduce the scenario, understand what's going wrong, and ship a fix.

Cyan4973 on 15 Jun 2019

Link works for me. But OK, I'll mirror the file on my server. Please wait a bit.

KirillKryukov on 15 Jun 2019

👍2

Please try this one: http://kirill.med.u-tokai.ac.jp/data/temp/Picea%20abies%20[genbank%20GCA_5F900067695.1%202016-11-09].fna.gz

(I also checked the original FTP link again - still working fine for me).

KirillKryukov on 15 Jun 2019

Thanks @KirillKryukov ,
I think the initial error was related to a security policy.
I'm now using a different system, which can access both sources.
Downloading currently...

Thanks for the repro case !

Cyan4973 on 15 Jun 2019

👍1

@KirillKryukov I've been able to reproduce the bug.

terrelln on 19 Jun 2019

👍2

I've found a bug, which I think is the bug, but I won't be able to confirm until tomorrow when my compression finishes with the bug fix.

Multithreaded compression loads the previous window log (2^31) as a dictionary in ultra mode.
Multithreaded compression reuses contexts, so its index could be up to 3 * 2^30 + 2^29.
When we add those together we overflow our 32-bit indices.
In non-binary-tree modes this will only reduce compression efficiency, because we check all offsets. In binary-tree modes, we assume the tree is constructed correctly, so we don't check all the bytes of the match, so this can lead to corruption.

A fix is to load the dictionary in chunks of at most ZSTD_CHUNKSIZE_MAX (~512 MB), and do overflow correction between chunks when needed. I'll put up a pull request soon.

Note that this bug will only happen when all of the following conditions apply:

You are using multithreading (default on the CLI)
You are not using long range mode, since we control the overlap log differently, and it is always < ZSTD_CHUNKSIZE_MAX.
You are using an overlap > 512 MB (wlog=31 & strategy >= btopt || wlog>=30 & strategy = btultra2), or are explicitly setting the overlap log with a large enough window log.

terrelln on 20 Jun 2019

👍1

Thanks @terrelln for quick work and updates!

You are using multithreading (default on the CLI)

Now I am confused. In my benchmark 4-thread zstd compresses ~4 times faster than the default: Compression speed of zstd settings on spruce genome.

The -4t settings have -T4 in the command line, the -1t settings don't (default): Used commands.

Therefore, it seems that the default zstd CLI does not use multithreading. What am I missing?

Second question: Your bug description sounds rather general. Do you think it occurs for all inputs above certain size (when the same options are used), or only for "unlucky" inputs?

KirillKryukov on 20 Jun 2019

By default we use 2 threads, one for IO and one for compression, which triggers the same path as any amount of threads.

This bug isn’t specific to particular inputs, except that they need to be many GB. But it will only be triggered by specific set of parameters.

terrelln on 20 Jun 2019

I see. Thanks & good luck with the fix.

KirillKryukov on 20 Jun 2019

Was this page helpful?

0 / 5 - 0 ratings