Using command line zstd, experimenting with maximum compression options. All is fine until getting to a 13 GB file. Command lines:
zstd --ultra -22 --zstd=wlog=31,clog=30,hlog=30,slog=26 -c
Decompression prints this error message:
"pa.zstd : Decoding error (36) : Restored data doesn't match checksum"
The decompressed file has correct size, but corrupted contents.
zstd: Version 1.4.0, built from source using bundled Makefile
OS: Ubuntu 18.04.1 LTS (64-bit)
gcc: 7.4.0
CPU: dual Xeon E5-2643v3 (hyperthreading is off)
RAM: 128 GB DDR4-2133 ECC Registered
Test data size: 13,409,043,938 bytes
Test data: ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/900/067/695/GCA_900067695.1_Pabies01/GCA_900067695.1_Pabies01_genomic.fna.gz
(Un-gzipped and renamed into "pa.fna")
This experiment was for the Sequence Compression Benchmark ( http://kirr.dyndns.org/sequence-compression-benchmark/ ). zstd is working fine with default levels. With the above options it worked fine for smaller files, producing a bit stronger compression than just "--ultra -22".
I'd like to include a stronger compression mode in the benchmark, but only if it works on inputs of all sizes.
Let me know if you need any additional information or help with testing the fix.
Thanks for the report @KirillKryukov,
We recently received a similar report associated with usage of wlog=31.
This is a pretty rare setting, which is probably badly tested.
We mostly need a reproducible test case to investigate.
The links proposed are unfortunately down currently (both http and ftp).
If we can get access to the source file, this will be very useful to reproduce the scenario, understand what's going wrong, and ship a fix.
Link works for me. But OK, I'll mirror the file on my server. Please wait a bit.
Please try this one: http://kirill.med.u-tokai.ac.jp/data/temp/Picea%20abies%20[genbank%20GCA_5F900067695.1%202016-11-09].fna.gz
(I also checked the original FTP link again - still working fine for me).
Thanks @KirillKryukov ,
I think the initial error was related to a security policy.
I'm now using a different system, which can access both sources.
Downloading currently...
Thanks for the repro case !
@KirillKryukov I've been able to reproduce the bug.
I've found a bug, which I think is the bug, but I won't be able to confirm until tomorrow when my compression finishes with the bug fix.
A fix is to load the dictionary in chunks of at most ZSTD_CHUNKSIZE_MAX (~512 MB), and do overflow correction between chunks when needed. I'll put up a pull request soon.
Note that this bug will only happen when all of the following conditions apply:
ZSTD_CHUNKSIZE_MAX.(wlog=31 & strategy >= btopt || wlog>=30 & strategy = btultra2), or are explicitly setting the overlap log with a large enough window log.Thanks @terrelln for quick work and updates!
You are using multithreading (default on the CLI)
Now I am confused. In my benchmark 4-thread zstd compresses ~4 times faster than the default: Compression speed of zstd settings on spruce genome.
The -4t settings have -T4 in the command line, the -1t settings don't (default): Used commands.
Therefore, it seems that the default zstd CLI does not use multithreading. What am I missing?
Second question: Your bug description sounds rather general. Do you think it occurs for all inputs above certain size (when the same options are used), or only for "unlucky" inputs?
By default we use 2 threads, one for IO and one for compression, which triggers the same path as any amount of threads.
This bug isn鈥檛 specific to particular inputs, except that they need to be many GB. But it will only be triggered by specific set of parameters.
I see. Thanks & good luck with the fix.
Most helpful comment
@KirillKryukov I've been able to reproduce the bug.