I tried to compress a file with zstd -22 --ultra --long=31 and then decompress it with zstd -d --long=31 and got the following error:
tmp.tar.zst : 6031 MB... tmp.tar.zst : Decoding error (36) : Corrupted block detected
Is this a known issue? Just checking before I try to create a repro and/or debug this since it takes about 2 hours to compress the file (it's ~7GB), so it may take a while to reproduce the issue. Unfortunately, the file I'm compressing isn't something I can pass around, so I can't just send it to you.
I've tried this with the same file on two different platforms (different hardware and different Linux kernel versions) and have run into the same things both times. In both cases, the code I had was built from c0eb6c9c81e2dd31571267b37c12cdaa92b7ffd7, by running with make with no extra options specified.
Hi @danluu ,
this seems related to #1653 , which was recently fixed.
Try commit 9038579ab2fcf35ea14da4bb1db72662fd12309e (or latest dev) if possible.
Indeed, testing high compression levels on very large data set is quite lengthy ...
Thanks for the quick response! I still can get an error with 096714d1b83f2e3950e88bc07f61c4f68f0e238b, although it happens at a slightly different place when decompressing:
$ zstd -d --long=31 tmp.tar.zst
tmp.tar.zst : 6115 MB... tmp.tar.zst : Decoding error (36) : Corrupted block detected
From the linked issue, it sounds like it's possible that the it doesn't matter what data is used and only the size & settings matter? Let me see if I can get a repro with an all 0s file of the same size or a /dev/urandom file of the same size. If that doesn't work, I can try generating some synthetic data that's less unnatural.
I wasn't able to reproduce with a toy file from /dev/*, but the file from #1653 (http://kirill.med.u-tokai.ac.jp/data/temp/Picea%20abies%20[genbank%20GCA_5F900067695.1%202016-11-09].fna.gz) gives me a similar looking error.
Because the file is huge, I compressed with 32 threads, zstd -22 --ultra --long=31 -T32. On decompression (zstd -d --long=31), I get 6178 MB... tmp.fna.zst : Decoding error (36) : Corrupted block detected when using 096714d1b83f2e3950e88bc07f61c4f68f0e238b.
Thanks for the feedback @danluu ,
and sorry for the long time it takes testing.
This issue was supposed to be fixed by #1659 .
It was tested with the same fna file, so it's surprising that it's still there.
Maybe the fix doesn't work well enough.
There is also a small remaining possibility that the issue was actually correctly fixed in #1659, corresponding to commit 9038579ab2fcf35ea14da4bb1db72662fd12309e, but sometimes later another commit generated a problem featuring same symptom (I'm suspecting 857e608b5138dafd579eabfda85803087d658e59, but that's just a wild guess).
To be investigated ...
No problem! Another curious thing is that I also tried single-threaded compression for a few "large" (> 10GB) files and they all hung at 6144MB compressed. One of them was the filed linked above and two of them were files pull of prime numbers that compressed and decompressed correctly when compressed with -T32.
For all of the above, I built the update with make and didn't do make clean, I did a make clean and started all of these again on the off chance it's due to an issue with the Makefile.
That's a very interesting data point, although there is a potential of confusion when labelling "single-threaded". It could be either :
--single-thread command -T1 (which is the same as default)--single-thread is the "real" single-threaded version. It's equivalent to ZSTD_compressStream(). This version is supposed to be immune to bug in #1653, and if you nonetheless find that it still fails, then there is something else going on.
-T1 is actually "multi-threaded" by design, and just happens to use only one worker thread. The rest of the machinery works exactly the same as if it was N >= 2 threads, meaning there is a separate I/O thread, and work is split into pre-defined job sizes which are then passed onto workers, which preload data, etc. As a side-note that's why zstd generates exactly the same compressed content whatever the nb of threads (for a given a version and compression level).
So -T1 will suffer exactly the same issue as any -T#, while --single-thread should not.
Thanks for the clarification! My previous comment refers to default threading, no T specified.
Yes, so default is equivalent to -T1, which means, it uses the multi-thread engine, with only 1 active worker.
That's the setup where we identified the issue with --long=31.
Also, the nature of the bug (presuming it's the same one) is that it overflows internal indexes.
And in order to skip all defenses related to index overflow, there are multiple conditions :
--long=31)overlapLog==9 (automatically the case with highest compression levels), thus triggering a history preload of 2 GBCCtx), where the starting index is already high enough to trigger the overflowRe-using an existing context is something which is done automatically internally. It will necessarily happen if nb_jobs > nb_workers.
But as a consequence, when specifying a very large number of workers, it can reach a point where nb_workers >= nb_jobs, hence each CCtx is only used once. And therefore, the bug should no longer be triggered.
This refers to your test using -T32 "that compressed and decompressed correctly". I don't remember what's the size of a job with --long=31 setting, but I would hazard that it's probably 2 GB. Thus if the file is < 64 GB, -T32 will create more workers than necessary, and each worker will use a brand new CCtx, thus circumventing the problematic re-usage pattern.
Finally, I should have mentioned it earlier :
zstd can also be compiled with all its assert() turned on, so that it would ideally abort during compression, hence much closer to the effective fault point.
This is done by using build macro DEBUGLEVEL
make V=1 clean zstd MOREFLAGS=-DDEBUGLEVEL=1
is one way to produce such a binary.
Thanks, I'll try that in the future!
I was able to reproduce both the failure on decompression and the hang on compression after 6144MB with default T / T=1 with a clean build.
On the hangs, gdb shows them hanging at
(gdb) backtrace
#0 0x00007f756d4e8965 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1 0x000000000046b5a5 in ZSTDMT_waitForLdmComplete ()
#2 0x000000000046b73a in ZSTDMT_tryGetInputRange ()
#3 0x00000000004711cc in ZSTDMT_compressStream_generic ()
#4 0x000000000041ee4b in ZSTD_compressStream2.part.29 ()
#5 0x00000000004b9ec5 in FIO_compressFilename_srcFile ()
#6 0x00000000004bc494 in FIO_compressMultipleFilenames ()
#7 0x00000000004067a1 in main ()
The other spawned process looks less interesting?
#0 0x00007f756d4e8965 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1 0x000000000040a04b in POOL_thread ()
#2 0x00007f756d4e4dd5 in start_thread () from /lib64/libpthread.so.0
#3 0x00007f756d20dead in clone () from /lib64/libc.so.6
Thanks @danluu !
That's very helpful, it gives a clear hint of where to look for !
Indeed, #1659 cleared the issue for the main compression path, but ldm has a separate index data structure, and may well be prone to the same problem.
I've had this happen for multiple different files when using -T0 --long=31 --ultra -22 and I'd say it's a very severe error because it silently corrupts your data making it impossible to recover - thankfully I still didn't trust zstd enough yet and tried decompressing before deleting the original files first...
Also, it's really not a niche use case because it happened to me for at least three different files and the above flag combination is simply the strongest compression level you can easily find from help. (As in, in my opinion you should release this as a hotfix as soon as possible and mark it as the severe problem it is so people can verify if they unknowingly corrupted their data)
This is an important bug to fix, and I'll look into it this week. If you replace --long=31 with --zstd=wlog=31 you will probably work around the bug.
I'd argue it is a niche use case since it only occurs with large data on the highest settings, and most users won't spend hours compressing their data. However, just because it is a niche use case doesn't mean it isn't worthy of a quick fix! We take any data corruption bugs extremely seriously.
I've been able to reproduce both the hang and the decoding error on the file pa.fna with sha1sum 4969c9cfc8880fdacfe9ff7d39a12ec2a69da047.
Decoding error:
./zstd --long=31 -22 --ultra -T0 -v < pa.fna -o pa.fna.zst
Hang:
./zstd --long=31 -22 --ultra -v < pa.fna -o pa.fna.zst
I suspect that the decoding error and hang are closely related.
I was not able to reproduce it at level 16, so it might have something to do with btultra2.
I've reproduced corruption and a hang much faster with this command:
./zstd --zstd=clog=27,hlog=10,slog=1,mml=7,tlen=7 --long=31 -1 -v < pa.fna -o pa.fna.zst
I believe I have a fix as well. I will test it with the longer running commands and put up a PR. If you all could help me verify that the PR fixes it for you it would be greatly appreciated.
I've put up a PR, if you have time to test that it fixes your problems, and no asserts fail when built with make clean zstd MOREFLAGS="-DDEBUGLEVEL=1 -g" it would help a lot! It certainly fixes a bug, but I want to be sure it is the only one you have ran into.
Thanks for the quick fix! I tried it on the internal dataset I originally saw data corruption with as well as some others and it worked fine on those. I'm also running on some larger datasets and will get back to you if any of them fail.
I'm using -22 --ultra --long=31 to compress and -d --long=31 to decompress and built with @Cyan4973 's suggestion, above (make V=1 clean zstd MOREFLAGS=-DDEBUGLEVEL=1).
The issue should be fixed, but please reopen the issue if this doesn't fix all the problems.
We are planning on releasing a new version of zstd with the fix in about a week.
Most helpful comment
Thanks for the quick fix! I tried it on the internal dataset I originally saw data corruption with as well as some others and it worked fine on those. I'm also running on some larger datasets and will get back to you if any of them fail.
I'm using
-22 --ultra --long=31to compress and-d --long=31to decompress and built with @Cyan4973 's suggestion, above (make V=1 clean zstd MOREFLAGS=-DDEBUGLEVEL=1).