Zstd: Failing regression test

Created on 31 Mar 2018  路  12Comments  路  Source: facebook/zstd

OpenBSD fails on a specific test (v1.3.4, make check) with:

- Compress without dictID
tmp                  : 21.55%   ( 42430 =>   9142 bytes, tmp.zst)
tmp.zst : Decoding error (36) : Restored data doesn't match checksum

The test itself does something strange; it uses different dictionaries for compressing, and uncompressing.

https://github.com/facebook/zstd/blob/7188862d325ffb1c83b5495fa2da5ab1ad35b778/tests/playTests.sh#L373-L376

Concerning --no-dictID the manual states:

       do not store dictionary ID within frame header (dictionary
       compression). The decoder will have to rely on implicit
       knowledge about which dictionary to use, it won麓t be able to
       check if it麓s correct.

Taking the above into account the observed behaviour of failing this test seems to be expected. Btw, I think the manpage contains a typo: --nodictID should be --no-dictID.

If I change the test to use the same dictionary (like below) the test runs successfully.

$ZSTD -f tmp -D tmpDict1 --no-dictID 
$ZSTD -d tmp.zst -D tmpDict1 -fo result 

Now my question: Am I misinterpreting the test?

Most helpful comment

Thanks @bket, that helps a lot, I think I found the culprit! You are using OpenBSD's libc, not glibc right?

In glibc, qsort() is actually a merge sort, so it is stable, but it seems like qsort() in OpenBSD isn't stable. If so, I think applying this patch will fix it:

diff --git i/lib/dictBuilder/cover.c w/lib/dictBuilder/cover.c
index b5a3957a..2b60aa1a 100644
--- i/lib/dictBuilder/cover.c
+++ w/lib/dictBuilder/cover.c
@@ -583,7 +583,7 @@ static int COVER_ctx_init(COVER_ctx_t *ctx, const void *samplesBuffer,
     }
     /* qsort doesn't take an opaque pointer, so pass as a global */
     g_ctx = ctx;
-    qsort(ctx->suffix, ctx->suffixSize, sizeof(U32),
+    mergesort(ctx->suffix, ctx->suffixSize, sizeof(U32),
           (ctx->d <= 8 ? &COVER_strict_cmp8 : &COVER_strict_cmp));
   }
   DISPLAYLEVEL(2, "Computing frequencies\n");

If it works, we'll have to gate it behind a macro in the actual fix. I'll follow up with a custom sort implementation, because we can do a lot better than qsort().

All 12 comments

The two dictionaries should be the same, except for bytes [4, 8), which have different dictionary IDs. I suspect that they are different, can you verify with something like sha1sum <(tail -c +9 tmpDict) <(tail -c +9 tmpDict1)? I suspect that the order of the *.c ../programs/*.c isn't deterministic on OpenBSD. If that isn't the case, I suspect there is some non-determinism somewhere else.

I suspect that they are different

Your suspicion seems correct, the checksums are different.

The fix should be as simple as (approximately)

dictFiles="*.c ../programs/*.c"
$ZSTD --train $dictFiles -o tmpDict
$ZSTD --train $dictFiles --dictID=1 -o tmpDict1

@bket are you planning on submitting a PR to fix this? If not I will, but I can't test if it worked or not.

I did some testing using the proposed fix:

--- tests/playTests.sh
+++ tests/playTests.sh
@@ -371,6 +371,9 @@ $ZSTD --train *.c ../programs/*.c -o tmpDict3 --maxdict=1K -v
 $ECHO "- Create dictionary with wrong parameter order (must fail)"
 $ZSTD --train *.c ../programs/*.c -o tmpDict3 --maxdict -v 4K && die "wrong order : --maxdict must be followed by argument "
 $ECHO "- Compress without dictID"
+dictFiles="*.c ../programs/*.c"
+$ZSTD --train $dictFiles -o tmpDict
+$ZSTD --train $dictFiles --dictID=1 -o tmpDict1
 $ZSTD -f tmp -D tmpDict1 --no-dictID
 $ZSTD -d tmp.zst -D tmpDict -fo result
 $DIFF $TESTFILE result

6 out of 10 test runs ran successfully. The other 4 failed with the same error as mentioned in my first post. Interesting to know is that I did another series of runs - accidentally without the above diff - with about the same success rate. I'm confident that the issues is not related to non-deterministic ordering of *.c ../programs/*.c.
So of the 20 runs half failed on the same test.

I'm not sure if it is relevant but I had a look at the output of diff -au tmpDict tmpDict1 of several runs. The readable bits all look the same. I've put an example of tmpDict and tmpDict1 of a failed run online.

I'm able to reproduce the above on another OpenBSD (amd64) box. It's an interesting interaction between OpenBSD and zstd....

Oops, thats my mistake, the shell doesn't expand the glob pattern until it is used, even if stored in a variable.

Can you see if the glob is actually non-deterministic? If it is, something like dictFiles=$(echo *.c ../programs/*.c) certainly works, but there might be a better way.

Thanks for looking into this!

Running with dictFiles=$(echo *.c ../programs/*.c) gives the same result: Failing 'Compress without dictID'. Success / failure rate is comparable to the previous series of runs.

I have a strong feeling that the non-determinism is in the file loading. If it were in the algorithm, I don't see why it would show up in OpenBSD and not Linux or OS X. I ran the command with MSAN, ASAN, and UBSAN, just in case, and everything looked good, and I wasn't able to reproduce any determinism.

Lets print out the exact files loaded by the trainer, so we can tell for certain if the problem is in the file loading or the algorithm. Can you run with this patch, and make sure the files are loaded in the same order on every run? I should have recommended this first, since the latency is so high, but oh well.

diff --git i/programs/dibio.c w/programs/dibio.c
index 112259dd..6b88b534 100644
--- i/programs/dibio.c
+++ w/programs/dibio.c
@@ -120,14 +120,16 @@ static unsigned DiB_loadFiles(void* buffer, size_t* bufferSizePtr,
         size_t const maxChunkSize = (size_t)MIN(chunkSize, SAMPLESIZE_MAX);
         U32 cnb;
         FILE* const f = fopen(fileName, "rb");
+        size_t bytes = 0;
         if (f==NULL) EXM_THROW(10, "zstd: dictBuilder: %s %s ", fileName, strerror(errno));
-        DISPLAYUPDATE(2, "Loading %s...       \r", fileName);
+        DISPLAYLEVEL(2, "Loading %s... ", fileName);
         for (cnb=0; cnb<nbChunks; cnb++) {
             size_t const toLoad = (size_t)MIN(maxChunkSize, remainingToLoad);
             if (toLoad > *bufferSizePtr-pos) break;
             {   size_t const readSize = fread(buff+pos, 1, toLoad, f);
                 if (readSize != toLoad) EXM_THROW(11, "Pb reading %s", fileName);
                 pos += readSize;
+                bytes += readSize;
                 sampleSizes[nbLoadedChunks++] = toLoad;
                 remainingToLoad -= targetChunkSize;
                 if (nbLoadedChunks == sstSize) { /* no more space left in sampleSizes table */
@@ -137,6 +139,7 @@ static unsigned DiB_loadFiles(void* buffer, size_t* bufferSizePtr,
                 if (toLoad < targetChunkSize) {
                     fseek(f, (long)(targetChunkSize - toLoad), SEEK_CUR);
         }   }   }
+        DISPLAYLEVEL(2, "%u B\n", (unsigned)bytes);
         fclose(f);
     }
     DISPLAYLEVEL(2, "\r%79s\r", "");

Output from a successful run:

- Compress without dictID
-----------------------------------------------------------------------
checkTag.c datagencli.c decodecorpus.c fullbench.c fuzzer.c invalidDictionaries.c legacy.c longmatch.c paramgrill.c poolTests.c roundTripCrash.c seqgen.c symbols.c zbufftest.c zstreamtest.c ../programs/bench.c ../programs/datagen.c ../programs/dibio.c ../programs/fileio.c ../programs/zstdcli.c
-----------------------------------------------------------------------
!  Warning : data size of samples too small for target dictionary size 
!  Samples should be about 100x larger than target dictionary size 
Loading longmatch.c... 3123 B
Loading decodecorpus.c... 68953 B
Loading ../programs/dibio.c... 14592 B
Loading roundTripCrash.c... 7897 B
Loading fullbench.c... 22929 B
Loading datagencli.c... 4523 B
Loading poolTests.c... 2755 B
Loading symbols.c... 4064 B
Loading ../programs/zstdcli.c... 42430 B
Loading legacy.c... 9555 B
Loading seqgen.c... 8031 B
Loading checkTag.c... 2008 B
Loading ../programs/fileio.c... 81412 B
Loading invalidDictionaries.c... 2106 B
Loading ../programs/datagen.c... 5995 B
Loading fuzzer.c... 86633 B
Loading zstreamtest.c... 97652 B
Loading zbufftest.c... 24613 B
Loading paramgrill.c... 38977 B
Loading ../programs/bench.c... 30128 B
^M                                                                               ^MTrying 5 different sets of parameters
^M20%       ^M                                                                               ^Mk=50
d=8
steps=4
Save dictionary of size 36265 into file tmpDict 
!  Warning : data size of samples too small for target dictionary size 
!  Samples should be about 100x larger than target dictionary size 
Loading longmatch.c... 3123 B
Loading decodecorpus.c... 68953 B
Loading ../programs/dibio.c... 14592 B
Loading roundTripCrash.c... 7897 B
Loading fullbench.c... 22929 B
Loading datagencli.c... 4523 B
Loading poolTests.c... 2755 B
Loading symbols.c... 4064 B
Loading ../programs/zstdcli.c... 42430 B
Loading legacy.c... 9555 B
Loading seqgen.c... 8031 B
Loading checkTag.c... 2008 B
Loading ../programs/fileio.c... 81412 B
Loading invalidDictionaries.c... 2106 B
Loading ../programs/datagen.c... 5995 B
Loading fuzzer.c... 86633 B
Loading zstreamtest.c... 97652 B
Loading zbufftest.c... 24613 B
Loading paramgrill.c... 38977 B
Loading ../programs/bench.c... 30128 B
^M                                                                               ^MTrying 5 different sets of parameters
^M20%       ^M                                                                               ^Mk=50
d=8
steps=4
Save dictionary of size 36265 into file tmpDict1 
^M                                                                               ^Mtmp                  : 17.49%   ( 42430 =>   7419 bytes, tmp.zst) 
^M                                                                               ^Mtmp.zst             : 42430 bytes                                                      

Output from a failed run:

- Compress without dictID
-----------------------------------------------------------------------
checkTag.c datagencli.c decodecorpus.c fullbench.c fuzzer.c invalidDictionaries.c legacy.c longmatch.c paramgrill.c poolTests.c roundTripCrash.c seqgen.c symbols.c zbufftest.c zstreamtest.c ../programs/bench.c ../programs/datagen.c ../programs/dibio.c ../programs/fileio.c ../programs/zstdcli.c
-----------------------------------------------------------------------
!  Warning : data size of samples too small for target dictionary size 
!  Samples should be about 100x larger than target dictionary size 
Loading longmatch.c... 3123 B
Loading decodecorpus.c... 68953 B
Loading ../programs/dibio.c... 14592 B
Loading roundTripCrash.c... 7897 B
Loading fullbench.c... 22929 B
Loading datagencli.c... 4523 B
Loading poolTests.c... 2755 B
Loading symbols.c... 4064 B
Loading ../programs/zstdcli.c... 42430 B
Loading legacy.c... 9555 B
Loading seqgen.c... 8031 B
Loading checkTag.c... 2008 B
Loading ../programs/fileio.c... 81412 B
Loading invalidDictionaries.c... 2106 B
Loading ../programs/datagen.c... 5995 B
Loading fuzzer.c... 86633 B
Loading zstreamtest.c... 97652 B
Loading zbufftest.c... 24613 B
Loading paramgrill.c... 38977 B
Loading ../programs/bench.c... 30128 B
^M                                                                               ^MTrying 5 different sets of parameters
^M20%       ^M                                                                               ^Mk=50
d=8
steps=4
Save dictionary of size 36265 into file tmpDict 
!  Warning : data size of samples too small for target dictionary size 
!  Samples should be about 100x larger than target dictionary size 
Loading longmatch.c... 3123 B
Loading decodecorpus.c... 68953 B
Loading ../programs/dibio.c... 14592 B
Loading roundTripCrash.c... 7897 B
Loading fullbench.c... 22929 B
Loading datagencli.c... 4523 B
Loading poolTests.c... 2755 B                                                                                                                                            
Loading symbols.c... 4064 B                                                                                                                                              
Loading ../programs/zstdcli.c... 42430 B                                                                                                                                 
Loading legacy.c... 9555 B                                                                                                                                               
Loading seqgen.c... 8031 B                                                                                                                                               
Loading checkTag.c... 2008 B                                                                                                                                             
Loading ../programs/fileio.c... 81412 B                                                                                                                                  
Loading invalidDictionaries.c... 2106 B                                                                                                                                  
Loading ../programs/datagen.c... 5995 B                                                                                                                                  
Loading fuzzer.c... 86633 B                                                                                                                                              
Loading zstreamtest.c... 97652 B                                                                                                                                         
Loading zbufftest.c... 24613 B                                                                                                                                           
Loading paramgrill.c... 38977 B                                                                                                                                          
Loading ../programs/bench.c... 30128 B                                                                                                                                   
^M                                                                               ^MTrying 5 different sets of parameters                                                 
^M20%       ^M                                                                               ^Mk=50                                                                      
d=8                                                                                                                                                                      
steps=4                                                                                                                                                                  
Save dictionary of size 36258 into file tmpDict1                                                                                                                         
^M                                                                               ^Mtmp                  : 17.41%   ( 42430 =>   7386 bytes, tmp.zst)                     
tmp.zst : Decoding error (36) : Restored data doesn't match checksum                                                                                                     

Output from diff -up good bad suggest that the only difference between the two runs is that a failed run has a smaller dictionary size?

--- good        Tue Apr  3 21:54:18 2018                                                                                                                                 
+++ bad Tue Apr  3 21:56:01 2018
@@ -55,6 +55,6 @@ Loading ../programs/bench.c... 30128 B
k=50                                                                           
 d=8
 steps=4
-Save dictionary of size 36265 into file tmpDict1 
tmp                  : 17.49%   ( 42430 =>   7419 bytes, tmp.zst)              
tmp.zst             : 42430 bytes                                              
+Save dictionary of size 36258 into file tmpDict1 
tmp                  : 17.41%   ( 42430 =>   7386 bytes, tmp.zst)              
+tmp.zst : Decoding error (36) : Restored data doesn't match checksum 

For reference purposes, I ran the tests with the diff below:

--- tests/playTests.sh.orig
+++ tests/playTests.sh
@@ -371,6 +372,12 @@ $ZSTD --train *.c ../programs/*.c -o tmpDict3 --maxdic
 $ECHO "- Create dictionary with wrong parameter order (must fail)"
 $ZSTD --train *.c ../programs/*.c -o tmpDict3 --maxdict -v 4K && die "wrong order : --maxdict must be followed by argument "
 $ECHO "- Compress without dictID"
+dictFiles=$(echo *.c ../programs/*.c)
+$ECHO "-----------------------------------------------------------------------"
+$ECHO $dictFiles
+$ECHO "-----------------------------------------------------------------------"
+$ZSTD --train $dictFiles -o tmpDict
+$ZSTD --train $dictFiles --dictID=1 -o tmpDict1
 $ZSTD -f tmp -D tmpDict1 --no-dictID
 $ZSTD -d tmp.zst -D tmpDict -fo result
 $DIFF $TESTFILE result

Thanks @bket, that helps a lot, I think I found the culprit! You are using OpenBSD's libc, not glibc right?

In glibc, qsort() is actually a merge sort, so it is stable, but it seems like qsort() in OpenBSD isn't stable. If so, I think applying this patch will fix it:

diff --git i/lib/dictBuilder/cover.c w/lib/dictBuilder/cover.c
index b5a3957a..2b60aa1a 100644
--- i/lib/dictBuilder/cover.c
+++ w/lib/dictBuilder/cover.c
@@ -583,7 +583,7 @@ static int COVER_ctx_init(COVER_ctx_t *ctx, const void *samplesBuffer,
     }
     /* qsort doesn't take an opaque pointer, so pass as a global */
     g_ctx = ctx;
-    qsort(ctx->suffix, ctx->suffixSize, sizeof(U32),
+    mergesort(ctx->suffix, ctx->suffixSize, sizeof(U32),
           (ctx->d <= 8 ? &COVER_strict_cmp8 : &COVER_strict_cmp));
   }
   DISPLAYLEVEL(2, "Computing frequencies\n");

If it works, we'll have to gate it behind a macro in the actual fix. I'll follow up with a custom sort implementation, because we can do a lot better than qsort().

Thank you @terrelln for your quick response!

You are using OpenBSD's libc, not glibc right?

Right!

If so, I think applying this patch will fix it:

I ran a series of tests using your last diff (for lib/dictBuilder/cover.c), which were all successful. On a second machine I did a couple of tests using only your last diff, excluding all other diffs (for programs/dibio.c, and tests/playTests). Also here all tests succeeded.

It seems that you have identified the issue, and solved it!

If it works, we'll have to gate it behind a macro in the actual fix.

Alternatively:

--- lib/dictBuilder/cover.c.orig
+++ lib/dictBuilder/cover.c
@@ -583,8 +583,13 @@ static int COVER_ctx_init(COVER_ctx_t *ctx, const void
     }
     /* qsort doesn't take an opaque pointer, so pass as a global */
     g_ctx = ctx;
+#if defined(__OpenBSD__)
+    mergesort(ctx->suffix, ctx->suffixSize, sizeof(U32),
+          (ctx->d <= 8 ? &COVER_strict_cmp8 : &COVER_strict_cmp));
+#else
     qsort(ctx->suffix, ctx->suffixSize, sizeof(U32),
           (ctx->d <= 8 ? &COVER_strict_cmp8 : &COVER_strict_cmp));
+#endif
   }
   DISPLAYLEVEL(2, "Computing frequencies\n");
   /* For each dmer group (group of positions with the same first d bytes):

Issue has been resolved. Thanks!

Was this page helpful?
0 / 5 - 0 ratings

Related issues

planet36 picture planet36  路  3Comments

vade picture vade  路  3Comments

itsnotvalid picture itsnotvalid  路  3Comments

TheSil picture TheSil  路  3Comments

animalize picture animalize  路  3Comments