Zstd: Regression in dict compression with zstd 1.3.3

Created on 7 Jan 2018 · 15Comments · Source: facebook/zstd

I just found a regression when using dictionaries with zstd 1.3.3 and c-blosc2. Here you can see the performance with zstd 1.3.0 and c-blosc2:

$ tests/test_dict_schunk 
STARTING TESTS for tests/test_dict_schunk
[blocksize: 1 KB] cratio w/o dict: 3.5x (compr @ 554.0 MB/s, decompr @ 6446.3 MB/s)
.[blocksize: 1 KB] cratio with dict: 22.0x (compr @ 600.2 MB/s, decompr @ 5795.8 MB/s)
.[blocksize: 4 KB] cratio w/o dict: 13.2x (compr @ 1041.9 MB/s, decompr @ 15021.5 MB/s)
.[blocksize: 4 KB] cratio with dict: 68.3x (compr @ 888.7 MB/s, decompr @ 9891.1 MB/s)
.[blocksize: 32 KB] cratio w/o dict: 76.8x (compr @ 2434.2 MB/s, decompr @ 13590.0 MB/s)
.[blocksize: 32 KB] cratio with dict: 111.7x (compr @ 1616.7 MB/s, decompr @ 15001.4 MB/s)
.[blocksize: 256 KB] cratio w/o dict: 283.9x (compr @ 1997.7 MB/s, decompr @ 12209.5 MB/s)
.[blocksize: 256 KB] cratio with dict: 263.6x (compr @ 509.3 MB/s, decompr @ 11138.2 MB/s)
.[blocksize: automatic] cratio w/o dict: 283.9x (compr @ 2047.2 MB/s, decompr @ 12706.5 MB/s)
.[blocksize: automatic] cratio with dict: 263.6x (compr @ 507.8 MB/s, decompr @ 11844.3 MB/s)
. ALL TESTS PASSED  Tests run: 10

and here right after upgrading to zstd 1.3.3:

STARTING TESTS for tests/test_dict_schunk
[blocksize: 1 KB] cratio w/o dict: 3.5x (compr @ 555.9 MB/s, decompr @ 10153.1 MB/s)
.[blocksize: 1 KB] cratio with dict: 22.3x (compr @ 620.6 MB/s, decompr @ 9059.2 MB/s)
.[blocksize: 4 KB] cratio w/o dict: 13.2x (compr @ 845.3 MB/s, decompr @ 14809.6 MB/s)
.[blocksize: 4 KB] cratio with dict: 45.5x (compr @ 2088.3 MB/s, decompr @ 12859.8 MB/s)
.[blocksize: 32 KB] cratio w/o dict: 76.8x (compr @ 2176.4 MB/s, decompr @ 15054.8 MB/s)
.[blocksize: 32 KB] cratio with dict: 202.5x (compr @ 5312.4 MB/s, decompr @ 14935.7 MB/s)
.[blocksize: 256 KB] cratio w/o dict: 283.9x (compr @ 1721.9 MB/s, decompr @ 12034.4 MB/s)
.[blocksize: 256 KB] cratio with dict: 49.6x (compr @ 4148.2 MB/s, decompr @ 11989.9 MB/s)
F (ERROR: Dict does not reach expected compression ratio)
    Tests run: 8

You can see how the compression ratio for 256 KB chunksizes using dicts decreases quite a lot (until the c-blosc2 test suite decided that it is not longer acceptable).

You can reproduce this easily by downloading the c-blosc2 library and running the test. Here are quick instructions on how to do this for Unix:

$ git clone https://github.com/Blosc/c-blosc2
$ cd c-blosc2
$ mkdir build; cd build
$ cmake ..
$ tests/test_dict_schunk

optimization

Source

FrancescAlted

👍1

Most helpful comment

I was wrong about the dictionary loading. It is the reduced window size. The smaller dictionary makes us chose a smaller window log during dictionary compression. Changing window log, with no other changes restores the compression ratio.

terrelln on 11 Jan 2018

🎉1 👍1

All 15 comments

Thanks for notification @FrancescAlted !
I'll look into it.

A few preliminary questions :

What is the zstd compression level ? Is it 1 for all tests ?
Is the dictionary exactly the same for both versions ? Or is the one used in test v1.3.0 generated from libzstd v1.3.0, and the one in test v1.3.3 generated from libzstd v1.3.3 ?
Is it possible to check if the regression happen in an intermediate version, such as v1.3.1 or v1.3.2 ?

Cyan4973 on 7 Jan 2018

Thanks for notification @FrancescAlted !
I'll look into it.

A few preliminary questions :

What is the zstd compression level ? Is it 1 for all tests ?

The zstd compression level for all these tests is 9 (correspoding to c-blosc2 level 5).

Is the dictionary exactly the same for both versions ? Or is the one used in test v1.3.0 generated from libzstd v1.3.0, and the one in test v1.3.3 generated from libzstd v1.3.3 ?

Yes, the dict is generated on-the flight, so for the 1.3.0 test, the zstd 1.3.0 dict was used, whereas for 1.3.3 test, the ztsd 1.3.3 dict was used instead.

Is it possible to check if the regression happen in an intermediate version, such as v1.3.1 or v1.3.2 ?

It is possible yes, but may take a while. In case you want to make progress by yourself, c-blosc can easily be compiled against an external zstd library, so you can experiment more easily:

$ cmake -DPREFER_EXTERNAL_ZSTD=ON ..

This will try to link c-blosc against an external libzstd in your system (typically in /usr/local or /usr).

Thanks.

FrancescAlted on 7 Jan 2018

👍1

For information,
just followed blindly above instructions, and outcome is :

$ cmake ..
$ tests/test_dict_schunk
-bash: tests/test_dict_schunk: No such file or directory

Cyan4973 on 7 Jan 2018

Ops, I forgot a ‘make’ before running the test.

FrancescAlted on 7 Jan 2018

👍1

Thanks, I can now observe c-blosc2 compression results.

To test different zstd versions, I tried the configuration trick mentioned above : cmake -DPREFER_EXTERNAL_ZSTD=ON.

But it doesn't work : link stage fails on lizard, as it's apparently expecting to use some files from zstd.

_edit_ : it's failing to find HUF_* and XXH_* symbols, even though the relevant source files are located in corresponding sub-directories. For some reason, now, lizard link stage fails all the time, whatever the change to configuration. Even a make clean cannot recover. So I have to git clone again.

Cyan4973 on 7 Jan 2018

Quick question :
I noticed that, using same binary, compression ratio with dictionary enabled can vary between successive runs. Is that expected ?

.[blocksize: 4 KB] cratio with dict: 42.1x (compr @ 544.1 MB/s, decompr @ 4585.0 MB/s)
...
.[blocksize: 4 KB] cratio with dict: 53.0x (compr @ 484.4 MB/s, decompr @ 5560.4 MB/s)
...
.[blocksize: 4 KB] cratio with dict: 43.3x (compr @ 536.4 MB/s, decompr @ 4351.2 MB/s)
...
.[blocksize: 4 KB] cratio with dict: 60.8x (compr @ 476.7 MB/s, decompr @ 5817.7 MB/s)

Cyan4973 on 7 Jan 2018

Yeah, I did notice this randomness with the compression ratio when using dicts before too. I'd say that blosc should not be responsible for that (e.g. this is not seen when no dicts are used).

FrancescAlted on 8 Jan 2018

I have a few questions about the dictionary construction in Blosc, I'm looking at the code here.

It looks like 2^14 * clevel bytes are provided for training, is there always that much data available? This might account for the randomness.
How large is nbbytes? If sample_size < dict_maxsize, then this commit might cause the regression in compression ratio, because we don't have an accurate dict size. The fix would to either provide more samples, or reduce dict_maxsize to sample_size. I would recommend scaling the sample size with the dict size.

terrelln on 8 Jan 2018

Hi @terrelln

The size provided for training for the 4 KB dictsize is as follows

nbytes (in chunk), nblocks (per chunk), sample_size (per block), dict_maxsize: 800000, 196, 417, 25000
[blocksize: 4 KB] cratio with dict: 60.8x (compr @ 504.1 MB/s, decompr @ 6122.6 MB/s)

so 81732 bytes (i.e. 196 blocks * 417 bytes per sample).

Whereas in another run:

nbytes (in chunk), nblocks (per chunk), sample_size (per block), dict_maxsize: 800000, 196, 417, 25000
[blocksize: 4 KB] cratio with dict: 46.4x (compr @ 642.8 MB/s, decompr @ 5511.2 MB/s)

So using the same number of blocks and sampling size, but zstd is giving different compression ratio.

Well, if I read correctly the info above, then I follow that sample_size == 81732, which is less than dict_maxsize == 25000. Or maybe I am reading the zstd docs incorrectly and sample_size is 417?

FrancescAlted on 9 Jan 2018

I'm looking into this regression further. I applied the following patch, and compiled with make -j LDFLAGS=-lxxhash. I found that when running tests/test_dict_schunk the hash varies from run to run for the first 3 dictionaries, and rarely the fourth. This means that one source of the non-determinism is coming from the samples that Blosc passes the dictionary builder.

The next step is to determine whether the regression is caused by the dictionary construction, or dictionary compression.

 blosc/blosc.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/blosc/blosc.c b/blosc/blosc.c
index dcb6c2a..77d1608 100644
--- a/blosc/blosc.c
+++ b/blosc/blosc.c
@@ -48,6 +48,8 @@

 #endif /*  HAVE_ZSTD */

+#include <xxhash.h>
+

 #if defined(_WIN32) && !defined(__MINGW32__)
   #include <windows.h>
@@ -1574,6 +1576,13 @@ int blosc2_compress_ctx(blosc2_context* context, size_t nbytes,
       sample_size = context->sourcesize % context->blocksize;
     samples_sizes[nblocks - 1] = sample_size;

+    size_t total_size = 0;
+    for (size_t i = 0; i < nblocks; ++i)
+      total_size += samples_sizes[i];
+
+    fprintf(stderr, "dict_maxsize %zu\n", dict_maxsize);
+    fprintf(stderr, "total_size %zu; XXH64 = %llx\n", total_size, XXH64(samples_buffer, total_size, 0));
+
     // Train from samples
     void* dict_buffer = malloc(dict_maxsize);
     size_t dict_actual_size = ZDICT_trainFromBuffer(

terrelln on 10 Jan 2018

Commit https://github.com/facebook/zstd/commit/9a54a315aa28a6659b935bd6ce95cb962715ebbc and commit https://github.com/facebook/zstd/commit/d49eb40c03845d0961f2819f502c51a11bd7cbe5 both cause the regression. The TL;DR is that the regression is caused by the artificial structure of the data in the tests, and I don't expect to see the same thing happen in real data.

The algorithm is behaving as expected after these commits, before them there was a bug. The algorithm selects segments of length k (ranging from 16 to 2 KB) based on the number of distinct substrings of length d (by default 8) that it covers that are not already present in the dictionary. Before those two commits, it would continue to add segments until the dictionary was full, even if they covered no new substrings of length d.

Because of the way the test data is artificially constructed, it contains a very limited number of distinct substrings of length 8. After only 2000 bytes the dictionary contains every distinct dmer, so the heuristic says adding more data to the dictionary won't help.

Now the question is why does the compression ratio drop when the dictionary went from 25 KB to 2 KB, even though nothing interesting was added in the extra 23 KB? The answer is in how zstd loads dictionaries. At low levels, it uses a heuristic and only loads in every 3rd position. This is hurting compression ratio because we'll have to write up to 3 extra literals until we find a match. To confirm this I took the 2000 byte dictionary, and put two copies back to back, then four, then eight.

| # copies | size | ratio |
| --- | --- | --- |
| 1 | 2000 | 49.6 |
| 2 | 4000 | 89.8 |
| 4 | 8000 | 164.8 |
| 8 | 16000 | 260.2 |

The only reason the compression ratio would improve like this is because we are loading more positions into zstd's hash table, so it can emit less literals. This data is exposing a weakness in the heuristic used by low compression levels, but the heuristic works much better than adding every position to the hash table in the common case.

terrelln on 11 Jan 2018

👍1

@FrancescAlted mentioned that :

The zstd compression level for all these tests is 9 (corresponding to c-blosc2 level 5).

@terrelln, is that confirmed in your tests ?
In which case, the dictionary should not be using fast nor dfast strategies, for which above explanations apply.

For a small 2K dictionary, I would expect level 9 to use strategy btlazy2,
while it would use lazy2 for a larger 25K dictionary.
Both are supposed to reference all positions in the dictionary.
Some tentative ideas :
1) the difference in strategy produces an impact ?
2) it's beneficial to generate distinct substrings of length > 8 ?

Cyan4973 on 11 Jan 2018

The cdict used for compression is created with ZSTD_createCDict(dict_buffer, dict_actual_size, 1), so level 1 is used for dictionary compression, which confirms that fast or dfast will be used.

terrelln on 11 Jan 2018

👍1

terrelln on 11 Jan 2018

🎉1 👍1

Fix merged into dev branch

Cyan4973 on 12 Jan 2018

Was this page helpful?

0 / 5 - 0 ratings