Zstd: small files compression / dictionary issues

Created on 3 Dec 2016 · 32Comments · Source: facebook/zstd

Hi,

I am struggling a bit with the compression ratio that remains very low, while I think it should be higher.

I am compressing a huge number of small data chunks (1323 bytes), representing terrain elevation data for a terrain rendering system.

The chunk is made of 441x u16 (elevation) + 441x u8 (alpha).

Now, I understand small data are not great, therefore I tried the 'dictionary' approach. But whatever I do, I can't even reach a 2x compression ratio.

Some weird things I have noticed:

using a 100 KB target dictionary along with a 10 MB samples buffer (each sample being 1323 bytes), I get a dictionary of 73 KB (not a big deal actually) and everything works as expected (but I am still left with my low compression ratio)
so I thought I would create a larger dictionary, and tried to train a 1MB dictionary using a 100MB sample buffer. However in that case, the dictionary training phase lasts forever (i.e. I interrupted it after 1 hour or so) => fail.

Would you have some hints for me ? Any idea how I could achieve higher compression ratio ? I need to be able to read/uncompress each tile randomly (so streaming is not an option).

Basically, the source data are the SRTM data (roughly 15 GB zip compressed) - when put into small tiles and compressed using ZSTD, it compresses down to about 100 GB (dictionary doesn't even bring 5%).

Thanks a lot !
Greg

question

Source

gjaegy

All 32 comments

Hi,
Your elevation seems a numerical type of data, Lempel-Ziv from zstd will not help much here.
However, alpha seems highly redundant - putting elevation and alpha in separate files should help.

For 2D elevation numerical data you should use the fact that close points have similar values - exactly like for lossless image compression: https://en.wikipedia.org/wiki/Lossless_JPEG
You scan line by line and encode only difference from a predictor basing on already decoded neighboring pixels - for numerical data (PDEs) there is used e.g. Lorenzo predictor: https://computing.llnl.gov/vis/images/pdf/Lindstrom_EEG03.pdf
You can write a filter which transforms your elevation map into differences from predictor, then just apply e.g. FSE itself: https://github.com/Cyan4973/FiniteStateEntropy

JarekDuda on 3 Dec 2016

For a table of regular samples, such as u8 (alpha) or u16 (elevation),
I would recommend using blosc from @FrancescAlted.

Blosc is a pre-filter which is very efficient for such data.
It will prepare it using a shuffle operation for the next round of compression.
In general, in combination with zstd (which is included in blosc), it tends to work very well.
See http://www.blosc.org/blog/zstd-has-just-landed-in-blosc.html for more details.

I would also recommend to separate elevation and alpha channel for best results.

Cyan4973 on 4 Dec 2016

Hi guys,

thanks a lot for your inputs, the two suggestions are rather pertinent.

I will experiment with blosc and then evaluate a predicator such as Lorenzo's one. Any idea which one could be the most efficient ?

One point I didn't get (I am not an expert in compression theories, excuse me if my question is stupid), is why do you recommend using FSE rather than using the full ZSTD compression routine ?

Thanks a lot ! Very helpful answers :)

gjaegy on 5 Dec 2016

Hi,
zstd is basically Lepel-Ziv (using repeating byte sequences - not very helpful for numerical data) + entropy coder (FSE).
blosc might help a bit, but if want a big improvement you need to use the numerical nature of of your data and the fact that close values should be similar - at least use differential filter: changing {v_i} sequence into {v_i - v_{i-1}} sequence, for which you apply entropy coder.

JarekDuda on 5 Dec 2016

OK Jarek, understood, I will therefore try to implement the predictor solution first (if I can get my head around it). Thanks a lot guys !! I will keep you informed about the win !

gjaegy on 5 Dec 2016

Jarek, stupid question: when using the Lorenzo predictor, the first row and the first column of my 2D matrix (regular grid storing the elevation values) are not changed (since computing the predicted value requires their neighbours), do I understand correctly ?

gjaegy on 5 Dec 2016

nevermind, I figured out I can use a 1D predictor for the first row.

Now, I still have two questions:

do you think it is worth doing the same on the "alpha" buffer (I have de-interleaved elevation and alpha data)
regarding predictor, my initial data are quantized data stored using a U16 (so, unsigned). Now, let's say the encoded result of the prediction is (-1). This will be encoded as 0xFFFF. Is that optimal or have I done something wrong ?

thanks (nearly got it :))

gjaegy on 5 Dec 2016

Hi guys,

as promised, some feedback:

implementation of the lorenzo predictor did help. For the SRTM data, the compressed file went down from 108GB to 85GB (uncompressed 193GB)
with the predictor applied, the dictionary doesn't bring anything anymore (file is 85GB with and without dictionary), which I asume is due to the fact that the entropy encoder does the job (dictionary is for LZ ?) - but I might be wrong

So, I am now stuck and I am not sure I can do anything better, let me know if I am wrong :) (maybe that with the 0xFFFF values ?)

In any case thanks a lot for the help, really appreciated !

gjaegy on 6 Dec 2016

What about using blosc on the predicted data ?

Cyan4973 on 6 Dec 2016

I haven't tried blosc yet, do you think it would be worth it ? I will give it a shot. Do you recommend simply using the shuffle/bitshuffle routines, or should I rather used the full blosc_compress() function ?

Another weird thing (at least, something I don't understand, yeah, I know, I am an assisted person :D), the alpha data are encoded after the elevation (i.e. leading to one single buffer made of 441xU16 + 441xU8).
When I removed them (i.e. ending up with a 441xU16 buffer only), compressed buffer goes from 128MB down to 110MB, while I would expect the alpha channel to be mostly filled with 255 or 0 (i.e. well suited for LZ).
Also, applying predictor on alpha part (i.e. the 441xU8) decreases the compression rate (additional 10MB).
This doesn't make sense to me.
Should I compress elevation and alpha using two different ZSTD compress calls ? Does that make any sense ?

Current figures (no dictionary used):
293MB uncompressed
152MB compressed (no predictor)
128MB compressed with Lorenzo on elevation only
137MB compressed with Lorenzo on elevation and on alpha
110MB compressed with Lorenzo on elevation (alpha data removed)

gjaegy on 6 Dec 2016

Assume nothing. Test possibilities, and keep the better ones.
Ultimately, as you have the data, you are in best position to select more adapted algorithms. Remotely, we can only hand-waive at directions.

Cyan4973 on 6 Dec 2016

Yes I know.. But you have a good experience in compression, mine is close to 0 😜
Anyway directions were pretty good so far.
Basically blosc seems to shuffle data and apply a compressor. Or does it some other magic? I am not interested in the threading features so i guess i will simply use the shuffling routines l then zstd.

The reason why shuffling would lead to better compression ratios is quite unnatural to me.. How does that work?

gjaegy on 6 Dec 2016

Oh and i am happy to provide you with data 😀 but I assume you have better and more interesting tasks to process 😂

gjaegy on 6 Dec 2016

Hi. Let me chime in to say that the shuffle filter groups bytes attending to its significance. Most significant bytes in a chunk are put in the same (sequentially ordered) group, whereas least significant bytes are put in another group. Typically, binary data tend to have lower variability in higher significant bytes, hence enhancing compression ratio.

Also, Blosc comes with a bitshuffle filter, that works the same than the shuffle described above, but using bits, not bytes. In some cases this works better than plain shuffle.

Finally, I can't promise anything, but if you tell us where your data is perhaps I can have a try at it using different parameters in Blosc.

FrancescAlted on 6 Dec 2016

👍1

Thanks a lot for your answer, Francesc, and thanks a lot for the explanation.

OK so basically, if I understand correctly, I should rather encode my data with a predictor that leads to small positive integer, i.e. (x - xmin), rather than using a predictor that can either lead to negative or positive value (0xFFFF or 0x0001 for instance) such a lorenzo ? Wouldn't it make sense ?

I have uploaded my test data, consisting in 150k small independent files contained in a .zip (easier for upload).
Each file has the following layout:

441 x U16 (elevation)
441 x U8 (alpha)

I have uploaded 3 different versions:

raw buffers link
buffers where I applied the lorenzo predictor on the elevation only link
buffers where I separately applied the lorenzo predictor on the elevation part and the alpha part link

I am now going to try shuffling bytes & bits using blosc routines prior to compression.

gjaegy on 7 Dec 2016

some additional results:

125MB compressed with lorenzo on elevation only + shuffle (all)
125MB compressed with lorenzo on elevation only + shuffle (all) + bitshuffle (all)
128MB compressed with lorenzo on elevation only + bitshuffle (all)

133MB compressed with delta predictor on elevation only
134MB compressed with delta predictor on elevation and data
123MB compressed with delta predictor on elevation only + shuffle (all)
125MB compressed with delta predictor on elevation and data + shuffle (all)
125MB compressed with delta predictor on elevation and data + shuffle elev + bitshuffle alpha
125MB compressed with delta predictor on elevation and data + shuffle elev

standard shuffle on alpha doesn't make sense (it doesn't change anything since U8).

I thought I would try a standard delta predictor (Xn = Xn-X(n-1)) since that removes the high frequency change at the start of each row (1D predictor instead of 2D), what decreases the compressed size by 2MB on my test data.

To make profit of the shuffle, I assume however my elevation delta should all be positive, so I might try to first find the lowest elevation value, then store Xn = Xn-Xmin, then shuffle.

Or do you have any better idea ?

gjaegy on 9 Dec 2016

I managed to get an additional 2MB compression (i.e. 121MB total now) using dictionary over the "compressed with delta predictor on elevation only + shuffle (all)" scheme.

I guess I could use a bigger dictionary, but the dictionary training phase gets really too long for anything above 10 MB of sample data (i.e. 100kB dictionary). I guess I should investigate why.

Now I think I am close to my goal, all I can do now is reducing the quality of my input data (i.e. quantize more).

Thanks once again for your precious help and for sharing your experience. Really appreciated.

gjaegy on 12 Dec 2016

Just tried with a 50MB data samples buffer. Dictionary training still ongoing after 15 minutes... Is that expected ? Any way to speed up that dictionary training step ?

I will let it complete to see whether it brings any compression improvement or not.

Meanwhile below some screenshots showing the two bottlenecks in my case...

perf00
perf01
perf02

gjaegy on 12 Dec 2016

👍1

15 mn for a 50 MB sample set, that's very long.
Maybe the file content has some special pattern that the algorithm doesn't like ...

The dictionary builder should display a progress number by default (add -v if not).
If the status doesn't progress, then it can be another problem, like an infinite loop.

Cyan4973 on 12 Dec 2016

Training is still ongoing :/
I am calling ZDICT_trainFromBuffer() directly, it seems I can get the same result using ZDICT_trainFromBuffer_advanced(), will try now. Then will try a debug version and see what's wrong.

gjaegy on 12 Dec 2016

it seems I can get the same result using ZDICT_trainFromBuffer_advanced()

Yes,
just set the notificationLevel to 2, and you should be able to see on console if algorithm makes any progress.

Cyan4973 on 12 Dec 2016

Compiling right now.

Since my samples are small (1323 bytes), 50MB means I have about 40.000 samples. Does that cause the issue ?

gjaegy on 12 Dec 2016

I don't think so. It should be able to cope with that.

Cyan4973 on 12 Dec 2016

OK it's running, currently 1.58% (after 3 minutes on i7 3770/release+optim/64-bits)...
Progressing so I assume it's simply (too) slow...

gjaegy on 12 Dec 2016

still progressing, 7.3% (after 10 minutes !)

gjaegy on 12 Dec 2016

It should progress progressively faster.
It seems there are some patterns that the algorithm doesn't like.
I would bet, there are a lot of long non-trivial repetitions, which have to be checked at each step.

Cyan4973 on 12 Dec 2016

Anything I can do, or should I simply forget using so many samples ? (still 13.42% after 20 minutes)

gjaegy on 12 Dec 2016

For now, just wait, let it work in the background.
For the future, consider providing less samples.
It's much longer than usual, I already sampled hundreds of MB of samples, it's definitely not that slow. But some patterns are more costly than others. It seems your samples are hitting a corner case with degenerated performances.

Cyan4973 on 12 Dec 2016

Actually, my plan was to run that dictionary training process at runtime, each time a user generates its data (in order to better adapt the dictionary to the actual data content).
I was trying to find out whether a bigger dictionary improves the compression ratio significantly or not.

gjaegy on 12 Dec 2016

With larger samples, you can expect an improvement, but only a small one.
10 MB is supposed to be good enough in most cases.

Cyan4973 on 12 Dec 2016

it eventually completed after more than one hour and a half.
The win is about 1MB (121MB => 120MB).
Good to take, if there is a way to speed up the training :/ ?

note: the resulting dictionary is only 16kB...

gjaegy on 12 Dec 2016

We will probably have to develop some new dictionary training algorithm in the future, able to quickly "approximate" over a given input segment.
This possibility doesn't exist today, unfortunately.

Cyan4973 on 6 Feb 2017

Was this page helpful?

0 / 5 - 0 ratings