Hi,
I am struggling a bit with the compression ratio that remains very low, while I think it should be higher.
I am compressing a huge number of small data chunks (1323 bytes), representing terrain elevation data for a terrain rendering system.
The chunk is made of 441x u16 (elevation) + 441x u8 (alpha).
Now, I understand small data are not great, therefore I tried the 'dictionary' approach. But whatever I do, I can't even reach a 2x compression ratio.
Some weird things I have noticed:
Would you have some hints for me ? Any idea how I could achieve higher compression ratio ? I need to be able to read/uncompress each tile randomly (so streaming is not an option).
Basically, the source data are the SRTM data (roughly 15 GB zip compressed) - when put into small tiles and compressed using ZSTD, it compresses down to about 100 GB (dictionary doesn't even bring 5%).
Thanks a lot !
Greg
Hi,
Your elevation seems a numerical type of data, Lempel-Ziv from zstd will not help much here.
However, alpha seems highly redundant - putting elevation and alpha in separate files should help.
For 2D elevation numerical data you should use the fact that close points have similar values - exactly like for lossless image compression: https://en.wikipedia.org/wiki/Lossless_JPEG
You scan line by line and encode only difference from a predictor basing on already decoded neighboring pixels - for numerical data (PDEs) there is used e.g. Lorenzo predictor: https://computing.llnl.gov/vis/images/pdf/Lindstrom_EEG03.pdf
You can write a filter which transforms your elevation map into differences from predictor, then just apply e.g. FSE itself: https://github.com/Cyan4973/FiniteStateEntropy
For a table of regular samples, such as u8 (alpha) or u16 (elevation),
I would recommend using blosc from @FrancescAlted.
Blosc is a pre-filter which is very efficient for such data.
It will prepare it using a shuffle operation for the next round of compression.
In general, in combination with zstd (which is included in blosc), it tends to work very well.
See http://www.blosc.org/blog/zstd-has-just-landed-in-blosc.html for more details.
I would also recommend to separate elevation and alpha channel for best results.
Hi guys,
thanks a lot for your inputs, the two suggestions are rather pertinent.
I will experiment with blosc and then evaluate a predicator such as Lorenzo's one. Any idea which one could be the most efficient ?
One point I didn't get (I am not an expert in compression theories, excuse me if my question is stupid), is why do you recommend using FSE rather than using the full ZSTD compression routine ?
Thanks a lot ! Very helpful answers :)
Hi,
zstd is basically Lepel-Ziv (using repeating byte sequences - not very helpful for numerical data) + entropy coder (FSE).
blosc might help a bit, but if want a big improvement you need to use the numerical nature of of your data and the fact that close values should be similar - at least use differential filter: changing {v_i} sequence into {v_i - v_{i-1}} sequence, for which you apply entropy coder.
OK Jarek, understood, I will therefore try to implement the predictor solution first (if I can get my head around it). Thanks a lot guys !! I will keep you informed about the win !
Jarek, stupid question: when using the Lorenzo predictor, the first row and the first column of my 2D matrix (regular grid storing the elevation values) are not changed (since computing the predicted value requires their neighbours), do I understand correctly ?
nevermind, I figured out I can use a 1D predictor for the first row.
Now, I still have two questions:
thanks (nearly got it :))
Hi guys,
as promised, some feedback:
So, I am now stuck and I am not sure I can do anything better, let me know if I am wrong :) (maybe that with the 0xFFFF values ?)
In any case thanks a lot for the help, really appreciated !
What about using blosc on the predicted data ?
I haven't tried blosc yet, do you think it would be worth it ? I will give it a shot. Do you recommend simply using the shuffle/bitshuffle routines, or should I rather used the full blosc_compress() function ?
Another weird thing (at least, something I don't understand, yeah, I know, I am an assisted person :D), the alpha data are encoded after the elevation (i.e. leading to one single buffer made of 441xU16 + 441xU8).
When I removed them (i.e. ending up with a 441xU16 buffer only), compressed buffer goes from 128MB down to 110MB, while I would expect the alpha channel to be mostly filled with 255 or 0 (i.e. well suited for LZ).
Also, applying predictor on alpha part (i.e. the 441xU8) decreases the compression rate (additional 10MB).
This doesn't make sense to me.
Should I compress elevation and alpha using two different ZSTD compress calls ? Does that make any sense ?
Current figures (no dictionary used):
293MB uncompressed
152MB compressed (no predictor)
128MB compressed with Lorenzo on elevation only
137MB compressed with Lorenzo on elevation and on alpha
110MB compressed with Lorenzo on elevation (alpha data removed)
Assume nothing. Test possibilities, and keep the better ones.
Ultimately, as you have the data, you are in best position to select more adapted algorithms. Remotely, we can only hand-waive at directions.
Yes I know.. But you have a good experience in compression, mine is close to 0 馃槣
Anyway directions were pretty good so far.
Basically blosc seems to shuffle data and apply a compressor. Or does it some other magic? I am not interested in the threading features so i guess i will simply use the shuffling routines l then zstd.
The reason why shuffling would lead to better compression ratios is quite unnatural to me.. How does that work?
Oh and i am happy to provide you with data 馃榾 but I assume you have better and more interesting tasks to process 馃槀
Hi. Let me chime in to say that the shuffle filter groups bytes attending to its significance. Most significant bytes in a chunk are put in the same (sequentially ordered) group, whereas least significant bytes are put in another group. Typically, binary data tend to have lower variability in higher significant bytes, hence enhancing compression ratio.
Also, Blosc comes with a bitshuffle filter, that works the same than the shuffle described above, but using bits, not bytes. In some cases this works better than plain shuffle.
Finally, I can't promise anything, but if you tell us where your data is perhaps I can have a try at it using different parameters in Blosc.
Thanks a lot for your answer, Francesc, and thanks a lot for the explanation.
OK so basically, if I understand correctly, I should rather encode my data with a predictor that leads to small positive integer, i.e. (x - xmin), rather than using a predictor that can either lead to negative or positive value (0xFFFF or 0x0001 for instance) such a lorenzo ? Wouldn't it make sense ?
I have uploaded my test data, consisting in 150k small independent files contained in a .zip (easier for upload).
Each file has the following layout:
I have uploaded 3 different versions:
I am now going to try shuffling bytes & bits using blosc routines prior to compression.
some additional results:
125MB compressed with lorenzo on elevation only + shuffle (all)
125MB compressed with lorenzo on elevation only + shuffle (all) + bitshuffle (all)
128MB compressed with lorenzo on elevation only + bitshuffle (all)
133MB compressed with delta predictor on elevation only
134MB compressed with delta predictor on elevation and data
123MB compressed with delta predictor on elevation only + shuffle (all)
125MB compressed with delta predictor on elevation and data + shuffle (all)
125MB compressed with delta predictor on elevation and data + shuffle elev + bitshuffle alpha
125MB compressed with delta predictor on elevation and data + shuffle elev
standard shuffle on alpha doesn't make sense (it doesn't change anything since U8).
I thought I would try a standard delta predictor (Xn = Xn-X(n-1)) since that removes the high frequency change at the start of each row (1D predictor instead of 2D), what decreases the compressed size by 2MB on my test data.
To make profit of the shuffle, I assume however my elevation delta should all be positive, so I might try to first find the lowest elevation value, then store Xn = Xn-Xmin, then shuffle.
Or do you have any better idea ?
I managed to get an additional 2MB compression (i.e. 121MB total now) using dictionary over the "compressed with delta predictor on elevation only + shuffle (all)" scheme.
I guess I could use a bigger dictionary, but the dictionary training phase gets really too long for anything above 10 MB of sample data (i.e. 100kB dictionary). I guess I should investigate why.
Now I think I am close to my goal, all I can do now is reducing the quality of my input data (i.e. quantize more).
Thanks once again for your precious help and for sharing your experience. Really appreciated.
Just tried with a 50MB data samples buffer. Dictionary training still ongoing after 15 minutes... Is that expected ? Any way to speed up that dictionary training step ?
I will let it complete to see whether it brings any compression improvement or not.
Meanwhile below some screenshots showing the two bottlenecks in my case...



15 mn for a 50 MB sample set, that's very long.
Maybe the file content has some special pattern that the algorithm doesn't like ...
The dictionary builder should display a progress number by default (add -v if not).
If the status doesn't progress, then it can be another problem, like an infinite loop.
Training is still ongoing :/
I am calling ZDICT_trainFromBuffer() directly, it seems I can get the same result using ZDICT_trainFromBuffer_advanced(), will try now. Then will try a debug version and see what's wrong.
it seems I can get the same result using ZDICT_trainFromBuffer_advanced()
Yes,
just set the notificationLevel to 2, and you should be able to see on console if algorithm makes any progress.
Compiling right now.
Since my samples are small (1323 bytes), 50MB means I have about 40.000 samples. Does that cause the issue ?
I don't think so. It should be able to cope with that.
OK it's running, currently 1.58% (after 3 minutes on i7 3770/release+optim/64-bits)...
Progressing so I assume it's simply (too) slow...
still progressing, 7.3% (after 10 minutes !)
It should progress progressively faster.
It seems there are some patterns that the algorithm doesn't like.
I would bet, there are a lot of long non-trivial repetitions, which have to be checked at each step.
Anything I can do, or should I simply forget using so many samples ? (still 13.42% after 20 minutes)
For now, just wait, let it work in the background.
For the future, consider providing less samples.
It's much longer than usual, I already sampled hundreds of MB of samples, it's definitely not that slow. But some patterns are more costly than others. It seems your samples are hitting a corner case with degenerated performances.
Actually, my plan was to run that dictionary training process at runtime, each time a user generates its data (in order to better adapt the dictionary to the actual data content).
I was trying to find out whether a bigger dictionary improves the compression ratio significantly or not.
With larger samples, you can expect an improvement, but only a small one.
10 MB is supposed to be good enough in most cases.
it eventually completed after more than one hour and a half.
The win is about 1MB (121MB => 120MB).
Good to take, if there is a way to speed up the training :/ ?
note: the resulting dictionary is only 16kB...
We will probably have to develop some new dictionary training algorithm in the future, able to quickly "approximate" over a given input segment.
This possibility doesn't exist today, unfortunately.