minetest 🚀 - Using Zstandard for mapblock compression?

Thumbs up for replacing zlib !

Speed tests I did for minetestmapper a while ago showed that about 50% (IIRC) of the running time was actually spent in zlib code, decompressing map data...

As the comparison table on the zstd page shows about 25% worse compression for lz4, but with the benefit of a 2-3 times speed increase, I'd welcome an implementation that allows different compression algorithms to be used, so that depending on CPU speed, disk speed and disk size, a server operator can choose for faster or for more compact compression. The actual algorithm that was used should obviously be encoded in a block's data.

At the same time, instead of only compressing the node data, _all_ of the block's data could be compressed. IIRC, one important reason LevelDB databases are often smaller, is that LevelDB uses snappy to compress all data it stores.

Rogier-5 on 2 Sep 2016

The compression ratio is not much better compared to zlib but for the same ratio it is able to process much more data, that's very nice. I am not sure about the compatibility with our supported OS' but it looks well maintained there.

SmallJoker on 2 Sep 2016

I'm suspicious of zstandard's licensing at first, their example code isn't even licensed properly, so you can't even use their example code freely.

I would wait and see in a year or so if this is actually taking off anywhere....

sofar on 2 Sep 2016

👍1

If we're going the new shiny stuff™ route, we should also consider https://github.com/google/brotli.

sfan5 on 19 Sep 2016

For the Germans amongst us, here's a relevant blog post: http://blog.fefe.de/?ts=a90c6783

kahrl on 4 Oct 2016

Here's a testing branch with brotli support: https://github.com/sfan5/minetest/tree/brotli
All you need is libbrotli (AUR) and it will automatically compress new worlds with brotli.

Do not use this branch with your existing worlds, they will become unreadable to a "normal" Minetest.

sfan5 on 4 Oct 2016

Nice work, @sfan5 !

I would prefer zstd over brotli, as it seems to be better, unless its proven to perform worse for our use case.

One problem for both brotli and zstd will be however that we'd have to discontinue support for older distros (like ubuntu 12.04, the one @PilzAdam uses) which don't have the library. Both brotli and zstd only first appear in ubuntu 16.04.

Its the same issue as for C++11, all over again :(

Although one could argue that its much much easier to install or download + compile a new library than installing a totally new C++ compiler.

est31 on 4 Oct 2016

@est31

One problem for both brotli and zstd will be however that we'd have to discontinue support for older distros

Actually _lib_brotli isn't in any package repositories right now afaik, brotli doesn't have an official API (yet). zstd does so we wouldn't have that problem.

However, if we intend to support ubuntu 12.04 out of the box then we need to bundle brotli/zstd anyway and this won't matter.

sfan5 on 4 Oct 2016

i prefer lz4 than zlib, it's very fast and i already have a working version in my own version :). lz4 is cpu free

nerzhul on 4 Oct 2016

👍2

Bundling might be an idea, yeah. I hate the concept but if we want brotli/zstd any time soon we might be forced to.

est31 on 4 Oct 2016

👍2

We already bundle jsoncpp, lua, gmp and maybe other things too.

Megaf on 4 Oct 2016

lz4 bundling is easy, 3 files.

nerzhul on 4 Oct 2016

@est31 @nerzhul @Megaf

Edit: see here

Here's the steps I took to compare:

1) Download and extract MG_World.tar.bz2 from http://daconcepts.com/vanessa/hobbies/minetest.html
2) ./minetestserver --world ./MG_World --migrate sqlite3

3) sqlite3 ./MG_World/map.sqlite 'VACUUM;'
4) du -h --si ./MG_World/map.sqlite # note down size (zlib)
5) ./minetestserver --world ./MG_World --decompresstimes # note down bench (zlib)

6) ./minetestserver --world ./MG_World --recompress # note down stats
7) sqlite3 ./MG_World/map.sqlite 'VACUUM;'
8) du -h --si ./MG_World/map.sqlite # note down size (brotli)
9) ./minetestserver --world ./MG_World --decompresstimes # note down bench (brotli)

sfan5 on 4 Oct 2016

World size is not very important , but performance

nerzhul on 4 Oct 2016

@nerzhul
I chose compression quality 6 (same as zlib's default, but that's not important) because it made the compression unittest take exactly the same duration.

Which means that while performance should be about the same while the world is smaller.

sfan5 on 4 Oct 2016

@sfan5 is it possible for you to check the duration for the decompression too? I suspect there's a lot more decompression than compression going on.

The numbers in lz4 benchmarks look very attractive, though.

Ferk on 4 Oct 2016

We have to compare those against lz4 now.
And for me the most important thing should be lower cpu usage/time. Size is secondary.
I will be very happy if the world doesnt get any larger.

Megaf on 4 Oct 2016

@sfan5 noted. I think you should benchmark various mapblock serialization times

nerzhul on 4 Oct 2016

@Ferk @nerzhul

Edit: see here

sfan5 on 4 Oct 2016

I've ran the test by @sfan5 for my zstd branch. (not the most recent commit, but the commit called "unittest").

I got in fact an increase in space required (944148 K with the zlib compression, and 1062140 K with the zstd compression, an increase by 12%). Note that I didn't use a dictionary, while brotli has a hardcoded dictionary, and for the minetest use case a dictionary would be really useful I think, as it perfectly matches the dictionary use case (few KB of data). So there is certainly room for improvement.

The decompression time however is really really great.

For zstd:

############
Total decompression CPU time:     19.0198s
Decompression CPU time per block: 7.27373us
Decompression speed:              30393 KB/s
############

For zlib:

############
Total decompression CPU time:     72.0594s
Decompression CPU time per block: 27.5577us
Decompression speed:              7037.76 KB/s
############

A bit odd was that when I rewrote the code to use the streaming API, it suddenly got horribly slow (20x decrease). The benchmarks are from using the non streaming API.

est31 on 7 Oct 2016

I got in fact an increase in space required. Note that I didn't use a dictionary [...], and for the minetest use case a dictionary would be really useful I think

Brotli supports dictionaries too, I agree that we should try that approach. We might need/want different dicts for block data vs. metadata too.

The problem however is that every application that wants to read the Minetest's data needs the same dictionary and we can't ever change the dict. Unless we add dictionary versioning which would also require that applications keep up with dict updates done in Minetest.

A bit odd was that when I rewrote the code to use the streaming API, it suddenly got horribly slow (20x decrease). The benchmarks are from using the non streaming API.

The numbers from using the non-streaming API don't matter, to compare we need the benchmark while using the final design (streaming API).

sfan5 on 7 Oct 2016

So I did all of the benchmarks again with some improvements I made to the statistics it outputs:

map.sqlite zlib: 820.3 MiB
map.sqlite brotli: 719.4 MiB (12.3% smaller)

--recompress:

2016-10-07 13:51:16: ACTION[Main]: ############
2016-10-07 13:51:16: ACTION[Main]: Recompressed blocks:            2614854
2016-10-07 13:51:16: ACTION[Main]: Total data (uncomp.):           4008 MB
2016-10-07 13:51:16: ACTION[Main]: Total compression CPU time:     388.206s
2016-10-07 13:51:16: ACTION[Main]: Compression CPU time per block: 148.462us
2016-10-07 13:51:16: ACTION[Main]: Compression speed:              10573.3 KB/s
2016-10-07 13:51:16: ACTION[Main]: ############

Decompression benchmark:

2016-10-07 14:05:19: ACTION[Main]: ### zlib ###
2016-10-07 14:05:19: ACTION[Main]: Total blocks:                     2614854
2016-10-07 14:05:19: ACTION[Main]: Total data (comp.):               495 MB
2016-10-07 14:05:19: ACTION[Main]: Total decompression CPU time:     83.7842s
2016-10-07 14:05:19: ACTION[Main]: Decompression CPU time per block: 32.0416us
2016-10-07 14:05:19: ACTION[Main]: Decompression speed:              6052.89 KB/s
2016-10-07 14:05:19: ACTION[Main]: ############
2016-10-07 14:01:56: ACTION[Main]: ## brotli ##
2016-10-07 14:01:56: ACTION[Main]: Total blocks:                     2614854
2016-10-07 14:01:56: ACTION[Main]: Total data (comp.):               406 MB
2016-10-07 14:01:56: ACTION[Main]: Total decompression CPU time:     79.2955s
2016-10-07 14:01:56: ACTION[Main]: Decompression CPU time per block: 30.325us
2016-10-07 14:01:56: ACTION[Main]: Decompression speed:              5246.44 KB/s
2016-10-07 14:01:56: ACTION[Main]: ############

sfan5 on 7 Oct 2016

I wonder how faster unpacking will affect stuttering ingame.

Fixer-007 on 7 Oct 2016

The numbers from using the non-streaming API don't matter, to compare we need the benchmark while using the final design (streaming API).

Yes, only the final design should matter, which is the streaming API one way or another. However, the non streaming API shouldn't be slower that much, meaning that there is either a bug in zstd, or in the code to use it. So the only numbers matter when/if that bug has been removed.

est31 on 9 Oct 2016

@est31

The slowness with using the streaming API might be caused by the need to call unget() on the input stream a lot of times to put unused data back. That wouldn't be easily fixed though.

So the only numbers matter when/if that bug has been removed.

No, currently the only way to compare zlib and whatever else in a fair way is not to fix the bug.

sfan5 on 9 Oct 2016

@sfan5 I don't use any unget on the input stream at all, read the code.

est31 on 9 Oct 2016

@est31 Benchmark results for zstd (with the settings est provided):

map.sqlite zstd: 924.6 MiB (12.7% bigger)

--recompress:

2016-10-13 17:10:43: ACTION[Main]: ############
2016-10-13 17:10:43: ACTION[Main]: Recompressed blocks:            2614854
2016-10-13 17:10:43: ACTION[Main]: Total data (uncomp.):           4008 MB
2016-10-13 17:10:43: ACTION[Main]: Total compression CPU time:     116.124s
2016-10-13 17:10:43: ACTION[Main]: Compression CPU time per block: 44.4092us
2016-10-13 17:10:43: ACTION[Main]: Compression speed:              35346.9 KB/s
2016-10-13 17:10:43: ACTION[Main]: ############

Decompression benchmark:

2016-10-13 17:15:57: ACTION[Main]: ############
2016-10-13 17:15:57: ACTION[Main]: Total blocks:                     2614854
2016-10-13 17:15:57: ACTION[Main]: Total data (comp.):               576 MB
2016-10-13 17:15:57: ACTION[Main]: Total decompression CPU time:     34.0404s
2016-10-13 17:15:57: ACTION[Main]: Decompression CPU time per block: 13.0181us
2016-10-13 17:15:57: ACTION[Main]: Decompression speed:              17342.6 KB/s
2016-10-13 17:15:57: ACTION[Main]: ############

_Edit_: Looks like we need to make a tradeoff between these two:

A little faster, smaller data sizes (brotli)
Very fast, bigger data sizes (zstd)

sfan5 on 13 Oct 2016

❤2

I did some speed tests, so far for a relatively small world (1k blocks), with different compression settings per algorithm, and created a few graphs.

In the process, I merged sfan5's and est31's trees, and I added snappy and lz4 support as well. I also had to port to a newer version of libbrotli.

My tree is at https://github.com/Rogier-5/minetest/tree/compression. It includes the scripts to re-run the tests. See README.recompress.

I don't know why, but the performance of zstd was abysmal on my system. See the graphs. Above level 7, performance keeps getting worse.

The X axis shows the compression level (or acceleration for lz4).

The series lz4-c is plotted against the upper x-axis

plot-recompress-comp_blocktime
plot-decompress-block_size
plot-decompress-comp_blocktime

Rogier-5 on 20 Oct 2016

👍2

I added tests without compression. The block 'compression' speed is slightly better than for snappy and lz4. However, the total elapsed time is worse, undoublty caused by additional disk IO time.

plot-recompress-comp_blocktime
plot-recompress-time_elapsed

Rogier-5 on 20 Oct 2016

I finished running tests on a world with 21k blocks, and the results are roughly the same. I suppose results may be different on other systems (my system is quite old, but I do have an SSD).

The best option may be to implement support for several compression algorithms, and allow the user to configure the most suitable one. From my tests (i.e. on my system), brotli (level 5) is a good choice for database size, while lz4 is good for both compression and decompression speed. Zlib (level 3, or 4) may be a suitable compromise (i.e. default) candidate...

More user-friendly configuration values may be desirable as well. E.g. compression = {zlib|zstd|brotli|(etc...)|default|speed|size} ...

Keeping the '--recompress' command-line option might be a good idea as well. From time to time, users could even optimize the database to a high compression setting, while using a faster setting while running the server. Alternatively, there could be a background process that does this, like /clearobjects.

Regarding the subject of dictionaries: different dictionaries can be implemented as different compression algorithms, or by storing an additional u8 identifying the dictionary that was used. If a new, improved dictionary is added, it is much the same as adding a new compression algorithm: older servers can't read newer databases, unless the new server was configured to use the older dictionary or compression algorithm.

Of course, the dictionary(/ies) could also be stored in a table in the database itself, allowing full forward compatibility wrt dictionary improvements.

Rogier-5 on 20 Oct 2016

More user-friendly configuration values may be desirable as well.

I don't think so. The only thing exposed to the user should be the compression factor, that's it. I don't think we should compress with multiple algorithms other than for legacy client purposes.

Keeping the '--recompress' command-line option might be a good idea as well.

Yes, --recompress would be quite useful to have on master, but it should be rewritten to be less a hack (it shouldn't load all the lua stuff for example).

Regarding dictionaries, I think it would be hard to come up with an algorithm to classify which dictionary to compress a given mapblock with, so we should rather only compress to one dictionary at a time. With "dictionary" I mean a full set of dictionaries for each type of data compression is used for, so one dictionary for metadata, one for node data, etc. Also, coming up with a dictionary needs lots of data, so its better done offline. I propose we integrate dictionary management into the --recompress command, so we make it do:

go through the map, decompress every block, and build a dictionary with those data.
store that generated dictionary in the dictionary table.
iterate the map a second time, but now actually recompress it with the dictionary we have now generated.
delete the old dictionary from the dictionary table

That algorithm can be made robust towards premature ctrl+c by storing a parity flag inside the mapblock that points to one of the up to two dictionaries used. In step 1 we could check whether there are two dictionaries in the map, and then to prevent a third dictionary from being created recompress with the later dictionary in step 3, skipping step 2.

est31 on 20 Oct 2016

I have stumbled upon a performance improvement for zstd compression. The speedup is large: 30% reduction in compression time at compression levels 1 and 2, and 50% reduction at levels 3 to 7. See the latest commit on my branch. The patch itself is a quick hack. It will need some polishing before it's fit for release.

I read in the zstd docs that the (de)compression stream object can be reused, so I tried that, using c++11 thread local storage, so I enabled c++11 as well. I haven't investigated whether it is one or the other that makes the difference...

Rogier-5 on 20 Oct 2016

🎉2

@Rogier-5 awesome! can you update the diagrams?

est31 on 20 Oct 2016

(this is one of the things that can be made optionally depend on C++11, so use the faster version if c++11 is enabled, continue to use the slow version if not)

est31 on 20 Oct 2016

@est31: I did some more investigation using perf and strace, and it turns out the malloc library is actually the culprit! It will repeatedly munmap() a free()d 5MB area of memory, only to mmap() it again on the subsequent malloc() in the zstd code. The original code even mmap()ed and munmap()ed 2 areas of 5MB and one of 2MB for every compression operation... Every time again, at least one of the 5MB areas is zeroed with memset(). How's that for a way to avoid having to turn on the heater when it's cold :-)

One would have thought malloc() was smarter than that (and apply some hysteresis). It may even be considered a bug.

Anyway, this can be fixed (workaround-style: the real problem is obviously with malloc and/or zstd) by using mallopt(). See my latest commit, and the graphs below. The total compression time reduction is now closer to 90% for levels 3 and up.

Comparison graphs for zstd (original code, initial patch, and final code):

plot-recompress-comp_blocktime

New compression graphs (including a few more zstd levels):

plot-recompress-comp_blocktime
plot-decompress-block_size
plot-decompress-comp_blocktime

It can be seen that in spite of the significant additional compression effort at higher levels, much further reduction is not achieved. The reason is most probably that even at lower levels, much of the data is already compressed from 16KB down to an amount ranging from 20 to at most a few 100 bytes. There simply isn't much more redundancy to squeeze out.

Rogier-5 on 21 Oct 2016

👍1

Why was this just closed?

Megaf on 3 May 2017

Note that you should also take a look at #3946 if this is implemented. You might be able to get better performance by compressing the whole mapblock instead of compressing several sections separately and there are a number of other improvements to be made if you're going to break the serialization format.

ShadowNinja on 4 May 2017

Zstd support (comp and decomp) included in newest linux 4.14 for Btrfs https://lkml.org/lkml/2017/6/29/816 (and for squashfs according to phoronix).

I'm suspicious of zstandard's licensing at first

License was changed 29 days ago. Zstd readme now says "Zstandard is dual-licensed under BSD and GPLv2."
Wikipedia had this:

The reference implementation is licensed under the BSD License, published at GitHub. Since version 1.0, it had additional Grant of Patent Rights.[4]
From version 1.3.1[5] this patent grant was dropped and the license was changed to BSD + GPLv2 dual license[6]
https://github.com/facebook/zstd/pull/801
https://github.com/facebook/zstd/releases/tag/v1.3.1

I think if you want to introduce new compression algo, do it now for 0.5.0. IMO I would prefer speed over disk space, speed advantage that lz4, zstd, etc give is huuuge compared to slightly bigger database size.

Fixer-007 on 18 Sep 2017

👍2

zstd will be supported for ZFS and LZ4 was choosen too

nerzhul on 18 Sep 2017

Linux Kernel people say LZO is the fastest.

Megaf on 19 Sep 2017

So, what do we select? I think it's good to start voting.

schneejs on 17 Nov 2017

On my server, the OS provides AVX512 accelerated zlib libraries, so I strongly prefer to stay on zlib.

There are also negative implications: Alternative compression algorithms are NOT backwards compatible with old releases and should be non-default to begin with. Maps will break and it may be impossible to make downloadable player maps that are widely supported by older clients.

All the talk about "oh the kernel has FOO support" is nonsense because that support IS NOT EXPORTED TO USER SPACE. It won't do minetest any good that your kernel has zstd. Minetest can't use it. You will need to rely on a completely different implementation for user space, and that will probably lack things like decent parallelism and AVX512 acceleration, and therefore lag in performance compared to ... zlib.

sofar on 18 Nov 2017

AVX512 accelerated

:thinking: :thinking: :thinking: https://blog.cloudflare.com/on-the-dangers-of-intels-frequency-scaling/

Alternative compression algorithms are NOT backwards compatible with old releases

Old maps are going to be readable by new Minetest versions as usual.
We can't drop zlib anyway as it's also used for other things.

it may be impossible to make downloadable player maps that are widely supported by older clients

It's certainly possible to write a tool that deserializes maps in and serializes them again in a format understood e.g. by 0.4.16, this could also be integrated into Minetest itself (similar to --migrate).

like decent parallelism and [...]

Since we're mostly compressing 12k bytes at once I doubt threading has much effect (if it is used at all).
Benchmarks comparing how well Intel's zlib does compared to brotli/zstd would be a good idea though.

sfan5 on 18 Nov 2017

And it's pure Intel, arm and AMD cpu doesn't benefit from that.No need to convert the whole map, just add format version to 1 for legacy mapblocks and write new mapblocks using zstd when written or loaded

nerzhul on 18 Nov 2017

Many Intel optimizations also benefit AMD. Check out the various performance tests that phoronix has run on AMD's threadripper platform where they compare ClearLinux against non-specifically Intel optimized distros - you can decide for yourself based on the data whether "pure intel" (a complete misnomer) is such a bad thing for the competition. Article link because you're lazy: https://www.phoronix.com/scan.php?page=article&item=tr1950x-7900x-3os&num=2

sofar on 24 Nov 2017

I stay on using zlib. It is stable and many processors has optimizations for it.

schneejs on 24 Nov 2017

👍1

I stay on using zlib. It is stable and many processors has optimizations for it.

What optimizations exactly? And zstd is faster than zlib in any case, as seen from countless benchmarks.

clavinet on 25 Mar 2019

no processor has optimization for a such algorithm, zlib and zstd has optimisations for processors.

nerzhul on 25 Mar 2019

Zstd is faster and equal in size or better than gzip, and in my work projects we are actively implementing zstd compression.

So yes, it would have my support.

sofar on 25 Mar 2019

👍2

It also has multithreading support built in which speeds up compression even more at absolutely no loss of compression ratio from my tests at least.

clavinet on 25 Mar 2019

It also has multithreading support built in which speeds up compression

Minetest should already multithread mapblocks saves, but this isn't as much of a factor of importance IMHO

even more at absolutely no loss of compression ratio from my tests at
least.

As I already stated. The biggest point is that it's faster to compress to the same size (and decompress too)

sofar on 25 Mar 2019

🚀1

Minetest: Using Zstandard for mapblock compression?

Most helpful comment

All 51 comments

Related issues