Mostly for educational purposes I wanted to check how zstd compresses very simple files and what the maximum achievable compression is (e.g. with deflate/gzip it's 1:1032 because the maximum match length is 258 which can in the best case be represented by 2 bits => 258*8 bits / 2 bits = 1032).
The most simple case is just a big empty file.
Trying with zstd 1.3.6:
dd status=none if=/dev/zero bs=1000 count=100000|zstd -o 100Mof0.zst
/*stdin*\ : 0.01% (100000000 => 9144 bytes, 100Mof0.zst)
You can check it with other file sizes and you will arrive at something ~ 1:11000 for the default compression level. It gets slightly better with higher levels:
dd status=none if=/dev/zero bs=1000 count=100000|zstd -o 100Mof0.13.zst -13
/*stdin*\ : 0.01% (100000000 => 8405 bytes, 100Mof0.13.zst)
i.e. a compression factor of ~ 1:11900
The result looks like this:
00000000 28 b5 2f fd 04 60 54 00 00 10 00 00 01 00 fb ff |(./..`T.........|
00000010 39 c0 02 44 00 00 00 01 00 fd ff e4 0e 08 44 00 |9..D..........D.|
00000020 00 00 01 00 fd ff e4 0e 08 44 00 00 00 01 00 fd |.........D......|
00000030 ff e4 0e 08 44 00 00 00 01 00 fd ff e4 0e 08 44 |....D..........D|
00000040 00 00 00 01 00 fd ff e4 0e 08 44 00 00 00 01 00 |..........D.....|
00000050 fd ff e4 0e 08 44 00 00 00 01 00 fd ff e4 0e 08 |.....D..........|
00000060 44 00 00 00 01 00 fd ff e4 0e 08 44 00 00 00 01 |D..........D....|
00000070 00 fd ff e4 0e 08 44 00 00 00 01 00 fd ff e4 0e |......D.........|
[snip]
Block 1:
54 00 00 10 00 00 01 00 fb ff 39 c0 02
Block 2 - n:
44 00 00 00 01 00 fd ff e4 0e 08
So, the first block contains the only literal '\0'. Afterwards, this one then gets referenced and copied. However, it seems there are a lot of identical short blocks after the first one. If you check what those blocks represents, it's always 131072 = 2^17 bytes of uncompressed data. That means the maximum achieved compression is 2^17 bytes / 11 bytes ~ 1:11916. Compared to gzip, it seems, however, that this is not a limitation of the format but of the compressor. (At least it should not be this limit, zstd max match length per sequence is 131075, which (I guess) could be represented by less than 1 bit with custom FSE tables).
I tried to find the cause of zstd creating so many small blocks, it seems to be that there's a hard-coded block size limit of 2^17 uncompressed bytes, in the sources defined as #define ZSTD_BLOCKSIZELOG_MAX 17.
This means the maximum compression is currently limited by the amount of block header overhead.
I guess it depends on the use case whether this is significant in real usage but I could imagine that real files with high redundancy might also be affected to some degree (e.g. disk images that contain large sparse empty sections).
Is there a good reason for the block size limitation? Why does the compressor enforce a maximum number of input bytes to take into account for a block in the first place?
Btw. in zlib, it seems the logic when to switch to a new block seems to be quite simple: while compressing data, literals and references are collected into a buffer of fixed size (65k entries, IIRC), once that is full the block is flushed.
Btw. I think long runs of the same single byte could also be efficiently represented using the literal RLE encoding in zstd (why isn't it?). However, the same issue is still valid for long runs of e.g. two different repeated bytes.
Hi @jrudolph
These are excellent questions.
The maximum block size is indeed a hard limit of 128 KB, defined in the format specification. This limit can't be changed without breaking compatibility.
Note that a block can be smaller, it just cannot be larger.
There are several motivations to get there.
The main one is tied to the maximum size of literals buffer.
By setting a maximum block size, it also sets a maximum amount of literals to decode. This in turns provides guarantees to the decoder, which are important for proper sizing of its resources.
Retrospectively, such guarantees could be provided differently. But it's just too late to change such decision. Yes, for huge swathes of zero, it puts an upper limit to achievable compression ratio. In practice though, it's unlikely to matter : even sparse data contain _some_ data, and do not suffer that much.
Block sizes also nicely cuts input into pieces of approximately the right size for entropy statistics adaptation (on usual sources), so it bundles several roles.
You are correct that the current encoder does not make use of the RLE blocks. It's an issue with detection of this use case. We would need to add some code specifically for this purpose. It will make more sense to create such code when we will have a smarter block splitter, which cuts input at nicer positions in the source, instead of just blindly filling 128KB.
If RLE blocks were supported and generated, the upper limit of compression ratio would be 4 bytes per 128 KB blocks, hence close to ~1:30K compression factor.
Thanks a lot, @Cyan4973 for the explanation.
I guess one should always remember that the compression game is one of efficiency and many features of an efficient codec/implementation is that it chooses the right trade-offs to be efficient in general. If you look at it from that perspective the whole blocking architecture is a latency-throughput trade-off which was optimized for throughput by using blocks as (mini) batches for processing. With its separate literal and match sections (and the requirement to read bits backwards), zstd went even a step further (for the benefit or even higher throughput). I guess it's fine that it introduces some limits to achieve that goal.
Thanks for this thread. We have some very sparse edge cases that ran into this issue.
I'm embarrassed to report that compressing it and then compressing it again addresses the problem:
$ head -c 10000000 /dev/zero |zstd -c |wc -c
934
$ head -c 10000000 /dev/zero |zstd -c |zstd -c |wc -c
78
gzip is similar -- the maximum block or window size yields repeating byte sequences.
Most helpful comment
Hi @jrudolph
These are excellent questions.
The maximum block size is indeed a hard limit of 128 KB, defined in the format specification. This limit can't be changed without breaking compatibility.
Note that a block can be smaller, it just cannot be larger.
There are several motivations to get there.
The main one is tied to the maximum size of literals buffer.
By setting a maximum block size, it also sets a maximum amount of literals to decode. This in turns provides guarantees to the decoder, which are important for proper sizing of its resources.
Retrospectively, such guarantees could be provided differently. But it's just too late to change such decision. Yes, for huge swathes of zero, it puts an upper limit to achievable compression ratio. In practice though, it's unlikely to matter : even sparse data contain _some_ data, and do not suffer that much.
Block sizes also nicely cuts input into pieces of approximately the right size for entropy statistics adaptation (on usual sources), so it bundles several roles.
You are correct that the current encoder does not make use of the RLE blocks. It's an issue with detection of this use case. We would need to add some code specifically for this purpose. It will make more sense to create such code when we will have a smarter block splitter, which cuts input at nicer positions in the source, instead of just blindly filling 128KB.
If RLE blocks were supported and generated, the upper limit of compression ratio would be 4 bytes per 128 KB blocks, hence close to ~1:30K compression factor.