When compressing strings of a single character (ie streams of zeroes), strategies 6-9 (btlazy2, btopt, btultra, and btultra2) show dramatic slowdowns. In particular, btlazy2 is about 6.3x slower than lazy2, btopt and btultra are about 3.7x slower than btlazy2 (23.3x lazy2), and btultra2 is about 3.9x slower than btultra (92.5x lazy2). Combined with the dictionary size increase, this makes zstd -19 about 400x slower than zstd -9 on this kind of data.
Obviously, I would expect higher level strategies to use more time, but the increase is out of proportion to the time increase with less repetitive data (more like 10x slower for -19 vs -9).
My test:
> for n in $(seq 9); do echo strategy=$n; dd status=none if=/dev/zero bs=1024k count=1000 | command time -f %U zstd --zstd=strategy=$n >/dev/null; done
strategy=1
0.34
strategy=2
0.36
strategy=3
0.41
strategy=4
0.93
strategy=5
1.33
strategy=6
8.41
strategy=7
31.03
strategy=8
31.13
strategy=9
123.05
The context: I was compressing a raw disk image (produced by ntfs_clone) with zstd, and decided to try increasing the compression with zstd -19. The image produced by ntfs_clone contains zero blocks wherever the space is unallocated. Two days later it finished compressing. This seemed somewhat ridiculous, so I dug in to see at what point it became very slow. At some points I wondered if zstd was hung somehow, given that it didn't seem to be making progress.
What version of zstd are you using?
Edit: I can reproduce with zstd-1.4.0.
Aha: This is a problem with multithreading interactions. When I add --single-thread the problem goes away. I was a bit confused because I know we handle repetitive data specially, but we must be missing that optimization with multithreading somehow. I'll look into it.
for n in $(seq 9); do echo strategy=$n; dd status=none if=/dev/zero bs=1024k count=1000 | command time -f %U zstd --single-thread --zstd=strategy=$n,tlen=128 >/dev/null; done
strategy=1
0.27
strategy=2
0.27
strategy=3
0.30
strategy=4
0.54
strategy=5
0.63
strategy=6
0.61
strategy=7
0.36
strategy=8
0.40
strategy=9
0.43
Multithreaded compression works by breaking the source into segments, and compressing each segment individually, then stitching the results together. To improve compression we use the part of the previous segment as the dictionary, so we don't have to warm up our tables.
It is the process of constructing this dictionary that is costing us CPU, since we only have the optimizations when parsing, not warming up the tables with the dictionary. This is very fixable.
As a workaround you can compress with --single-thread.
Fixed by #1635.
Wow, what a difference! That fixes it, thanks.
an update on performance at high compression level would be nice. We are stuck at version 0.6.0 in https://github.com/facebook/zstd/blob/dev/doc/images/DCspeed5.png
Yep, we could update that.
Note that the changes since v0.6 are not highly visible on this scenario : maximum compression ratio is still "about the same". The intermediate points are slightly different, generally compressing a bit more, but nothing earth shattering. We made larger progresses on _small_ data, but that's a scenario that the linked graph does not show.