Runtime: GZipStream regression - Much worse compression for sparse data when specifying CompressionLevel.Optimal

Created on 19 Dec 2018  Â·  38Comments  Â·  Source: dotnet/runtime

Expected behaviour

To produce identical compressed result for the same original byte array on .NET Framework and .NET Core.

Actual behaviour

Results of GZipStream on .NET Framework and .NET Core are different.

Repro

  1. An array of 1500 characters 'A'
  2. Compressed using GZipStream
  3. Verified results on .NET Framework (4.6.1) and .NET Core (2.1) that should be identical, but they are not.

Code available here.

Results

Platform | Compressed result in hexa
---------------------------|----------------------------------------
.NET Framework
4.6.1 | 1f8b080000000000040073741c05a360148c825130dc00004d6f6cebdc050000
.NET Core 2.2.100
Commit b9f2fa0ca8

Host 2.2.0 Commit 1249f08fed | 1f8b080000000000000b73741c05a321301a02a321301a028ec30c00004d6f6cebdc050000

Executed on Windows 10 Pro 64bit 1809 build 17763.134

area-System.IO.Compression bug tenet-performance

Most helpful comment

Given that nothing seems to be wrong, I would not be surprised if we won't treat it as a bug.

We don't expect/require that the compressed data will match across runtimes/versions, as long as it's valid. But there does appear to be a size regression here.
```C#
using System;
using System.IO;
using System.IO.Compression;

class Program
{
static void Main()
{
byte[] data = new byte[1024*1024];
for (int i = 0; i < data.Length; i++) data[i] = (byte)'a';

    Compress(data, CompressionLevel.NoCompression);
    Compress(data, CompressionLevel.Optimal);
    Compress(data, CompressionLevel.Fastest);
}

static void Compress(byte[] data, CompressionLevel level)
{
    using (MemoryStream ms = new MemoryStream())
    {
        using (GZipStream gz = new GZipStream(ms, level, leaveOpen: true))
        {
            gz.Write(data, 0, data.Length);
        }
        Console.WriteLine(level + ":\t" + ms.Length);
    }
}

}

results in the following for me on Windows:

C:\Users\stoub\Desktop\tmpapp>dotnet run -c Release -f net472
NoCompression: 1048759
Optimal: 1052
Fastest: 4609

C:\Users\stoub\Desktop\tmpapp>dotnet run -c Release -f netcoreapp3.0
NoCompression: 1048759
Optimal: 4590
Fastest: 4609
```
Note that the values for NoCompression and Fastest are the same, but the values for Optimal differ by 4x. Since Optimal is supposed to mean best compression ratio, there's obviously an issue when it's possible to be 4x smaller.

All 38 comments

Are they both valid and decompress to the correct data?
Is binary compatibility of the stream expected? It doesn't seem that doing so would be sensible overall because you'd never be able to improve compression performance if it were.

Are they both valid and decompress to the correct data?

Yes.

Is binary compatibility of the stream expected?

Yes. Sending side and receiving side could be implemented different frameworks, running on different platforms.

The question here is why using the same platform results are different. For instance, running a 1MB byte array with all values identical through compression on .NET Framework has yielded ~x5 smaller result that with .NET Core (2K vs 10K).

💭 Different CompressionLevel perhaps?
Could you try and see if specifying a compression level yields a consistent result? It would be kinda odd if they used different defaults, but there are both managed and ZLib implementations which may differ _somewhere_.

Thank you @BhaaLseN. CompressionLevel did help to identify what seems to be inconsistency in the implementation.

This is what I was able to find playing with different CompressionLevel options.

Platform | Tested version | CompressionLevel.Fastest | CompressionLevel.Optimal
----------|------------------|------------------------------|-------------------------------
Windows| .NET Framework| 9218 | 2104
Windows| .NET Core | 9218 | 9180
WSL (Ubuntu)|.NET Core | 9218 | 2104

Note: to emphasize the difference, I've used 1MB (1024x1024) array filled with the same character, not the original 1500 size from the issue description.

I've also tested against Linux (Ubuntu 16.04.2 LTS) and it looks like .NET Core on Windows is the outlier.
@karelz should the issue be re-tagged as a bug instead of question?

I let area owners decide if it is a bug or question. Given that nothing seems to be wrong, I would not be surprised if we won't treat it as a bug.

Given that nothing seems to be wrong

Compression far from optimal for .NET Core on Windows when Optimal flag is used. Saying that, agree, let's hear from owners first 🙂 Thank you.

Given that nothing seems to be wrong, I would not be surprised if we won't treat it as a bug.

We don't expect/require that the compressed data will match across runtimes/versions, as long as it's valid. But there does appear to be a size regression here.
```C#
using System;
using System.IO;
using System.IO.Compression;

class Program
{
static void Main()
{
byte[] data = new byte[1024*1024];
for (int i = 0; i < data.Length; i++) data[i] = (byte)'a';

    Compress(data, CompressionLevel.NoCompression);
    Compress(data, CompressionLevel.Optimal);
    Compress(data, CompressionLevel.Fastest);
}

static void Compress(byte[] data, CompressionLevel level)
{
    using (MemoryStream ms = new MemoryStream())
    {
        using (GZipStream gz = new GZipStream(ms, level, leaveOpen: true))
        {
            gz.Write(data, 0, data.Length);
        }
        Console.WriteLine(level + ":\t" + ms.Length);
    }
}

}

results in the following for me on Windows:

C:\Users\stoub\Desktop\tmpapp>dotnet run -c Release -f net472
NoCompression: 1048759
Optimal: 1052
Fastest: 4609

C:\Users\stoub\Desktop\tmpapp>dotnet run -c Release -f netcoreapp3.0
NoCompression: 1048759
Optimal: 4590
Fastest: 4609
```
Note that the values for NoCompression and Fastest are the same, but the values for Optimal differ by 4x. Since Optimal is supposed to mean best compression ratio, there's obviously an issue when it's possible to be 4x smaller.

I spent some time here(I'm on Intel).
I tried to understand if it's a regression after PR https://github.com/dotnet/corefx/pull/32732 so I reverted to before that commit (git reset --hard 32f43a75ff51ca633bb310ca3afe577c54a75d91) and tested with:

        [Theory]
        [InlineData(CompressionLevel.Fastest, 5000)]
        [InlineData(CompressionLevel.Optimal, 2000)]
        public void CompressRatio(CompressionLevel level, int lessThan)
        {
            byte[] data = new byte[1024 * 1024];
            for (int i = 0; i < data.Length; i++)
                data[i] = (byte)'a';

            using (MemoryStream ms = new MemoryStream())
            {
                using (GZipStream gz = new GZipStream(ms, level, leaveOpen: true))
                {
                    gz.Write(data, 0, data.Length);
                }
                Assert.True(ms.Length < lessThan, $"Compression level {level} result not less than {lessThan}, {ms.Length}");
                Console.WriteLine(ms.Length);
            }
        }

But it fail also before Compression level Optimal result not less than 2000, 4590

So I went deep in code and I found that there is a symbol defined here https://github.com/dotnet/corefx/blob/master/src/Native/Windows/clrcompression/zlib-intel/x86.h#L25 that affect level -> alg type relationship https://github.com/dotnet/corefx/blob/master/src/Native/Windows/clrcompression/zlib-intel/deflate.c#L134.
I cannot see which alg is used on .NET full https://referencesource.microsoft.com/#System/parent/parent/parent/InternalApis/NDP_FX/inc/ZLibNative.cs,261

However I found that USE_MEDIUM was added in version 1.2.11 https://github.com/jtkukunas/zlib/commit/d6a0d5fcef51a3ecc5b1921dec6bdc88db007994#diff-6512e2889bbced8d558793c317fa04c4 but .NET full version seems oldest https://referencesource.microsoft.com/#System/parent/parent/parent/InternalApis/NDP_FX/inc/ZLibNative.cs,73 (1.2.3)
I tried to compile without USE_MEDIUM and test pass with deflate_slow alg
Maybe to be consistent with .NET full we need to remove #define USE_MEDIUM.
I hope it helps

/cc @vkvenkat

This is one of those "Fix or Document" kind of cases. If CompressionLevel.Optimal is intended to be machine-specific, document as such.

@joshudson not quite sure what do you mean by "machine-specific" given verification tests running on the identical hardware configuration with different framework targets.

Also, not sure how badly performinng Optimal case can be considered an acceptable result.

/cc @ahsonkhan, @ViktorHofer https://github.com/dotnet/corefx/issues/34157#issuecomment-449645039

Keeping bug active to track the secondary issue that @stephentoub identified with Optimal compression on the latest zlib (.NET Core version) being less-optimal than the older zlib in .NET Framework 4.x

@joshfree secondary issue? Are you referring to https://github.com/dotnet/corefx/issues/34157#issuecomment-449452688 or there's another issue raised?

@SeanFeldman Yup. Though I was specifically referring to this subsection in @stephentoub 's comment:

results in the following for me on Windows:

C:\Users\stoub\Desktop\tmpapp>dotnet run -c Release -f net472
NoCompression: 1048759
Optimal: 1052
Fastest: 4609

C:\Users\stoub\Desktop\tmpapp>dotnet run -c Release -f netcoreapp3.0
NoCompression:  1048759
Optimal:        4590
Fastest:        4609

Note that the values for NoCompression and Fastest are the same, but the values for Optimal differ by 4x. Since Optimal is supposed to mean best compression ratio, there's obviously an issue when it's possible to be 4x smaller.

So I happened to look at this a little while investigating another issue and have some information to share. For input data that represents "normal" data Optimal is providing better compression than Fastest. I was testing using the XML versions of the ECMA specs (334/335) and saw a significant difference between the two settings on par with what we got in Desktop, so at least in this case there is no regression. For input data that is very regular (either 0, or a, or a repeating pattern) I'm seeing that Optimal and Fastest are both doing a similar job at compressing and Optimal is worse than desktop. We shouldn't discard this scenario since it is rather common to have sparse data and this regression impacts that as well. If I inject enough sparse data into my normal case I can reproduce similar differences with desktop as the regular case.

I went back to see where this regression was introduced and it appears to be between .NETCore 1.0 and 1.1. It's isolated to changes to CLRCompression.dll in that release (downgrade the 1.1 version with 1.0 will undo the regression). https://github.com/dotnet/corefx/commits/release/1.1.0/src/Native/Windows/clrcompression

So I would further scope this issue to be: CompressionLevel.Optimal does worse with sparse data than desktop and netcoreapp1.0.

There are only a couple changes in the 1.0/1.1 window so I plan to selectively revert them to experiment if I can identify the regression.

@bjjones @vkvenkat can one of you have a look at this? I believe the regression was caused by changes in zlib-intel. I can confirm we had good behavior in 1.0 before @bjjones changes https://github.com/dotnet/corefx/pull/9273, bad behavior after this was merged, and simply undoing those changes in the current codebase doesn't seem sufficient to restore perf.

@ericstj I did some test https://github.com/dotnet/corefx/issues/34157#issuecomment-449645039 and compiling without USE_MEDIUM I get same performance on @stephentoub test have you tried?

Yes, removing USE_MEDIUM does improve the compression, but that flag existed back in 1.0 as well and we don't have this problem. https://github.com/dotnet/corefx/blob/8209c8cc3a642c24b0c6b0b26a4ade94038463fa/src/Native/Windows/clrcompression/zlib-intel/zutil.h#L142

I believe we would have been hitting that same define in 1.0. I'll go back and do some more forensics to see if the pre-processor evaluation changed.

Ok, so the 1.0 / 1.1 code differences were actually not the cause. In 1.0 we built CLRCompression out of an internal source repository which had none of the zlib-intel contributions and built from the main zlib codebase version 1.2.3, so your analysis on USE_MEDIUM appears to be correct. It looks like USE_MEDIUM has been on since the very first zlib-intel drop (which first shipped in 1.1.0).

https://github.com/dotnet/corefx/blob/8209c8cc3a642c24b0c6b0b26a4ade94038463fa/src/Native/Windows/clrcompression/zlib-intel/deflate.c#L314
https://github.com/dotnet/corefx/blob/8209c8cc3a642c24b0c6b0b26a4ade94038463fa/src/Native/Windows/clrcompression/zlib-intel/deflate.c#L135-L161

This isn't a typical regression where we can undo a minor change to restore the previous functionality. We've likely gotten used to the performance characteristics of deflate_medium since its been on for some time. That said, its a bit disingenuous to use something that's specifically designed to compromise compression ration in favor of performance for a value we claim is CompressionLevel.Optimal.

@bjjones @vkvenkat @jtkukunas : is it possible to make deflate_medium do a better job with sparse data?

That said, its a bit disingenuous to use something that's specifically designed to compromise compression ration in favor of performance for a value we claim is CompressionLevel.Optimal

This is a good point, and it brings up another question. Why does CompressionLevel.Optimal map to zlib q6 anyway?

https://github.com/dotnet/corefx/blob/master/src/System.IO.Compression/src/System/IO/Compression/DeflateZLib/ZLibNative.cs#L41-L44

System.IO.Compression.Brotli maps CompressionLevel.Optimal to Brotli's q11 despite the fact that the perf penalty for q11 in Brotli is much worse than the penalty for q9 in zlib.

https://github.com/dotnet/corefx/blob/master/src/System.IO.Compression.Brotli/src/System/IO/Compression/enc/BrotliEncoder.cs#L28-L30

There's some history with using level 6. Desktop used level 6. Folks tried to use 9 and then reverted it due to perf impact without significant compression benefits. This was all before deflate_medium, so 6 and 9 were both using deflate_slow.
https://github.com/dotnet/corefx/pull/4589
https://github.com/dotnet/corefx/pull/5458

It might be interesting to look at performance of the Calgary Corpus data set contrasting deflate_medium to deflate_slow. I suspect that data is missing sparse data as described in this issue: it'd be good to add that to the comparison. I'll see about collecting some of these numbers.

In the meantime, I'd really like for the Intel folks to weigh in on the feasibility of a change to deflate_medium to better handle sparse data.

Ah, ok. The ptt5 sample from the Canterbury Corpus is a fairly sparse file (many runs of zeros), and it's the only one that shows a significant improvement between levels 6 and 9 in the results in https://github.com/dotnet/corefx/pull/5458

The reasoning behind using level 6 as 'optimal` in zlib makes sense generally. However, given the huge drop-off in perf at the higher Brotli quality levels, it's surprising the default setting there wasn't more balanced so the two compression libraries are more consistent in behavior/performance. I'll open a separate issue for discussion on that.

@ericstj Just discussed this issue with the ZLib-Intel folks. It looks like changes to deflate_medium are pretty invasive and we would need to test some ideas to see if we can improve on the compression ratio at a reasonable perf. We will get back on this in 2 weeks.

Also, I guess the comparison between levels 6 and 9 in #5498 was with public ZLib v1.2.8. It might be a good idea to re-assess this with the current ZLib-Intel (v1.2.11) in corefx for both small and large files - most files in Canterbury Corpus are < 1MB.

Sure, I was planning on getting some data comparing deflate_slow to deflate_medium. I can replay the 6 vs 9 comparison as well (using only deflate_slow of course, since the current 6 maps to deflate_medium).

So I'll have data for the following scenarios:

  • 6 : deflate_medium
  • 6 : deflate_slow
  • 9 : deflate_slow

Folks tried to use 9 and then reverted it due to perf impact without significant compression benefits.

It is not possible to determine in general that the compression benefits are low. Likely, the benefit were low on the corpus in question. Different users have vastly different data characteristics.

In my opinion it is very important to expose a very high compression mode. In many scenarios speed is not important. For example, any kind of data archival is often not on any latency critical path but any space savings gained are appreciated forever. Storage space is not always cheap (e.g. relational database space or simply large amounts of data stored on disk).

So if you are compressing 100MB of data per day in a background job for long-term archival then compression speed is completely unimportant.

@gspp that is a great point. Right now I want this issue to be constrained to evaluating our choice for "optimal". Let optimal mean a implementation-specific tradeoff of size vs resource usage/performance. Let's open a new issue to propose adding CompressionLevel.Highest where it maps to an algorithms best settings for size, forgoing resource usage and performance.

I measured all the files from cantebury corpus, calgary corpus, google's brotli test data, and I created a sparse bitmap as a "real world example" to illustrate the specific issue we've been discussing (smile24.bmp).

m6 is deflate_medium strategy at level 6 (current CompressionLevel.Optimal)
s6 is deflate_slow strategy at level 6 (old CompressionLevel.Optimal prior to zlib-intel)
s9 is deflate_slow strategy at level 9

μs is the mean duration of compression operation of the workload in micro-seconds.
% represents the percent savings (difference from original file size). Larger is better. 100% is ideal/impossible
B is the size in bytes of the compressed data

| File | Bytes | m6-µs | s6-µs | s9-µs | m6-% | s6-% | s9-% | m6-B | s6-B | s9-B |
| ---- | ----- | ----- | ----- | ----- | ---- | ---- | ---- | ---- | ---- | ---- |
| smile24.bmp | 6375546 | 15,769.32 | 38,912.22 | 76,535.74 | 99.35 | 99.67 | 99.74 | 41574 | 20751 | 16682 |
| kennedy.xls | 1029744 | 20,154.30 | 26,736.70 | 594,471.06 | 78.63 | 79.71 | 79.90 | 220018 | 208923 | 207023 |
| book1 | 768771 | 43,679.78 | 64,006.38 | 88,303.90 | 58.47 | 59.22 | 59.35 | 319265 | 313539 | 312480 |
| book2 | 610856 | 28,907.70 | 39,192.91 | 49,532.37 | 65.78 | 66.17 | 66.26 | 209048 | 206636 | 206129 |
| pic | 513216 | 8,662.30 | 10,017.80 | 74,759.04 | 89.27 | 89.14 | 89.84 | 55056 | 55732 | 52134 |
| ptt5 | 513216 | 8,612.64 | 9,939.09 | 74,834.41 | 89.27 | 89.14 | 89.84 | 55056 | 55732 | 52134 |
| plrabn12.txt | 471162 | 29,484.27 | 42,517.96 | 54,463.16 | 58.19 | 58.90 | 59.00 | 196972 | 193641 | 193156 |
| lcet10.txt | 419235 | 22,042.88 | 29,370.63 | 39,428.60 | 65.50 | 65.87 | 65.99 | 144621 | 143087 | 142598 |
| news | 377109 | 15,486.09 | 19,603.39 | 22,389.77 | 61.13 | 61.61 | 61.69 | 146565 | 144780 | 144474 |
| mapsdatazrh | 285886 | 11,401.40 | 10,924.97 | 12,454.52 | 36.77 | 36.95 | 37.05 | 180759 | 180256 | 179969 |
| zeros | 262144 | 226.29 | 1,452.25 | 1,459.86 | 99.56 | 99.90 | 99.90 | 1157 | 271 | 271 |
| obj2 | 246814 | 9,268.89 | 11,716.66 | 21,546.60 | 67.00 | 67.02 | 67.18 | 81443 | 81411 | 81009 |
| quickfox_repeated | 176128 | 160.80 | 967.51 | 970.74 | 99.46 | 99.68 | 99.68 | 950 | 572 | 572 |
| alice29.txt | 148481 | 7,870.77 | 10,829.34 | 15,631.86 | 63.47 | 63.90 | 64.03 | 54239 | 53598 | 53402 |
| compressed_repeated | 144224 | 4,271.53 | 3,281.49 | 3,349.60 | 30.21 | 30.26 | 30.26 | 100660 | 100589 | 100589 |
| asyoulik.txt | 125179 | 6,897.01 | 9,898.32 | 11,884.34 | 60.32 | 60.95 | 61.04 | 49669 | 48880 | 48772 |
| TestDocument.pdf | 121993 | 5,650.56 | 4,320.09 | 4,315.58 | 5.51 | 5.54 | 5.54 | 115275 | 115239 | 115231 |
| bib | 111261 | 4,323.40 | 5,796.17 | 7,576.86 | 68.01 | 68.35 | 68.48 | 35590 | 35212 | 35065 |
| geo | 102400 | 5,937.75 | 7,526.69 | 7,786.26 | 33.13 | 33.24 | 33.25 | 68472 | 68363 | 68355 |
| trans | 93695 | 2,366.10 | 2,877.56 | 4,682.35 | 79.47 | 79.68 | 79.82 | 19240 | 19043 | 18910 |
| paper2 | 82199 | 4,028.95 | 5,534.28 | 6,934.14 | 63.34 | 63.81 | 63.92 | 30137 | 29747 | 29659 |
| progl | 71646 | 2,440.63 | 3,007.46 | 6,113.43 | 77.17 | 77.33 | 77.47 | 16358 | 16241 | 16140 |
| backward65536 | 65792 | 66.42 | 377.62 | 392.78 | 99.53 | 99.87 | 99.87 | 307 | 85 | 85 |
| paper1 | 53161 | 2,144.42 | 2,758.95 | 3,194.55 | 64.80 | 65.10 | 65.17 | 18711 | 18553 | 18518 |
| compressed_file | 50096 | 2,098.84 | 1,468.60 | 1,482.02 | -0.04 | -0.04 | -0.04 | 50116 | 50116 | 50116 |
| progp | 49379 | 1,508.51 | 1,746.35 | 5,125.77 | 77.20 | 77.28 | 77.40 | 11260 | 11221 | 11162 |
| paper3 | 46526 | 2,086.46 | 2,781.58 | 3,202.37 | 60.67 | 61.16 | 61.21 | 18300 | 18071 | 18049 |
| TestDocument.doc | 45568 | 563.22 | 766.41 | 4,540.80 | 85.09 | 85.25 | 85.35 | 6793 | 6723 | 6676 |
| progc | 39611 | 1,447.30 | 1,825.53 | 2,300.77 | 65.93 | 66.33 | 66.36 | 13494 | 13337 | 13324 |
| sum | 38240 | 1,280.88 | 1,636.87 | 9,633.46 | 66.10 | 66.13 | 66.44 | 12963 | 12950 | 12832 |
| paper6 | 38105 | 1,432.65 | 1,813.22 | 2,089.92 | 64.73 | 65.12 | 65.16 | 13438 | 13292 | 13274 |
| cp.html | 24603 | 721.12 | 809.83 | 913.71 | 67.39 | 67.67 | 67.75 | 8023 | 7955 | 7934 |
| TestDocument.txt | 21686 | 63.70 | 153.64 | 155.41 | 96.09 | 96.38 | 96.38 | 847 | 785 | 785 |
| obj1 | 21504 | 709.98 | 735.47 | 1,253.14 | 51.88 | 52.05 | 52.05 | 10347 | 10312 | 10311 |
| TestDocument.docx | 17705 | 539.33 | 400.62 | 651.45 | 29.83 | 29.90 | 29.94 | 12423 | 12412 | 12405 |
| paper4 | 13286 | 448.23 | 549.64 | 544.34 | 57.85 | 58.52 | 58.54 | 5600 | 5511 | 5509 |
| paper5 | 11954 | 407.82 | 453.77 | 462.31 | 57.92 | 58.42 | 58.42 | 5030 | 4970 | 4970 |
| fields.c | 11150 | 326.70 | 366.44 | 425.02 | 71.95 | 72.05 | 72.12 | 3128 | 3116 | 3109 |
| random_org_10k.bin | 10000 | 339.07 | 201.28 | 205.23 | -0.05 | -0.05 | -0.05 | 10005 | 10005 | 10005 |
| xargs.1 | 4227 | 139.63 | 146.77 | 146.63 | 58.81 | 59.07 | 59.07 | 1741 | 1730 | 1730 |
| grammar.lsp | 3721 | 107.49 | 117.51 | 121.54 | 67.02 | 67.32 | 67.32 | 1227 | 1216 | 1216 |
| monkey | 843 | 34.82 | 30.92 | 31.12 | 54.57 | 54.69 | 54.69 | 383 | 382 | 382 |
| ukkonooa | 119 | 14.02 | 15.33 | 16.12 | 42.02 | 43.70 | 43.70 | 69 | 67 | 67 |
| 64x | 64 | 12.51 | 14.09 | 12.89 | 90.62 | 90.62 | 90.62 | 6 | 6 | 6 |
| quickfox | 43 | 15.84 | 15.43 | 15.70 | -2.33 | -2.33 | -2.33 | 44 | 44 | 44 |
| 10x10y | 20 | 13.01 | 12.56 | 13.02 | 60.00 | 60.00 | 60.00 | 8 | 8 | 8 |
| xyzzy | 5 | 13.60 | 12.70 | 12.63 | -40.00 | -40.00 | -40.00 | 7 | 7 | 7 |
| x | 1 | 12.71 | 12.27 | 12.79 | -200.00 | -200.00 | -200.00 | 3 | 3 | 3 |
| empty | 0 | 12.21 | 12.07 | 12.31 | -∞ | -∞ | -∞ | 2 | 2 | 2 |

@ericstj Thanks for sharing this data. Could you please add another column for original size and sort the table by size?

Updated the data in place with sorting requested.

The differences in compression ration are extremely small. It is a bit suspicious. Maybe it is worth trying lower levels just to make sure there is the potential for size differences at all. Maybe the configuration is somehow broken.

@gspp you can still see the sort of differences this issue was pointing out. For example smile24.bmp (my hand-crafted example) zeros (another sparse case), quickfox_repeated, TestDocument.txt, etc. Note the difference in compressed size s significant for these (sometimes over 2x) it's just that when you consider this measured as a % of original size it is rather small.

Here's another table that demonstrates all the size comparisons for this data (including "Fastest" as well as desktop). https://gist.github.com/ericstj/9b33d22bb19ae6aa463a3788c8c97b76

Here's the small change I made to help collect the measurements: https://github.com/ericstj/corefx/commit/c2f9237f0a477dc1147a76d08b6291a53a0e40bd

I see @ericstj. I had not noticed that. The benchmark data set looks quite diverse to me as well. The results seem meaningful.

Could you try the branch sparsedatafix at jtkukunas/zlib and verify whether it resolves this issue? Thanks. https://github.com/jtkukunas/zlib/tree/sparsedatafix

Thanks @jtkukunas. I've ported your fix over to corefx and am rerunning the measurements (same machine same build). I'll post updated results later today.

Here's updated timing data for the new run. I also updated the size data in place here: https://gist.github.com/ericstj/9b33d22bb19ae6aa463a3788c8c97b76

The fix appears to give size improvements in most cases, no size regressions, and the perf seems comparable to what we had before. Seems like a good fix to take.

It still doesn't give parity with deflate_slow for some of the "real world" sparse/repeat cases.

| File | Bytes | m6-µs | s6-µs | s9-µs | m6-% | s6-% | s9-% | m6-B | s6-B | s9-B |
| ---- | ----- | ----- | ----- | ----- | ---- | ---- | ---- | ---- | ---- | ---- |
| smile24.bmp | 6375546 | 13,114.19 | 36,305.76 | 73,748.74 | 99.47 | 99.67 | 99.74 | 33748 | 20751 | 16682 |
| kennedy.xls | 1029744 | 18,451.64 | 24,823.10 | 558,952.47 | 78.63 | 79.71 | 79.90 | 220018 | 208923 | 207023 |
| book1 | 768771 | 42,354.79 | 61,795.43 | 81,634.92 | 58.47 | 59.22 | 59.35 | 319265 | 313539 | 312480 |
| book2 | 610856 | 28,037.45 | 36,451.96 | 47,469.13 | 65.78 | 66.17 | 66.26 | 209048 | 206636 | 206129 |
| pic | 513216 | 8,264.67 | 9,764.58 | 73,320.32 | 89.36 | 89.14 | 89.84 | 54615 | 55732 | 52134 |
| ptt5 | 513216 | 8,085.37 | 9,717.51 | 71,776.03 | 89.36 | 89.14 | 89.84 | 54615 | 55732 | 52134 |
| plrabn12.txt | 471162 | 27,217.41 | 41,206.83 | 52,387.87 | 58.19 | 58.90 | 59.00 | 196972 | 193641 | 193156 |
| lcet10.txt | 419235 | 20,405.16 | 27,493.17 | 37,617.29 | 65.50 | 65.87 | 65.99 | 144621 | 143087 | 142598 |
| news | 377109 | 14,430.22 | 19,299.06 | 21,726.47 | 61.13 | 61.61 | 61.69 | 146565 | 144780 | 144474 |
| mapsdatazrh | 285886 | 11,321.81 | 10,800.47 | 11,897.18 | 36.77 | 36.95 | 37.05 | 180759 | 180256 | 179969 |
| zeros | 262144 | 207.09 | 1,379.81 | 1,447.71 | 99.90 | 99.90 | 99.90 | 271 | 271 | 271 |
| obj2 | 246814 | 9,035.53 | 11,381.85 | 21,474.28 | 67.00 | 67.02 | 67.18 | 81443 | 81411 | 81009 |
| quickfox_repeated | 176128 | 156.83 | 925.93 | 958.22 | 99.46 | 99.68 | 99.68 | 946 | 572 | 572 |
| alice29.txt | 148481 | 7,517.47 | 10,435.39 | 15,115.83 | 63.47 | 63.90 | 64.03 | 54239 | 53598 | 53402 |
| compressed_repeated | 144224 | 4,225.56 | 3,172.82 | 3,254.75 | 30.21 | 30.26 | 30.26 | 100660 | 100589 | 100589 |
| asyoulik.txt | 125179 | 6,410.47 | 9,304.98 | 11,417.60 | 60.32 | 60.95 | 61.04 | 49669 | 48880 | 48772 |
| TestDocument.pdf | 121993 | 5,489.20 | 3,984.92 | 4,177.42 | 5.51 | 5.54 | 5.54 | 115275 | 115239 | 115231 |
| bib | 111261 | 4,210.31 | 5,536.25 | 7,320.29 | 68.01 | 68.35 | 68.48 | 35590 | 35212 | 35065 |
| geo | 102400 | 5,584.86 | 7,090.60 | 7,224.05 | 33.13 | 33.24 | 33.25 | 68472 | 68363 | 68355 |
| trans | 93695 | 2,243.56 | 2,733.24 | 4,480.96 | 79.47 | 79.68 | 79.82 | 19239 | 19043 | 18910 |
| paper2 | 82199 | 3,812.94 | 5,257.98 | 7,055.03 | 63.34 | 63.81 | 63.92 | 30137 | 29747 | 29659 |
| progl | 71646 | 2,303.06 | 2,891.23 | 6,211.89 | 77.17 | 77.33 | 77.47 | 16358 | 16241 | 16140 |
| backward65536 | 65792 | 62.34 | 359.85 | 383.22 | 99.87 | 99.87 | 99.87 | 85 | 85 | 85 |
| paper1 | 53161 | 2,023.28 | 2,730.17 | 3,203.59 | 64.80 | 65.10 | 65.17 | 18711 | 18553 | 18518 |
| compressed_file | 50096 | 2,096.48 | 1,397.13 | 1,430.56 | -0.04 | -0.04 | -0.04 | 50116 | 50116 | 50116 |
| progp | 49379 | 1,390.90 | 1,686.18 | 4,979.20 | 77.20 | 77.28 | 77.40 | 11260 | 11221 | 11162 |
| paper3 | 46526 | 2,005.80 | 2,629.90 | 3,183.97 | 60.67 | 61.16 | 61.21 | 18300 | 18071 | 18049 |
| TestDocument.doc | 45568 | 539.29 | 758.73 | 4,310.46 | 85.16 | 85.25 | 85.35 | 6762 | 6723 | 6676 |
| progc | 39611 | 1,384.91 | 1,726.21 | 2,311.14 | 65.93 | 66.33 | 66.36 | 13494 | 13337 | 13324 |
| sum | 38240 | 1,213.22 | 1,576.25 | 9,378.29 | 66.10 | 66.13 | 66.44 | 12962 | 12950 | 12832 |
| paper6 | 38105 | 1,362.99 | 1,732.85 | 2,047.83 | 64.73 | 65.12 | 65.16 | 13438 | 13292 | 13274 |
| cp.html | 24603 | 700.10 | 778.85 | 887.75 | 67.39 | 67.67 | 67.75 | 8023 | 7955 | 7934 |
| TestDocument.txt | 21686 | 62.14 | 147.33 | 149.39 | 96.09 | 96.38 | 96.38 | 847 | 785 | 785 |
| obj1 | 21504 | 681.50 | 704.85 | 1,259.24 | 51.90 | 52.05 | 52.05 | 10343 | 10312 | 10311 |
| TestDocument.docx | 17705 | 529.96 | 385.78 | 623.58 | 29.88 | 29.90 | 29.94 | 12415 | 12412 | 12405 |
| paper4 | 13286 | 441.97 | 532.22 | 545.02 | 57.85 | 58.52 | 58.54 | 5600 | 5511 | 5509 |
| paper5 | 11954 | 390.74 | 445.14 | 462.56 | 57.92 | 58.42 | 58.42 | 5030 | 4970 | 4970 |
| fields.c | 11150 | 315.79 | 346.24 | 415.34 | 71.95 | 72.05 | 72.12 | 3128 | 3116 | 3109 |
| random_org_10k.bin | 10000 | 325.19 | 193.79 | 195.84 | -0.05 | -0.05 | -0.05 | 10005 | 10005 | 10005 |
| xargs.1 | 4227 | 133.94 | 135.07 | 138.76 | 58.81 | 59.07 | 59.07 | 1741 | 1730 | 1730 |
| grammar.lsp | 3721 | 105.50 | 110.61 | 117.97 | 67.02 | 67.32 | 67.32 | 1227 | 1216 | 1216 |
| monkey | 843 | 32.25 | 30.54 | 29.62 | 54.57 | 54.69 | 54.69 | 383 | 382 | 382 |
| ukkonooa | 119 | 15.92 | 16.44 | 15.43 | 42.02 | 43.70 | 43.70 | 69 | 67 | 67 |
| 64x | 64 | 11.97 | 12.48 | 12.07 | 90.62 | 90.62 | 90.62 | 6 | 6 | 6 |
| quickfox | 43 | 14.43 | 14.24 | 14.17 | -2.33 | -2.33 | -2.33 | 44 | 44 | 44 |
| 10x10y | 20 | 11.95 | 12.26 | 11.94 | 60.00 | 60.00 | 60.00 | 8 | 8 | 8 |
| xyzzy | 5 | 12.08 | 11.46 | 12.00 | -40.00 | -40.00 | -40.00 | 7 | 7 | 7 |
| x | 1 | 11.78 | 12.08 | 11.85 | -200.00 | -200.00 | -200.00 | 3 | 3 | 3 |
| empty | 0 | 11.14 | 11.44 | 11.03 | -∞ | -∞ | -∞ | 2 | 2 | 2 |0

Thanks @ericstj for compiling this data.

Out of the 49 test cases, it appears that 44 are what I would consider to be on-par in terms of compression ratio, ranging from slightly better to 1-2% larger. Of the remaining 5 cases, I see ukkonooa (3% larger), kennedy.xls (5% larger), TestDocument.txt (8% larger), smile24.bmp (63% larger), and quickfox_repeated (65% larger). I'm familiar with kennedy.xls, as it's part of the standard Canterbury corpus; however, I'm not familiar with the others. Are they available somewhere?

There are always compromises in these strategies with regards to what data patterns compress well, as well as performance. The above fixes are neutral with regards to performance and compression ratio, i.e., nothing regresses; however, I'm not sure how much more I'll be able to squeeze without making a tradeoff elsewhere.

In the meantime, I'm going to publish a new zlib release that includes the above fixes, so you can take advantage of these compression ratio improvements immediately.

Most of these data files are here: https://github.com/dotnet/corefx-testdata/tree/master/System.IO.Compression.TestData/UncompressedTestFiles

Smile24.bmp was a file I created to demonstrate this issue. smile24.zip

I was also able to test your change on a partner provided data file and found that it addressed the concern that originally drew this issue to my attention. (a large database that was compressing to 10 MB on desktop, regressed to 40 MB on core, its now back to 10MB with your fix).

I appreciate you having a look.

Okay. Thanks. Will take a look at those files.

For reference, the new zlib release that includes these fixes is v1.2.11.1_jtkv6.3 (https://github.com/jtkukunas/zlib/releases/tag/v1.2.11.1_jtkv6.3)

Was this page helpful?
0 / 5 - 0 ratings

Related issues

yahorsi picture yahorsi  Â·  3Comments

chunseoklee picture chunseoklee  Â·  3Comments

jzabroski picture jzabroski  Â·  3Comments

sahithreddyk picture sahithreddyk  Â·  3Comments

bencz picture bencz  Â·  3Comments