Runtime: System.IO.Compression to support zlib thin wrapper over DEFLATE?

Created on 7 Apr 2016  路  23Comments  路  Source: dotnet/runtime

Currently we have access to this: https://msdn.microsoft.com/en-us/library/system.io.compression.deflatestream(v=vs.110).aspx

The _DeflateStream_ type can be used to compress/decompress DEFLATE data.

Yet in the wild world people normally wrap that with zlib (i.e. RFC1950), e.g. in the Git pack file, there are multiple zlib streams embedded within.

Given underneath we are calling zlib Native, any chance we could provide types that can handle zlib streams? As in:

https://github.com/icsharpcode/SharpZipLib/blob/master/src/Zip/Compression/Inflater.cs
https://docs.python.org/2/library/zlib.html (zlib.compress(string[, level]))

api-suggestion area-System.IO.Compression

Most helpful comment

This is an advanced scenario (only 3 people wanting it so far).

3 people plus all the people that would benefit from the change. I'm very disappointed at the short-sightedness here.

See dotnet/corefx#40170 for further reasons why this is important.

All 23 comments

I'm not sure I follow; could you give an example of what you'd like to be able to do? DeflateStream is already a managed wrapper around the zlib API that provides functionality in the .NET stream style.

echo -n "example text to compress and decompress" | python2 -c 'import sys,zlib; sys.stdout.write(zlib.compress(sys.stdin.read()))' | hd

We get this:

00000000  78 9c 4b ad 48 cc 2d c8  49 55 28 49 ad 28 51 28  |x.K.H.-.IU(I.(Q(|
00000010  c9 57 48 ce cf 2d 28 4a  2d 2e 56 48 cc 4b 51 48  |.WH..-(J-.VH.KQH|
00000020  49 85 71 01 2a 93 0f 09                           |I.q.*...|

In which the 78 9c is the magic header for zlib (http://stackoverflow.com/questions/9050260/what-does-a-zlib-header-look-like)

Or

echo -n "example text to compress and decompress" | gzip | hd

We get:

00000000  1f 8b 08 00 bf 59 06 57  00 03 4b ad 48 cc 2d c8  |.....Y.W..K.H.-.|
00000010  49 55 28 49 ad 28 51 28  c9 57 48 ce cf 2d 28 4a  |IU(I.(Q(.WH..-(J|
00000020  2d 2e 56 48 cc 4b 51 48  49 85 71 01 fb 2d 09 c5  |-.VH.KQHI.q..-..|
00000030  27 00 00 00                                       |'...|

The 1f 8b are gzip header.

With System.IO.Compression.DeflateStream we get:

4b ad 48 cc 2d c8 49 55  28 49 ad 28 51 28 c9 57
48 ce cf 2d 28 4a 2d 2e  56 48 cc 4b 51 48 49 85
71 01

This is the DEFLATE stream, not zlib stream. It is in the middle of both zlib and gzip stream, but without headers and tails.

Ah I see what the ask is now, thanks for the elaboration.

I agree that this is something we should have at the very least because it won't require very much effort to implement and increases the number of cases that we can handle. It does complicate dotnet/runtime#16574 though.

Off the top of my head, we could add this in a few different ways:

  • GZipStream is already a thin wrapper around DeflateStream so it would be trivial to add a ZlibStream that behaved similarly.
  • Expose the DeflateStream constructor that takes windowBits so that you could select the correct windowBits number depending on the type of compression you wanted to work with.
  • Add a CompressionType enum with values Raw, ZLib, and GZip. Add a constructor to DeflateStream that accepted that enum.

@ianhays Now with this "up for grabs", what is the conclusion regarding dotnet/runtime#16574 then? Managed or not, both should work, but eventually there needs to be a "policy" or "agreement" reached or pushed either from top or bottom.

My feeling is that native code should perform faster, (although happy to be proved wrong on this). Trivia difference for small runs, would add up when you are facing huge number of calls.

Another thought on this: talking about managed implementation, how about merging the effort with https://github.com/icsharpcode/SharpZipLib?

Now with this "up for grabs", what is the conclusion regarding dotnet/runtime#16574 then? Managed or not, both should work, but eventually there needs to be a "policy" or "agreement" reached or pushed either from top or bottom.

The purpose of dotnet/runtime#16574 is to provide a fallback implementation where a native compression library (e.g. zlib) doesn't exist. That issue actually shouldn't interfere with this one since any new API could be PlatformNotSupported in the managed implementation if it came to that.

My feeling is that native code should perform faster, (although happy to be proved wrong on this). Trivia difference for small runs, would add up when you are facing huge number of calls.

You are correct. Switching to native zlib increased our compression bechmarks by something like ~30% in cases where more than a few bytes were compressed/decompressed.

Another thought on this: talking about managed implementation, how about merging the effort with https://github.com/icsharpcode/SharpZipLib?

We already have a managed implementation in source control, I just deleted it since we didn't need it at the time. Also, as a general rule, we don't take code from other projects/libraries unless the code is available under a very liberal license like zlib or the owner contributes the code themselves under our license.

As far as this issue is concerned, implementing deflate support in CoreFX wouldn't require adding a managed Deflate algorithm, it would only require we add API around our native library (zlib) that is already built to handle that format. It's just a question of what we want that API to look like.

The work remaining on this is to propose an API for a ZLibStream class that compresses/decompresses deflate data that is wrapped in a ZLib header. The API (and implementation for that matter) for this should probably just be a copy/paste of GZipStream.

@xied75 do you want to complete formal API proposal then we can review formally?

Docs on the API review process and see Jon's API request for a great example of a strong proposal. The more concrete the proposal is (e.g. has examples of usage, real-world scenarios, fleshed out API), the more discussion it will start and the better the chances of us being able to push the addition through the review process.

@danmosemsft Sorry for the delayed response, I reckon a formal API proposal will need large block of time which I can't start right now. This is needed in my project thus I'll come to it sooner or later, if no others want to grab it first.

Just found we've already got these https://github.com/dotnet/corefx/tree/master/src/System.IO.Compression/src/System/IO/Compression/DeflateZLib

I wonder would these files warrant closing this issue?

@xied75 the nomenclature is somewhat confusing, but the files in that folder are for the DeflateStream that uses zlib as the underlying compression engine. It does not contain a public interface for handing zlib headers, only raw deflate.

Would this mean that dotnet core would be able to decompress a file that is created using zlib's pigz implementation http://zlib.net/pigz/ it seems that the current implementation of the GZipStream only decompresses part of the stream if it's compressed using that.

From looking at the pigz docs, it appears to be something entirely different from the deflated data stream prepended by a ZLIB header that is being discussed in this issue. So I don't see this issue affecting pigz.

I would also like access to the zlib internals to use the native implementation instead of a managed version I have. This is specifically to expose GZip functionality like the filename or comment.

This is for SharpCompress: https://github.com/adamhathcock/sharpcompress

RFC1950 support would be nice for gRPC in .NET. A common compression encoding is deflate. However the deflate content from servers and clients has headers from RFC1950.

This isn't the end of the world. gRPC .NET can provide gzip as an option instead.

Triage:
This is an advanced scenario (only 3 people wanting it so far). We are reluctant to expose our internal zlib abstraction and maintain its compatibility over time. It might prevent us from upgrading zlib easily.

@JamesNK how does this impact gRPC scenarios?

It isn't required. If it becomes more important then we can re-evaluate.

This is an advanced scenario (only 3 people wanting it so far).

3 people plus all the people that would benefit from the change. I'm very disappointed at the short-sightedness here.

See dotnet/corefx#40170 for further reasons why this is important.

Sadly, I spend most of my time coding and little of it tracking feature requests in .NET.

I'm number 4.... and I'm sure there are more out there like us.

@JimBobSquarePants we are hesitant to add features that would tie our hands long term and we do not believe they bring large value.

If it is that needed, I think there is always opportunity for standalone nuget package. If it is successful and widely used over time, I'd be willing to reconsider and take it into the platform.

I don't quite understand why the difference between frameworks matters. I bet we will close that issue on Wed with next round of triage. I admit I don't don't read it all yet ...

As I mentioned in dotnet/runtime#1550 my primary issue is that the GZip implementation doesn't expose the archive metadata that GZip (the archive format) uses. GZip is basically DEFLATE plus some optional metadata.

@adamhathcock is this the only way that problem can be solved? Do we have other way to expose format medatada?
What exactly is your scenario / usage pattern? How common is it? (in the context on this issue - the linked one is rather long and detailed about other things)

When using GZipStream to compress data like in a web server, you don't use the metadata fields. However, GZip (the utility) compresses a single file and adds metadata like the file name and timestamp. The ability to read and write this metadata is what I need.

I missed this when I made a new issue for it. I'm really disappointed by the decision made here. I've been waiting for zlib support for a long time now. I made a new issue requesting an API surface similar or almost exact with the API surface of GZipStream. This does not expose too many specifics of the underlying zlib implementation while still allowing the C# developer to handle zlib data in a way they are already familiar with. The ability to compress data to zlib with the existing CompressionLevel enum, and to decompress any zlib data stream would probably be good enough for the majority of use cases.

If this is something supporters of this issue would want, then show your support in the new issue, and comment about your use cases. I hope we can get them to change their minds on this topic.

Also, let's not forget that many similar languages to C# already have support for zlib, including Java and Python.

https://github.com/dotnet/runtime/issues/2236

Was this page helpful?
0 / 5 - 0 ratings