Runtime: Add a Forward-only API for System.IO.Compression

Created on 1 Jun 2017  路  36Comments  路  Source: dotnet/runtime

This was started in https://github.com/dotnet/corefx/issues/9657

There seems to be a growing desire/need for a forward-only API that accesses compressed file formats (e.g. zip, gzip, tar, etc.) in a streaming manner. This means very large files as well as streams like network streams can be read and decompressed on the fly. Basically, the API reads from Stream objects and never seeks on it. This is how the Reader/Writer API from SharpCompress works.

Here's a sample from the unit tests:

using (Stream stream = new ForwardOnlyStream(File.OpenRead(path)))
using (IReader reader = ReaderFactory.Open(stream))
{
    while (reader.MoveToNextEntry())
    {
        if (!reader.Entry.IsDirectory)
        {
            reader.WriteEntryToDirectory(test.SCRATCH_FILES_PATH, new ExtractionOptions()
            {
                ExtractFullPath = true,
                Overwrite = true
            });
        }
    }
}

public interface IReader : IDisposable
{
    event EventHandler<ReaderExtractionEventArgs<IEntry>> EntryExtractionProgress;

    event EventHandler<CompressedBytesReadEventArgs> CompressedBytesRead;
    event EventHandler<FilePartExtractionBeginEventArgs> FilePartExtractionBegin;

    ArchiveType ArchiveType { get; }

    IEntry Entry { get; }

    /// <summary>
    /// Decompresses the current entry to the stream.  This cannot be called twice for the current entry.
    /// </summary>
    /// <param name="writableStream"></param>
    void WriteEntryTo(Stream writableStream);

    bool Cancelled { get; }
    void Cancel();

    /// <summary>
    /// Moves to the next entry by reading more data from the underlying stream.  This skips if data has not been read.
    /// </summary>
    /// <returns></returns>
    bool MoveToNextEntry();

    /// <summary>
    /// Opens the current entry as a stream that will decompress as it is read.
    /// Read the entire stream or use SkipEntry on EntryStream.
    /// </summary>
    EntryStream OpenEntryStream();
}

WriteEntryToDirectory is an extension method that provides some shortcuts for dealing with file operations but what it really does is just grab the internal stream and decompresses. The actual entry method is just IReader.WriteEntryTo(Stream); If the entry isn't decompressed then the internal stream is just moved forward and not decompressed if possible (some formats require decompression since compressed length can be unknown)

The Writer API works similarly.

There is also a generic API from ReaderFactory or WriterFactory that doesn't require knowledge of the format beforehand. There is also a similar ArchiveFactory that is writeable (that uses WriterFactory internally to save) that could also be used for the current ZipArchive API and beyond.

As the author of SharpCompress, I'd like to push a lot of the ideas into core library but having native access to the compression streams (like the internal zlib) would be a great performance benefit for me. I haven't ever written any compression algorithm implementations myself so I'm sure my managed implementations need a lot of love.

I would start by creating ZipReader and ZipWriter (as well as starting the generic API) using a lot of the internal code already in the core library to prove out the API. This kind of relates to https://github.com/dotnet/corefx/issues/14853 but forward-only access is something most libraries don't account for so I'm not sure. Other easy additions would be Reader/Writer support for GZip, BZip2 and LZip (with an LZMA compressor).

Tar support linked with the previous single file formats would be a larger addition. SharpCompress deals with tar.gz, tar.bz2 and tar.lz auto-magically by detecting first the compressed format then a tar file inside. The above API works the same.

Thoughts?

Summoning a few people from the other issue
cc: @ianhays @karelz @qmfrederik

api-suggestion area-System.IO.Compression

Most helpful comment

We have encountered a few issues in System.IO.Packaging that users of DocumentFormat.OpenXml (a common Office document manipulation library) have hit that require streaming support in zip files. I'm not familiar with the domain enough to add much to the API discussion, but would love to see some movement on this issue.

All 36 comments

@ianhays what is the next step? You did +1 above, does it mean you like the idea? Do we need API proposal? Or is it ready for API review? (Disclaimer: I just skimmed through it)

I can put together a new API based on what I've done. It would be very similar to SharpCompress.

  1. Is a new writer API necessary? AFAIK, ZipArchive already supports writing to non-seekable streams on .Net Core (https://github.com/dotnet/corefx/issues/11497).
  2. For reading, I think the API should return an IEnumerable<T>. That way, the user can use a foreach and they don't have to manually call something like MoveNext in a while loop. (Some LINQ methods also become a possibility.)

1) You're right, this probably isn't strictly necessary. It would be the inverse of a proposed Reader API and it would have a generic writing implementation while the current one is specific to Zip and could probably not apply in a generic case. If a generic API isn't desired then this could possibly be dropped.
2) I've gone back and forth if I wanted to do it that way. The reason why I haven't is because people could be tempted to do ToList() on the reader which would definitely not work and defeat the point of the reader API when they should have used the archive API.

Then maybe the API should return a type that can be used in a foreach, but does not implement IEnumerable<T>. Though that could be confusing to someone who sees the API for the first time.

If a generic API isn't desired then this could possibly be dropped.

I've gone back-and-forth about this myself. I opened https://github.com/dotnet/corefx/issues/9709 to track my thoughts long-term on a way to centralize our compression algorithms somehow. It's clear that as we continue adding more compression types to corefx it's going to become trickier and tricker to integrate them smoothly and expose them to devs of all skill levels without being overwhelming or overly simplistic.

I've gone back and forth if I wanted to do it that way. The reason why I haven't is because people could be tempted to do ToList() on the reader which would definitely not work and defeat the point of the reader API when they should have used the archive API.

I think in this case increased usability is more valuable than protecting devs from themselves. You could implement a code analysis rule specifically to look for instances where devs do a ToList() and suggest they avoid it.

https://github.com/dotnet/corefx/issues/9237 is also somewhat related, though less so.

@ianhays what is the next step? You did +1 above, does it mean you like the idea? Do we need API proposal? Or is it ready for API review? (Disclaimer: I just skimmed through it)

I do like the idea. My main concern is that I really want to do our compression expansion right. There are a ton of questions around design and usability that I want answered well before we take anything like this issue, dotnet/corefx#9237, or dotnet/corefx#9709. Adding something like LZMA is easy because we can model it after DeflateStream/GZipStream with a similar API, but this issue (and the others linked above) is far more complex in the long-term implications of its design and API expansion.

OK, what is the next step? (you coming back with larger design?) What is rough ETA?

OK, what is the next step? (you coming back with larger design?)

Continue discussing design choices in this issue, dotnet/corefx#9709, and dotnet/corefx#9237. Wait for other people to be interested and comment/vote on these threads with their opinion.

What is rough ETA?

Whenever we reach quorum and have a sizable mass of users requesting the feature. It's not something that we're hurting for right now or that a lot of customers are asking for, so I don't feel comfortable assigning an ETA for it.

Here's a rough Reader API based on what I've done with SharpCompress and making the Reader itself be IEnumerable might not quite be correct.

//The file format of the Archive.  Can be confusing as GZip is a single file format as well as a compression type (using Tar as archive type).
public enum ArchiveType
{
    Rar,
    Zip,
    Tar,
    SevenZip,
    GZip
}

//The compression type used on the Archive/Entry.  GZip,BZip2,LZip,Xz are single file formats as well as a compression type.
public enum CompressionType
{
    None,
    GZip,
    BZip2,
    PPMd,
    Deflate,
    Rar,
    LZMA,
    BCJ,
    BCJ2,
    LZip,
    Xz,
    Unknown
}

//possibly not needed
public class ReaderOptions
{
    //Password protected archives (out of scope?)
    public string Password { get; set; }

    //Take ownership of the stream
    public bool CloseStream { get; set; }
}

public static class ReaderFactory
{
    public static IReader Open(Stream stream, ReaderOptions options = null);
#if NETSTANDARD1_3
    public static IReader Open(FileInfo file, ReaderOptions options = null);
    public static IReader Open(string filePath, ReaderOptions options = null);
#endif
}

public interface IReader : IEnumerable<IReaderEntry>
{
    IEnumerator<IReaderEntry> GetEnumerator();

    ArchiveType ArchiveType { get; }

    IReaderEntry CurrentEntry { get; }

    void Dispose();
}

public interface IReaderEntry
{
    Stream OpenRead();
}

public interface IEntry
{
    //This is on entry because some archive types (like ZIP) can have compression types different per entry
    CompressionType CompressionType { get; }
    string Key { get; }

    //everything below here is optional depending on archive type
    DateTime? ArchivedTime { get; }

    long? CompressedSize { get; }
    long? Crc { get; }
    long? Size { get; }

    DateTime? CreatedTime { get; }

    bool? IsDirectory { get; }
    bool? IsEncrypted { get; }
    bool? IsSplit { get; }

    DateTime? LastAccessedTime { get; }
    DateTime? LastModifiedTime { get; }
}

public static class IReaderExtensions
{
#if NETSTANDARD1_3
    public static void WriteEntryTo(this IReaderEntry reader, string filePath);
    /// <summary>
    /// Extract all remaining unread entries to specific directory, retaining filename
    /// </summary>
    public static void WriteAllToDirectory(this IReader reader, string destinationDirectory,
                                            ExtractionOptions options = null);

    /// <summary>
    /// Extract to specific directory, retaining filename
    /// </summary>
    public static void WriteEntryToDirectory(this IReaderEntry reader, string destinationDirectory,
                                                ExtractionOptions options = null);

    /// <summary>
    /// Extract to specific file
    /// </summary>
    public static void WriteEntryToFile(this IReaderEntry reader, string destinationFileName,
                                        ExtractionOptions options = null);
#endif
}
[Flags]
public enum ExtractionOptions
{
    None,
    /// <summary>
    /// overwrite target if it exists
    /// </summary>
    Overwrite,

    /// <summary>
    /// extract with internal directory structure
    /// </summary>
    ExtractFullPath,

    /// <summary>
    /// preserve file time
    /// </summary>
    PreserveFileTime,

    /// <summary>
    /// preserve windows file attributes
    /// </summary>
    PreserveAttributes,
}

Sample usage

public static void Main() 
{
    using(var reader = ReaderFactory.Open("C:\\test.zip"))
    {
        foreach(var entry in reader)
        {
            if (!entry.IsDirectory)
            {
                entry.WriteEntryToDirectory("D:\\temp", ExtractionOptions.ExtractFullPath);
            }
        }
    }

    using(var reader = ReaderFactory.Open("C:\\test.zip"))
    {
        reader.WriteAllToDirectory("D:\\temp2", ExtractionOptions.ExtractFullPath);
    }

//no explicit using statement for IReader
    foreach(var entry in ReaderFactory.Open("C:\\test.zip")))
    {
        if (entry.Key.StartsWith("x"))
        {
            using(var data = entry.OpenRead()) 
            {
                data.CopyTo(new MemoryStream());
            }
        }
    }
}

Can expand out to the show the generic Archive API too.

@ianhays

My main concern is that I really want to do our compression expansion right. There are a ton of questions around design and usability that I want answered well before we take anything like this issue, dotnet/corefx#9237, or dotnet/corefx#9709.

Would it then make sense to design (and implement) an initial version of the whole compression expansion in corefxlab? That way, all the related APIs can be designed together and easily inform each other.

Putting it wholesale together as a real thing in corefxlab makes sense. Evolving the API might be too slow.

Would it then make sense to design (and implement) an initial version of the whole compression expansion in corefxlab? That way, all the related APIs can be designed together and easily inform each other.

That is probably the best idea. We can design the large structure in corefxlab, then move small pieces to corefx one-by-one with their own justification so that we can actually get them through API review.

Here's a rough Reader API based on what I've done with SharpCompress and making the Reader itself be IEnumerable might not quite be correct.

I like that too. Your rough API is pretty similar to what I had in mind.

A couple suggestions:

  • ReaderOptions should include an option to specify the ArchiveType explicitly.
  • ArchiveType.GZip should be called TarGZip or GZipTar to avoid confusion.
  • ExtractionOptions should have a PreservePermissions entry.
  • Should there be some relation between IReaderEntry and IEntry? Where is the latter used?
  • Instead of having IReader.WriteEntryTo and friends, I think those should be extension methods on the IReaderEntry (or IEntry) itself.
  • All of the instances of Reader/IReader should be replaced with ArchiveReader or IArchiveReader (or something like that).

On a more general note, could you explain why you split reading and writing to their own interfaces/classes? Having a single IArchive seems more intuitive and usable despite opening up some notsupported operations e.g. attempting to write to an archive that is being read.

public interface IEntry

It should be possible to combine this with the concept of a compression algorithm interface a la https://github.com/dotnet/corefx/issues/9709.

ReaderOptions should include an option to specify the ArchiveType explicitly.

Makes sense to explicitly try an ArchiveType.

ArchiveType.GZip should be called TarGZip or GZipTar to avoid confusion.

That's a way to go. Explicitly identify Tar archives with a matching single file compression.

ExtractionOptions should have a PreservePermissions entry.

Can do.

Should there be some relation between IReaderEntry and IEntry? Where is the latter used?

IEntry is used with IArchive. Just want to reuse the object among both the forward-only API and random access API.

Instead of having IReader.WriteEntryTo and friends, I think those should be extension methods on the IReaderEntry (or IEntry) itself.

Only one of the extensions is for IReader. The others are for IReaderEntry. I guess it's not explicit as they're all in the same extension class.

All of the instances of Reader/IReader should be replaced with ArchiveReader or IArchiveReader (or something like that).

I disagree. Archive is random access (like the current ZipArchive is random access). The forward-only requires a different way of thinking. This goes back to using IEnumerable or not. Archive is a "nicer" API. However, for perf concerns, you should learn a specialized forward-only API (Reader). There should be a matching ArchiveFactory that is random access and can be read/write. This is what exists in SharpCompress. If you try to combine random access and forward-only perf in the same API, you're in for a world of hurt.

It should be possible to combine this with the concept of a compression algorithm interface a la dotnet/corefx#9709.

I don't think so. You could probably put DeflateStream, LZMAStream, etc behind CompressionStream. IEntry is about reading the metadata of an entry in an archive. Then choosing to read/extract the data or skip it.

That's a way to go. Explicitly identify Tar archives with a matching single file compression.

Yeah. It's not ideal though.

IEntry is used with IArchive. Just want to reuse the object among both the forward-only API and random access API.
Only one of the extensions is for IReader. The others are for IReaderEntry. I guess it's not explicit as they're all in the same extension class.

I see, thanks.

This is what exists in SharpCompress. If you try to combine random access and forward-only perf in the same API, you're in for a world of hurt.

To me it seems like a lot of distinct interfaces and classes to protect the programmer from themselves. I'd personally much rather use one interface/class like how ZipArchive can either read or write depending on the stream passed to it.

Yeah. It's not ideal though.

Then we don't need it! Can already match based on headers.

To me it seems like a lot of distinct interfaces and classes to protect the programmer from themselves.

It's not about protecting, it's about guiding. People often say they like the SharpCompress API better than others because its more explicit. I'm definitely against combining the forward-only and random access scenarios. The use-cases have very different ways they need to be handled and forcing the APIs together feels unnatural. Knowing which methods/properties can be used and when puts a lot of cognitive burden on the user. Fixing this doesn't result in reduced performance and simpler code.

I'd personally much rather use one interface/class like how ZipArchive can either read or write depending on the stream passed to it.

The Archive API does do read/writes in SharpCompress. The Reader/Writer API has a clean break as these actions in a forward-only scenario are very different.

I'm just offloading my experience with SharpCompress. I'm not saying there isn't a better way than the direction I've gone but I don't want the same mistakes repeated.

I disagree. Archive is random access (like the current ZipArchive is random access). The forward-only requires a different way of thinking. This goes back to using IEnumerable or not. Archive is a "nicer" API. However, for perf concerns, you should learn a specialized forward-only API (Reader). There should be a matching ArchiveFactory that is random access and can be read/write. This is what exists in SharpCompress. If you try to combine random access and forward-only perf in the same API, you're in for a world of hurt.

Some archive types are designed to be read forward-only (such as CPIO or tar archives). On the other hand, zip archives have a central directory at the end of the file, so you can have a very efficient random access API for zip archives, too.

Some archive types are designed to be read forward-only (such as CPIO or tar archives). On the other hand, zip archives have a central directory at the end of the file, so you can have a very efficient random access API for zip archives, too.

Trust me, I know. I've been dealing and creating SharpCompress for almost a decade.

For random access APIs (Archive): If there is no central directory, then you scan the file. If the sizes of the entries are known, then you can just seek to headers and it's not a big deal.

Zip is the only format that really does have a central directory. 7Zip has funky stuff (and I hate it ) and everything else is basically <header><data><header><data> (RAR, Tar, etc.) so they need to be scanned.

My point is that for very large files in certain cases you want a forward-only API to for performance reasons. Everything except 7Zip and Zip64 (when writing) can be done this way. Zip has post data descriptors when writing out (though Zip64 doesn't support this). Tar can be written out as a forward-only archive if you know the data size for the header.

Then we don't need it! Can already match based on headers.

I was more thinking about having it for the Writer.

It's not about protecting, it's about guiding. People often say they like the SharpCompress API better than others because its more explicit. I'm definitely against combining the forward-only and random access scenarios. The use-cases have very different ways they need to be handled and forcing the APIs together feels unnatural. Knowing which methods/properties can be used and when puts a lot of cognitive burden on the user. Fixing this doesn't result in reduced performance and simpler code.

I definitely don't disagree with you. I personally have always disliked having the same class for writing/reading because it's more difficult to reason about. I don't even like the words 'Write'/'Read' being used in DeflateStream/GZipStream, but they're necessary to implement Stream.

I mostly just want to make sure we're avoiding unnecessary duplication and/or API's with low usage from redundancy.

You've convinced me that we should at least have an IForwardArchive and an IArchive.

If I was designing this from scratch I would probably aim to have something like (very rough):

```c#
public interface IForwardArchive : IEnumerable
{
// Reader functions
IEnumerator GetEnumerator();
IReaderEntry CurrentEntry { get; }

    // Writer functions
    void Write(string filename, Stream source, DateTime? modificationTime);

    // Shared functions
    ArchiveType Type { get; }
    void Dispose();
}

public interface IArchive : IForwardArchive
{
    IEnumerable<IArchiveEntry> Entries { get; }
    IEnumerable<IVolume> Volumes { get; }

    bool IsSolid { get; }
    bool IsComplete { get; }
    long TotalSize { get; }
    long TotalUncompressSize { get; }
}

```

I was more thinking about having it for the Writer.

Yes, with a writer you have to specify an archive type.

I mostly just want to make sure we're avoiding unnecessary duplication and/or API's with low usage from redundancy.

You've convinced me that we should at least have an IForwardArchive and an IArchive.

If I was designing this from scratch I would probably aim to have something like (very rough):

I don't think that layout quite works. You shouldn't have CurrentEntry on IArchive. It doesn't have the concept of a current entry because it's random access. There is very little shared between IArchive and IForwardArchive really. The concepts just don't match much other than you're dealing with entries. One case is a only an enumerable the other is a collection.

You also want to remove Write from the forward-only reading functions. For a forward-only writer, you want to explicitly create an entry each time and pass a byte stream to the writer to handle a potential large stream or network stream just one at a time. Again, mixing writing and reading in a forward-only context here just creates confusion in the API.

I don't even like the words 'Write'/'Read' being used in DeflateStream/GZipStream, but they're necessary to implement Stream.

I don't like Read/Write being on Stream at the same time either. One thing the Java API got correct was separating it :) Having to always check CanWrite or CanRead can be burdensome.

Should this proposal be considered in conjunction with the upcoming System.IO.Pipeline: https://github.com/dotnet/corefxlab/blob/091d12e/docs/specs/pipelines-io.md?

There seems to be compression APIs in place based on pipelines as well, under System.IO.Pipelines.Compression namespace: https://github.com/dotnet/corefxlab/tree/091d12e/src/System.IO.Pipelines.Compression

It is currently unclear if System.IO.Pipeline will make it into CoreFX or not. There are still active discussions around that. It may just sit on top of CoreFX.

If there are Compression investments in the Pipelines space, I would hope they are considered to be done in lower layers (System.IO.Compression) first, so that more people/apps can benefit from the investments. @joshfree @KrzysztofCwalina can you please comment?

If there are Compression investments in the Pipelines space, I would hope they are considered to be done in lower layers (System.IO.Compression) first, so that more people/apps can benefit from the investments.

Yes, any actual compression investments are being done at a lower level. For instance one of our summer interns is working on System.IO.Compression.Brotli for a future release in .NET Core @ https://github.com/dotnet/corefxlab/tree/master/src/System.IO.Compression.Brotli.

We have encountered a few issues in System.IO.Packaging that users of DocumentFormat.OpenXml (a common Office document manipulation library) have hit that require streaming support in zip files. I'm not familiar with the domain enough to add much to the API discussion, but would love to see some movement on this issue.

I want to add my voice agreeing with @twsouthwick - this is critically needed for the OpenXML library.

thanks - dave

In the case of *.tar.* files, why add direct support for the compression cases?
I mean, it feels more natural to for example call the constructor with a GZipStream

MOved to future. This is not a must fix for 3.0

Triage:
Complex feature and entirely different API model. Significant investment. There seems to be interest in it, including partners like NuGet and WPF. It could probably be implemented on top of DeflateStream. The API design could be similar to the one required for TAR dotnet/runtime#1540 .

Agreed with the API similar to Tar!

The problem with Zip is that the format is streamable but not all Zips are created with the required header info to make them stream-readable. I just throw exceptions in Sharpcompress in this case.

The current implementation of Zip requires using the dictionary at the end.

GZipStream is also a problem because the current implementation doesn't expose the file headers/metadata to use it properly as an archive format. (Related to dotnet/corefx#7570 )

I have a library that parses XLSX files which would have made good use of this.

I like the general idea, despite of the problems with ZIP format. In most environments, one might be able to constrain itself to compliant ZIPs, and it's very useful for TAR , too.

However, some of the methods might need Async variants, I guess, especially WriteEntryTo which might potentially take a very long time. :-)

Wanted to add a bit more information on this topic - I am one of the authors' of Microsoft SimpleRemote that we use for testing in Surface, and we ended up using a 3rd party tar library because there was no way to do forward-only operations with what's included in the current Compression namespace.

This is less of an API defect and more of a problem with the zip format itself - Zip is easy to stream when writing to it, because it stores metadata for the files at the end of the archive, so you can easily write to the stream without ever seeking. Reading, however, is incredibly difficult, because you need to read end of the file before operating on the archive.

To address this, if you try to open a non-seekable stream with ZipArchive, .NET internally reads the entire stream into a memory stream before processing (this works in some cases, but does not work on zip files larger than 4GB, as memory stream does not support streams larger than int32 bytes).

Here's an example of where this is a problem: Let's say you're reading a zip stream over a network - this means you'd need to download the entire file before being able to access any of the contents. If we compare this to tar (or cpio, or any tar-like format), we don't have this restriction - we can freely read elements as they come in, since all metadata for a given file is just before that file.

This is a problem for any case where the source stream does not support seeking (if we're encrypting data using cryptostreams, for example, they also aren't seekable). Adding tar, or another archive that supports forward only reads and writes, would make these scenarios much more straightforward and performant.

One note, though - if we implement tar, please use something that, at minimum, supports USTAR (and, ideally, supports the 2001 PAX extensions, so we can write files with unlimited size - historic tar has an 8GB/entry cap).

SharpCompress does what you need but the Tar support isn鈥檛 super great. Would love help expanding it.

@adamhathcock I'll take a look! From a quick glance at the code, it looks like SharpCompress Tar entries are USTAR + GNU extensions, correct? It should be pretty easy to add basic support for PAX extensions for long-file, long-link, and large sizes.

Was this page helpful?
0 / 5 - 0 ratings