Runtime: Add async ZipFile APIs

Created on 25 Jan 2016  路  28Comments  路  Source: dotnet/runtime

All ZipFile APIs are currently synchronous. This means manipulations to zip files will always block a thread. We should investigate using an async file and calling async IO APIs (ReadAsync/WriteAsync) to free up the thread while IO is happening. https://github.com/dotnet/corefx/pull/5680

api-suggestion area-System.IO.Compression

Most helpful comment

Any update on this? I just hit this as well when trying to stream directly to the response stream in my v3 Azure function.

All 28 comments

This needs someone to write up a formal API proposal then we can review and if approved, offer up for implementation. Anyone?

Docs on the API review process and see Jon's API request for a great example of a strong proposal. The more concrete the proposal is (e.g. has examples of usage, real-world scenarios, fleshed out API), the more discussion it will start and the better the chances of us being able to push the addition through the review process.

Forked the repo, and working on an api proposal.
Seems like most of the work outlined in this item is already done. e.g
https://github.com/dotnet/corefx/blob/master/src/System.IO.FileSystem/src/System/IO/File.cs#L682
https://github.com/dotnet/corefx/blob/master/src/System.IO.FileSystem/src/System/IO/File.cs#L732

That probably means this item is done, right?
@danmosemsft

Not quite, @abdulbeard. This particular issue is around adding ZipFile async APIs.

For example, currently we have ZipFile.ExtractToDirectory(string, string). This issue is around adding async versions of those API e.g. ZipFile.ExtractToDirectoryAsync(string, string).

Gotcha @ianhays . I did a terrible job reading the title 馃槢

Summary

Curently, System.IO.Compression.Zipfile doesn't have async methods for CreateFromDirectory, ExtractToDirectory, Open and OpenRead. This means that these zipfile operations are thread-blocking, and not asynchronous, and hence, comparatively less performant.

Proposed API

namespace System.IO.Compression {
    public partial class Zipfile {
        public static Task CreateFromDirectoryAsync(string sourceDirectoryName, string destinationArchiveFileName, CancellationToken cancellationToken = default(CancellationToken));

        public static Task CreateFromDirectoryAsync(string sourceDirectoryName, string destinationArchiveFileName,  CompressionLevel compressionLevel, bool includeBaseDirectory, CancellationToken cancellationToken = default(CancellationToken));

        public static Task CreateFromDirectoryAsync(string sourceDirectoryName, string destinationArchiveFileName,  CompressionLevel compressionLevel, bool includeBaseDirectory, Encoding entryNameEncoding, CancellationToken cancellationToken = default(CancellationToken));

        public static Task ExtractToDirectoryAsync(string sourceArchiveFileName, string destinationDirectoryName, CancellationToken cancellationToken = default(CancellationToken));

        public static Task ExtractToDirectoryAsync(string sourceArchiveFileName, string destinationDirectoryName, Encoding entryNameEncoding, CancellationToken cancellationToken = default(CancellationToken));

        public static Task<ZipArchive> OpenAsync(string archiveFileName, ZipArchiveMode mode, CancellationToken cancellationToken = default(CancellationToken));

        public static Task<ZipArchive> OpenAsync(string archiveFileName, ZipArchiveMode mode, Encoding entryNameEncoding, CancellationToken cancellationToken = default(CancellationToken));

        public static Task<ZipArchive> OpenReadAsync(string archiveFileName, CancellationToken cancellationToken = default(CancellationToken));
    }
}

Expected Use and Benefits

These new functions are asynchronous versions of their synchronous counterparts. This allows the .net runtime to allow the thread to do other work while the asynchronous operation completes (as mandated by the TPL and async/await scheduling).

  • Zipfile operations will no longer be thread-blocking, improving performance.
  • Long-running Zipfile operations can now be cancelled via the CancellationToken, by the user, allowing them more control.
  • Asynchronous FileStream or other OS/runtime optimizations can be utilized to improve performance here.

Here's a proposal ^. Looking forward to constructive feedback and insight.

Should async APIs be added to ZipArchive at the same time? For example, consider the following code:

```c#
using (var archive = await ZipFile.OpenAsync("archive.zip", ZipArchiveMode.Update))
{
archive.CreateEntry("empty.txt");
}

using (var archive = await ZipFile.OpenReadAsync("archive.zip"))
{
foreach (var entry in archive.Entries)
{
Console.WriteLine(entry.FullName);
}
}
```

As far as I can tell from skimming the source code, even though the above code uses the proposed async APIs as much as it can, it still blocks when accessing Entries and, more insidiously, when disposing the first archive.

Here's a proposal for ZipArchive class:

namespace System.IO.Compression {
    public class ZipArchive {
        public Task<ReadOnlyCollection<ZipArchiveEntry>> GetEntriesAsync();
        public Task<ZipArchiveEntry> GetEntryAsync(string entryName, CancellationToken cancellationToken = default(cancellationToken));
    }
}

@abdulbeard How does that help? As far as I can tell, CreateEntry does not block in the current implementation. On the other hand, accessing Entries and calling Dispose() can block, depending on the mode.

@svick you're right. Async versions of CreateEntry don't do much, since their underlying functionality is not blocking.
On the matter of Dispose() being async, not sure there's a good way to do that.
In the matter of Entries, I can imagine replacing that with something like:

public Task<ReadOnlyCollection<ZipArchiveEntry>> GetEntriesAsync();
public Task<ZipArchiveEntry> GetEntryAsync(string entryName, CancellationToken cancellationToken = default(cancellationToken));

Amended the proposal above.

@JeremyKuhne could you give feedback on this proposal? Perhaps we can move it forward

@abdulbeard

On the matter of Dispose() being async, not sure there's a good way to do that.

Yeah, there isn't a great approach until https://github.com/dotnet/roslyn/issues/114 is added. But in the meantime, I think there are ways to make it work. The approaches I considered are:

  • Add FlushAsync(). This would write the current state of the ZipArchive to disk, which means that calling Dispose() right after FlushAsync() would not block.

    Example usage:

    c# using (var archive = await ZipFile.OpenAsync("archive.zip", ZipArchiveMode.Update)) { archive.CreateEntry("empty.txt"); await archive.FlushAsync(); }

  • Add DisposeAsync(). Once https://github.com/dotnet/roslyn/issues/114 is added, you would be able to use it with using.

    Example usage:

    c# var archive = await ZipFile.OpenAsync("archive.zip", ZipArchiveMode.Update) try { archive.CreateEntry("empty.txt"); } finally { await archive.DisposeAsync(); }

Looking at the two examples above, I think adding FlushAsync() is the better option.

@svick
To me it makes sense to wait on resolution of dotnet/roslyn/#114 before deciding on FlusyAsync vs DisposeAsync.
But I can also see value in finishing out this stage with FlushAsync, and then once we have an async dispose, let that replace FlushAsync.
But I'd love to hear from @JeremyKuhne etc. as well.

@danmosemsft @JeremyKuhne any suggestions for changes to ^, or shall we move forward?

Anything I can do to help move this along?

crickets..........

@abdulbeard Sorry, the actual owner here moved on to another project and this got lost in the shuffle. 馃槮 Thanks for poking on it.

@stephentoub can you give some direction/guidance on the disposal questions? @ericstj do you think this proposal provides what you were looking for?

Presumably IAsyncEnumerable<ZipArchiveEntry> would be the right choice for GetEntriesAsync...

@ahsonkhan, @ViktorHofer can one of you take this and drive it forward?

can you give some direction/guidance on the disposal questions?

We have added IAsyncDisposable and C# is adding await using that works with it. If disposing a ZipArchive does work that could be asynchronous (e.g. flushing a stream), having ZipArchive implement it is quite reasonable. Note that Stream itself now implements IAsyncDisposable as well.

Presumably IAsyncEnumerable would be the right choice for GetEntriesAsync

It depends. Is ZipArchive lazily reading from a stream, e.g. when you go to read the next entry, might that perform a read on the stream? If yes, then having it return an IAsyncEnumerable could make sense. If no, then it should probably return a synchronous enumerable.

Just curious is there any update on this issue? We have a .NET Core project that creates zip files in Azure blobs, and we are forced to use the sync version of ZipArchive apis because the async ones will cause unit tests to hang.

I have some code that is generating and streaming a zip archive on the fly, which fails if AllowSynchronousIO is false, which is the default from core 3.0.0-preview3 onwards:
https://github.com/aspnet/AspNetCore/issues/7644

System.InvalidOperationException: Synchronous operations are disallowed. Call WriteAsync or set AllowSynchronousIO to true instead.
at Microsoft.AspNetCore.Server.IIS.Core.HttpResponseStream.Write(Byte[] buffer, Int32 offset, Int32 count)
at Microsoft.AspNetCore.Server.IIS.Core.WrappingStream.Write(Byte[] buffer, Int32 offset, Int32 count)
at System.IO.Compression.PositionPreservingWriteOnlyStreamWrapper.Write(Byte[] buffer, Int32 offset, Int32 count)
at System.IO.BinaryWriter.Write(UInt32 value)
at System.IO.Compression.ZipArchiveEntry.WriteCentralDirectoryFileHeader()
at System.IO.Compression.ZipArchive.WriteFile()
at System.IO.Compression.ZipArchive.Dispose(Boolean disposing)
at System.IO.Compression.ZipArchive.Dispose()

Like @kccsf I have simillar issue with the Kestrel server configured to disallow synchronous IO. With the current implementation I can not use ZipArchive directly with the Kestrel's response stream - because of synchronous disposing.

cc @carlossanlop

Just hit this today as well. :\ After writing the entries, Dispose() throws the exception. Worked around by allowing sync I/O at that point for now, but would be nice to have some async API that I could call to remove the need for sync I/O.

Triage:
We considered this. Next step is API design and review.

Any update on this? I just hit this as well when trying to stream directly to the response stream in my v3 Azure function.

whats about thread safety of ZipArchive.CreateEntryFromFile / CreateEntryFromFileAsync? I didn't find any information about whether the current implementation is thread safe. Will the new proposal be thread safe?
Background: I assume, if I could process multiple files in parallel, the CPU and Disk I/O would be fully used.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

GitAntoinee picture GitAntoinee  路  3Comments

matty-hall picture matty-hall  路  3Comments

sahithreddyk picture sahithreddyk  路  3Comments

iCodeWebApps picture iCodeWebApps  路  3Comments

aggieben picture aggieben  路  3Comments