Runtime: System.IO.Compression: ZipArchiveEntry always stores uncompressed data in memory

Created on 13 Sep 2016  路  8Comments  路  Source: dotnet/runtime

If a ZipArchive is opened in Update mode, calling ZipArchiveEntry.Open will always result in the entry's data being decompressed and stored in memory.

This is because ZipArchiveEntry.Open calls ZipArchiveEntry.OpenInUpdateMode, which in return gets the value of the ZipArchiveEntry.UncompressedData property, which will decompress the data and store it in a MemoryStream.

This in its own is already fairly inconvenient - if I'm updating a large file (say 1GB) that's compressed in a zip archive, I would want to read it from the ZipArchiveEntry in smaller chunks, save it to a temporary file, and update the entry in the ZipArchive in a similar way (i.e. limiting the memory overhead and preferring temporary files instead).

This also means that as soon as a ZipArchive is opened in Update mode, even reading an ZipArchiveEntry which you'll never update incurs the additional cost.

A short-term fix may be to expose ZipArchiveEntry.OpenInReadMode, which will return a DeflateStream instead. If you're doing mixed reading/writing of entries in a single ZipArchive, this should already help you avoid some of the memory overhead.

api-suggestion area-System.IO.Compression

Most helpful comment

@carlossanlop This is blocking many users from moving to .NET Core as writing large office files ends up hitting this by way of System.IO.Packaging->DocumentFormat.OpenXml

All 8 comments

I've noticed the same. In general ZipArchiveEntry.Open is very non-intuitive in its behavior.

For read only / write only you get a wrapped DeflateStream which doesn't tell you the length of the stream nor permit you to seek it. For read/write (update) ZipArchiveEntry will read and decompress the entire entry into memory (in fact, into a memory stream backed by a single contiguous managed array) so that you have a seek-able representation of the file. Once opened for update the file is then written back to the archive when the archive itself is closed.

I agree with @qmfrederik here that we need a better API. Rather than rely solely on the archive's open mode we should allow for the individual call's to Open to specify what kind of stream they want. We can then check that against how the archive was opened in case it is incompatible and throw. Consider the addition:
C# public Stream Open(FileAccess desiredAccess)
For an archive opened with ZipArchiveMode.Update we could allow FileAccess.Read, FileAccess.Write, or FileAccess.ReadWrite, where only the latter would do the MemoryStream expansion. Read and write would be have as they did today. In addition to solving the OP issue, this would address the case where someone is opening an archive for Update and simply adding a single file: we shouldn't need to copy that uncompressed data into memory just to write it to the archive.

We also can do better in the read case, we know the length of the data (assuming the archive is not inconsistent) and can expose that rather than throwing.

Another interesting consideration for this is something like the approach taken by System.IO.Packaging in desktop. It implemented an abstraction over the deflate stream that would change modes depending on how you interacted with it: https://referencesource.microsoft.com/#WindowsBase/Base/MS/Internal/IO/Packaging/CompressStream.cs,e0a52fedb240c2b8

Exclusive reads small seeks would operate on a deflate stream; same for exclusive writes. Large seeks or random access would fall back to "emulation mode" wherein it would decompress everything to stream that was partially backed by memory but would fallback to disk.

I don't really like this since it hides some very expensive operations behind synchronous calls, as well as introducing a potential trust boundary (temp file) behind something that is expected to be purely computation. I think it makes sense to keep Zip lower level and not try to hide this in the streams we return. Perhaps we could allow for the caller to provide a stream for temporary storage in the Update case.

Triage:
This would be nice to have.

@carlossanlop This is blocking many users from moving to .NET Core as writing large office files ends up hitting this by way of System.IO.Packaging->DocumentFormat.OpenXml

Maybe we should mark this 6.0.0

@twsouthwick @danmosemsft Is there a way to work around this issue (for OpenXml or otherwise) as a temporary solution? Alternatively re-discuss it for 5.0.0.

I do not have context on this area. That is @twsouthwick and @carlossanlop

Typically you can workaround this if you open an archive only for create, or only for read. When you open an archive for update our Zip implementation needs to buffer things to memory since it can't mutate in-place. I agree that we should try to fix something here in 6.0.0.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

omajid picture omajid  路  3Comments

bencz picture bencz  路  3Comments

iCodeWebApps picture iCodeWebApps  路  3Comments

jzabroski picture jzabroski  路  3Comments

GitAntoinee picture GitAntoinee  路  3Comments