Right now corefx supports zip files as well as gz files. Would it be hard to get it to support tar files as well for compatibility with the other OS's who package files as tgz very often?
A C# implementation already exists at https://code.google.com/p/tar-cs/ and could be used either as a guideline or directly imported.
If this is something that is desired I could work on designing an API, but it shouldn't be hard to visualize how it might look like.
Right now corefx supports zip files as well as gz files. Would it be hard to get it to support tar files as well for compatibility with the other OS's who package files as tgz very often?
We don't have any existing tar code to leverage, if that's what you mean. Tar is a pretty different format from Zip (particularly since it doesn't compress) so we would need to start mostly from scratch. That's not to say it isn't worthwhile, though. I'd love to be able to handle all popular compression formats.
I'm not sure though where we would even want to put this if we did add it. System.IO.Compression only kind of makes sense since a tar doesn't compress. I guess FileSystem? tar is so frequently associated with gzip it seems incorrect to not place it alongside it.
If this is something that is desired I could work on designing an API, but it shouldn't be hard to visualize how it might look like.
In my opinion it would be ideal for it to be as similar to ZipArchive as possible.
@ericstj @jasonwilliams200OK
Thanks @ianhays.
From dotnet/corefx#9673:
Also .bz2 if possible (http://www.bzip.org/1.0.3/html/zlib-compat.html). :)
Usually bz2 is the compressed format which contains a tarball (as bz2 only compresses one file: https://en.wikipedia.org/wiki/Bzip2). We can probably use the same API methods to support bz2 (except for some format specific settings).
This way, tarball expansion / contraction might make sense in S.I.C as part of bz2 (or even zip) compression / decompression.
With .NET running xplat it would be nice to be able to use it with the compression/archive formats common to systems other than Windows. Tar is an archival format popularly used in Unix alongside some sort of single-file compression, and we don't currently have a way of deaing with it. I suggest we add API for Tar that is similar to ZipArchive, as well as some extension methods similar to ZipFile. I'll focus on the former in this issue.
Tar has been around for a while. A long while. As a result, there are a few different accepted formats that aren't all compatible with each other. Most programs will detect the format of a tar and deal with it accordingly when de-archiving, but they usually will only archive in one format. It's reasonable for us to do the same, though an expansion point in the future would be to allow archiving in multiple formats for potential compat reasons. Unix Tar allows archiving in multiple formats, for example, but will default to the GNU tar format (though this is supposed to change in the next version).
To start, the API should mirror ZipArchive/ZipArchiveEntry.
public class TarArchive : IDisposable {
public TarArchive(Stream stream);
public TarArchive(Stream stream, bool leaveOpen);
public ReadOnlyCollection<TarArchiveEntry> Entries { get; }
public TarArchiveEntry CreateEntry(string entryName);
public void Dispose();
public TarArchiveEntry GetEntry(string entryName);
}
public class TarArchiveEntry {
public TarArchiveEntry();
public ZipArchive Archive { get; }
public string FullName { get; }
public DateTimeOffset LastWriteTime { get; set; }
public long Length { get; }
public string Name { get; }
public void Delete();
public Stream Open();
}
The above API works with every format. We could also consider adding additional TarArchiveEntry properties for expanded metadata available only in newer formats, or we could make subclasses e.g. UStarTarArchiveEntry. I'd say that's more of an expansion point than something we should do straight away, however.
Tar doesn't have a central directory for entries like Zip does; entries are placed sequentially in the archive with no indexing. This makes finding a particular entry time consuming, but it also means that implementing the format is comparitively simple. The lack of compression of individual files also greatly simplifies implementing a TarArchive class. The difficulty primarily comes from the different formats requiring very different measures for the more complicated features e.g. sparse files in GNU or metadata files in PAX. There are also some more specific rules for edge cases like duplicate entries that we'll need robust tests to validate.
My plan of attack is to split the work into chunks with each chunk being "up for grabs":
@ianhays I recently implemented a set of classes for manipulating tar files, including ustar and PAX (but not GNU, IIRC) support. We needed this for our Docker PowerShell cmdlets. They might be a good starting point for this work.
https://github.com/Microsoft/Docker-PowerShell/tree/master/src/Tar
I should also note that tar archives are fundamentally different from zip archives in that they are stream-oriented and do not contain a central directory of files. This means that both TarArchive.GetEntry and TarArchiveEntry.Open are unnatural: to implement either, you have to require a seekable stream, or you have to buffer the contents of the entire archive into memory or a temporary file (which obviously is uncompetitive from a performance perspective). And it's not realistic to require a seekable stream, since you'll want to support decompressing and extracting tar.gz files in one pass, and decompressors such as GZipStream are not seekable.
The reality is that tar demands a very different interface from zip. You may want to look at my implementation for some ideas of what works better with tar.
And it's not realistic to require a seekable stream, since you'll want to support decompressing and extracting tar.gz files in one pass, and decompressors such as GZipStream are not seekable.
In my above comment (under "Implementation") I was operating under the tentative plan that Entry indexing would require a seekable stream or throw an exception if it wasn't, but as you said this isn't likely to be frequently done since it will nearly always be wrapped in a GZip or LZMA stream. It may be worth adding anyways to cover edge cases, but I doubt it. Enumerating entries would be the preferred way of reading the archive.
The reality is that tar demands a very different interface from zip. You may want to look at my implementation for some ideas of what works better with tar.
The nice thing about not having a common parent with ZipArchive is that we can diverge the interface where it's necessary. While it would be ideal to have the API be similar, it isn't required. That said, I think we can at least keep the TarArchive/TarArchiveEntry structure if we just make some tweaks.
I recently implemented a set of classes for manipulating tar files, including ustar and PAX (but not GNU, IIRC) support. We needed this for our Docker PowerShell cmdlets. They might be a good starting point for this work.
Thanks @jstarks, that looks very close to what I had in mind with the exception of some minor API differences (e.g. a unified TarArchive class rather than a TarReader/TarWriter, IEnumerable Entries, disposable TarArchive) and of course the removal of indexing entries. Regardless of the structure though, the implementation looks good and could be easily adapted into the code base with some minor tweaks.
Per discussion with @ianhays a conservative estimate for all this is 5 weeks, if it was forward only that would be less time.
+1 on requesting this support on .net core. This would be super helpful for any cloud service that's trying to untar developer uploaded files from a linux machine. We are one of them (that uses .net core). Would really prefer to avoid picking third party libraries or resorting to running a shell process to achieve this. This becomes more crucial due to the fact that file permissions on zip archives are not set correctly for files archived on Linux. Given that neither this nor zip are fully functional out of the box on Linux makes it hard to support Linux platform on our service in a clean way.
Triage:
We are interested in doing this. We would also like to add tar ball (tar.gz) support, and have scenario parity with ZipFiles.
Next step: Finish the API design and bring it for review.
As the author of Sharpcompress, my key wish for this and Zip is to make the API stream oriented and not require a seekable stream.
Then I can base my code on yours or not support it altogether!
Transferring to the dotnet/runtime repo.
Most helpful comment
+1 on requesting this support on .net core. This would be super helpful for any cloud service that's trying to untar developer uploaded files from a linux machine. We are one of them (that uses .net core). Would really prefer to avoid picking third party libraries or resorting to running a shell process to achieve this. This becomes more crucial due to the fact that file permissions on zip archives are not set correctly for files archived on Linux. Given that neither this nor zip are fully functional out of the box on Linux makes it hard to support Linux platform on our service in a clean way.