Untar_data function gives an error "not a gzip file" when trying to download and untar a file.
Describe the bug
To Reproduce
untar_data('https://s3.amazonaws.com/fast-ai-imageclas/mnist_png.tgz')
untar_data('http://data.vision.ee.ethz.ch/cvl/food-101.tar.gz')
untar_data('http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz')
Expected behavior
I expect the file to be downloaded and untarred
Screenshots
Additional context
Getting an error that says "not a gzip file"
What is going on here
Here I will explain the current behavior of untar_data. As for if such behavior is desirable, people who have an opinion can express it here.
Here is an example of the use of untar_data in the lesson 1 notebook:
path = untar_data(URLs.PETS)
From the naming of the variable, it is common to assume that URLs.PETS is an actual url. Looking at the documentation of untar_data, it shows that
untar_data(url:str, fname:PathOrStr=None, dest:PathOrStr=None, data=True)
"Download url if it doesn't exist to fname and un-tgz to folder dest"
So it also suggests that the first argument of untar_data is a url.
If you look into the exact content of URLs.PETS, which is simply a constant field of the class URLs, you will find that it is defined as
'https://s3.amazonaws.com/fast-ai-imageclas/oxford-iiit-pet'
However, if you try to visit that url or download the file using command like wget, you will find that this url is invalid. Indeed, the actual address is
'https://s3.amazonaws.com/fast-ai-imageclas/oxford-iiit-pet.tgz'
with .tgz added at the end.
What I do find confusing here is that, though both the naming URLs and the docs of untar_data suggest that its first argument is a url, URLs.PETS is actually not a valid url. Digging a bit deeper, we found in the source code of download_data function called by untar_url function the following lines
if not fname.exists():
print(f'Downloading {url}')
download_url(f'{url}.tgz', fname)
So it is clear that the url variable here is actually not a true url, but one with .tgz missing at the end. We can also infer that the url parameter untar_data takes and every field of URLs class is actually the true url with .tgz removed from the end.
Now going back to your post, to download any dataset ending with .tgz, simply remove the .tgz and throw it into untar_data and it should work. For datasets with other formats, currently it is not supported with untar_data, because regardless of your dataset url and its file format, it always appends .tgz when downloading the data.
Closing this. @odysseus0 you should put all that paragraph in a PR for the docs of untar_data!
@sgugger Do you think that there is any need to make changes to the fact that URLs class actually does not contain full urls, and the fact that untar_data only supports tgz file url in a non-intuitive way?
Should I create a PR for such feature recommendation?
Sorry. I am still not very familiar with the general guidelines of contributing to open source.
untar_data is supposed to be an internal function with the fastai datasets.
Maybe you could create a new function that will be more general? download_url will work for everything already, I guess it's mostly having a uncompress function that would deal with any format that's missing, but that's not an easy feature.
That makes sense to me. I will start experimenting with it.
However, another issue here is that the setup of untar_data the URLs class gives people the misconception that it can take any url that represents compressed file as an argument. For the very minimum, we can add some more notes to the untar_data, but more ideally, we should also name URLs class instead as FastaiDatasets, so that it is clear it is only an internal function for now and should only be used with Fastai Datasets.
I know this is really minuscule and you probably have way more important things to look after. In such case, should I simply start a PR for it?
Most helpful comment
What is going on here
Here I will explain the current behavior of
untar_data. As for if such behavior is desirable, people who have an opinion can express it here.Here is an example of the use of
untar_datain the lesson 1 notebook:From the naming of the variable, it is common to assume that
URLs.PETSis an actual url. Looking at the documentation ofuntar_data, it shows thatSo it also suggests that the first argument of
untar_datais a url.If you look into the exact content of
URLs.PETS, which is simply a constant field of the classURLs, you will find that it is defined asHowever, if you try to visit that url or download the file using command like
wget, you will find that this url is invalid. Indeed, the actual address iswith
.tgzadded at the end.What I do find confusing here is that, though both the naming
URLsand the docs ofuntar_datasuggest that its first argument is a url,URLs.PETSis actually not a valid url. Digging a bit deeper, we found in the source code ofdownload_datafunction called byuntar_urlfunction the following linesSo it is clear that the
urlvariable here is actually not a true url, but one with.tgzmissing at the end. We can also infer that theurlparameteruntar_datatakes and every field ofURLsclass is actually the true url with.tgzremoved from the end.Now going back to your post, to download any dataset ending with
.tgz, simply remove the.tgzand throw it intountar_dataand it should work. For datasets with other formats, currently it is not supported withuntar_data, because regardless of your dataseturland its file format, it always appends.tgzwhen downloading the data.