Datasets: Reuse downloaded ImageNet dataset .tar files

Created on 5 May 2019  路  6Comments  路  Source: tensorflow/datasets

I am wondering how to reuse the tar files I manually downloaded from Imagenet before.
They are ILSVRC2012_img_train.tar and ILSVRC2012_img_val.tar.

I tried to set download=False and data_dir to the dir of those two files. Also I tried to use REUSE_CACHE_IF_EXISTS. But builder just starts to download a new copy of dataset.

I am using tensorflow-datasets 1.0.2 by the way

Thank you for any help!

help

Most helpful comment

Yes, now we try to download imagenet instead of looking in the manual_dir. We should provide a way for the user to skip the download phase if files already exists. Currently, there is no real way of doing this. You can try to copy the ILSVRC2012_img_train.tar files in ~/tensorflow_datasets/downloads

The file has to be renamed as follow:

imag-net.org_chal_LSVR_2012_nnou_ILSV_img_sIIAonqONCGKDlj942sP6Pc7w3f0rOotkWAgV8PKRbs.tar
imag-net.org_chal_LSVR_2012_nnou_ILSV_img_sIIAonqONCGKDlj942sP6Pc7w3f0rOotkWAgV8PKRbs.tar.INFO

The .INFO file is just a text file containing metadata:

{"dataset_names": ["imagenet2012_corrupted", "imagenet2012"], "original_fname": "ILSVRC2012_img_train.tar", "urls": ["http://www.image-net.org/challenges/LSVRC/2012/nnoupb/ILSVRC2012_img_train.tar"]}

And for ILSVRC2012_img_val.tar:

imag-net.org_chal_LSVR_2012_nnou_ILSV_img_x-BqbAuszwbY2-tld9ce__hGc6Xb3VBjOrRPjqBFauA.tar
imag-net.org_chal_LSVR_2012_nnou_ILSV_img_x-BqbAuszwbY2-tld9ce__hGc6Xb3VBjOrRPjqBFauA.tar.INFO

with the info file content:

{"dataset_names": ["imagenet2012"], "original_fname": "ILSVRC2012_img_val.tar", "urls": ["http://www.image-net.org/challenges/LSVRC/2012/nnoupb/ILSVRC2012_img_val.tar"]}

All 6 comments

Oh, I should use manual_dir... Sorry I just missed that part of document

Hi,
So it seems doesn't work to set manual_dir. Maybe I'm wrong but I checked the code, this manual_dir is not checked before download manager starting to download?

Yes, now we try to download imagenet instead of looking in the manual_dir. We should provide a way for the user to skip the download phase if files already exists. Currently, there is no real way of doing this. You can try to copy the ILSVRC2012_img_train.tar files in ~/tensorflow_datasets/downloads

The file has to be renamed as follow:

imag-net.org_chal_LSVR_2012_nnou_ILSV_img_sIIAonqONCGKDlj942sP6Pc7w3f0rOotkWAgV8PKRbs.tar
imag-net.org_chal_LSVR_2012_nnou_ILSV_img_sIIAonqONCGKDlj942sP6Pc7w3f0rOotkWAgV8PKRbs.tar.INFO

The .INFO file is just a text file containing metadata:

{"dataset_names": ["imagenet2012_corrupted", "imagenet2012"], "original_fname": "ILSVRC2012_img_train.tar", "urls": ["http://www.image-net.org/challenges/LSVRC/2012/nnoupb/ILSVRC2012_img_train.tar"]}

And for ILSVRC2012_img_val.tar:

imag-net.org_chal_LSVR_2012_nnou_ILSV_img_x-BqbAuszwbY2-tld9ce__hGc6Xb3VBjOrRPjqBFauA.tar
imag-net.org_chal_LSVR_2012_nnou_ILSV_img_x-BqbAuszwbY2-tld9ce__hGc6Xb3VBjOrRPjqBFauA.tar.INFO

with the info file content:

{"dataset_names": ["imagenet2012"], "original_fname": "ILSVRC2012_img_val.tar", "urls": ["http://www.image-net.org/challenges/LSVRC/2012/nnoupb/ILSVRC2012_img_val.tar"]}

Thanks

Yes, now we try to download imagenet instead of looking in the manual_dir. We should provide a way for the user to skip the download phase if files already exists. Currently, there is no real way of doing this. You can try to copy the ILSVRC2012_img_train.tar files in ~/tensorflow_datasets/downloads

The file has to be renamed as follow:

imag-net.org_chal_LSVR_2012_nnou_ILSV_img_sIIAonqONCGKDlj942sP6Pc7w3f0rOotkWAgV8PKRbs.tar
imag-net.org_chal_LSVR_2012_nnou_ILSV_img_sIIAonqONCGKDlj942sP6Pc7w3f0rOotkWAgV8PKRbs.tar.INFO

The .INFO file is just a text file containing metadata:

{"dataset_names": ["imagenet2012_corrupted", "imagenet2012"], "original_fname": "ILSVRC2012_img_train.tar", "urls": ["http://www.image-net.org/challenges/LSVRC/2012/nnoupb/ILSVRC2012_img_train.tar"]}

And for ILSVRC2012_img_val.tar:

imag-net.org_chal_LSVR_2012_nnou_ILSV_img_x-BqbAuszwbY2-tld9ce__hGc6Xb3VBjOrRPjqBFauA.tar
imag-net.org_chal_LSVR_2012_nnou_ILSV_img_x-BqbAuszwbY2-tld9ce__hGc6Xb3VBjOrRPjqBFauA.tar.INFO

with the info file content:

{"dataset_names": ["imagenet2012"], "original_fname": "ILSVRC2012_img_val.tar", "urls": ["http://www.image-net.org/challenges/LSVRC/2012/nnoupb/ILSVRC2012_img_val.tar"]}

Yes this helps a lot!
Thanks!! And I close this issue.

Hey - sorry to reopen this, would it be possible to get it documented explicitly what the expected file folder structure and filenames are supposed to be? (/downloads and /imagenet2012?)

This is what I am calling:

ds_test = tfds.load(name="imagenet2012", split="validation", data_dir='.data', download=True)

I am trying to:

  • Get the ImageNet files (validation and test tars) from a S3 bucket to my instance in the expected location (without downloading afresh each time). Actually, I am just using the validation tar because I am doing evaluation only - ideally I would not like to copy the training dataset across.
  • I tried renaming the validation tar to what was suggested above in this issue, and putting it in a downloads folder within .data. Still couldn't find the right location.

Also is there a reason why we have to rename the file to something different than ILSVRC2012_img_val.tar? torchvision didn't do this and it was much easier to get working with this use case (just point to the folder with the tars, and it works).

The structure should be:

# In ~/tensorflow_datasets/downloads/
imag-net.org_chal_LSVR_2012_nnou_ILSV_img_sIIAonqONCGKDlj942sP6Pc7w3f0rOotkWAgV8PKRbs.tar
imag-net.org_chal_LSVR_2012_nnou_ILSV_img_sIIAonqONCGKDlj942sP6Pc7w3f0rOotkWAgV8PKRbs.tar.INFO

I agree that using the original filenames would be simpler. We are adding the hash in the filename to avoid collision between file names from different datasets.


But if you have your data already generated in ~/tensorflow_datasets/imagenet2012/3.0.0, can't you just copy the preprocessed files in your S3 bucket directly ? And access those files:

Either directly:

tfds.load('imagenet2012', data_dir='s3://my_bucket/tensorflow_datasets/')

Or if it doesn't work, with first copying the preprocessed data locally S3 to your local instance, and then load them with tfds.load.

This way, it will skip the download_and_prepare step, and directly reuse the pre-processed data.

Was this page helpful?
0 / 5 - 0 ratings