I am wondering how to reuse the tar files I manually downloaded from Imagenet before.
They are ILSVRC2012_img_train.tar and ILSVRC2012_img_val.tar.
I tried to set download=False and data_dir to the dir of those two files. Also I tried to use REUSE_CACHE_IF_EXISTS. But builder just starts to download a new copy of dataset.
I am using tensorflow-datasets 1.0.2 by the way
Thank you for any help!
Oh, I should use manual_dir... Sorry I just missed that part of document
Hi,
So it seems doesn't work to set manual_dir. Maybe I'm wrong but I checked the code, this manual_dir is not checked before download manager starting to download?
Yes, now we try to download imagenet instead of looking in the manual_dir. We should provide a way for the user to skip the download phase if files already exists. Currently, there is no real way of doing this. You can try to copy the ILSVRC2012_img_train.tar files in ~/tensorflow_datasets/downloads
The file has to be renamed as follow:
imag-net.org_chal_LSVR_2012_nnou_ILSV_img_sIIAonqONCGKDlj942sP6Pc7w3f0rOotkWAgV8PKRbs.tar
imag-net.org_chal_LSVR_2012_nnou_ILSV_img_sIIAonqONCGKDlj942sP6Pc7w3f0rOotkWAgV8PKRbs.tar.INFO
The .INFO file is just a text file containing metadata:
{"dataset_names": ["imagenet2012_corrupted", "imagenet2012"], "original_fname": "ILSVRC2012_img_train.tar", "urls": ["http://www.image-net.org/challenges/LSVRC/2012/nnoupb/ILSVRC2012_img_train.tar"]}
And for ILSVRC2012_img_val.tar:
imag-net.org_chal_LSVR_2012_nnou_ILSV_img_x-BqbAuszwbY2-tld9ce__hGc6Xb3VBjOrRPjqBFauA.tar
imag-net.org_chal_LSVR_2012_nnou_ILSV_img_x-BqbAuszwbY2-tld9ce__hGc6Xb3VBjOrRPjqBFauA.tar.INFO
with the info file content:
{"dataset_names": ["imagenet2012"], "original_fname": "ILSVRC2012_img_val.tar", "urls": ["http://www.image-net.org/challenges/LSVRC/2012/nnoupb/ILSVRC2012_img_val.tar"]}
Thanks
Yes, now we try to download imagenet instead of looking in the manual_dir. We should provide a way for the user to skip the download phase if files already exists. Currently, there is no real way of doing this. You can try to copy the
ILSVRC2012_img_train.tarfiles in~/tensorflow_datasets/downloadsThe file has to be renamed as follow:
imag-net.org_chal_LSVR_2012_nnou_ILSV_img_sIIAonqONCGKDlj942sP6Pc7w3f0rOotkWAgV8PKRbs.tar imag-net.org_chal_LSVR_2012_nnou_ILSV_img_sIIAonqONCGKDlj942sP6Pc7w3f0rOotkWAgV8PKRbs.tar.INFOThe
.INFOfile is just a text file containing metadata:{"dataset_names": ["imagenet2012_corrupted", "imagenet2012"], "original_fname": "ILSVRC2012_img_train.tar", "urls": ["http://www.image-net.org/challenges/LSVRC/2012/nnoupb/ILSVRC2012_img_train.tar"]}And for
ILSVRC2012_img_val.tar:imag-net.org_chal_LSVR_2012_nnou_ILSV_img_x-BqbAuszwbY2-tld9ce__hGc6Xb3VBjOrRPjqBFauA.tar imag-net.org_chal_LSVR_2012_nnou_ILSV_img_x-BqbAuszwbY2-tld9ce__hGc6Xb3VBjOrRPjqBFauA.tar.INFOwith the info file content:
{"dataset_names": ["imagenet2012"], "original_fname": "ILSVRC2012_img_val.tar", "urls": ["http://www.image-net.org/challenges/LSVRC/2012/nnoupb/ILSVRC2012_img_val.tar"]}
Yes this helps a lot!
Thanks!! And I close this issue.
Hey - sorry to reopen this, would it be possible to get it documented explicitly what the expected file folder structure and filenames are supposed to be? (/downloads and /imagenet2012?)
This is what I am calling:
ds_test = tfds.load(name="imagenet2012", split="validation", data_dir='.data', download=True)
I am trying to:
.data. Still couldn't find the right location. Also is there a reason why we have to rename the file to something different than ILSVRC2012_img_val.tar? torchvision didn't do this and it was much easier to get working with this use case (just point to the folder with the tars, and it works).
The structure should be:
# In ~/tensorflow_datasets/downloads/
imag-net.org_chal_LSVR_2012_nnou_ILSV_img_sIIAonqONCGKDlj942sP6Pc7w3f0rOotkWAgV8PKRbs.tar
imag-net.org_chal_LSVR_2012_nnou_ILSV_img_sIIAonqONCGKDlj942sP6Pc7w3f0rOotkWAgV8PKRbs.tar.INFO
I agree that using the original filenames would be simpler. We are adding the hash in the filename to avoid collision between file names from different datasets.
But if you have your data already generated in ~/tensorflow_datasets/imagenet2012/3.0.0, can't you just copy the preprocessed files in your S3 bucket directly ? And access those files:
Either directly:
tfds.load('imagenet2012', data_dir='s3://my_bucket/tensorflow_datasets/')
Or if it doesn't work, with first copying the preprocessed data locally S3 to your local instance, and then load them with tfds.load.
This way, it will skip the download_and_prepare step, and directly reuse the pre-processed data.
Most helpful comment
Yes, now we try to download imagenet instead of looking in the manual_dir. We should provide a way for the user to skip the download phase if files already exists. Currently, there is no real way of doing this. You can try to copy the
ILSVRC2012_img_train.tarfiles in~/tensorflow_datasets/downloadsThe file has to be renamed as follow:
The
.INFOfile is just a text file containing metadata:And for
ILSVRC2012_img_val.tar:with the info file content: