Datasets: Accessing already downloaded dataset

Created on 3 Apr 2019  路  11Comments  路  Source: tensorflow/datasets

How do I access a previously downloaded and extracted dataset?

I downloaded the Open Images V4 dataset with the following code:

import tensorflow_datasets as tfds
import tensorflow as tf
open_images_dataset = tfds.image.OpenImagesV4()
open_images_dataset.download_and_prepare(download_dir="/notebooks/dataset/")

This gave me the following folder structure:
image
Under extracted I have the following structure:
image
These folders contain the actual image files (.jpg).

Now, how do I access them with the TensorFlow Datasets API? When using:
tfds.load(name="open_images_v4", data_dir="/notebooks/open_images_dataset/extracted/", download=False)
I only get the following error message:
AssertionError: Dataset open_images_v4: could not find data in /notebooks/open_images_dataset/extracted/. Please make sure to call dataset_builder.download_and_prepare(), or pass download=True to tfds.load() before trying to access the tf.data.Dataset object.
This also happens when choosing the parent folder of "extracted". The documentation doesn't really help in this regard.
When I call download_and_prepare() again, it only starts to download the whole dataset again, which is quite impractical because of its ~570 GB file size.

help

Most helpful comment

You have two main directory:

  • data_dir: Where the final dataset is installed. Default to /home/<user>/tensorflow_datasets/
  • download_dir: Where the original raw files are downloaded and extracted.

Here you have downloaded the intermediate files to you big drive, but installed the final dataset into your home drive.
To download and install the files on the same directory, you should only set data_dir and not download_dir:

builder = tfds.image.OpenImagesV4(data_dir='/raid/tensorflow_datasets')
builder.download_and_prepare()

or

ds = tfds.load('open_images_v4', data_dir='/raid/tensorflow_datasets')

It will download the files into /raid/tensorflow_datasets/download and install the dataset into /raid/tensorflow_datasets/open_images_v4. You can also set both data_dir and download_dir to re-use your previous download/extract dir.

If this still doesn't work, can you past the full logs in a gist ?

All 11 comments

Thank you for feedback about documentation.
I think you given a wrong path. You should give a extract_dir to choosing extracted dataset. Please try extract_dir="./notebooks/open_images_dataset/extracted/".

After preprocessing, the dataset is saved in ~/tensorflow_datasets/<dataset_name>. The download_directory you provided is just used during the download_and_prepare phase to indicate the library where to download and extract the raw data (default to ~/tensorflow_datasets/download). It does not correspond where the final dataset files (tfrecords files) are after preprocessing

To answer your question, you should provide the same data_dir than when you created your data.

builder = tfds.image.OpenImagesV4()  # data_dir=None (default to ~/tensorflow_datasets/open_images_v4)
builder.download_and_prepare(download_dir="/notebooks/dataset/")

To reuse:

ds = tfds.load(name="open_images_v4")  # data_dir=None (reuse default)

The download_dir isn't used after the data has been generated a first time (unless you want to regenerate the data afterwards).

So I tried re-downloading the whole dataset again. The download process finished completely. However, the extraction process stopped at the last step and I get a ResourceExahaustedError saying "No space left on the device". This doesn't make sense as there are 5.5 Terrabytes of free space left on the drive that I downloaded the data to.
What's also weird is that the directory that is mentioned in the error ("/home/my_username/tensorflow_datasets/open_images_v4/") is completely different to the one I downloaded the data to ("/raid/openimages/dataset"). Are there two different directories for extraction and for download? If so how do I change the extraction directory (I don't have enough space under /home/...)?

To answer your question, you should provide the same data_dir than when you created your data.

builder = tfds.image.OpenImagesV4()  # data_dir=None (default to ~/tensorflow_datasets/open_images_v4)
builder.download_and_prepare(download_dir="/notebooks/dataset/")

To reuse:

ds = tfds.load(name="open_images_v4")  # data_dir=None (reuse default)

The download_dir isn't used after the data has been generated a first time (unless you want to regenerate the data afterwards).

Also both of these mentioned ways didn't work. I think it's because of the missing extraction. Is it possible to manually extract the downloaded data? If so is there a folder hirarchy I have to adhere to so that TensorFlow can work with the data?

Thank you for feedback about documentation.
I think you given a wrong path. You should give a extract_dir to choosing extracted dataset. Please try extract_dir="./notebooks/open_images_dataset/extracted/".

This didn't work either (there seems to be no extract_dir parameter).

You have two main directory:

  • data_dir: Where the final dataset is installed. Default to /home/<user>/tensorflow_datasets/
  • download_dir: Where the original raw files are downloaded and extracted.

Here you have downloaded the intermediate files to you big drive, but installed the final dataset into your home drive.
To download and install the files on the same directory, you should only set data_dir and not download_dir:

builder = tfds.image.OpenImagesV4(data_dir='/raid/tensorflow_datasets')
builder.download_and_prepare()

or

ds = tfds.load('open_images_v4', data_dir='/raid/tensorflow_datasets')

It will download the files into /raid/tensorflow_datasets/download and install the dataset into /raid/tensorflow_datasets/open_images_v4. You can also set both data_dir and download_dir to re-use your previous download/extract dir.

If this still doesn't work, can you past the full logs in a gist ?

Thanks for the solution. Using the data_dir parameter did work.

Great. We should probably clarify more the download_dir/data_dir in our doc. Closing this but feel free to re-open if this didn't solve your problem

~/.keras/datasets is the path for TF latest versions.

How do I use a local dataset with tfds?

How do I use a local dataset with tfds?

Do you have a more specific question ? What have you tried ?
I would recommend starting by https://www.tensorflow.org/datasets/overview and https://www.tensorflow.org/datasets/add_dataset.

@bm777 Maybe something like tfds.builder?

from os import path
import tensorflow_datasets.public_api as tfds

tfds.builder(MY_DATASET_NAME, data_dir=path.join(path.expanduser("~"), ".keras", "datasets"))

More on usage here: https://www.tensorflow.org/datasets/overview#tfdsbuilder

Was this page helpful?
0 / 5 - 0 ratings