Datasets: `download_and_prepare` when developing an in-development dataset redownloads data every time

Created on 4 Mar 2019  路  5Comments  路  Source: tensorflow/datasets

Short description
When developing a new dataset, it seems that the download_and_preapre redownloads the data every time.

Environment information

  • Operating System: Ubuntu
  • Python version: 3.6
  • tensorflow-datasets/tfds-nightly version: tfds-nightly==1.0.1.dev201903040105
  • tensorflow/tensorflow-gpu/tf-nightly/tf-nightly-gpu version: tensorflow==1.13.1

Reproduction instructions

# Change the MNIST class name in the code so that it seems like it's a new dataset
# class MNISTZ

# Run once to see the data downloading, and kill it before it finishes
# Then run it again and see that it redownloads
python -m tensorflow_datasets.scripts.download_and_prepare --datasets=mnistz

Expected behavior
The downloads to be cached.

bug

Most helpful comment

The proposition is to cache globally.
Path will land soon.

All 5 comments

Looking at the code, it seems that the sha256 hashes are only stored after the dataset building completes, which causes the files to be re-downloaded every time. It would be very nice if this was relaxed, at least for development.

We could simply add the sha to the INFO file.

This is really slowing me down :(

The current hack is to locally change the https://github.com/tensorflow/datasets/blob/master/tensorflow_datasets/core/download/download_manager.py#L210:
From:

if (not self._force_download and resource.sha256 and
    resource.exists_locally()):

To:

if not self._force_download and resource.exists_locally():

But it has side effect elsewhere (like not computing checksums anymore).

Waiting for @pierrot0 for a more stable fix.

Is the proposition to cache on the builder level, or the data_dir level? Is there any reason mnistz from above would not want to use downloaded data from mnist, given the url is the same?

The proposition is to cache globally.
Path will land soon.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

powergkrry picture powergkrry  路  3Comments

ageron picture ageron  路  4Comments

keshan picture keshan  路  5Comments

AmitMY picture AmitMY  路  4Comments

jinbo-huang picture jinbo-huang  路  3Comments