Short description
When developing a new dataset, it seems that the download_and_preapre redownloads the data every time.
Environment information
tensorflow-datasets/tfds-nightly version: tfds-nightly==1.0.1.dev201903040105tensorflow/tensorflow-gpu/tf-nightly/tf-nightly-gpu version: tensorflow==1.13.1Reproduction instructions
# Change the MNIST class name in the code so that it seems like it's a new dataset
# class MNISTZ
# Run once to see the data downloading, and kill it before it finishes
# Then run it again and see that it redownloads
python -m tensorflow_datasets.scripts.download_and_prepare --datasets=mnistz
Expected behavior
The downloads to be cached.
Looking at the code, it seems that the sha256 hashes are only stored after the dataset building completes, which causes the files to be re-downloaded every time. It would be very nice if this was relaxed, at least for development.
We could simply add the sha to the INFO file.
This is really slowing me down :(
The current hack is to locally change the https://github.com/tensorflow/datasets/blob/master/tensorflow_datasets/core/download/download_manager.py#L210:
From:
if (not self._force_download and resource.sha256 and
resource.exists_locally()):
To:
if not self._force_download and resource.exists_locally():
But it has side effect elsewhere (like not computing checksums anymore).
Waiting for @pierrot0 for a more stable fix.
Is the proposition to cache on the builder level, or the data_dir level? Is there any reason mnistz from above would not want to use downloaded data from mnist, given the url is the same?
The proposition is to cache globally.
Path will land soon.
Most helpful comment
The proposition is to cache globally.
Path will land soon.