Short description
NonMatchingChecksumError when downloading a .tar or .gz or .tar.gz from gcloud.
Environment information
Reproduction instructions
import os
import tensorflow_datasets.public_api as tfds
import tensorflow as tf
class MyDataset(tfds.core.GeneratorBasedBuilder):
"""Short description of my dataset."""
VERSION = tfds.core.Version("0.1.0")
def _info(self):
# Specifies the tfds.core.DatasetInfo object
return tfds.core.DatasetInfo(
builder=self,
# This is the description that will appear on the datasets page.
description=(
"This is the dataset for x"
),
# tfds.features.FeatureConnectors
features=tfds.features.FeaturesDict(
{
"image_rgb": tfds.features.Image(),
}
),
)
def _split_generators(self, dl_manager):
# Downloads the data and defines the splits
# dl_manager is a tfds.download.DownloadManager that can be used to
# download and extract URLs
dl_paths = dl_manager.download_and_extract(
{"foo": "gs://path_to_file.tar"}
)
return [
tfds.core.SplitGenerator(
name="train",
num_shards=10,
gen_kwargs={"images_dir_path": os.path.join(dl_paths["foo"], "train")},
),
]
if __name__ == "__main__":
print("Testing")
print(tfds.list_builders())
builder = tfds.builder("mydataset")
builder.download_and_prepare()
ds = builder.as_dataset(split="train")
info = builder.info
print(info)
Logs
tensorflow_datasets.core.download.download_manager.NonMatchingChecksumError: Artifact gs://path_to_file.tar, downloaded to /home/emanuele/tensorflow_datasets/downloads/path_to_file.tar, has wrong checksum.
Expected behavior
The file should be downloaded and extracted as expected.
Closing, the issue was due to a missing checksum.
@EmanueleGhelfi how you corrected it ?
I do the following step:
I suggest the error message be fixed to say that "no checksum was specified" rather than "wrong checksum".
I do the following step:
- create a checksum file with touch path_to_tfds/url_checksums/my_dataset.txt
- Run the script with the downloadmanager field _register_checksums=True
By doing this the download manager registers the checksums of the new dataset.
Notice that for the following loading of MyDataset the field register_checksums is not needed.
Could you please explain how to create a checksum file, or refer to a link about this respect ?
Most helpful comment
I suggest the error message be fixed to say that "no checksum was specified" rather than "wrong checksum".