Datasets: NonMatchingChecksumError when dowloading from gcloud

Created on 19 Mar 2019  路  5Comments  路  Source: tensorflow/datasets

Short description
NonMatchingChecksumError when downloading a .tar or .gz or .tar.gz from gcloud.

Environment information

  • Operating System: Arch Linux
  • Python version: 3.6
  • tfds-nightly==1.0.1.dev201903180105
  • tf-nightly-gpu-2.0-preview==2.0.0.dev20190318

Reproduction instructions

import os

import tensorflow_datasets.public_api as tfds
import tensorflow as tf


class MyDataset(tfds.core.GeneratorBasedBuilder):
    """Short description of my dataset."""

    VERSION = tfds.core.Version("0.1.0")

    def _info(self):
        # Specifies the tfds.core.DatasetInfo object
        return tfds.core.DatasetInfo(
            builder=self,
            # This is the description that will appear on the datasets page.
            description=(
                "This is the dataset for x"
            ),
            # tfds.features.FeatureConnectors
            features=tfds.features.FeaturesDict(
                {
                    "image_rgb": tfds.features.Image(),
                }
            ),
        )

    def _split_generators(self, dl_manager):
        # Downloads the data and defines the splits
        # dl_manager is a tfds.download.DownloadManager that can be used to
        # download and extract URLs
        dl_paths = dl_manager.download_and_extract(
            {"foo": "gs://path_to_file.tar"}
        )
        return [
            tfds.core.SplitGenerator(
                name="train",
                num_shards=10,
                gen_kwargs={"images_dir_path": os.path.join(dl_paths["foo"], "train")},
            ),
        ]


if __name__ == "__main__":
    print("Testing")
    print(tfds.list_builders())
    builder = tfds.builder("mydataset")
    builder.download_and_prepare()

    ds = builder.as_dataset(split="train")

    info = builder.info
    print(info)

Logs

tensorflow_datasets.core.download.download_manager.NonMatchingChecksumError: Artifact gs://path_to_file.tar, downloaded to /home/emanuele/tensorflow_datasets/downloads/path_to_file.tar, has wrong checksum.

Expected behavior
The file should be downloaded and extracted as expected.

bug

Most helpful comment

I suggest the error message be fixed to say that "no checksum was specified" rather than "wrong checksum".

All 5 comments

Closing, the issue was due to a missing checksum.

@EmanueleGhelfi how you corrected it ?

I do the following step:

  • create a checksum file with touch path_to_tfds/url_checksums/my_dataset.txt
  • Run the script with the downloadmanager field _register_checksums=True
    By doing this the download manager registers the checksums of the new dataset.
    Notice that for the following loading of MyDataset the field register_checksums is not needed.

I suggest the error message be fixed to say that "no checksum was specified" rather than "wrong checksum".

I do the following step:

  • create a checksum file with touch path_to_tfds/url_checksums/my_dataset.txt
  • Run the script with the downloadmanager field _register_checksums=True
    By doing this the download manager registers the checksums of the new dataset.
    Notice that for the following loading of MyDataset the field register_checksums is not needed.

Could you please explain how to create a checksum file, or refer to a link about this respect ?

Was this page helpful?
0 / 5 - 0 ratings