Datasets: No files found for wiki-40B datasets

Created on 4 Apr 2020  路  14Comments  路  Source: tensorflow/datasets

Short description
I can't download the Wiki-40B datasets.

Environment information

  • Operating System: OSX
  • Python version: 3.7.5
  • tensorflow-datasets/tfds-nightly version: tfds-nightly 2.1.0.dev202004030105
  • tensorflow/tensorflow-gpu/tf-nightly/tf-nightly-gpu version: tensorflow 2.1.0

Reproduction instructions

python -m tensorflow_datasets.scripts.download_and_prepare \
--datasets=wiki40b/Wiki40B.it

Here, I've illustrated with the Italian dataset, but none of the other languages seem to download either.

Link to logs

I0403 15:30:20.204281 4533941696 download_and_prepare.py:197] Running download_and_prepare for dataset(s):
wiki40b/Wiki40B.it
I0403 15:30:20.204676 4533941696 dataset_builder.py:203] Load pre-computed datasetinfo (eg: splits) from bucket.
I0403 15:30:20.392930 4533941696 dataset_info.py:428] Loading info from GCS for wiki40b/Wiki40B.it/1.1.0
I0403 15:30:20.534743 4533941696 dataset_info.py:400] Field info.description from disk and from code do not match. Keeping the one from code.
I0403 15:30:20.534955 4533941696 dataset_info.py:400] Field info.redistribution_info from disk and from code do not match. Keeping the one from code.
I0403 15:30:20.535104 4533941696 download_and_prepare.py:136] download_and_prepare for dataset wiki40b/Wiki40B.it/1.1.0...
I0403 15:30:20.854214 4533941696 dataset_builder.py:337] Generating dataset wiki40b (/Users/bacon/tensorflow_datasets/wiki40b/Wiki40B.it/1.1.0)
Downloading and preparing dataset wiki40b/Wiki40B.it/1.1.0 (download: Unknown size, generated: 2.00 GiB, total: 2.00 GiB) to /Users/bacon/tensorflow_datasets/wiki40b/Wiki40B.it/1.1.0
...
I0403 15:30:21.059412 4533941696 pipeline.py:175] Missing pipeline option (runner). Executing pipeline using the default runner: DirectRunner.
I0403 15:30:21.064759 4533941696 dataset_builder.py:915] Generating split train
I0403 15:30:21.065811 4533941696 wiki40b.py:134] generating examples from = /bigstore/tfds-data/downloads/wiki40b/tfrecord_prod/train/it_examples-*
Traceback (most recent call last):
  File "/Users/bacon/miniconda/envs/model/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/Users/bacon/miniconda/envs/model/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/Users/bacon/miniconda/envs/model/lib/python3.7/site-packages/tensorflow_datasets/scripts/download_and_prepare.py", line 237, in <module>
    app.run(main)
  File "/Users/bacon/miniconda/envs/model/lib/python3.7/site-packages/absl/app.py", line 299, in run
    _run_main(main, args)
  File "/Users/bacon/miniconda/envs/model/lib/python3.7/site-packages/absl/app.py", line 250, in _run_main
    sys.exit(main(argv))
  File "/Users/bacon/miniconda/envs/model/lib/python3.7/site-packages/tensorflow_datasets/scripts/download_and_prepare.py", line 232, in main
    download_and_prepare(builder)
  File "/Users/bacon/miniconda/envs/model/lib/python3.7/site-packages/tensorflow_datasets/scripts/download_and_prepare.py", line 153, in download_and_prepare
    download_config=dl_config,
  File "/Users/bacon/miniconda/envs/model/lib/python3.7/site-packages/tensorflow_datasets/core/api_utils.py", line 53, in disallow_positional_args_dec
    return fn(*args, **kwargs)
  File "/Users/bacon/miniconda/envs/model/lib/python3.7/site-packages/tensorflow_datasets/core/dataset_builder.py", line 367, in download_and_prepare
    download_config=download_config)
  File "/Users/bacon/miniconda/envs/model/lib/python3.7/site-packages/tensorflow_datasets/core/dataset_builder.py", line 1181, in _download_and_prepare
    pipeline=pipeline,
  File "/Users/bacon/miniconda/envs/model/lib/python3.7/site-packages/tensorflow_datasets/core/dataset_builder.py", line 919, in _download_and_prepare
    self._prepare_split(split_generator, **prepare_split_kwargs)
  File "/Users/bacon/miniconda/envs/model/lib/python3.7/site-packages/tensorflow_datasets/core/dataset_builder.py", line 1229, in _prepare_split
    _ = pipeline | split_name >> _build_pcollection()   # pylint: disable=no-value-for-parameter
  File "/Users/bacon/miniconda/envs/model/lib/python3.7/site-packages/apache_beam/transforms/ptransform.py", line 989, in __ror__
    return self.transform.__ror__(pvalueish, self.label)
  File "/Users/bacon/miniconda/envs/model/lib/python3.7/site-packages/apache_beam/transforms/ptransform.py", line 549, in __ror__
    result = p.apply(self, pvalueish, label)
  File "/Users/bacon/miniconda/envs/model/lib/python3.7/site-packages/apache_beam/pipeline.py", line 536, in apply
    return self.apply(transform, pvalueish)
  File "/Users/bacon/miniconda/envs/model/lib/python3.7/site-packages/apache_beam/pipeline.py", line 577, in apply
    pvalueish_result = self.runner.apply(transform, pvalueish, self._options)
  File "/Users/bacon/miniconda/envs/model/lib/python3.7/site-packages/apache_beam/runners/runner.py", line 195, in apply
    return m(transform, input, options)
  File "/Users/bacon/miniconda/envs/model/lib/python3.7/site-packages/apache_beam/runners/runner.py", line 225, in apply_PTransform
    return transform.expand(input)
  File "/Users/bacon/miniconda/envs/model/lib/python3.7/site-packages/apache_beam/transforms/ptransform.py", line 913, in expand
    return self._fn(pcoll, *args, **kwargs)
  File "/Users/bacon/miniconda/envs/model/lib/python3.7/site-packages/tensorflow_datasets/core/dataset_builder.py", line 1223, in _build_pcollection
    pipeline, **split_generator.gen_kwargs)
  File "/Users/bacon/miniconda/envs/model/lib/python3.7/site-packages/tensorflow_datasets/text/wiki40b.py", line 154, in _build_pcollection
    | beam.FlatMap(_extract_content))
  File "/Users/bacon/miniconda/envs/model/lib/python3.7/site-packages/apache_beam/io/tfrecordio.py", line 261, in __init__
    validate)
  File "/Users/bacon/miniconda/envs/model/lib/python3.7/site-packages/apache_beam/io/tfrecordio.py", line 178, in __init__
    validate=validate)
  File "/Users/bacon/miniconda/envs/model/lib/python3.7/site-packages/apache_beam/io/filebasedsource.py", line 125, in __init__
    self._validate()
  File "/Users/bacon/miniconda/envs/model/lib/python3.7/site-packages/apache_beam/options/value_provider.py", line 140, in _f
    return fnc(self, *args, **kwargs)
  File "/Users/bacon/miniconda/envs/model/lib/python3.7/site-packages/apache_beam/io/filebasedsource.py", line 186, in _validate
    'No files found based on the file pattern %s' % pattern)
OSError: No files found based on the file pattern /bigstore/tfds-data/downloads/wiki40b/tfrecord_prod/train/it_examples-*

Expected behavior
I'd expect the datasets to download.

Additional context
The Wikipedia datasets download fine.

python -m tensorflow_datasets.scripts.download_and_prepare \
--datasets=wikipedia/20190301.aa
bug

Most helpful comment

Also as this is wikipedia data, we can probably re-host the pre-generated version, such as generation can be skipped entirely.
Once I check internally, the dataset will be available directly in https://pantheon.corp.google.com/storage/browser/tfds-data/datasets/, probably next week

All 14 comments

From the code, it seems that we have to manually download the data.
@Conchylicultor if this is the case, then we have to add some manual download instructions

This is a bug from our end. Could you try to replace _DATA_DIRECTORY by:

_DATA_DIRECTORY = "gs://tfds-data/downloads/wiki40b/tfrecord_prod"

in wiki40b.py ?

@vijayphoenix, the dataset shouldn't require manual download in this case.

Also as this is wikipedia data, we can probably re-host the pre-generated version, such as generation can be skipped entirely.
Once I check internally, the dataset will be available directly in https://pantheon.corp.google.com/storage/browser/tfds-data/datasets/, probably next week

As per described solution after replacing _DATA_DIRECTORY but it's not working colab

@Eshan-Agarwal what is the error ? I don't see any stacktrace in the colab you shared.

@Conchylicultor in root directory dataset was not completely downloaded and extracted. Showing this dir name 1.1.0.incomplete6LH1PP

@Eshan-Agarwal it just mean that the dataset is currently generating. Did you kill the program before it finished ?

No it automatically offs tried twice

@Conchylicultor Its worked fine now, maybe previously it terminates due to bad connectivity, colab, sorry for trouble

It works now because it loads the pre-generated data directly from https://pantheon.corp.google.com/storage/browser/tfds-data/datasets/, so it skip the download_and_prepare phase.

@Conchylicultor Thanks for updating this .

@geoffbacon Please close this issue if your problem was solved

Thanks @Conchylicultor, @vijayphoenix and @Eshan-Agarwal, this is fixed is tfds-nightly now.

Was this page helpful?
0 / 5 - 0 ratings