Short description
I can't download the Wiki-40B datasets.
Environment information
tensorflow-datasets/tfds-nightly version: tfds-nightly 2.1.0.dev202004030105tensorflow/tensorflow-gpu/tf-nightly/tf-nightly-gpu version: tensorflow 2.1.0Reproduction instructions
python -m tensorflow_datasets.scripts.download_and_prepare \
--datasets=wiki40b/Wiki40B.it
Here, I've illustrated with the Italian dataset, but none of the other languages seem to download either.
Link to logs
I0403 15:30:20.204281 4533941696 download_and_prepare.py:197] Running download_and_prepare for dataset(s):
wiki40b/Wiki40B.it
I0403 15:30:20.204676 4533941696 dataset_builder.py:203] Load pre-computed datasetinfo (eg: splits) from bucket.
I0403 15:30:20.392930 4533941696 dataset_info.py:428] Loading info from GCS for wiki40b/Wiki40B.it/1.1.0
I0403 15:30:20.534743 4533941696 dataset_info.py:400] Field info.description from disk and from code do not match. Keeping the one from code.
I0403 15:30:20.534955 4533941696 dataset_info.py:400] Field info.redistribution_info from disk and from code do not match. Keeping the one from code.
I0403 15:30:20.535104 4533941696 download_and_prepare.py:136] download_and_prepare for dataset wiki40b/Wiki40B.it/1.1.0...
I0403 15:30:20.854214 4533941696 dataset_builder.py:337] Generating dataset wiki40b (/Users/bacon/tensorflow_datasets/wiki40b/Wiki40B.it/1.1.0)
Downloading and preparing dataset wiki40b/Wiki40B.it/1.1.0 (download: Unknown size, generated: 2.00 GiB, total: 2.00 GiB) to /Users/bacon/tensorflow_datasets/wiki40b/Wiki40B.it/1.1.0
...
I0403 15:30:21.059412 4533941696 pipeline.py:175] Missing pipeline option (runner). Executing pipeline using the default runner: DirectRunner.
I0403 15:30:21.064759 4533941696 dataset_builder.py:915] Generating split train
I0403 15:30:21.065811 4533941696 wiki40b.py:134] generating examples from = /bigstore/tfds-data/downloads/wiki40b/tfrecord_prod/train/it_examples-*
Traceback (most recent call last):
File "/Users/bacon/miniconda/envs/model/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/Users/bacon/miniconda/envs/model/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/Users/bacon/miniconda/envs/model/lib/python3.7/site-packages/tensorflow_datasets/scripts/download_and_prepare.py", line 237, in <module>
app.run(main)
File "/Users/bacon/miniconda/envs/model/lib/python3.7/site-packages/absl/app.py", line 299, in run
_run_main(main, args)
File "/Users/bacon/miniconda/envs/model/lib/python3.7/site-packages/absl/app.py", line 250, in _run_main
sys.exit(main(argv))
File "/Users/bacon/miniconda/envs/model/lib/python3.7/site-packages/tensorflow_datasets/scripts/download_and_prepare.py", line 232, in main
download_and_prepare(builder)
File "/Users/bacon/miniconda/envs/model/lib/python3.7/site-packages/tensorflow_datasets/scripts/download_and_prepare.py", line 153, in download_and_prepare
download_config=dl_config,
File "/Users/bacon/miniconda/envs/model/lib/python3.7/site-packages/tensorflow_datasets/core/api_utils.py", line 53, in disallow_positional_args_dec
return fn(*args, **kwargs)
File "/Users/bacon/miniconda/envs/model/lib/python3.7/site-packages/tensorflow_datasets/core/dataset_builder.py", line 367, in download_and_prepare
download_config=download_config)
File "/Users/bacon/miniconda/envs/model/lib/python3.7/site-packages/tensorflow_datasets/core/dataset_builder.py", line 1181, in _download_and_prepare
pipeline=pipeline,
File "/Users/bacon/miniconda/envs/model/lib/python3.7/site-packages/tensorflow_datasets/core/dataset_builder.py", line 919, in _download_and_prepare
self._prepare_split(split_generator, **prepare_split_kwargs)
File "/Users/bacon/miniconda/envs/model/lib/python3.7/site-packages/tensorflow_datasets/core/dataset_builder.py", line 1229, in _prepare_split
_ = pipeline | split_name >> _build_pcollection() # pylint: disable=no-value-for-parameter
File "/Users/bacon/miniconda/envs/model/lib/python3.7/site-packages/apache_beam/transforms/ptransform.py", line 989, in __ror__
return self.transform.__ror__(pvalueish, self.label)
File "/Users/bacon/miniconda/envs/model/lib/python3.7/site-packages/apache_beam/transforms/ptransform.py", line 549, in __ror__
result = p.apply(self, pvalueish, label)
File "/Users/bacon/miniconda/envs/model/lib/python3.7/site-packages/apache_beam/pipeline.py", line 536, in apply
return self.apply(transform, pvalueish)
File "/Users/bacon/miniconda/envs/model/lib/python3.7/site-packages/apache_beam/pipeline.py", line 577, in apply
pvalueish_result = self.runner.apply(transform, pvalueish, self._options)
File "/Users/bacon/miniconda/envs/model/lib/python3.7/site-packages/apache_beam/runners/runner.py", line 195, in apply
return m(transform, input, options)
File "/Users/bacon/miniconda/envs/model/lib/python3.7/site-packages/apache_beam/runners/runner.py", line 225, in apply_PTransform
return transform.expand(input)
File "/Users/bacon/miniconda/envs/model/lib/python3.7/site-packages/apache_beam/transforms/ptransform.py", line 913, in expand
return self._fn(pcoll, *args, **kwargs)
File "/Users/bacon/miniconda/envs/model/lib/python3.7/site-packages/tensorflow_datasets/core/dataset_builder.py", line 1223, in _build_pcollection
pipeline, **split_generator.gen_kwargs)
File "/Users/bacon/miniconda/envs/model/lib/python3.7/site-packages/tensorflow_datasets/text/wiki40b.py", line 154, in _build_pcollection
| beam.FlatMap(_extract_content))
File "/Users/bacon/miniconda/envs/model/lib/python3.7/site-packages/apache_beam/io/tfrecordio.py", line 261, in __init__
validate)
File "/Users/bacon/miniconda/envs/model/lib/python3.7/site-packages/apache_beam/io/tfrecordio.py", line 178, in __init__
validate=validate)
File "/Users/bacon/miniconda/envs/model/lib/python3.7/site-packages/apache_beam/io/filebasedsource.py", line 125, in __init__
self._validate()
File "/Users/bacon/miniconda/envs/model/lib/python3.7/site-packages/apache_beam/options/value_provider.py", line 140, in _f
return fnc(self, *args, **kwargs)
File "/Users/bacon/miniconda/envs/model/lib/python3.7/site-packages/apache_beam/io/filebasedsource.py", line 186, in _validate
'No files found based on the file pattern %s' % pattern)
OSError: No files found based on the file pattern /bigstore/tfds-data/downloads/wiki40b/tfrecord_prod/train/it_examples-*
Expected behavior
I'd expect the datasets to download.
Additional context
The Wikipedia datasets download fine.
python -m tensorflow_datasets.scripts.download_and_prepare \
--datasets=wikipedia/20190301.aa
From the code, it seems that we have to manually download the data.
@Conchylicultor if this is the case, then we have to add some manual download instructions
This is a bug from our end. Could you try to replace _DATA_DIRECTORY by:
_DATA_DIRECTORY = "gs://tfds-data/downloads/wiki40b/tfrecord_prod"
in wiki40b.py ?
@vijayphoenix, the dataset shouldn't require manual download in this case.
Also as this is wikipedia data, we can probably re-host the pre-generated version, such as generation can be skipped entirely.
Once I check internally, the dataset will be available directly in https://pantheon.corp.google.com/storage/browser/tfds-data/datasets/, probably next week
As per described solution after replacing _DATA_DIRECTORY but it's not working colab
@Eshan-Agarwal what is the error ? I don't see any stacktrace in the colab you shared.
@Conchylicultor in root directory dataset was not completely downloaded and extracted. Showing this dir name 1.1.0.incomplete6LH1PP
@Eshan-Agarwal it just mean that the dataset is currently generating. Did you kill the program before it finished ?
No it automatically offs tried twice
@Conchylicultor Its worked fine now, maybe previously it terminates due to bad connectivity, colab, sorry for trouble
It works now because it loads the pre-generated data directly from https://pantheon.corp.google.com/storage/browser/tfds-data/datasets/, so it skip the download_and_prepare phase.
@Conchylicultor Thanks for updating this .
@geoffbacon Please close this issue if your problem was solved
Thanks @Conchylicultor, @vijayphoenix and @Eshan-Agarwal, this is fixed is tfds-nightly now.
Most helpful comment
Also as this is wikipedia data, we can probably re-host the pre-generated version, such as generation can be skipped entirely.
Once I check internally, the dataset will be available directly in https://pantheon.corp.google.com/storage/browser/tfds-data/datasets/, probably next week