Short description
Run download_and_perpare() to get wmt_translate dataset would cause some errors due to the wrong keys for the downloaded dataset.
Environment information
tensorflow-datasets/tfds-nightly version: tensorflow-datasets 1.1.0tensorflow/tensorflow-gpu/tf-nightly/tf-nightly-gpu version: tensorflow-gpu 2.0 beta1Reproduction instructions
import tensorflow as tf
import tensorflow_datasets as tfds
def main():
config = tfds.translate.wmt.WmtConfig(
version="0.0.3",
language_pair=("zh", "en"),
subsets={
tfds.Split.TRAIN: ["newscommentary_v14"],
},
)
builder = tfds.builder("wmt_translate", config=config)
builder.download_and_prepare()
if __name__ == "__main__":
main()
Link to logs
I met 2 types of errors.
1st:
Traceback (most recent call last):
File "/home/iceman126/.vscode/extensions/ms-python.python-2019.6.24221/pythonFiles/ptvsd_launcher.py", line 43, in <module>
main(ptvsdArgs)
File "/home/iceman126/.vscode/extensions/ms-python.python-2019.6.24221/pythonFiles/lib/python/ptvsd/__main__.py", line 434, in main
run()
File "/home/iceman126/.vscode/extensions/ms-python.python-2019.6.24221/pythonFiles/lib/python/ptvsd/__main__.py", line 312, in run_file
runpy.run_path(target, run_name='__main__')
File "/home/iceman126/miniconda3/envs/tf2/lib/python3.6/runpy.py", line 263, in run_path
pkg_name=pkg_name, script_name=fname)
File "/home/iceman126/miniconda3/envs/tf2/lib/python3.6/runpy.py", line 96, in _run_module_code
mod_name, mod_spec, pkg_name, script_name)
File "/home/iceman126/miniconda3/envs/tf2/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/iceman126/NMT/test.py", line 16, in <module>
main()
File "/home/iceman126/NMT/test.py", line 13, in main
builder.download_and_prepare()
File "/home/iceman126/miniconda3/envs/tf2/lib/python3.6/site-packages/tensorflow_datasets/core/api_utils.py", line 52, in disallow_positional_args_dec
return fn(*args, **kwargs)
File "/home/iceman126/miniconda3/envs/tf2/lib/python3.6/site-packages/tensorflow_datasets/core/dataset_builder.py", line 287, in download_and_prepare
download_config=download_config)
File "/home/iceman126/miniconda3/envs/tf2/lib/python3.6/site-packages/tensorflow_datasets/core/dataset_builder.py", line 948, in _download_and_prepare
max_examples_per_split=download_config.max_examples_per_split,
File "/home/iceman126/miniconda3/envs/tf2/lib/python3.6/site-packages/tensorflow_datasets/core/dataset_builder.py", line 816, in _download_and_prepare
self._prepare_split(split_generator, **prepare_split_kwargs)
File "/home/iceman126/miniconda3/envs/tf2/lib/python3.6/site-packages/tensorflow_datasets/core/dataset_builder.py", line 972, in _prepare_split
example = self.info.features.encode_example(record)
File "/home/iceman126/miniconda3/envs/tf2/lib/python3.6/site-packages/tensorflow_datasets/core/features/features_dict.py", line 168, in encode_example
in utils.zip_dict(self._feature_dict, example_dict)
File "/home/iceman126/miniconda3/envs/tf2/lib/python3.6/site-packages/tensorflow_datasets/core/features/features_dict.py", line 165, in <dictcomp>
return {
File "/home/iceman126/miniconda3/envs/tf2/lib/python3.6/site-packages/tensorflow_datasets/core/utils/py_utils.py", line 67, in zip_dict
yield key, tuple(d[key] for d in dicts)
File "/home/iceman126/miniconda3/envs/tf2/lib/python3.6/site-packages/tensorflow_datasets/core/utils/py_utils.py", line 67, in <genexpr>
yield key, tuple(d[key] for d in dicts)
KeyError: 'e'
2nd:
Traceback (most recent call last):
File "/home/iceman126/.vscode/extensions/ms-python.python-2019.6.24221/pythonFiles/ptvsd_launcher.py", line 43, in <module>
main(ptvsdArgs)
File "/home/iceman126/.vscode/extensions/ms-python.python-2019.6.24221/pythonFiles/lib/python/ptvsd/__main__.py", line 434, in main
run()
File "/home/iceman126/.vscode/extensions/ms-python.python-2019.6.24221/pythonFiles/lib/python/ptvsd/__main__.py", line 312, in run_file
runpy.run_path(target, run_name='__main__')
File "/home/iceman126/miniconda3/envs/tf2/lib/python3.6/runpy.py", line 263, in run_path
pkg_name=pkg_name, script_name=fname)
File "/home/iceman126/miniconda3/envs/tf2/lib/python3.6/runpy.py", line 96, in _run_module_code
mod_name, mod_spec, pkg_name, script_name)
File "/home/iceman126/miniconda3/envs/tf2/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/iceman126/NMT/test.py", line 16, in <module>
main()
File "/home/iceman126/NMT/test.py", line 13, in main
builder.download_and_prepare()
File "/home/iceman126/miniconda3/envs/tf2/lib/python3.6/site-packages/tensorflow_datasets/core/api_utils.py", line 52, in disallow_positional_args_dec
return fn(*args, **kwargs)
File "/home/iceman126/miniconda3/envs/tf2/lib/python3.6/site-packages/tensorflow_datasets/core/dataset_builder.py", line 287, in download_and_prepare
download_config=download_config)
File "/home/iceman126/miniconda3/envs/tf2/lib/python3.6/site-packages/tensorflow_datasets/core/dataset_builder.py", line 948, in _download_and_prepare
max_examples_per_split=download_config.max_examples_per_split,
File "/home/iceman126/miniconda3/envs/tf2/lib/python3.6/site-packages/tensorflow_datasets/core/dataset_builder.py", line 816, in _download_and_prepare
self._prepare_split(split_generator, **prepare_split_kwargs)
File "/home/iceman126/miniconda3/envs/tf2/lib/python3.6/site-packages/tensorflow_datasets/core/dataset_builder.py", line 972, in _prepare_split
example = self.info.features.encode_example(record)
File "/home/iceman126/miniconda3/envs/tf2/lib/python3.6/site-packages/tensorflow_datasets/core/features/features_dict.py", line 168, in encode_example
in utils.zip_dict(self._feature_dict, example_dict)
File "/home/iceman126/miniconda3/envs/tf2/lib/python3.6/site-packages/tensorflow_datasets/core/features/features_dict.py", line 165, in <dictcomp>
return {
File "/home/iceman126/miniconda3/envs/tf2/lib/python3.6/site-packages/tensorflow_datasets/core/utils/py_utils.py", line 67, in zip_dict
yield key, tuple(d[key] for d in dicts)
File "/home/iceman126/miniconda3/envs/tf2/lib/python3.6/site-packages/tensorflow_datasets/core/utils/py_utils.py", line 67, in <genexpr>
yield key, tuple(d[key] for d in dicts)
TypeError: string indices must be integers
Expected behavior
It should download and extract the dataset.
Additional context
I printed out the dicts in zip_dict function, and it didn't get the sentences.
Same issue here with a custom dataset, see MVCE below:
EDIT: downgrading to 1.0.2 solved the issue.
import tensorflow_datasets.public_api as tfds
import tensorflow as tf
import numpy as np
class MyCustomDataset(tfds.core.GeneratorBasedBuilder):
VERSION = tfds.core.Version('0.1.0')
def _info(self):
return tfds.core.DatasetInfo(
builder=self,
description="",
features=tfds.features.FeaturesDict({
"myfeature": tfds.features.Tensor(shape=(), dtype=tf.float32),
}),
urls=[],
citation=""
)
def _split_generators(self, dl_manager):
train = tfds.core.SplitGenerator(
name="train",
num_shards=10,
gen_kwargs={
"cases": np.arange(10),
})
return [train]
def _generate_examples(self, cases):
for i in range(10):
yield {'myfeature': np.ones((), dtype=np.float32)}
mycustomdataset = MyCustomDataset()
mycustomdataset.download_and_prepare()
I sometime get KeyError: 'e'with a random character, sometimes get ValueError as below:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-1-32de1e850b02> in <module>
35
36 mycustomdataset = MyCustomDataset()
---> 37 mycustomdataset.download_and_prepare()
/hdd1/miniconda3/lib/python3.6/site-packages/tensorflow_datasets/core/api_utils.py in disallow_positional_args_dec(fn, instance, args, kwargs)
50 _check_no_positional(fn, args, ismethod, allowed=allowed)
51 _check_required(fn, kwargs)
---> 52 return fn(*args, **kwargs)
53
54 return disallow_positional_args_dec(wrapped) # pylint: disable=no-value-for-parameter
/hdd1/miniconda3/lib/python3.6/site-packages/tensorflow_datasets/core/dataset_builder.py in download_and_prepare(self, download_dir, download_config)
285 self._download_and_prepare(
286 dl_manager=dl_manager,
--> 287 download_config=download_config)
288
289 # NOTE: If modifying the lines below to put additional information in
/hdd1/miniconda3/lib/python3.6/site-packages/tensorflow_datasets/core/dataset_builder.py in _download_and_prepare(self, dl_manager, download_config)
946 super(GeneratorBasedBuilder, self)._download_and_prepare(
947 dl_manager=dl_manager,
--> 948 max_examples_per_split=download_config.max_examples_per_split,
949 )
950
/hdd1/miniconda3/lib/python3.6/site-packages/tensorflow_datasets/core/dataset_builder.py in _download_and_prepare(self, dl_manager, **prepare_split_kwargs)
814
815 # Prepare split will record examples associated to the split
--> 816 self._prepare_split(split_generator, **prepare_split_kwargs)
817
818 # Update the info object with the splits.
/hdd1/miniconda3/lib/python3.6/site-packages/tensorflow_datasets/core/dataset_builder.py in _prepare_split(self, split_generator, max_examples_per_split)
969 hash_salt=split_generator.name)
970 for key, record in utils.tqdm(generator, unit=" examples",
--> 971 total=split_info.num_examples, leave=False):
972 example = self.info.features.encode_example(record)
973 writer.write(key, example)
ValueError: not enough values to unpack (expected 2, got 1)
We are transitioning to a new reader/writer (see: https://www.tensorflow.org/datasets/splits#two_apis_s3_and_legacy), so this trigger some bugs, but once the migration is complete, things should be back to normal. Thank you both for reporting.
@iceman126, Wmt hasn't been updated to this new S3 feature yet, so you need to manually opt-out by setting version:
config = tfds.translate.wmt.WmtConfig(
version=tfds.core.Version('0.0.3', experiments={tfds.core.Experiment.S3: False}),
...
)
Edit: @iceman126 optionally, a fix should come soon for this use case, so this should be fixed in one of the next tfds-nightly this week.
@remydubois For new dataset, the new S3 reader is activated by default, so you need to update your _generate_examples function to yield key, example instead of just example.
(see doc: https://www.tensorflow.org/datasets/add_dataset#writing_an_example_generator)
@Conchylicultor Thank you for your solution. This solved my problem.
Most helpful comment
We are transitioning to a new reader/writer (see: https://www.tensorflow.org/datasets/splits#two_apis_s3_and_legacy), so this trigger some bugs, but once the migration is complete, things should be back to normal. Thank you both for reporting.
@iceman126, Wmt hasn't been updated to this new S3 feature yet, so you need to manually opt-out by setting version:
Edit: @iceman126 optionally, a fix should come soon for this use case, so this should be fixed in one of the next tfds-nightly this week.
@remydubois For new dataset, the new S3 reader is activated by default, so you need to update your
_generate_examplesfunction toyield key, exampleinstead of just example.(see doc: https://www.tensorflow.org/datasets/add_dataset#writing_an_example_generator)