Datasets: Dictionary key error when using wmt_translate dataset

Created on 7 Aug 2019  路  3Comments  路  Source: tensorflow/datasets

Short description
Run download_and_perpare() to get wmt_translate dataset would cause some errors due to the wrong keys for the downloaded dataset.

Environment information

  • Operating System: Ubuntu 16.04
  • Python version: Python 3.6
  • tensorflow-datasets/tfds-nightly version: tensorflow-datasets 1.1.0
  • tensorflow/tensorflow-gpu/tf-nightly/tf-nightly-gpu version: tensorflow-gpu 2.0 beta1

Reproduction instructions

import tensorflow as tf
import tensorflow_datasets as tfds

def main():
    config = tfds.translate.wmt.WmtConfig(
        version="0.0.3",
        language_pair=("zh", "en"),
        subsets={
            tfds.Split.TRAIN: ["newscommentary_v14"],
        },
    )
    builder = tfds.builder("wmt_translate", config=config)
    builder.download_and_prepare()

if __name__ == "__main__":
    main()

Link to logs
I met 2 types of errors.
1st:

Traceback (most recent call last):
  File "/home/iceman126/.vscode/extensions/ms-python.python-2019.6.24221/pythonFiles/ptvsd_launcher.py", line 43, in <module>
    main(ptvsdArgs)
  File "/home/iceman126/.vscode/extensions/ms-python.python-2019.6.24221/pythonFiles/lib/python/ptvsd/__main__.py", line 434, in main
    run()
  File "/home/iceman126/.vscode/extensions/ms-python.python-2019.6.24221/pythonFiles/lib/python/ptvsd/__main__.py", line 312, in run_file
    runpy.run_path(target, run_name='__main__')
  File "/home/iceman126/miniconda3/envs/tf2/lib/python3.6/runpy.py", line 263, in run_path
    pkg_name=pkg_name, script_name=fname)
  File "/home/iceman126/miniconda3/envs/tf2/lib/python3.6/runpy.py", line 96, in _run_module_code
    mod_name, mod_spec, pkg_name, script_name)
  File "/home/iceman126/miniconda3/envs/tf2/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/iceman126/NMT/test.py", line 16, in <module>
    main()
  File "/home/iceman126/NMT/test.py", line 13, in main
    builder.download_and_prepare()
  File "/home/iceman126/miniconda3/envs/tf2/lib/python3.6/site-packages/tensorflow_datasets/core/api_utils.py", line 52, in disallow_positional_args_dec
    return fn(*args, **kwargs)
  File "/home/iceman126/miniconda3/envs/tf2/lib/python3.6/site-packages/tensorflow_datasets/core/dataset_builder.py", line 287, in download_and_prepare
    download_config=download_config)
  File "/home/iceman126/miniconda3/envs/tf2/lib/python3.6/site-packages/tensorflow_datasets/core/dataset_builder.py", line 948, in _download_and_prepare
    max_examples_per_split=download_config.max_examples_per_split,
  File "/home/iceman126/miniconda3/envs/tf2/lib/python3.6/site-packages/tensorflow_datasets/core/dataset_builder.py", line 816, in _download_and_prepare
    self._prepare_split(split_generator, **prepare_split_kwargs)
  File "/home/iceman126/miniconda3/envs/tf2/lib/python3.6/site-packages/tensorflow_datasets/core/dataset_builder.py", line 972, in _prepare_split
    example = self.info.features.encode_example(record)
  File "/home/iceman126/miniconda3/envs/tf2/lib/python3.6/site-packages/tensorflow_datasets/core/features/features_dict.py", line 168, in encode_example
    in utils.zip_dict(self._feature_dict, example_dict)
  File "/home/iceman126/miniconda3/envs/tf2/lib/python3.6/site-packages/tensorflow_datasets/core/features/features_dict.py", line 165, in <dictcomp>
    return {
  File "/home/iceman126/miniconda3/envs/tf2/lib/python3.6/site-packages/tensorflow_datasets/core/utils/py_utils.py", line 67, in zip_dict
    yield key, tuple(d[key] for d in dicts)
  File "/home/iceman126/miniconda3/envs/tf2/lib/python3.6/site-packages/tensorflow_datasets/core/utils/py_utils.py", line 67, in <genexpr>
    yield key, tuple(d[key] for d in dicts)
KeyError: 'e'

2nd:

Traceback (most recent call last):
  File "/home/iceman126/.vscode/extensions/ms-python.python-2019.6.24221/pythonFiles/ptvsd_launcher.py", line 43, in <module>
    main(ptvsdArgs)
  File "/home/iceman126/.vscode/extensions/ms-python.python-2019.6.24221/pythonFiles/lib/python/ptvsd/__main__.py", line 434, in main
    run()
  File "/home/iceman126/.vscode/extensions/ms-python.python-2019.6.24221/pythonFiles/lib/python/ptvsd/__main__.py", line 312, in run_file
    runpy.run_path(target, run_name='__main__')
  File "/home/iceman126/miniconda3/envs/tf2/lib/python3.6/runpy.py", line 263, in run_path
    pkg_name=pkg_name, script_name=fname)
  File "/home/iceman126/miniconda3/envs/tf2/lib/python3.6/runpy.py", line 96, in _run_module_code
    mod_name, mod_spec, pkg_name, script_name)
  File "/home/iceman126/miniconda3/envs/tf2/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/iceman126/NMT/test.py", line 16, in <module>
    main()
  File "/home/iceman126/NMT/test.py", line 13, in main
    builder.download_and_prepare()
  File "/home/iceman126/miniconda3/envs/tf2/lib/python3.6/site-packages/tensorflow_datasets/core/api_utils.py", line 52, in disallow_positional_args_dec
    return fn(*args, **kwargs)
  File "/home/iceman126/miniconda3/envs/tf2/lib/python3.6/site-packages/tensorflow_datasets/core/dataset_builder.py", line 287, in download_and_prepare
    download_config=download_config)
  File "/home/iceman126/miniconda3/envs/tf2/lib/python3.6/site-packages/tensorflow_datasets/core/dataset_builder.py", line 948, in _download_and_prepare
    max_examples_per_split=download_config.max_examples_per_split,
  File "/home/iceman126/miniconda3/envs/tf2/lib/python3.6/site-packages/tensorflow_datasets/core/dataset_builder.py", line 816, in _download_and_prepare
    self._prepare_split(split_generator, **prepare_split_kwargs)
  File "/home/iceman126/miniconda3/envs/tf2/lib/python3.6/site-packages/tensorflow_datasets/core/dataset_builder.py", line 972, in _prepare_split
    example = self.info.features.encode_example(record)
  File "/home/iceman126/miniconda3/envs/tf2/lib/python3.6/site-packages/tensorflow_datasets/core/features/features_dict.py", line 168, in encode_example
    in utils.zip_dict(self._feature_dict, example_dict)
  File "/home/iceman126/miniconda3/envs/tf2/lib/python3.6/site-packages/tensorflow_datasets/core/features/features_dict.py", line 165, in <dictcomp>
    return {
  File "/home/iceman126/miniconda3/envs/tf2/lib/python3.6/site-packages/tensorflow_datasets/core/utils/py_utils.py", line 67, in zip_dict
    yield key, tuple(d[key] for d in dicts)
  File "/home/iceman126/miniconda3/envs/tf2/lib/python3.6/site-packages/tensorflow_datasets/core/utils/py_utils.py", line 67, in <genexpr>
    yield key, tuple(d[key] for d in dicts)
TypeError: string indices must be integers

Expected behavior
It should download and extract the dataset.

Additional context
I printed out the dicts in zip_dict function, and it didn't get the sentences.

bug

Most helpful comment

We are transitioning to a new reader/writer (see: https://www.tensorflow.org/datasets/splits#two_apis_s3_and_legacy), so this trigger some bugs, but once the migration is complete, things should be back to normal. Thank you both for reporting.

@iceman126, Wmt hasn't been updated to this new S3 feature yet, so you need to manually opt-out by setting version:

config = tfds.translate.wmt.WmtConfig(
        version=tfds.core.Version('0.0.3', experiments={tfds.core.Experiment.S3: False}),              
        ...
)

Edit: @iceman126 optionally, a fix should come soon for this use case, so this should be fixed in one of the next tfds-nightly this week.

@remydubois For new dataset, the new S3 reader is activated by default, so you need to update your _generate_examples function to yield key, example instead of just example.
(see doc: https://www.tensorflow.org/datasets/add_dataset#writing_an_example_generator)

All 3 comments

Same issue here with a custom dataset, see MVCE below:

EDIT: downgrading to 1.0.2 solved the issue.

import tensorflow_datasets.public_api as tfds
import tensorflow as tf
import numpy as np

class MyCustomDataset(tfds.core.GeneratorBasedBuilder):

    VERSION = tfds.core.Version('0.1.0')

    def _info(self):
        return tfds.core.DatasetInfo(
            builder=self,
            description="",
            features=tfds.features.FeaturesDict({
                "myfeature": tfds.features.Tensor(shape=(), dtype=tf.float32),
            }),
            urls=[],
            citation=""
        )

    def _split_generators(self, dl_manager):

        train = tfds.core.SplitGenerator(
            name="train",
            num_shards=10,
            gen_kwargs={
                "cases": np.arange(10),
            })


        return [train]

    def _generate_examples(self, cases):
        for i in range(10):
            yield {'myfeature': np.ones((), dtype=np.float32)}

mycustomdataset = MyCustomDataset()
mycustomdataset.download_and_prepare()

I sometime get KeyError: 'e'with a random character, sometimes get ValueError as below:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-1-32de1e850b02> in <module>
     35
     36 mycustomdataset = MyCustomDataset()
---> 37 mycustomdataset.download_and_prepare()

/hdd1/miniconda3/lib/python3.6/site-packages/tensorflow_datasets/core/api_utils.py in disallow_positional_args_dec(fn, instance, args, kwargs)
     50     _check_no_positional(fn, args, ismethod, allowed=allowed)
     51     _check_required(fn, kwargs)
---> 52     return fn(*args, **kwargs)
     53
     54   return disallow_positional_args_dec(wrapped)  # pylint: disable=no-value-for-parameter

/hdd1/miniconda3/lib/python3.6/site-packages/tensorflow_datasets/core/dataset_builder.py in download_and_prepare(self, download_dir, download_config)
    285         self._download_and_prepare(
    286             dl_manager=dl_manager,
--> 287             download_config=download_config)
    288
    289         # NOTE: If modifying the lines below to put additional information in

/hdd1/miniconda3/lib/python3.6/site-packages/tensorflow_datasets/core/dataset_builder.py in _download_and_prepare(self, dl_manager, download_config)
    946     super(GeneratorBasedBuilder, self)._download_and_prepare(
    947         dl_manager=dl_manager,
--> 948         max_examples_per_split=download_config.max_examples_per_split,
    949     )
    950

/hdd1/miniconda3/lib/python3.6/site-packages/tensorflow_datasets/core/dataset_builder.py in _download_and_prepare(self, dl_manager, **prepare_split_kwargs)
    814
    815       # Prepare split will record examples associated to the split
--> 816       self._prepare_split(split_generator, **prepare_split_kwargs)
    817
    818     # Update the info object with the splits.

/hdd1/miniconda3/lib/python3.6/site-packages/tensorflow_datasets/core/dataset_builder.py in _prepare_split(self, split_generator, max_examples_per_split)
    969                                      hash_salt=split_generator.name)
    970     for key, record in utils.tqdm(generator, unit=" examples",
--> 971                                   total=split_info.num_examples, leave=False):
    972       example = self.info.features.encode_example(record)
    973       writer.write(key, example)

ValueError: not enough values to unpack (expected 2, got 1)

We are transitioning to a new reader/writer (see: https://www.tensorflow.org/datasets/splits#two_apis_s3_and_legacy), so this trigger some bugs, but once the migration is complete, things should be back to normal. Thank you both for reporting.

@iceman126, Wmt hasn't been updated to this new S3 feature yet, so you need to manually opt-out by setting version:

config = tfds.translate.wmt.WmtConfig(
        version=tfds.core.Version('0.0.3', experiments={tfds.core.Experiment.S3: False}),              
        ...
)

Edit: @iceman126 optionally, a fix should come soon for this use case, so this should be fixed in one of the next tfds-nightly this week.

@remydubois For new dataset, the new S3 reader is activated by default, so you need to update your _generate_examples function to yield key, example instead of just example.
(see doc: https://www.tensorflow.org/datasets/add_dataset#writing_an_example_generator)

@Conchylicultor Thank you for your solution. This solved my problem.

Was this page helpful?
0 / 5 - 0 ratings