Datasets: Add test and fake data for librispeech

Created on 5 Mar 2019  路  6Comments  路  Source: tensorflow/datasets

Librispeech is missing unittest and fake data.

Please follow:
https://github.com/tensorflow/datasets/blob/master/docs/add_dataset.md#testing-mydataset

bug

Most helpful comment

Thanks @ChanchalKumarMaji!

All 6 comments

@cyfra I want to start working on this issue. Please assign this to me.

Thanks @ChanchalKumarMaji!

When I run this code on my fake_dataset

from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

from tensorflow_datasets import testing
from tensorflow_datasets.audio import librispeech 


class LibrispeechTest(testing.DatasetBuilderTestCase):
  DATASET_CLASS = librispeech.Librispeech
  SPLITS = {
      "train": 1,
      "test": 1,
  }

if __name__ == "__main__":
  testing.test_main()

I get this error:

.Testing config clean100_plain_text
Downloading / extracting dataset librispeech (?? GiB) to /tmp/librispeech_testufs4ovss/tmp_6wchvyh/librispeech/clean100_plain_text/0.1.0...
E..s
======================================================================
ERROR: test_download_and_prepare_as_dataset (__main__.LibrispeechTest)
test_download_and_prepare_as_dataset (__main__.LibrispeechTest)
Run the decorated test method.
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/chanchal/anaconda3/lib/python3.6/site-packages/tensorflow_datasets/testing/test_utils.py", line 167, in decorated
    f(self, *args, **kwargs)
  File "/home/chanchal/anaconda3/lib/python3.6/site-packages/tensorflow_datasets/testing/dataset_builder_testing.py", line 192, in test_download_and_prepare_as_dataset
    self._download_and_prepare_as_dataset(builder)
  File "/home/chanchal/anaconda3/lib/python3.6/site-packages/tensorflow_datasets/testing/dataset_builder_testing.py", line 209, in _download_and_prepare_as_dataset
    builder.download_and_prepare(download_config=download_config)
  File "/home/chanchal/anaconda3/lib/python3.6/site-packages/tensorflow_datasets/core/api_utils.py", line 52, in disallow_positional_args_dec
    return fn(*args, **kwargs)
  File "/home/chanchal/anaconda3/lib/python3.6/site-packages/tensorflow_datasets/core/dataset_builder.py", line 219, in download_and_prepare
    max_examples_per_split=download_config.max_examples_per_split)
  File "/home/chanchal/anaconda3/lib/python3.6/site-packages/tensorflow_datasets/core/dataset_builder.py", line 650, in _download_and_prepare
    for split_generator in self._split_generators(dl_manager):
  File "/home/chanchal/anaconda3/lib/python3.6/site-packages/tensorflow_datasets/audio/librispeech.py", line 195, in _split_generators
    self._vocab_text_gen(extracted_dirs[tfds.Split.TRAIN]))
TypeError: string indices must be integers

----------------------------------------------------------------------
Ran 5 tests in 0.190s

FAILED (errors=1, skipped=1)

I think I am missing something. @rsepassi @cyfra @Conchylicultor please help.

@ChanchalKumarMaji You need to pass a Dictionary for the file path. This can done by assigning a Dictionary to DL_EXTRACT_RESULT.
The keys of which will be tfds.Split.TRAIN, tfds.Split.TEST, tfds.Split.VALIDATION. And the values will be the path to the Directories for the respective Fake Samples.

Reference: https://github.com/tensorflow/datasets/blob/master/tensorflow_datasets/image/mnist_test.py

class LibrispeechTest(testing.DatasetBuilderTestCase):
  DATASET_CLASS = librispeech.Librispeech

  SPLITS = {
      "train": 1,
      "test": 1,
      "dev": 1,
  }

  DL_EXTRACT_RESULT = {
      tfds.Split.TRAIN: "train-clean-100",
      tfds.Split.TEST: "test-clean",
      tfds.Split.VALIDATION: "dev-clean",
  }

if __name__ == "__main__":
  testing.test_main()

gives

Traceback (most recent call last):
  File "/home/chanchal/anaconda3/lib/python3.6/site-packages/tensorflow_datasets/testing/test_utils.py", line 167, in decorated
    f(self, *args, **kwargs)
  File "/home/chanchal/anaconda3/lib/python3.6/site-packages/tensorflow_datasets/testing/dataset_builder_testing.py", line 192, in test_download_and_prepare_as_dataset
    self._download_and_prepare_as_dataset(builder)
  File "/home/chanchal/anaconda3/lib/python3.6/site-packages/tensorflow_datasets/testing/dataset_builder_testing.py", line 212, in _download_and_prepare_as_dataset
    self._assertAsDataset(builder)
  File "/home/chanchal/anaconda3/lib/python3.6/site-packages/tensorflow_datasets/testing/dataset_builder_testing.py", line 236, in _assertAsDataset
    self.assertLen(examples, expected_examples_number)
  File "/home/chanchal/anaconda3/lib/python3.6/site-packages/absl/testing/absltest.py", line 784, in assertLen
    container_repr, len(container), expected_len), msg)
  File "/home/chanchal/anaconda3/lib/python3.6/site-packages/absl/testing/absltest.py", line 1609, in fail
    return super(TestCase, self).fail(self._formatMessage(prefix, msg))
AssertionError: [] has length of 0, expected 1.

CHAPTERS.TXT is

; Some pipe(|) separated metadata about the audio chapters included in the corpus.
;
; The meaning of the fields in left-to-right order is as follows:
;
; chapter_id: the ID of the chapter in the LibriVox's database
; reader_id: the ID of the reader in the LibriVox's database
; duration: how many minutes of this chapter are used in the corpus
; subset: the corpus subset to which this chapter is assigned
; project_id: the LibriVox project ID
; book_id: the Project Gutenberg's ID for the book on which the LibriVox project is based
; chapter_title: the title of the chapter on LibriVox
; project_title: the title of the LibriVox project
;
;ID    |READER|MINUTES| SUBSET           | PROJ.|BOOK ID| CH. TITLE  | PROJECT TITLE
01     | 11   | 19.77 | dev-clean        | 53   | 2     | In Chancer | Bleak House
02     | 12   | 10.30 | dev-clean        | 53   | 3     | In Fashion | Bleak House
03     | 13   | 7.67  | dev-other        | 68   | 7     | Letter XXV | Unbeaten Tracks in Japan
04     | 14   | 8.42  | dev-other        | 219  | 9     | Chapter 01 | Northanger Abbey
05     | 15   | 11.68 | test-clean       | 219  | 1     | Chapter 02 | Northanger Abbey
06     | 16   | 11.25 | test-clean       | 219  | 5     | Chapter 03 | Northanger Abbey
07     | 17   | 7.57  | test-other       | 219  | 9     | Chapter 04 | Northanger Abbey
08     | 18   | 12.76 | test-other       | 219  | 3     | Chapter 07 | Northanger Abbey
09     | 19   | 12.82 | train-clean-100  | 219  | 4     | Chapter 08 | Northanger Abbey
10     | 20   | 18.33 | train-clean-100  | 219  | 6     | Chapter 10 | Northanger Abbey
11     | 21   | 12.95 | train-clean-360  | 219  | 8     | Chapter 11 | Northanger Abbey
12     | 22   | 8.20  | train-clean-360  | 219  | 1     | Chapter 12 | Northanger Abbey
13     | 23   | 12.09 | train-other-500  | 219  | 4     | Chapter 15 | Northanger Abbey
14     | 24   | 6.19  | train-other-500  | 219  | 5     | Chapter 17 | Northanger Abbey

Created just one sample of each types.

Please help @rsepassi @cyfra @Conchylicultor

Okk, now when I run librispeech_test.py I get no errors. Thanks @captain-pool for the help. @rsepassi @cyfra @Conchylicultor I am generating a pull request.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

keshan picture keshan  路  5Comments

ageron picture ageron  路  4Comments

ashutosh1919 picture ashutosh1919  路  5Comments

lgeiger picture lgeiger  路  5Comments

MahdiNicoo picture MahdiNicoo  路  3Comments