Datasets: [data request] Mozilla Common Voice

Created on 1 Mar 2019 · 7Comments · Source: tensorflow/datasets

Name of dataset: Common Voice
URL of dataset: https://voice.mozilla.org/en/datasets
License of dataset: Creative Commons Attribution Share-Alike 3.0 Unported license
Short description of dataset and use case(s): open source, multi-language dataset of voices that anyone can use to train speech-enabled applications.

Folks who would also like to see this dataset in tensorflow/datasets, please +1/thumbs-up so the developers can know which requests to prioritize.

dataset request

Source

rsepassi

👍8 🎉3 ❤2

Most helpful comment

@rsepassi @cyfra @Conchylicultor Please Take a Look. I've added heavy configuration to support all the available languages, as asked by @rsepassi.

captain-pool on 8 Mar 2019

🎉1 😄1 👍1

All 7 comments

Hey @rsepassi , I'ld like to work on this issue. Can you assign me for this? Thank You.

captain-pool on 1 Mar 2019

👍1

@captain-pool - please accept the collaborator invite.

cyfra on 1 Mar 2019

@rsepassi , @cyfra, The Mozilla Voice Dataset is having train.tsv, test.tsv and validation.tsv files which points to a set of mp3 files placed in a folder. If I try to read the tsv files using tf.data.TextLineDataset, and try to segregate the training and testing files, for returning through SplitGenerator, I need to iterate over the entries of the tsv files, which will depend on the default graph, provided eager execution is not enabled. In such case, to get the exact entries, we need to go for a sess.run().
So, can I use pandas to read the tsv files instead of using tensorflow library functions?

captain-pool on 2 Mar 2019

Nevermind found a solution without using Pandas.
Turns out, wrapping the Tensor Object with tfds.as_numpy() does the job and returns a generator to work on, without the need for calling sess.run() explicitly.

captain-pool on 2 Mar 2019

In data generation, it is not necessary to use TF, so you can (and probably should) use regular Python libraries, though all file access should go through tf.io.gfile.

rsepassi on 2 Mar 2019

@rsepassi @cyfra @Conchylicultor Please Take a Look. I've added heavy configuration to support all the available languages, as asked by @rsepassi.

captain-pool on 8 Mar 2019

🎉1 😄1 👍1

This issue is being closed as Mozila Common voice is now part of TensorFlow datasets package.
Please refer to https://github.com/tensorflow/datasets/blob/master/tensorflow_datasets/audio/commonvoice.py and https://www.tensorflow.org/datasets/catalog/common_voice for more info.
Thank you for your contribution