Folks who would also like to see this dataset in tensorflow/datasets, please +1/thumbs-up so the developers can know which requests to prioritize.
Hey @rsepassi , I'ld like to work on this issue. Can you assign me for this? Thank You.
@captain-pool - please accept the collaborator invite.
@rsepassi , @cyfra, The Mozilla Voice Dataset is having train.tsv, test.tsv and validation.tsv files which points to a set of mp3 files placed in a folder. If I try to read the tsv files using tf.data.TextLineDataset, and try to segregate the training and testing files, for returning through SplitGenerator, I need to iterate over the entries of the tsv files, which will depend on the default graph, provided eager execution is not enabled. In such case, to get the exact entries, we need to go for a sess.run().
So, can I use pandas to read the tsv files instead of using tensorflow library functions?
Nevermind found a solution without using Pandas.
Turns out, wrapping the Tensor Object with tfds.as_numpy() does the job and returns a generator to work on, without the need for calling sess.run() explicitly.
In data generation, it is not necessary to use TF, so you can (and probably should) use regular Python libraries, though all file access should go through tf.io.gfile.
@rsepassi @cyfra @Conchylicultor Please Take a Look. I've added heavy configuration to support all the available languages, as asked by @rsepassi.
This issue is being closed as Mozila Common voice is now part of TensorFlow datasets package.
Please refer to https://github.com/tensorflow/datasets/blob/master/tensorflow_datasets/audio/commonvoice.py and https://www.tensorflow.org/datasets/catalog/common_voice for more info.
Thank you for your contribution
Most helpful comment
@rsepassi @cyfra @Conchylicultor Please Take a Look. I've added heavy configuration to support all the available languages, as asked by @rsepassi.