Wav2letter: Dataset format documentation

Created on 22 Dec 2018 · 3Comments · Source: flashlight/wav2letter

It would be helpful to document how to prepare a custom dataset into a form readable in
wav2letter++ pipelines.

What is the correct audio format, sampling rate, transcriptions format, directories structure etc?

Source

smolendawid

👍8

Most helpful comment

Hi,
We are in the process of writing more documentation including walk-through examples on a sample dataset etc. We should have these with a much more detailed explanation in the coming days. Here, I explain a most typical setting to get started.

how to prepare a custom dataset

Training the acoustic model : audio file -> subword units (graphemes, phonemes, etc..).

I consider graphemes here.

Token Dictionary : Would consists of all the graphemes you would like the acoustic model to predict. Typically, a token.txt file would look like this

|
'
a
b
c 
...
... (and so on)
z

We use "|" to denote space.

Audio, Target, Word file : Create separate directory for train, valid and test. Each folder should contain - .wav , .tkn and .wrd files numbered like 000000000.wav, 000000000.tkn, 000000000.wrd, 000000001.wav, 000000001.tkn, 000000001.wrd and so on. The folders for these are specified using -datadir, -train, -valid flags during training and -test for testing/decoding.

Let's say your transcription for first sample is "hello world".
000000001.tkn would look like h e l l o | w o r l d
000000001.wrd would look like hello world

[~/speech/data/train] ls | sort | head -n 30
000000000.wav
000000000.id
000000000.tkn
000000000.wrd
000000001.wav
000000001.id
000000001.tkn
000000001.wrd
000000002.wav
000000002.id
000000002.tkn
000000002.wrd
000000003.wav
000000003.id
000000003.tkn
000000003.wrd
000000004.wav
000000004.id
000000004.tkn
000000004.wrd
000000005.wav
000000005.id
000000005.tkn
000000005.wrd
000000006.wav
000000006.id
000000006.tkn
000000006.wrd
000000007.wav
000000007.id
000000007.tkn
000000007.wrd
// Ignore '.id' files as they are not used in the pipelines now.

Testing/decoding

Lexicon : word -> list of graphemes
Language Model : You can use standard n-gram LMs although the framework is generic enough to plugin convLMs, RNN LMs or anything.

What is the correct audio format, sampling rate, transcriptions format, directories structure etc?

We use sndfile for loading the audio files. It supports many different formats which include .wav, .flac etc... You can specify them using -input flag.

For samplerate, 16Khz is the default option but you can specify a different one using -samplerate flag. Note that, we require all the train/valid/test data to have the same samplerate for now.

Transcriptions format should be specified in .tkn and .wrd files as mentioned above.

There is no specific directory structure that we require as long as the above guidelines are followed.

vineelpratap on 22 Dec 2018

👍3

All 3 comments

how to prepare a custom dataset

Training the acoustic model : audio file -> subword units (graphemes, phonemes, etc..).

I consider graphemes here.

Token Dictionary : Would consists of all the graphemes you would like the acoustic model to predict. Typically, a token.txt file would look like this

|
'
a
b
c 
...
... (and so on)
z

We use "|" to denote space.

Audio, Target, Word file : Create separate directory for train, valid and test. Each folder should contain - .wav , .tkn and .wrd files numbered like 000000000.wav, 000000000.tkn, 000000000.wrd, 000000001.wav, 000000001.tkn, 000000001.wrd and so on. The folders for these are specified using -datadir, -train, -valid flags during training and -test for testing/decoding.

Let's say your transcription for first sample is "hello world".
000000001.tkn would look like h e l l o | w o r l d
000000001.wrd would look like hello world

[~/speech/data/train] ls | sort | head -n 30
000000000.wav
000000000.id
000000000.tkn
000000000.wrd
000000001.wav
000000001.id
000000001.tkn
000000001.wrd
000000002.wav
000000002.id
000000002.tkn
000000002.wrd
000000003.wav
000000003.id
000000003.tkn
000000003.wrd
000000004.wav
000000004.id
000000004.tkn
000000004.wrd
000000005.wav
000000005.id
000000005.tkn
000000005.wrd
000000006.wav
000000006.id
000000006.tkn
000000006.wrd
000000007.wav
000000007.id
000000007.tkn
000000007.wrd
// Ignore '.id' files as they are not used in the pipelines now.

Testing/decoding

Lexicon : word -> list of graphemes
Language Model : You can use standard n-gram LMs although the framework is generic enough to plugin convLMs, RNN LMs or anything.

What is the correct audio format, sampling rate, transcriptions format, directories structure etc?

We use sndfile for loading the audio files. It supports many different formats which include .wav, .flac etc... You can specify them using -input flag.

For samplerate, 16Khz is the default option but you can specify a different one using -samplerate flag. Note that, we require all the train/valid/test data to have the same samplerate for now.

Transcriptions format should be specified in .tkn and .wrd files as mentioned above.

There is no specific directory structure that we require as long as the above guidelines are followed.

vineelpratap on 22 Dec 2018

👍3

Hi,
We just posted documentation for Data preparation.

There is also a new tutorial section where you can find examples for getting started with wav2letter++. You can check it out here.

We will also gladly accept PRs on any tutorials / examples of cool applications you build with wav2letter++ so that others can benefit. Thanks !

vineelpratap on 26 Dec 2018

thank you

smolendawid on 3 Jan 2019

Was this page helpful?

0 / 5 - 0 ratings