It would be helpful to document how to prepare a custom dataset into a form readable in
wav2letter++ pipelines.
What is the correct audio format, sampling rate, transcriptions format, directories structure etc?
Hi,
We are in the process of writing more documentation including walk-through examples on a sample dataset etc. We should have these with a much more detailed explanation in the coming days. Here, I explain a most typical setting to get started.
how to prepare a custom dataset
Training the acoustic model : audio file -> subword units (graphemes, phonemes, etc..).
I consider graphemes here.
token.txt file would look like this|
'
a
b
c
...
... (and so on)
z
We use "|" to denote space.
.wav , .tkn and .wrd files numbered like 000000000.wav, 000000000.tkn, 000000000.wrd, 000000001.wav, 000000001.tkn, 000000001.wrd and so on. The folders for these are specified using -datadir, -train, -valid flags during training and -test for testing/decoding. Let's say your transcription for first sample is "hello world".
000000001.tkn would look like h e l l o | w o r l d
000000001.wrd would look like hello world
[~/speech/data/train] ls | sort | head -n 30
000000000.wav
000000000.id
000000000.tkn
000000000.wrd
000000001.wav
000000001.id
000000001.tkn
000000001.wrd
000000002.wav
000000002.id
000000002.tkn
000000002.wrd
000000003.wav
000000003.id
000000003.tkn
000000003.wrd
000000004.wav
000000004.id
000000004.tkn
000000004.wrd
000000005.wav
000000005.id
000000005.tkn
000000005.wrd
000000006.wav
000000006.id
000000006.tkn
000000006.wrd
000000007.wav
000000007.id
000000007.tkn
000000007.wrd
// Ignore '.id' files as they are not used in the pipelines now.
What is the correct audio format, sampling rate, transcriptions format, directories structure etc?
We use sndfile for loading the audio files. It supports many different formats which include .wav, .flac etc... You can specify them using -input flag.
For samplerate, 16Khz is the default option but you can specify a different one using -samplerate flag. Note that, we require all the train/valid/test data to have the same samplerate for now.
Transcriptions format should be specified in .tkn and .wrd files as mentioned above.
There is no specific directory structure that we require as long as the above guidelines are followed.
Hi,
We just posted documentation for Data preparation.
There is also a new tutorial section where you can find examples for getting started with wav2letter++. You can check it out here.
We will also gladly accept PRs on any tutorials / examples of cool applications you build with wav2letter++ so that others can benefit. Thanks !
thank you
Most helpful comment
Hi,
We are in the process of writing more documentation including walk-through examples on a sample dataset etc. We should have these with a much more detailed explanation in the coming days. Here, I explain a most typical setting to get started.
Training the acoustic model : audio file -> subword units (graphemes, phonemes, etc..).
I consider graphemes here.
token.txtfile would look like thisWe use "|" to denote space.
.wav,.tknand.wrdfiles numbered like 000000000.wav, 000000000.tkn, 000000000.wrd, 000000001.wav, 000000001.tkn, 000000001.wrd and so on. The folders for these are specified using-datadir,-train,-validflags during training and-testfor testing/decoding.Let's say your transcription for first sample is "hello world".
000000001.tkn would look like h e l l o | w o r l d
000000001.wrd would look like hello world
Testing/decoding
We use sndfile for loading the audio files. It supports many different formats which include .wav, .flac etc... You can specify them using
-inputflag.For samplerate, 16Khz is the default option but you can specify a different one using
-samplerateflag. Note that, we require all the train/valid/test data to have the same samplerate for now.Transcriptions format should be specified in
.tknand.wrdfiles as mentioned above.There is no specific directory structure that we require as long as the above guidelines are followed.