How to correctly do data preparation for this algorithm?
Hello, I'm trying to train Who needs words?: Lexicon Free Speech Recognition in my own dataset (from a complex language domain). While I followed the steps in wav2letter's "Data Preparation" section, it's still unclear how to use my .lexicon, token.txt and .lst files generated with this architecture. In the script prepare.py, the librispeech data is used and pre-processed, should I transform my data in this format and replace paths in the .py? Should I Train what kind of language model (char or word-based/with or without lexicon)? This may be just misinterpretation of my reading of the library, but any guidance would be helpful, Thank you!
Hi @JpMCarrilho
Sorry for the delay. Let me explain what you need to have and how to prepare, probably it will be simpler just create new prepare.py for your data.
This file should contain tokens, so that model will return you probability for each frame across tokens. Tokens files thus just defines classes for the frame classification problem. In case of English we use a-z letters, apostrophe, and "|" - as words delimiter.
It should contain mapping of all your train/dev words into your tokens sequence, like "hello h e l l o |" - so format is word then tokens sequence separated by space and at the end we use "|" to set the word boundaries.
So you need just to collect set of all words from train transcriptions and then do mapping to your letters sequence.
Lexicon is used to map target transcription (which is standard words sequence) to the tokens sequence.
List file is the format of input data for the model, the format is
id file_path file_duration_in_ms target_transcription
Target transcription is just word transcription.
Hope, this explains what and why you need to have. You could adapt prepare.py from lexicon-free recipe, but it depends on your language what you need to modify.
hi,
this may be a stupid question but,
If I have a dataset from which I define a set of representative tokens as in [a-z'|], doesn't this mean that words in my dataset is eventually a list of these tokens i.e. hello: h e l l o | ?
So why to construct the lexicon manually beforehands? Do i miss something?
Sorry, I don't understand what do you mean. Do you mean that we don't need to prepare lexicon because it is just a sequence of letters?
Yes. Is that just a convenience file/mapping or does it involve a more sophisticated linguistic step that I miss?
Tokens set you need to provide as for LM training for example, you need to specify the index class for each token. Then you could skip lexicon if you are sure that there no other tokens in your transcriptions (otherwise you will have crash). We have falling back to letters if word is not presented in the lexicon, and all these letters should be in the tokens set. No any sophisticated linguistic step, but here with lexicon file you could do special mapping of abbreviations, like "tv t | v |".
thank you!
Most helpful comment
Hi @JpMCarrilho
Sorry for the delay. Let me explain what you need to have and how to prepare, probably it will be simpler just create new prepare.py for your data.
Tokens.txt
This file should contain tokens, so that model will return you probability for each frame across tokens. Tokens files thus just defines classes for the frame classification problem. In case of English we use a-z letters, apostrophe, and "|" - as words delimiter.
Lexicon file
It should contain mapping of all your train/dev words into your tokens sequence, like "hello h e l l o |" - so format is word then tokens sequence separated by space and at the end we use "|" to set the word boundaries.
So you need just to collect set of all words from train transcriptions and then do mapping to your letters sequence.
Lexicon is used to map target transcription (which is standard words sequence) to the tokens sequence.
List file
List file is the format of input data for the model, the format is
Target transcription is just word transcription.
Hope, this explains what and why you need to have. You could adapt prepare.py from lexicon-free recipe, but it depends on your language what you need to modify.