Wav2letter: Training Lexicon-Free Speech Recognition in other languages

Created on 27 Jul 2020 · 6Comments · Source: flashlight/wav2letter

Question

How to correctly do data preparation for this algorithm?

Additional Context

Hello, I'm trying to train Who needs words?: Lexicon Free Speech Recognition in my own dataset (from a complex language domain). While I followed the steps in wav2letter's "Data Preparation" section, it's still unclear how to use my .lexicon, token.txt and .lst files generated with this architecture. In the script prepare.py, the librispeech data is used and pre-processed, should I transform my data in this format and replace paths in the .py? Should I Train what kind of language model (char or word-based/with or without lexicon)? This may be just misinterpretation of my reading of the library, but any guidance would be helpful, Thank you!

question

Source

jpmcarrilho

Most helpful comment

Hi @JpMCarrilho

Sorry for the delay. Let me explain what you need to have and how to prepare, probably it will be simpler just create new prepare.py for your data.

Tokens.txt

This file should contain tokens, so that model will return you probability for each frame across tokens. Tokens files thus just defines classes for the frame classification problem. In case of English we use a-z letters, apostrophe, and "|" - as words delimiter.

Lexicon file

It should contain mapping of all your train/dev words into your tokens sequence, like "hello h e l l o |" - so format is word then tokens sequence separated by space and at the end we use "|" to set the word boundaries.

So you need just to collect set of all words from train transcriptions and then do mapping to your letters sequence.
Lexicon is used to map target transcription (which is standard words sequence) to the tokens sequence.

List file

List file is the format of input data for the model, the format is

id file_path file_duration_in_ms target_transcription

Target transcription is just word transcription.

Hope, this explains what and why you need to have. You could adapt prepare.py from lexicon-free recipe, but it depends on your language what you need to modify.

tlikhomanenko on 7 Aug 2020

👍2 🎉1

All 6 comments

Hi @JpMCarrilho

Sorry for the delay. Let me explain what you need to have and how to prepare, probably it will be simpler just create new prepare.py for your data.

Tokens.txt

Lexicon file

List file

List file is the format of input data for the model, the format is

id file_path file_duration_in_ms target_transcription

Target transcription is just word transcription.

Hope, this explains what and why you need to have. You could adapt prepare.py from lexicon-free recipe, but it depends on your language what you need to modify.

tlikhomanenko on 7 Aug 2020

👍2 🎉1

hi,

this may be a stupid question but,

If I have a dataset from which I define a set of representative tokens as in [a-z'|], doesn't this mean that words in my dataset is eventually a list of these tokens i.e. hello: h e l l o | ?

So why to construct the lexicon manually beforehands? Do i miss something?

ozancaglayan on 7 Aug 2020

Sorry, I don't understand what do you mean. Do you mean that we don't need to prepare lexicon because it is just a sequence of letters?

tlikhomanenko on 8 Aug 2020

Yes. Is that just a convenience file/mapping or does it involve a more sophisticated linguistic step that I miss?

ozancaglayan on 8 Aug 2020

Tokens set you need to provide as for LM training for example, you need to specify the index class for each token. Then you could skip lexicon if you are sure that there no other tokens in your transcriptions (otherwise you will have crash). We have falling back to letters if word is not presented in the lexicon, and all these letters should be in the tokens set. No any sophisticated linguistic step, but here with lexicon file you could do special mapping of abbreviations, like "tv t | v |".

tlikhomanenko on 8 Aug 2020

thank you!

ozancaglayan on 8 Aug 2020

Was this page helpful?

0 / 5 - 0 ratings