Wav2letter: Training Lexicon-Free Speech Recognition in other languages

Created on 27 Jul 2020  路  6Comments  路  Source: flashlight/wav2letter

Question

How to correctly do data preparation for this algorithm?

Additional Context

Hello, I'm trying to train Who needs words?: Lexicon Free Speech Recognition in my own dataset (from a complex language domain). While I followed the steps in wav2letter's "Data Preparation" section, it's still unclear how to use my .lexicon, token.txt and .lst files generated with this architecture. In the script prepare.py, the librispeech data is used and pre-processed, should I transform my data in this format and replace paths in the .py? Should I Train what kind of language model (char or word-based/with or without lexicon)? This may be just misinterpretation of my reading of the library, but any guidance would be helpful, Thank you!

question

Most helpful comment

Hi @JpMCarrilho

Sorry for the delay. Let me explain what you need to have and how to prepare, probably it will be simpler just create new prepare.py for your data.

Tokens.txt

This file should contain tokens, so that model will return you probability for each frame across tokens. Tokens files thus just defines classes for the frame classification problem. In case of English we use a-z letters, apostrophe, and "|" - as words delimiter.

Lexicon file

It should contain mapping of all your train/dev words into your tokens sequence, like "hello h e l l o |" - so format is word then tokens sequence separated by space and at the end we use "|" to set the word boundaries.

So you need just to collect set of all words from train transcriptions and then do mapping to your letters sequence.
Lexicon is used to map target transcription (which is standard words sequence) to the tokens sequence.

List file

List file is the format of input data for the model, the format is

id file_path file_duration_in_ms target_transcription

Target transcription is just word transcription.

Hope, this explains what and why you need to have. You could adapt prepare.py from lexicon-free recipe, but it depends on your language what you need to modify.

All 6 comments

Hi @JpMCarrilho

Sorry for the delay. Let me explain what you need to have and how to prepare, probably it will be simpler just create new prepare.py for your data.

Tokens.txt

This file should contain tokens, so that model will return you probability for each frame across tokens. Tokens files thus just defines classes for the frame classification problem. In case of English we use a-z letters, apostrophe, and "|" - as words delimiter.

Lexicon file

It should contain mapping of all your train/dev words into your tokens sequence, like "hello h e l l o |" - so format is word then tokens sequence separated by space and at the end we use "|" to set the word boundaries.

So you need just to collect set of all words from train transcriptions and then do mapping to your letters sequence.
Lexicon is used to map target transcription (which is standard words sequence) to the tokens sequence.

List file

List file is the format of input data for the model, the format is

id file_path file_duration_in_ms target_transcription

Target transcription is just word transcription.

Hope, this explains what and why you need to have. You could adapt prepare.py from lexicon-free recipe, but it depends on your language what you need to modify.

hi,

this may be a stupid question but,

If I have a dataset from which I define a set of representative tokens as in [a-z'|], doesn't this mean that words in my dataset is eventually a list of these tokens i.e. hello: h e l l o | ?

So why to construct the lexicon manually beforehands? Do i miss something?

Sorry, I don't understand what do you mean. Do you mean that we don't need to prepare lexicon because it is just a sequence of letters?

Yes. Is that just a convenience file/mapping or does it involve a more sophisticated linguistic step that I miss?

Tokens set you need to provide as for LM training for example, you need to specify the index class for each token. Then you could skip lexicon if you are sure that there no other tokens in your transcriptions (otherwise you will have crash). We have falling back to letters if word is not presented in the lexicon, and all these letters should be in the tokens set. No any sophisticated linguistic step, but here with lexicon file you could do special mapping of abbreviations, like "tv t | v |".

thank you!

Was this page helpful?
0 / 5 - 0 ratings

Related issues

mlexplore1122 picture mlexplore1122  路  3Comments

nihiluis picture nihiluis  路  5Comments

ekorudi picture ekorudi  路  5Comments

isaacleeai picture isaacleeai  路  5Comments

bill-kalog picture bill-kalog  路  4Comments