Wav2letter: creting Language model for german language

Created on 1 Mar 2019 · 5Comments · Source: flashlight/wav2letter

Hi there,

I want to build a language model for the German model. Can I use the same KenLM for German as well as mentioned in the tutorial for the English model?
If yes,
any changes to be done for the prepare_lm.py file as I think this was specific for Librispeech data?
if no,
Which language model can be used. any initial point for search would help me.

Thank you :)

question

Source

megharangaswamy

👍1

All 5 comments

@megharangaswamy — yes, you should be able to train and use a German model with KenLM.

One option is to use KenLM without the prepare_lm.py script — as long as you can convert your LM into binary format (you can use KenLM's build_binary,) you can pass it to the decoder.

The prepare_lm.py script just extracts needed data from transcripts that the LM needs (e.g. outputting a lexicon in w2l format). You should be able to repurpose the script to use if you have German transcripts available.

jacobkahn on 5 Mar 2019

👍1

Dear @jacobkahn

I am confused with LM creation. Can you please clarify my question?

1) If I want to create LM for different language using KenLM, I need to have a data set which is huge,
is it?
2) For training my acoustic model I used only 5.2GB of German dataset. And this is the data set size I
have all together. Is it good to use this data for LM creation?
3) Can I use any other LM available for German? Like I came across LM available for German created
by Zamia project.
Thank you :)

megharangaswamy on 15 Mar 2019

@megharangaswamy —

Having a larger dataset can improve the quality of your LM in most cases, but might not always help when decoding: training an LM with many words which are out-of vocabulary from your transcripts can sometimes lead to inconsistent scoring at decoding time.
In general, training an LM on transcripts that you train your acoustic model on has worked well for us.
If the LM is in arpa format, you can convert it using KenLM into a binary format that wav2letter can read in.

jacobkahn on 19 Mar 2019

👍1

Hi, Would like to build a model for few indian languages. Like german we have compound words and there wont really be a definite lexicon, only definete lemmas. Can we follow the same procedure used for english?

Krishna-suraj on 19 Dec 2019

@Krishna-suraj,

For acoustic model you should define the token set as usual in English, this could be letters or any tokens (word pieces). So this defines which tokens can be predicted for each frame.
For language model you can train KenLM (or any other language) on sequence of token (like for sequence of words). In this case each token is separated with space and tokens set should be the same as for acoustic model. If you have notion of word boundaries then you should have for example | token for the word boundary, so your target transcription have it and also in language model you have this token between words (only in the case if you train not on words, but on sub-word units).
For the beam-search decoding you can use lexicon-free decoder, details on it here https://github.com/facebookresearch/wav2letter/wiki/Beam-Search-Decoder in Lexicon-free beam-search decoder section.

As an example how we did lexicon-free for English on letters you can check recipe https://github.com/facebookresearch/wav2letter/tree/master/recipes/models/lexicon_free/librispeech