Hi there,
I want to build a language model for the German model. Can I use the same KenLM for German as well as mentioned in the tutorial for the English model?
If yes,
any changes to be done for the prepare_lm.py file as I think this was specific for Librispeech data?
if no,
Which language model can be used. any initial point for search would help me.
Thank you :)
@megharangaswamy — yes, you should be able to train and use a German model with KenLM.
One option is to use KenLM without the prepare_lm.py script — as long as you can convert your LM into binary format (you can use KenLM's build_binary,) you can pass it to the decoder.
The prepare_lm.py script just extracts needed data from transcripts that the LM needs (e.g. outputting a lexicon in w2l format). You should be able to repurpose the script to use if you have German transcripts available.
Dear @jacobkahn
I am confused with LM creation. Can you please clarify my question?
1) If I want to create LM for different language using KenLM, I need to have a data set which is huge,
is it?
2) For training my acoustic model I used only 5.2GB of German dataset. And this is the data set size I
have all together. Is it good to use this data for LM creation?
3) Can I use any other LM available for German? Like I came across LM available for German created
by Zamia project.
Thank you :)
@megharangaswamy —
Hi, Would like to build a model for few indian languages. Like german we have compound words and there wont really be a definite lexicon, only definete lemmas. Can we follow the same procedure used for english?
@Krishna-suraj,
| token for the word boundary, so your target transcription have it and also in language model you have this token between words (only in the case if you train not on words, but on sub-word units).As an example how we did lexicon-free for English on letters you can check recipe https://github.com/facebookresearch/wav2letter/tree/master/recipes/models/lexicon_free/librispeech