Hi, I am trying to use kenlm decoder for beam search decoding in Speech recognition task. After I load the kenlm model with my lexicon. And try to get start_state or set up Trie, segmentation fault occurs every time, Even if it pass in a different environment then when I start the given loop even then segmentation fault occur .
First I thought it was memory issue but I have around 120 gigs of it now and even when I load a small set of text to create a toy binary and lexicon, even then segmentation fault happens.
I have attached the lexicon, token dict and arpa file for reference Files.zip and error_backtrace that I recieve.
I have tried direct loading of arpa file too.
Can you help me out on what I am doing wrong and suggest a solution for it?
Here is the sample of code I am using:
class W2lKenLMDecoder(W2lDecoder):
def __init__(self, tgt_dict):
super().__init__( tgt_dict)
self.silence = (
tgt_dict.index("<ctc_blank>")
if "<ctc_blank>" in tgt_dict.indices
else tgt_dict.bos()
)
self.lexicon = load_words("args.lexicon")
self.word_dict = create_word_dict(self.lexicon)
self.unk_word = self.word_dict.get_index("<unk>")
self.lm = KenLM(args.kenlm, self.word_dict)
self.trie = Trie(self.vocab_size, self.silence) # Line at which segmentation fault occur
start_state = self.lm.start(False) # Line at which segmentation fault occur
for i, (word, spellings) in enumerate(self.lexicon.items()):
word_idx = self.word_dict.get_index(word)
_, score = self.lm.score(start_state, word_idx) # Line at which segmentation fault occur
for spelling in spellings:
spelling_idxs = [tgt_dict.index(token) for token in spelling]
assert (
tgt_dict.unk() not in spelling_idxs
), f"{spelling} {spelling_idxs}"
self.trie.insert(spelling_idxs, word_idx, score)
self.trie.smear(SmearingMode.MAX)
Why do you have self.lm = KenLM("args.kenlm", self.word_dict)? It means that file name for kenlm is "args.kenlm" and he is using it as file name. Can you try self.lm = KenLM(args.kenlm, self.word_dict)?
Why do you have
self.lm = KenLM("args.kenlm", self.word_dict)? It means that file name for kenlm is "args.kenlm" and he is using it as file name. Can you tryself.lm = KenLM(args.kenlm, self.word_dict)?
Yeah sorry that is a typo, earlier for quickly checking that error was really in this part of code I hardcoded some file paths in my Jupiter notebook rather than passing them through args. so before posting here must have forgot to remove quotes from args.kenlm. So that part is correct in the code.
Ok, could you first confirm that binary files, like query in the kenlm build dir are working with your lm fine?
Hi @tlikhomanenko querying is working. Here is image for reference.

Could you send exact reproduction of your env setup (or docker image if it is reproducible there). Several people reported that problem, but I cannot repro it, possibly related to python/conda env setup.
I am sharing the line by line steps here: (Machine used Ubuntu 16/18)
To satisfy all the kenlm, openblas, blog, ...etc dependencies.
sudo apt-get install liblzma-dev libbz2-dev libzstd-dev libsndfile1-dev libopenblas-dev libfftw3-dev libgflags-dev libgoogle-glog-dev
Installing Kenlm
git clone https://github.com/kpu/kenlm.git
cd kenlm
mkdir -p build && cd build
cmake .. -DCMAKE_BUILD_TYPE=Release -DKENLM_MAX_ORDER=20 -DCMAKE_POSITION_INDEPENDENT_CODE=ON
make -j16
export KENLM_ROOT_DIR=<Kenlm root dir path>
conda create -n exp1 python=3.7
pip install packaging
pip install soundfile
git clone https://github.com/pytorch/fairseq
cd fairseq
pip install --editable ./
pip install torch==1.7.0+cu101 torchvision==0.8.1+cu101 torchaudio==0.7.0 -f https://download.pytorch.org/whl/torch_stable.html
cd ..
https://github.com/facebookresearch/wav2letter.git -b v0.2
cd wav2letter/bindings/python
pip install -e .
I have already tried this with different torch versions(1.4,1.5,1.6,1.7) but the error remains same.
If you want to do a quick try and reproduce the same problem, you can just install wav2letter(python bindings) on Colab and use the files I have provided in Files.zip along with the code. You will observe that kernel dies there when in code we enter the loop to get the LM scores.
Which conda version do you have? Then will try to repro your steps with the same conda
Which conda version do you have? Then will try to repro your steps with the same conda
Conda version 4.8.3
Hi @tlikhomanenko , Were you able to reproduce? Is there any quick solution.
Hi, I'm having segfaults in the same fragment of code trying to use a KenLM decoder for wav2vec2.0. @amant555 were you able to figure out what the issue is? I've used the exact same setup commands for fairseq and wav2letter as you. I'm finding that the segfault occurs at the line self.trie.insert(spelling_idxs, word_idx, score). self.trie is a python binding for trie.cpp under ~/install_dir/wav2letter/src/libraries/decoder/Trie.cpp (see issue https://github.com/facebookresearch/wav2letter/issues/462).
Sometimes I get a 'corrupted double linked-list' error instead of the segmentation fault, sometimes this is an malloc() error. @tlikhomanenko Does this help diagnose the issue?
@tlikhomanenko Did you get a chance to solve this issue?
Hey it got solved, closing this issue.
@amant555, could you post how you solved it?
So sorry for the delay :( I hadn't time to repro. Was too busy with cmake/installation/code refactor with others in the team. Hopefully for the new version it will much simpler and straightforward to install.
@amant555 How were you able to fix this issue? I'm still running into this problem, unable to run kenlm decoder
Hi @tlikhomanenko @pkadambi,
Sorry for the late reply.
I was able solve the issue by rebuilding kenlm without any flags -DCMAKE_BUILD_TYPE=Release -DKENLM_MAX_ORDER=20 -DCMAKE_POSITION_INDEPENDENT_CODE=ON and then re-build wav2letter bindings.
This solved the issue of segmentation fault.
Most helpful comment
Hi @tlikhomanenko @pkadambi,
Sorry for the late reply.
I was able solve the issue by rebuilding kenlm without any flags
-DCMAKE_BUILD_TYPE=Release -DKENLM_MAX_ORDER=20 -DCMAKE_POSITION_INDEPENDENT_CODE=ONand then re-build wav2letter bindings.This solved the issue of segmentation fault.