Wav2letter: While using kenlm decoder beam decoding getting error: segmentation fault

Created on 31 Oct 2020  路  15Comments  路  Source: flashlight/wav2letter

Hi, I am trying to use kenlm decoder for beam search decoding in Speech recognition task. After I load the kenlm model with my lexicon. And try to get start_state or set up Trie, segmentation fault occurs every time, Even if it pass in a different environment then when I start the given loop even then segmentation fault occur .

First I thought it was memory issue but I have around 120 gigs of it now and even when I load a small set of text to create a toy binary and lexicon, even then segmentation fault happens.
I have attached the lexicon, token dict and arpa file for reference Files.zip and error_backtrace that I recieve.
I have tried direct loading of arpa file too.

Can you help me out on what I am doing wrong and suggest a solution for it?

Here is the sample of code I am using:

class W2lKenLMDecoder(W2lDecoder):
    def __init__(self,  tgt_dict):
        super().__init__( tgt_dict)

        self.silence = (
            tgt_dict.index("<ctc_blank>")
            if "<ctc_blank>" in tgt_dict.indices
            else tgt_dict.bos()
        )
        self.lexicon = load_words("args.lexicon")
        self.word_dict = create_word_dict(self.lexicon)
        self.unk_word = self.word_dict.get_index("<unk>")

        self.lm = KenLM(args.kenlm, self.word_dict)
        self.trie = Trie(self.vocab_size, self.silence) # Line at which segmentation fault occur

        start_state = self.lm.start(False) # Line at which segmentation fault occur
        for i, (word, spellings) in enumerate(self.lexicon.items()):
            word_idx = self.word_dict.get_index(word)
            _, score = self.lm.score(start_state, word_idx)  # Line at which segmentation fault occur
            for spelling in spellings:
                spelling_idxs = [tgt_dict.index(token) for token in spelling]
                assert (
                    tgt_dict.unk() not in spelling_idxs
                ), f"{spelling} {spelling_idxs}"
                self.trie.insert(spelling_idxs, word_idx, score)
        self.trie.smear(SmearingMode.MAX)
question

Most helpful comment

Hi @tlikhomanenko @pkadambi,
Sorry for the late reply.
I was able solve the issue by rebuilding kenlm without any flags -DCMAKE_BUILD_TYPE=Release -DKENLM_MAX_ORDER=20 -DCMAKE_POSITION_INDEPENDENT_CODE=ON and then re-build wav2letter bindings.

This solved the issue of segmentation fault.

All 15 comments

Why do you have self.lm = KenLM("args.kenlm", self.word_dict)? It means that file name for kenlm is "args.kenlm" and he is using it as file name. Can you try self.lm = KenLM(args.kenlm, self.word_dict)?

Why do you have self.lm = KenLM("args.kenlm", self.word_dict)? It means that file name for kenlm is "args.kenlm" and he is using it as file name. Can you try self.lm = KenLM(args.kenlm, self.word_dict)?

Yeah sorry that is a typo, earlier for quickly checking that error was really in this part of code I hardcoded some file paths in my Jupiter notebook rather than passing them through args. so before posting here must have forgot to remove quotes from args.kenlm. So that part is correct in the code.

Ok, could you first confirm that binary files, like query in the kenlm build dir are working with your lm fine?

Hi @tlikhomanenko querying is working. Here is image for reference.
image

Could you send exact reproduction of your env setup (or docker image if it is reproducible there). Several people reported that problem, but I cannot repro it, possibly related to python/conda env setup.

I am sharing the line by line steps here: (Machine used Ubuntu 16/18)

  1. To satisfy all the kenlm, openblas, blog, ...etc dependencies.
    sudo apt-get install liblzma-dev libbz2-dev libzstd-dev libsndfile1-dev libopenblas-dev libfftw3-dev libgflags-dev libgoogle-glog-dev

  2. Installing Kenlm

git clone https://github.com/kpu/kenlm.git
cd kenlm
mkdir -p build && cd build
cmake .. -DCMAKE_BUILD_TYPE=Release -DKENLM_MAX_ORDER=20 -DCMAKE_POSITION_INDEPENDENT_CODE=ON
make -j16
export KENLM_ROOT_DIR=<Kenlm root dir path>
  1. As I am using wav2letter in fairseq so I build them together.
conda create -n exp1 python=3.7
pip install packaging
pip install soundfile
git clone https://github.com/pytorch/fairseq
cd fairseq
pip install --editable ./
pip install torch==1.7.0+cu101 torchvision==0.8.1+cu101 torchaudio==0.7.0 -f https://download.pytorch.org/whl/torch_stable.html
cd ..
https://github.com/facebookresearch/wav2letter.git -b v0.2
cd wav2letter/bindings/python
pip install -e .

I have already tried this with different torch versions(1.4,1.5,1.6,1.7) but the error remains same.

If you want to do a quick try and reproduce the same problem, you can just install wav2letter(python bindings) on Colab and use the files I have provided in Files.zip along with the code. You will observe that kernel dies there when in code we enter the loop to get the LM scores.

Which conda version do you have? Then will try to repro your steps with the same conda

Which conda version do you have? Then will try to repro your steps with the same conda

Conda version 4.8.3

Hi @tlikhomanenko , Were you able to reproduce? Is there any quick solution.

Hi, I'm having segfaults in the same fragment of code trying to use a KenLM decoder for wav2vec2.0. @amant555 were you able to figure out what the issue is? I've used the exact same setup commands for fairseq and wav2letter as you. I'm finding that the segfault occurs at the line self.trie.insert(spelling_idxs, word_idx, score). self.trie is a python binding for trie.cpp under ~/install_dir/wav2letter/src/libraries/decoder/Trie.cpp (see issue https://github.com/facebookresearch/wav2letter/issues/462).

Sometimes I get a 'corrupted double linked-list' error instead of the segmentation fault, sometimes this is an malloc() error. @tlikhomanenko Does this help diagnose the issue?

@tlikhomanenko Did you get a chance to solve this issue?

Hey it got solved, closing this issue.

@amant555, could you post how you solved it?

So sorry for the delay :( I hadn't time to repro. Was too busy with cmake/installation/code refactor with others in the team. Hopefully for the new version it will much simpler and straightforward to install.

@amant555 How were you able to fix this issue? I'm still running into this problem, unable to run kenlm decoder

Hi @tlikhomanenko @pkadambi,
Sorry for the late reply.
I was able solve the issue by rebuilding kenlm without any flags -DCMAKE_BUILD_TYPE=Release -DKENLM_MAX_ORDER=20 -DCMAKE_POSITION_INDEPENDENT_CODE=ON and then re-build wav2letter bindings.

This solved the issue of segmentation fault.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

EdwinWenink picture EdwinWenink  路  4Comments

bmblr497 picture bmblr497  路  5Comments

abhinavkulkarni picture abhinavkulkarni  路  3Comments

JanX2 picture JanX2  路  5Comments

pzelasko picture pzelasko  路  6Comments