Deepspeech: Switch KenLM to trie based language model

Created on 15 Feb 2018  路  10Comments  路  Source: mozilla/DeepSpeech

enhancement

Most helpful comment

@dbanka Concretely, our current language model is 1.5 GB the we've made a trie based model which basically reproduces its quality and is 66 MB.

All 10 comments

@kdavis-mozilla what would be the benefit of switching to trie based language model?

@dbanka Trie based models can be compressed[1] making the entire footprint smaller. Our current language models can't be compressed.

@dbanka Concretely, our current language model is 1.5 GB the we've made a trie based model which basically reproduces its quality and is 66 MB.

Reopening since we reverted the fixes.

The code snippet below builds a pruned, quantized 5-gram language model that is significantly better than the "quick-fix" language model.

The corpus used is described in section 4. Language Models of the original LibriSpeech paper.

With little to no optimisation or hyper-parameter tuning we get a dev-clean WER of ~6.4 on a version of our internal implementation of DS1. You can adjust the order, pruning level, and quantization level to suit your needs :-) (e.g. a 3-gram 8-bit binary trie is <1GB and has a dev-clean WER of ~6.5)

Note that this code was written in a Jupyter notebook and uses the lmplz and build_binary commands from the kenlm library. The resulting language model is ~1.7GB. I've sent this to @kdavis-mozilla via IRC already.

import gzip
import io
import os

from urllib import request

# Grab corpus.
url = 'http://www.openslr.org/resources/11/librispeech-lm-norm.txt.gz'
data_upper = '/tmp/upper.txt.gz'
request.urlretrieve(url, data_upper)

# Convert to lowercase and cleanup.
data_lower = '/tmp/lower.txt'
with open(data_lower, 'w', encoding='utf-8') as lower:
    with io.TextIOWrapper(io.BufferedReader(gzip.open(data_upper)), encoding='utf8') as upper:
        for line in upper:
            lower.write(line.lower())
os.remove(data_upper)

# Build pruned LM.
lm_path = '/tmp/lm.arpa'
!lmplz --order 5 \
       --temp_prefix /tmp/ \
       --memory 50% \
       --text {data_lower} \
       --arpa {lm_path} \
       --prune 0 0 0 1

# Quantize and produce trie binary.
binary_path = '/tmp/lm.binary'
!build_binary -a 255 \
              -q 8 \
              trie \
              {lm_path} \
              {binary_path} 
os.remove(lm_path)

Example output:

=== 1/5 Counting and sorting n-grams ===
Reading /tmp/lower.txt
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************
Unigram tokens 803288729 types 973676
=== 2/5 Calculating and sorting adjusted counts ===
Chain sizes: 1:11684112 2:3126698496 3:5862559744 4:9380094976 5:13679306752
Statistics:
1 973676 D1=0.647192 D2=1.04159 D3+=1.3919
2 41161096 D1=0.723617 D2=1.06317 D3+=1.36127
3 207278547 D1=0.804357 D2=1.09256 D3+=1.31993
4 60615302/438095063 D1=0.876863 D2=1.15052 D3+=1.32047
5 42225053/587120377 D1=0.914203 D2=1.27108 D3+=1.35262
Memory estimate for binary LM:
type      MB
probing 7822 assuming -p 1.5
probing 9594 assuming -r models -p 1.5
trie    4304 without quantization
trie    2457 assuming -q 8 -b 8 quantization 
trie    3556 assuming -a 22 array pointer compression
trie    1708 assuming -a 22 -q 8 -b 8 array pointer compression and quantization
=== 3/5 Calculating and sorting initial probabilities ===
Chain sizes: 1:11684112 2:658577536 3:4145570940 4:1454767248 5:1182301484
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
*******#############################################################################################
=== 4/5 Calculating and writing order-interpolated probabilities ===
Chain sizes: 1:11684112 2:658577536 3:4145570940 4:1454767248 5:1182301484
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
####################################################################################################
=== 5/5 Writing ARPA model ===
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************
Name:lmplz  VmPeak:31701288 kB  VmRSS:32144 kB  RSSMax:27359364 kB  user:1187.72    sys:465.288 CPU:1653.02 real:2043.65

Reading /tmp/lm.arpa
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************
Identifying n-grams omitted by SRI
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************
Quantizing
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************
Writing trie
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************
SUCCESS

Are there going to be tools to extend the new language model with custom corpus data or individual phrases?

@pvanickova You'll be able to use all the features of KenLM to extend the language model.

@pvanickova You can do that, following data/lm/README.md and augmenting with your own data.

@lissyx perfect, thanks - so basically rebuilding the language model from scratch using the librivox corpus + my own corpus

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

Was this page helpful?
0 / 5 - 0 ratings