@kdavis-mozilla what would be the benefit of switching to trie based language model?
@dbanka Trie based models can be compressed[1] making the entire footprint smaller. Our current language models can't be compressed.
@dbanka Concretely, our current language model is 1.5 GB the we've made a trie based model which basically reproduces its quality and is 66 MB.
Reopening since we reverted the fixes.
The code snippet below builds a pruned, quantized 5-gram language model that is significantly better than the "quick-fix" language model.
The corpus used is described in section 4. Language Models of the original LibriSpeech paper.
With little to no optimisation or hyper-parameter tuning we get a dev-clean WER of ~6.4 on a version of our internal implementation of DS1. You can adjust the order, pruning level, and quantization level to suit your needs :-) (e.g. a 3-gram 8-bit binary trie is <1GB and has a dev-clean WER of ~6.5)
Note that this code was written in a Jupyter notebook and uses the lmplz and build_binary commands from the kenlm library. The resulting language model is ~1.7GB. I've sent this to @kdavis-mozilla via IRC already.
import gzip
import io
import os
from urllib import request
# Grab corpus.
url = 'http://www.openslr.org/resources/11/librispeech-lm-norm.txt.gz'
data_upper = '/tmp/upper.txt.gz'
request.urlretrieve(url, data_upper)
# Convert to lowercase and cleanup.
data_lower = '/tmp/lower.txt'
with open(data_lower, 'w', encoding='utf-8') as lower:
with io.TextIOWrapper(io.BufferedReader(gzip.open(data_upper)), encoding='utf8') as upper:
for line in upper:
lower.write(line.lower())
os.remove(data_upper)
# Build pruned LM.
lm_path = '/tmp/lm.arpa'
!lmplz --order 5 \
--temp_prefix /tmp/ \
--memory 50% \
--text {data_lower} \
--arpa {lm_path} \
--prune 0 0 0 1
# Quantize and produce trie binary.
binary_path = '/tmp/lm.binary'
!build_binary -a 255 \
-q 8 \
trie \
{lm_path} \
{binary_path}
os.remove(lm_path)
Example output:
=== 1/5 Counting and sorting n-grams ===
Reading /tmp/lower.txt
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************
Unigram tokens 803288729 types 973676
=== 2/5 Calculating and sorting adjusted counts ===
Chain sizes: 1:11684112 2:3126698496 3:5862559744 4:9380094976 5:13679306752
Statistics:
1 973676 D1=0.647192 D2=1.04159 D3+=1.3919
2 41161096 D1=0.723617 D2=1.06317 D3+=1.36127
3 207278547 D1=0.804357 D2=1.09256 D3+=1.31993
4 60615302/438095063 D1=0.876863 D2=1.15052 D3+=1.32047
5 42225053/587120377 D1=0.914203 D2=1.27108 D3+=1.35262
Memory estimate for binary LM:
type MB
probing 7822 assuming -p 1.5
probing 9594 assuming -r models -p 1.5
trie 4304 without quantization
trie 2457 assuming -q 8 -b 8 quantization
trie 3556 assuming -a 22 array pointer compression
trie 1708 assuming -a 22 -q 8 -b 8 array pointer compression and quantization
=== 3/5 Calculating and sorting initial probabilities ===
Chain sizes: 1:11684112 2:658577536 3:4145570940 4:1454767248 5:1182301484
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
*******#############################################################################################
=== 4/5 Calculating and writing order-interpolated probabilities ===
Chain sizes: 1:11684112 2:658577536 3:4145570940 4:1454767248 5:1182301484
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
####################################################################################################
=== 5/5 Writing ARPA model ===
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************
Name:lmplz VmPeak:31701288 kB VmRSS:32144 kB RSSMax:27359364 kB user:1187.72 sys:465.288 CPU:1653.02 real:2043.65
Reading /tmp/lm.arpa
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************
Identifying n-grams omitted by SRI
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************
Quantizing
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************
Writing trie
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************
SUCCESS
Are there going to be tools to extend the new language model with custom corpus data or individual phrases?
@pvanickova You'll be able to use all the features of KenLM to extend the language model.
@pvanickova You can do that, following data/lm/README.md and augmenting with your own data.
@lissyx perfect, thanks - so basically rebuilding the language model from scratch using the librivox corpus + my own corpus
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.
Most helpful comment
@dbanka Concretely, our current language model is 1.5 GB the we've made a trie based model which basically reproduces its quality and is 66 MB.