Deepspeech: Language model experiments

Created on 13 Nov 2019 · 20Comments · Source: mozilla/DeepSpeech

Source

reuben

Most helpful comment

WER values for different combinations of AM and LM sizes:

| LM➡️ / ⬇️AM | 957.9MB | 110.8MB | 50.9MB | 6.5MB | 2.7MB* | no LM |
|-|-|-|-|-|-|-|
| .pb 189MB | 0.074233 | 0.086009 | 0.086305 | 0.108474 | 0.124479 | 0.147201 |
| .tflite 47MB | 0.077583 | 0.089014 | 0.089984 | 0.112732 | 0.130326 | 0.154728 |

* Trie only.

reuben on 13 Jan 2020

👍4

All 20 comments

Is this simply taking the individual rare word cases out of the text or does it remove the sentence featuring that rare word?

I'm thinking the former would result in ungrammatical sentences, having a slight negative impact on the LM.

Purely on a hunch, it also seems plausible that such sentences may be less modern or largely gibberish (as @rhamnett mentions here: https://github.com/mozilla/DeepSpeech/pull/2528#issuecomment-553500642 )

nmstoker on 21 Nov 2019

It removes all n-grams containing OOV words

reuben on 21 Nov 2019

👍1

https://kheafield.com/code/kenlm/filter/

reuben on 21 Nov 2019

@reuben Is all you wanted to do here pruning or is there more?

kdavis-mozilla on 10 Jan 2020

There's more, I want to figure out a way to combine alternative sources of data (e.g. OpenWebText, OSCAR) with the LibriSpeech LM data so that we can improve accuracy without significantly impacting LibriSpeech-test-clean WER.

reuben on 10 Jan 2020

WER values for different combinations of AM and LM sizes:

* Trie only.

reuben on 13 Jan 2020

👍4

@reuben this table is quite exciting because while the 950MB LM is out of the question for a mobile app, 50MB is definitely reasonable and the WER doesn't suffer that much. This truly unlocks the mobile usecase for deepspeech! Can you explain how you got the LM down to 50.9MB and 6.5MB - is it just KenLM pruning and quantization?

Are the LMs you were using the same as the ones hosted on OpenSLR? Theirs are 759MB for 3-gram ARPA LM unpruned, down to 13MB for 3-gram ARPA LM pruned with 3e-7.

adhsu on 14 Jan 2020

@reuben this table is quite exciting because while the 950MB LM is out of the question for a mobile app, 50MB is definitely reasonable and the WER doesn't suffer _that_ much. This truly unlocks the mobile usecase for deepspeech! Can you explain how you got the LM down to 50.9MB and 6.5MB - is it just KenLM pruning and quantization?

Are the LMs you were using the same as the ones hosted on OpenSLR? Theirs are 759MB for 3-gram ARPA LM unpruned, down to 13MB for 3-gram ARPA LM pruned with 3e-7.

Can you explain how this unlocks the use case? Except disk usage, the current lm already gives very good experience on mobile.

lissyx on 14 Jan 2020

@lissyx Oh, just that a 950MB LM is way too big to practically distribute in a mobile app, so it's not really usable in production as is. 100MB total for tflite+LM is doable though.

adhsu on 14 Jan 2020

@lissyx Oh, just that a 950MB LM is way too big to practically distribute in a mobile app, so it's not really usable in production as is. 100MB total for tflite+LM is doable though.

I understand it is big, and it can be a problem in some (many?) cases, but it's not really blocking in general.

What's your use-case, if you can share ?

lissyx on 14 Jan 2020

Sorry, I didn't mean to imply that the current system doesn't work or anything, just that I feel like getting total filesize down to <100MB crosses a threshold where many many more app developers would be willing to use it and include in a production app distribution.

My usecase is that I'm working a mobile app for practicing English speaking. The user reads a sentence and we use ASR to recognize that they said it aloud. I'm excited to try to get the tflite model and small LM working for offline ASR.

Do you have any information on how those smaller LMs for the test were generated?

adhsu on 14 Jan 2020

Do you have any information on how those smaller LMs for the test were generated?

I guess reuben changed the content of the vocabulary, maybe through filtering?

just that I feel like getting total filesize down to <100MB crosses a threshold where many many more app developers would be willing to use it and include in a production app distribution.

What is currently blocking you precisely ? Just the amount in general, or does the size makes it impossible for you to distribute, and if the latter, how do you handle distribution ?

lissyx on 14 Jan 2020

Distribution is through the Play Store and App Store. Most users just won't be willing to download a 1GB app and it's generally seen as pretty bad to have a huge app. That's mostly what I'm referring to.

There are also some store limitations - on Android it's technically possible but tricky to have your bundle be over 100MB.

adhsu on 14 Jan 2020

Are the LMs you were using the same as the ones hosted on OpenSLR? Theirs are 759MB for 3-gram ARPA LM unpruned, down to 13MB for 3-gram ARPA LM pruned with 3e-7.

No. They were built from that dataset though. I built lower order (o=2, o=1) models, and pruned more aggressively (--prune 0 1 on the -o 2 models). I then filtered models to the top 100k words in the LibriSpeech LM corpus, following a similar procedure as documented in data/lm/generate_lm.py.

reuben on 14 Jan 2020

Distribution is through the Play Store and App Store. Most users just won't be willing to download a 1GB app and it's generally seen as pretty bad to have a huge app. That's mostly what I'm referring to.

There are also some store limitations - on Android it's technically possible but tricky to have your bundle be over 100MB.

Ok, that's mostly what I was expecting. While I agree that 1GB can be problematic (not everybody has LTE with unlimited plan nor WiFi backed by optic) it's still doable, if you want to experiment, to host the data elsewhere. This is what we do in some cases. It brings other issues.

At some point, it's really complicated to solve everything and keep into the APK or other: if you want / need to address several languages, then either your size explodes, or you host them elsewhere.

As much as I recall, Android's way of splitting was also an issue because we would be unable to properly mmap() the file, but that's from my memory, not sure it is really the case.

lissyx on 14 Jan 2020

@reuben Are the smaller LMs available anywhere? I couldn't find any downloads and I can't find any instructions for generating them. Thanks!

9define on 10 Feb 2020

Did you miss this comment? https://github.com/mozilla/DeepSpeech/issues/2529#issuecomment-574314842

reuben on 10 Feb 2020

😄1

Yes I must have. Would it be possible to publish the LMs of those other sizes, as training seems to take quite some time? If it's possible to go from an existing LM (such as the release version) to a reduced size LM, I would be happy to read through any docs or scripts that would allow me to do so. Sorry if the answer is obvious, but I'm not very well-versed in LMs and training, etc. Thanks!

9define on 12 Feb 2020

I haven't tried the filtering step myself but the rest of the KenLM steps required to create an LM are actually remarkably quick, which is why I was rather surprised when you said the "training seems to take quite some time".

I'd suggest you have a go creating an LM, as the instructions are written up clearly in the relevant directory in this repo and then look at the comment @reuben linked to and you should be able to figure out how to adapt the settings. KenLM itself is also fairly well documented online. Not only will you learn a little from doing it yourself but you'll then be able to make an LM that's specifically suited to your needs.

nmstoker on 12 Feb 2020