Deepspeech: Unsupervised pre-training

Created on 7 Sep 2020 · 4Comments · Source: mozilla/DeepSpeech

Does DeepSpeech have any kind of unsupervised pre-training, similar to wav2vec?

I'm working with thousands of hours of dictation -- it can be a bit noisy and the pronunciation, a little... hmm.

invalid

Source

caseybasichis

Most helpful comment

I'm the student of @ftyers. We have tried using the HDF5 embedding produced by wav2vec to train DeepSpeech. Currently the implementation is a customized Docker image (on Docker Hub) and a monkey-patch to DeepSpeech (soon I'll tidy it up and modify on top of a fork of DeepSpeech instead). The code and instructions are available in repository Contextualist/DeepSpeech-build.

Summary of what the monkey-patch does:

Loader for .h5context files (read as HDF5 and extract field 'features').
Disable audio transcoding and augmentation.
Disable MFCC featurization
Set input dimension to 512 instead of 26 (which is for MFCC).

We've tried this on 136hrs of speech for pre-training + 0.5hr speech for training, and it yields result comparable to transfer learning (using DeepSpeech's released English model) result. We got a CER of 0.465.

Contextualist on 8 Sep 2020

❤3

All 4 comments

Hi @caseybasichis I have a student who has been working on integrating wav2vec into DeepSpeech. I'll ask him to comment here.

ftyers on 7 Sep 2020

🎉3

I'm the student of @ftyers. We have tried using the HDF5 embedding produced by wav2vec to train DeepSpeech. Currently the implementation is a customized Docker image (on Docker Hub) and a monkey-patch to DeepSpeech (soon I'll tidy it up and modify on top of a fork of DeepSpeech instead). The code and instructions are available in repository Contextualist/DeepSpeech-build.

Summary of what the monkey-patch does:

Loader for .h5context files (read as HDF5 and extract field 'features').
Disable audio transcoding and augmentation.
Disable MFCC featurization
Set input dimension to 512 instead of 26 (which is for MFCC).

We've tried this on 136hrs of speech for pre-training + 0.5hr speech for training, and it yields result comparable to transfer learning (using DeepSpeech's released English model) result. We got a CER of 0.465.

Contextualist on 8 Sep 2020

❤3

Thanks for that summary @Contextualist !

ftyers on 8 Sep 2020

While this looks interesting, Github issues are here to discuss about issues. Please continue that discussion on Discourse.

lissyx on 8 Sep 2020

Was this page helpful?

0 / 5 - 0 ratings

Related issues

0.3 models link is broken (deepspeech-0.3.0-models.tar.gz is not found)

axxapy · 3Comments

ValueError: ../../VERSION is not valid SemVer string

cvenci · 4Comments

small example failed

Jfeng3 · 5Comments

Specify grammar file SRGS

NicoHood · 5Comments

Can I just use the model without any training? If no, can I download a pretrained model?

The-Gupta · 7Comments