Deepspeech: Unsupervised pre-training

Created on 7 Sep 2020  路  4Comments  路  Source: mozilla/DeepSpeech

Does DeepSpeech have any kind of unsupervised pre-training, similar to wav2vec?

I'm working with thousands of hours of dictation -- it can be a bit noisy and the pronunciation, a little... hmm.

invalid

Most helpful comment

I'm the student of @ftyers. We have tried using the HDF5 embedding produced by wav2vec to train DeepSpeech. Currently the implementation is a customized Docker image (on Docker Hub) and a monkey-patch to DeepSpeech (soon I'll tidy it up and modify on top of a fork of DeepSpeech instead). The code and instructions are available in repository Contextualist/DeepSpeech-build.

Summary of what the monkey-patch does:

  • Loader for .h5context files (read as HDF5 and extract field 'features').
  • Disable audio transcoding and augmentation.
  • Disable MFCC featurization
  • Set input dimension to 512 instead of 26 (which is for MFCC).

We've tried this on 136hrs of speech for pre-training + 0.5hr speech for training, and it yields result comparable to transfer learning (using DeepSpeech's released English model) result. We got a CER of 0.465.

All 4 comments

Hi @caseybasichis I have a student who has been working on integrating wav2vec into DeepSpeech. I'll ask him to comment here.

I'm the student of @ftyers. We have tried using the HDF5 embedding produced by wav2vec to train DeepSpeech. Currently the implementation is a customized Docker image (on Docker Hub) and a monkey-patch to DeepSpeech (soon I'll tidy it up and modify on top of a fork of DeepSpeech instead). The code and instructions are available in repository Contextualist/DeepSpeech-build.

Summary of what the monkey-patch does:

  • Loader for .h5context files (read as HDF5 and extract field 'features').
  • Disable audio transcoding and augmentation.
  • Disable MFCC featurization
  • Set input dimension to 512 instead of 26 (which is for MFCC).

We've tried this on 136hrs of speech for pre-training + 0.5hr speech for training, and it yields result comparable to transfer learning (using DeepSpeech's released English model) result. We got a CER of 0.465.

Thanks for that summary @Contextualist !

While this looks interesting, Github issues are here to discuss about issues. Please continue that discussion on Discourse.

Was this page helpful?
0 / 5 - 0 ratings