Does DeepSpeech have any kind of unsupervised pre-training, similar to wav2vec?
I'm working with thousands of hours of dictation -- it can be a bit noisy and the pronunciation, a little... hmm.
Hi @caseybasichis I have a student who has been working on integrating wav2vec into DeepSpeech. I'll ask him to comment here.
I'm the student of @ftyers. We have tried using the HDF5 embedding produced by wav2vec to train DeepSpeech. Currently the implementation is a customized Docker image (on Docker Hub) and a monkey-patch to DeepSpeech (soon I'll tidy it up and modify on top of a fork of DeepSpeech instead). The code and instructions are available in repository Contextualist/DeepSpeech-build.
Summary of what the monkey-patch does:
.h5context files (read as HDF5 and extract field 'features').We've tried this on 136hrs of speech for pre-training + 0.5hr speech for training, and it yields result comparable to transfer learning (using DeepSpeech's released English model) result. We got a CER of 0.465.
Thanks for that summary @Contextualist !
While this looks interesting, Github issues are here to discuss about issues. Please continue that discussion on Discourse.
Most helpful comment
I'm the student of @ftyers. We have tried using the HDF5 embedding produced by wav2vec to train DeepSpeech. Currently the implementation is a customized Docker image (on Docker Hub) and a monkey-patch to DeepSpeech (soon I'll tidy it up and modify on top of a fork of DeepSpeech instead). The code and instructions are available in repository Contextualist/DeepSpeech-build.
Summary of what the monkey-patch does:
.h5contextfiles (read as HDF5 and extract field'features').We've tried this on 136hrs of speech for pre-training + 0.5hr speech for training, and it yields result comparable to transfer learning (using DeepSpeech's released English model) result. We got a CER of 0.465.