Deepspeech: WER Results

Created on 8 Jun 2017 · 29Comments · Source: mozilla/DeepSpeech

I'm trying to get a sense for what values of WER folks are seeing after training (such as @dahlem in #569) I was focusing on TED (but am curious about other datasets as well), and never hit anything below 39% on the test set by around 10 epochs.

What values are people's best models producing?
Has anyone had success with any architecture/hyperparameter configurations other than the defaults?

P4 question

Source

aaronzira

Most helpful comment

@sruteesh Good news they agree to share the model, you can download at https://drive.google.com/file/d/0B9d9mk1oWvx2VWVuQnk2RFlvY00/view?usp=sharing The model is trained on about ~1700hour data with ~20 global steps. Following is some details about the model.

Dataset:
Librivox(960 hour)
Ted (200 hour)
Voxforge (118 hour)
Internal Datasets (465 hour)

WER result:
Librivox-clean-test: 13.0097%
Librivox-other-test: 31.4808%
Ted-test: 24.2525%
Voxforge-test: 23.2385%

gardenia22 on 1 Oct 2017

🎉6

All 29 comments

Training on TED for 10 epochs isn't enough, as you've seen, to get a good WER.

First you need to train for more epochs, we default to 50.

Second, the TED data set itself isn't really large enough to train on. So, the WER is always a bit high for TED. I think we've gotten to mid/high 20%.

kdavis-mozilla on 9 Jun 2017

Thanks! What sorts of values for hidden layers, hidden units, dropout, and learning rate were you using to get that WER? Without significant changes to the default hyperparams, I don't see how it's possible to not overfit long before 50 epochs.

aaronzira on 10 Jun 2017

The default 2048 width layers + dropout 30% + default learning rate. However the language model was an older one, somewhere in the history of master.

I agree it was overfitting at 50. I'm not sure where the best epoch count was, we haven't done early stopping yet, see issue #538

kdavis-mozilla on 13 Jun 2017

I ran 15 epochs on Ted with default hyperparameter and it is overfitting already. following is the learning curve of my training process. I got 37% WER on the test set.

And noticed that there is a drop at epoch 8 on the training set, which is confused. Any idea why this happened? The training process is not continuous. I made 4 stops during the training process, and the numbers of GPU of each training break are different. Should I use the same number of GPU during one training?

gardenia22 on 15 Jun 2017

@gardenia22 B default I mean those in run-wer-automation.sh is this also what you mean?

kdavis-mozilla on 1 Jul 2017

@kdavis-mozilla No, I use hyperparameters in run-ted.sh.

gardenia22 on 3 Jul 2017

@gardenia22 OK. Interesting.

The best WER we've had training on the TED training data set and testing on the TED test data set was 31.96%, see the old results here[1].

What we've currently found, as I guess you have too, is that the TED training data set is not of a sufficient size to get a low WER on the TED test set. So we're combining various training sets then testing on the TED test data set. This is on-going work.

kdavis-mozilla on 3 Jul 2017

@kdavis-mozilla Yes, you are right, I am doing the same thing. Training on Librivox(~1000 hour) and TED(~200 hour) data set, so far I got 27.52% on TED test and 25.98% on LIbrivox test.

gardenia22 on 3 Jul 2017

@gardenia22 The best results we've had so far is 12% WER training on Librivox's training data set and testing on Librivox's clean test data set. The language model used in this case was, as far as I remember, based off of the Librivox training data set.

kdavis-mozilla on 3 Jul 2017

@kdavis-mozilla Thanks for sharing the results with me:). The best result I got on Librivox's clean test data set is 15.0096% WER. Looks like there still much room to improve. Could you please share the hyperparameter used in your best model? Like how many epochs, the dropout rate, etc.

gardenia22 on 4 Jul 2017

@gardenia22 The hyperparameter settings are reflected in run-librivox.sh.

However, the language model used in the 12% run was based off of the Librivox training data set. The current language model is trained using a combination of several (Librivox, Fisher, TED...) training data sets.

kdavis-mozilla on 4 Jul 2017

@gardenia22 Is it possible for you to share your model which has got 15 % WER on Librivox. I am trying out Transfer learning on Indian English Dataset ( 800+ hrs mixed data (lectures and conversations) and I'm currently getting 35% WER when trained from scratch. I was hoping training on top of Librivox model should improve the performance.

sruteesh on 26 Sep 2017

@sruteesh I can check with my former employee to see if it is ok to make the model public because I did this when I interned there and use some internal dataset to train.

gardenia22 on 27 Sep 2017

Dataset:
Librivox(960 hour)
Ted (200 hour)
Voxforge (118 hour)
Internal Datasets (465 hour)

WER result:
Librivox-clean-test: 13.0097%
Librivox-other-test: 31.4808%
Ted-test: 24.2525%
Voxforge-test: 23.2385%

gardenia22 on 1 Oct 2017

🎉6

@gardenia22 Great. Thank you so much.
Will try to run training on top this model and hopefully get good results.

sruteesh on 1 Oct 2017

@gardenia22 any idea which tensorflow version the model was built on.
I am getting the follwoing error while importing the model.
Traceback (most recent call last): File "DeepSpeech.py", line 1744, in <module> tf.app.run() File "/root/deepspeech_tensorflow-1.0.1/deespeech_libri_trained/local/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 44, in run _sys.exit(main(_sys.argv[:1] + flags_passthrough)) File "DeepSpeech.py", line 1704, in main train() File "DeepSpeech.py", line 1523, in train config=session_config) as session: File "/root/deepspeech_tensorflow-1.0.1/deespeech_libri_trained/local/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 315, in MonitoredTrainingSession return MonitoredSession(session_creator=session_creator, hooks=all_hooks) File "/root/deepspeech_tensorflow-1.0.1/deespeech_libri_trained/local/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 601, in __init__ session_creator, hooks, should_recover=True) File "/root/deepspeech_tensorflow-1.0.1/deespeech_libri_trained/local/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 434, in __init__ self._sess = _RecoverableSession(self._coordinated_creator) File "/root/deepspeech_tensorflow-1.0.1/deespeech_libri_trained/local/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 767, in __init__ _WrappedSession.__init__(self, self._create_session()) File "/root/deepspeech_tensorflow-1.0.1/deespeech_libri_trained/local/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 772, in _create_session return self._sess_creator.create_session() File "/root/deepspeech_tensorflow-1.0.1/deespeech_libri_trained/local/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 494, in create_session self.tf_sess = self._session_creator.create_session() File "/root/deepspeech_tensorflow-1.0.1/deespeech_libri_trained/local/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 375, in create_session init_fn=self._scaffold.init_fn) File "/root/deepspeech_tensorflow-1.0.1/deespeech_libri_trained/local/lib/python2.7/site-packages/tensorflow/python/training/session_manager.py", line 256, in prepare_session config=config) File "/root/deepspeech_tensorflow-1.0.1/deespeech_libri_trained/local/lib/python2.7/site-packages/tensorflow/python/training/session_manager.py", line 188, in _restore_checkpoint saver.restore(sess, ckpt.model_checkpoint_path) File "/root/deepspeech_tensorflow-1.0.1/deespeech_libri_trained/local/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 1428, in restore {self.saver_def.filename_tensor_name: save_path}) File "/root/deepspeech_tensorflow-1.0.1/deespeech_libri_trained/local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 767, in run run_metadata_ptr) File "/root/deepspeech_tensorflow-1.0.1/deespeech_libri_trained/local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 965, in _run feed_dict_string, options, run_metadata) File "/root/deepspeech_tensorflow-1.0.1/deespeech_libri_trained/local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1015, in _do_run target_list, options, run_metadata) File "/root/deepspeech_tensorflow-1.0.1/deespeech_libri_trained/local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1035, in _do_call raise type(e)(node_def, op, message) tensorflow.python.framework.errors_impl.NotFoundError: Key b3/Adam_1 not found in checkpoint [[Node: save_1/RestoreV2_9 = RestoreV2[dtypes=[DT_FLOAT], _device="/job:localhost/replica:0/task:0/cpu:0"](_recv_save_1/Const_0, save_1/RestoreV2_9/tensor_names, save_1/RestoreV2_9/shape_and_slices)]]

I have faced similar issue earlier and I think it is because of mismatch between the tensorflow versions used for training and restoring the model.
Also have u made any changes to hyperparameters with respect to those in run-librivox.sh
Thanks

sruteesh on 3 Oct 2017

@sruteesh I am running on tensorflow 1.1.0. Following are hyperparameters:
--train_batch_size 12 --dev_batch_size 12 --test_batch_size 12 --learning_rate 0.0001 --epoch 18 --display_step 18 --validation_step 1 --dropout_rate 0.30 --default_stddev 0.046875

gardenia22 on 5 Oct 2017

@sruteesh I was able to get it running on tensorflow 1.3.0 so if you're willing & able to move to a newer version, I think you might be okay then

@gardenia22 many thanks for posting this - it works for me and does reasonably well with a private data set of my own voice recordings (but my English English accent must be throwing it off somewhat I think!)

nmstoker on 8 Oct 2017

@nmstoker Thanks for sharing your results! Can you share the WER on your private datasets? I just curious about the generalization of our model.

gardenia22 on 8 Oct 2017

@nmstoker Were you able to restore the model in TF-1.3.0 or did u use the output_graph.pb for inference. I'm able to use the output_graph.pb for inference but I'm unable to restore the model even in TF-1.3.0.
I am getting the following error
tensorflow.python.framework.errors_impl.NotFoundError: Key h3/Adam not found in checkpoint
which basically says some of the variables weren't saved while saving the model.

@gardenia22 Thanks for sharing the code, but looks like it has other dependencies which I don't have since it is quite old. I'm stuck

sruteesh on 9 Oct 2017

I was using the model for inference, just doing a few manual runs with my recordings and seeing what it output. Haven't tried restoring the model yet. I'll try to get some time for that next weekend and update you how it goes.

nmstoker on 9 Oct 2017

@nmstoker Could you comment on the updates? I see that was 17 days ago

nutelina on 25 Oct 2017

Sorry, I haven't had time to do much more on this in the last few weeks @nutelina (I only work on this as an occasional / weekend project)

nmstoker on 26 Oct 2017

Thanks for the reply @nmstoker I was looking for a good STT NN to train so the WER would be a good indication. Regards your data sets, they seem much to small, there is a good YT video from baidu which says at least 10,000 hours. Video: https://youtu.be/g-sndkf7mCs and it's resp. Github page: https://github.com/baidu-research/ba-dls-deepspeech

nutelina on 26 Oct 2017

@gardenia22 , when you did the training with the combined data set (Ted, Librivox, VoxForge and internal), did you form the training dataset by shuffling all these resources and sorting them according to the length of the audio file? Or, did you train with Ted first, followed by Librivox, and so on?

Thank you very much for making your model available. I did a preliminary evaluation using TED test set and the performance is very impressive! It beats the model I got by training DeepSpeech on just the TED corpus for 10 epochs.

abuvaneswari on 16 Dec 2017

@abuvaneswari The audios are sorted ascendingly according to their length. I trained with Librivox alone first, then with Ted and other datasets mixed with librivox together.

gardenia22 on 18 Dec 2017

I have use the TIMIT test set to test the pre-trained model, and wer is about 27%. The result is not very good, so I want to know have someone tested the TIMIT set and get the similar result?