DeepSpeech 0.6.1 - Metadata duration parameter is not correct for small audio files

Created on 17 Apr 2020 · 18Comments · Source: mozilla/DeepSpeech

The Metadata 'duration' info is not correct, especially when the audio sample is small (one or two words only).

TensorFlow: v1.14.0-21-ge77504ac6b
DeepSpeech: v0.6.1-52-g8431251
INFO: Initialized TensorFlow Lite runtime.
audio_format=1
num_channels=1
sample_rate=16000 (desired=16000)
bits_per_sample=16
res.buffer_size=38400
{"metadata":{"confidence":-4.8416},"words":[{"word":"visteon","time":0.24,"duration":0.36}]}

Please find attached test wav file
https://drive.google.com/file/d/1_vlqyDldadvWZkLEN23b51cF0Z68Jh4m/view?usp=sharing

In this example, it shows incorrect duration value

The above result is after applying the following patch
https://gist.github.com/reuben/70fdb0bb81b5155aeda3864fbf97766f

For the same file without the above patch, the results as below
TensorFlow: v1.14.0-21-ge77504ac6b
DeepSpeech: v0.6.1-51-g18403f0
INFO: Initialized TensorFlow Lite runtime.
audio_format=1
num_channels=1
sample_rate=16000 (desired=16000)
bits_per_sample=16
res.buffer_size=38400
{"metadata":{"confidence":-4.83948},"words":[{"word":"visteon","time":0,"duration":0.6}]}

Source

cnair-83

Most helpful comment

Exactly, I already have lm and trie for which I am using with 0.6.1 model, will generate scorer for 0.7 and will confirm that, the duration is fine when there is an accurate recognition.

cnair-83 on 28 Apr 2020

👍2

All 18 comments

Could you please repro on current master without local patches? Or can you repro only with this file or any other that is built the same way?

lissyx on 17 Apr 2020

DeepSpeech: v0.6.1-52-g8431251

This is still not pure DeepSpeech v0.6.1. Could you please ensure using the same tree?

lissyx on 17 Apr 2020

ping @cnair-83 ?

lissyx on 28 Apr 2020

I am trying to reproduce the same in 0.7 release model, But the 0.7 native client does not support the lm. Is it an expected behaviour?

cnair-83 on 28 Apr 2020

Yes the language model + trie are now packaged in the scorer. See for example[[1](https://github.com/mozilla/DeepSpeech/blob/master/doc/USING.rst#using-a-pre-trained-model)]

kdavis-mozilla on 28 Apr 2020

With 0.7 pre-trained model and scorer, the duration parameter seems to be correct, but the recognition is not accurate.

0.7 pre-trained model test results for the same wav file.

TensorFlow: v1.15.0-24-gceb46aa
DeepSpeech: v0.7.0-0-g3fbbca2
INFO: Initialized TensorFlow Lite runtime.
audio_format=1
num_channels=1
sample_rate=16000 (desired=16000)
bits_per_sample=16
res.buffer_size=38400
{
"metadata":{"confidence":2.86337},"words":[{"word":"mister","time":0.38,"duration":0.24},{"word":"yarn","time":0.7,"duration":0.16}],
"alternatives":[
{"metadata":{"confidence":3.02035},"words":[{"word":"mistor","time":0.38,"duration":0.24},{"word":"yarn","time":0.7,"duration":0.16}]},
{"metadata":{"confidence":3.81517},"words":[{"word":"mester","time":0.38,"duration":0.24},{"word":"yarn","time":0.7,"duration":0.16}]}
]
}
here it recognise as "mister yarn" instead of "visteon".

I will export my model to 0.7 and will confirm the results.

cnair-83 on 28 Apr 2020

As "visteon" is such an uncommon word, I doubt if it's in its trie. So it's not a surprise.

If you need uncommon words such as "visteon", I suggest training your own scorer.

kdavis-mozilla on 28 Apr 2020

👍1

Exactly, I already have lm and trie for which I am using with 0.6.1 model, will generate scorer for 0.7 and will confirm that, the duration is fine when there is an accurate recognition.

cnair-83 on 28 Apr 2020

👍2

Also, the fact that previously it recognized "visteon," and now it is "mister yarn" could in itself explain the correct duration value

lissyx on 28 Apr 2020

@cnair-83 Did you try 0.7.1?

kdavis-mozilla on 18 May 2020

Please find below the results with 0.7.1 released model and scorer.

/deepspeech --model deepspeech-0.7.0-models.tflite --scorer deepspeech-0.7.0-models.scorer --json --audio /data/local/test_data/2.wav <
TensorFlow: v1.15.0-24-gceb46aa
DeepSpeech: v0.7.0-0-g3fbbca2
INFO: Initialized TensorFlow Lite runtime.
audio_format=1
num_channels=1
sample_rate=16000 (desired=16000)
bits_per_sample=16
res.buffer_size=58880
{
"metadata":{"confidence":-25.3747},"words":[{"word":"alexa","time":0.78,"duration":0.28}],
"alternatives":[
{"metadata":{"confidence":-10.7203},"words":[{"word":"alex","time":0.78,"duration":0.24}]},
{"metadata":{"confidence":-24.7374},"words":[{"word":"alaka","time":0.78,"duration":0.3}]}
]
}

/deepspeech --model deepspeech-0.7.0-models.tflite --scorer deepspeech-0.7.0-models.scorer --json --audio /data/local/test_data/3.wav <
TensorFlow: v1.15.0-24-gceb46aa
DeepSpeech: v0.7.0-0-g3fbbca2
INFO: Initialized TensorFlow Lite runtime.
audio_format=1
num_channels=1
sample_rate=16000 (desired=16000)
bits_per_sample=16
res.buffer_size=58880
{
"metadata":{"confidence":-25.784},"words":[{"word":"alexa","time":0.76,"duration":0.36}],
"alternatives":[
{"metadata":{"confidence":-11.6792},"words":[{"word":"alex","time":0.76,"duration":0.32}]},
{"metadata":{"confidence":-10.141},"words":[{"word":"alec","time":0.76,"duration":0.3}]}
]
}

https://drive.google.com/drive/folders/1oYwmXVswgJUpfZDHu8r1GDJmSUjaMkfB?usp=sharing

It seems both the start time and duration is not accurate for both the samples.

cnair-83 on 18 May 2020

./deepspeech --model deepspeech-0.7.0-models.tflite --scorer deepspeech-0.7.0-models.scorer --json --audio /data/local/test_data/3.wav --beam_width 128
TensorFlow: v1.15.0-24-gceb46aa
DeepSpeech: v0.7.0-0-g3fbbca2
INFO: Initialized TensorFlow Lite runtime.
audio_format=1
num_channels=1
sample_rate=16000 (desired=16000)
bits_per_sample=16
res.buffer_size=58880
{
"metadata":{"confidence":-25.784},"words":[{"word":"alexa","time":0.76,"duration":0.38}],
"alternatives":[
{"metadata":{"confidence":-11.6792},"words":[{"word":"alex","time":0.76,"duration":0.32}]},
{"metadata":{"confidence":-10.141},"words":[{"word":"alec","time":0.76,"duration":0.32}]}
]
}

./deepspeech --model deepspeech-0.7.0-models.tflite --scorer deepspeech-0.7.0-models.scorer --json --audio /data/local/test_data/3.wav --beam_width 64
TensorFlow: v1.15.0-24-gceb46aa
DeepSpeech: v0.7.0-0-g3fbbca2
INFO: Initialized TensorFlow Lite runtime.
audio_format=1
num_channels=1
sample_rate=16000 (desired=16000)
bits_per_sample=16
res.buffer_size=58880
{
"metadata":{"confidence":-25.784},"words":[{"word":"alexa","time":0.76,"duration":0.68}],
"alternatives":[
{"metadata":{"confidence":-11.6792},"words":[{"word":"alex","time":0.76,"duration":0.34}]},
{"metadata":{"confidence":-10.141},"words":[{"word":"alec","time":0.76,"duration":0.34}]}
]
}

In the above two examples, one with beam_width 128 is giving duration 0.38 and the beam_width 64 is giving duration 0.68, which seems to be more closer to the actual scenario. But the start time is still not accurate.
Why beam_width is making a big difference in the duration calculation here?

cnair-83 on 28 May 2020

The problem with start_time for the first word is that, showing the time step at which it recognise the first letter, that means the time step at the end of that letter. Ideally we should get the time step at which the start of the first letter. But with the current implementation, it is difficult to get the start time of the first letter of the starting word.

For in the above example, in case of word alexa, the letter 'a' is recognised at the time step 38(0.76 ms), that is the end of letter 'a', ideally the word starts at time step 33(0.66 ms) where the letter 'a' starts.

cnair-83 on 2 Jun 2020

For in the above example, in case of word alexa, the letter 'a' is recognised at the time step 38(0.76 ms), that is the end of letter 'a', ideally the word starts at time step 33(0.66 ms) where the letter 'a' starts.

Do you reproduce with other words that would not start / end with the same letter?

lissyx on 2 Jun 2020

./deepspeech --model deepspeech-0.7.0-models.tflite --scorer deepspeech-0.7.0-models.scorer --audio ../../test_data/en-US_testing/pcm_general_ckohls_american_accent/withagc/1583528025758_but_as_the_name_of_the_russian_president.wav --json --beam_width 512 TensorFlow: v1.15.0-24-gceb46aa DeepSpeech: v0.7.0-0-g3fbbca2 INFO: Initialized TensorFlow Lite runtime. audio_format=1 num_channels=1 sample_rate=16000 (desired=16000) bits_per_sample=16 res.buffer_size=96000 { "metadata":{"confidence":-44.8398},"words":[{"word":"**what**","time":**0.54**,"duration":0.12},{"word":"is","time":0.68,"duration":0.1},{"word":"the","time":0.8,"duration":0.08},{"word":"name","time":0.92,"duration":0.16},{"word":"of","time":1.1,"duration":0.12},{"word":"the","time":1.24,"duration":0.0999999},{"word":"russian","time":1.38,"duration":0.32},{"word":"president","time":1.8,"duration":0.4}], "alternatives":[ {"metadata":{"confidence":-37.1093},"words":[{"word":"what","time":0.54,"duration":0.12},{"word":"is","time":0.68,"duration":0.1},{"word":"the","time":0.8,"duration":0.08},{"word":"name","time":0.92,"duration":0.16},{"word":"of","time":1.1,"duration":0.12},{"word":"the","time":1.24,"duration":0.0999999},{"word":"russian","time":1.38,"duration":0.32},{"word":"resident","time":1.8,"duration":0.4}]}, {"metadata":{"confidence":-38.2863},"words":[{"word":"what","time":0.54,"duration":0.12},{"word":"is","time":0.68,"duration":0.1},{"word":"the","time":0.8,"duration":0.08},{"word":"name","time":0.92,"duration":0.16},{"word":"of","time":1.1,"duration":0.12},{"word":"the","time":1.24,"duration":0.0999999},{"word":"russian","time":1.38,"duration":0.32},{"word":"present","time":1.8,"duration":0.42}]} ] }

In this example also, for the first word 'what' the start_time is 0.54 ms, that is the time step where the letter 'w' ends. Ideally the start time of the word 'what' is 0.47 ms, where the letter w starts.
Please find the sample file at https://drive.google.com/drive/folders/1oYwmXVswgJUpfZDHu8r1GDJmSUjaMkfB?usp=sharing

cnair-83 on 2 Jun 2020

In this example also, for the first word 'what' the start_time is 0.54 ms, that is the time step where the letter 'w' ends. Ideally the start time of the word 'what' is 0.47 ms, where the letter w starts.

I think I remember discussions around that specifically when the feature was landed and we had to find a middle ground that works for any language. @reuben might be able to shed more light, since he was reviewing that code :/

lissyx on 2 Jun 2020

I don't remember any specific choice of first vs last letter being decided on, or even coded. I think what you're seeing is just a coincidence. If you can figure out a better way to capture the timings, PRs are welcome. But the first task would be to collect a diverse set of audios as a test set to show that your changes don't regress timings too badly on files that aren't your own.