Deepspeech: Text timeline

Created on 20 Dec 2017  Â·  15Comments  Â·  Source: mozilla/DeepSpeech

Could you please make it possible to get exact time at which each word was said?

enhancement

Most helpful comment

SO excited for the word timeline!

All 15 comments

@kaszperro I might be confused, but isn't it what https://discourse.mozilla.org/t/audacity-vamp-plugin-for-deep-speech/23349 does ?

I have to check, because there isn't a word about such a functionality.

@lissyx
The plugin does VAD first and then passes the audio snippets on to Deep Speech. So no. You get some time metadata this way, but it’s very coarse-grained. Additionally, the granularity is based on the pauses in the recorded speech. So it’s kinda random.

I second this. It would be great to have the possibility to get timestamps for every character, natively returned by DeepSpeech.

The way I understand it, every output character is based on a particular segment of the input audio anyways, so I assume that piece of information should be there and would just have to get returned by stt.
Also I guess internally DeepSpeech uses this information already to parse characters into words.

On 16 Feb 2018, at 17:01, Michael Leue notifications@github.com wrote:

I second this. It would be great to have the possibility to get timestamps for every character, natively returned by DeepSpeech.

The way I understand it, every output character is based on a particular segment of the input audio anyways, so I assume that piece of information should be there anyways and would just have to get returned by stt.

It is not.
Also I guess internally DeepSpeech uses this information already to parse characters into words.

It does not.

Returning that information requires extending the CTC algorithm in TensorFlow, it's a non trivial change. Patches always welcome, of course :)

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub, or mute the thread.

Seems like a challenge 😊

@reuben
Thanks for the explanations. I just quickly skimmed through your article here earlier https://hacks.mozilla.org/2017/11/a-journey-to-10-word-error-rate/ and there you are talking about "slices of audio" being used to predict character probabilities, so I thought there would be some way of linking back in the opposite direction as well.

Anyways, thanks for responding.

I'm interested in doing something like this - at a minimum, it would be quite nice to be able to get the character probabilities before they go to the CTC step. I'm going to be doing my own digging to see how this might work, but if anyone has any leads I'd appreciate it.

In my opinion, you would need a training set with timestamped words (unfortunately I'm not aware of such). Correct me if I'm wrong.

I'm confident that we can get the character probabilities by unfreezing and running the model ourselves at a minimum, but that wouldn't give timestamps because they would still have to be inferred. And as @reuben said, ideally this would be done by CTC itself.

It's possible we could get a 'good enough' time stamp once we have the character probabilities without actually rewriting the CTC algorithm, but it's hard to say without trying.

For the time being, I'm actually mostly interested in the character probabilities, and less so in the time stamps, although I do want those further down the line.

No need to unfreeze the model, just fetch the "raw_logits" node which is the output of the acoustic model (and input to CTC decoder).

So, some news that @reuben was too shy to share: https://github.com/mozilla/DeepSpeech/issues/1656#issuecomment-431296511

SO excited for the word timeline!

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

aaronzira picture aaronzira  Â·  29Comments

MatthewWaller picture MatthewWaller  Â·  74Comments

breandan picture breandan  Â·  41Comments

khu834 picture khu834  Â·  48Comments

beriberikix picture beriberikix  Â·  36Comments