Deepspeech: Text timeline

Created on 20 Dec 2017 · 15Comments · Source: mozilla/DeepSpeech

Could you please make it possible to get exact time at which each word was said?

enhancement

Source

kaszperro

👍10

Most helpful comment

SO excited for the word timeline!

spencer-brown on 3 Nov 2018

👍5

All 15 comments

@kaszperro I might be confused, but isn't it what https://discourse.mozilla.org/t/audacity-vamp-plugin-for-deep-speech/23349 does ?

lissyx on 20 Dec 2017

I have to check, because there isn't a word about such a functionality.

kaszperro on 20 Dec 2017

@lissyx
The plugin does VAD first and then passes the audio snippets on to Deep Speech. So no. You get some time metadata this way, but it’s very coarse-grained. Additionally, the granularity is based on the pauses in the recorded speech. So it’s kinda random.

JanX2 on 25 Dec 2017

I second this. It would be great to have the possibility to get timestamps for every character, natively returned by DeepSpeech.

The way I understand it, every output character is based on a particular segment of the input audio anyways, so I assume that piece of information should be there and would just have to get returned by stt.
Also I guess internally DeepSpeech uses this information already to parse characters into words.

mleue on 16 Feb 2018

On 16 Feb 2018, at 17:01, Michael Leue notifications@github.com wrote:

I second this. It would be great to have the possibility to get timestamps for every character, natively returned by DeepSpeech.

The way I understand it, every output character is based on a particular segment of the input audio anyways, so I assume that piece of information should be there anyways and would just have to get returned by stt.

It is not.
Also I guess internally DeepSpeech uses this information already to parse characters into words.

It does not.

Returning that information requires extending the CTC algorithm in TensorFlow, it's a non trivial change. Patches always welcome, of course :)

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub, or mute the thread.

reuben on 16 Feb 2018

👍2

Seems like a challenge 😊

kaszperro on 16 Feb 2018

@reuben
Thanks for the explanations. I just quickly skimmed through your article here earlier https://hacks.mozilla.org/2017/11/a-journey-to-10-word-error-rate/ and there you are talking about "slices of audio" being used to predict character probabilities, so I thought there would be some way of linking back in the opposite direction as well.

Anyways, thanks for responding.

mleue on 16 Feb 2018

I'm interested in doing something like this - at a minimum, it would be quite nice to be able to get the character probabilities before they go to the CTC step. I'm going to be doing my own digging to see how this might work, but if anyone has any leads I'd appreciate it.

mathematiguy on 12 Oct 2018

In my opinion, you would need a training set with timestamped words (unfortunately I'm not aware of such). Correct me if I'm wrong.

kaszperro on 12 Oct 2018

I'm confident that we can get the character probabilities by unfreezing and running the model ourselves at a minimum, but that wouldn't give timestamps because they would still have to be inferred. And as @reuben said, ideally this would be done by CTC itself.

It's possible we could get a 'good enough' time stamp once we have the character probabilities without actually rewriting the CTC algorithm, but it's hard to say without trying.

For the time being, I'm actually mostly interested in the character probabilities, and less so in the time stamps, although I do want those further down the line.

mathematiguy on 15 Oct 2018

No need to unfreeze the model, just fetch the "raw_logits" node which is the output of the acoustic model (and input to CTC decoder).

reuben on 15 Oct 2018

🎉1

So, some news that @reuben was too shy to share: https://github.com/mozilla/DeepSpeech/issues/1656#issuecomment-431296511

lissyx on 19 Oct 2018

👍2

SO excited for the word timeline!

spencer-brown on 3 Nov 2018

👍5

Closing because this is being taken care of in https://github.com/mozilla/DeepSpeech/pull/1893 https://github.com/mozilla/DeepSpeech/pull/1892

lissyx on 20 Feb 2019