Deepspeech: Feature request: streaming decoder (fast DS_IntermediateDecode calls)

Created on 19 Jan 2019 · 10Comments · Source: mozilla/DeepSpeech

Can the streaming recognition service be added to deep speech client, cause currently an audio file is recorded and later it is transcribed by the engine . However most of the big STT services provide a feature of streaming realtime audio from mic and getting back results simultaneously.. That feature will in fact give a boost to the applications of this project for realtime recognition.

Source

Chidhambararajan

Most helpful comment

I refactored all of this and got it working. Just need to tidy it up a bit and then I'll post a PR.

dabinat on 18 May 2019

👍3

All 10 comments

Is the vad_transcriber insufficient? If so, why?

kdavis-mozilla on 19 Jan 2019

@kdavis-mozilla I think @Chidhambararajan is referring to the ability of being able to get the transcription as soon as we get enough accumulated logits to do so.
For example: I've tried to use the .NET client to stream the Windows audio output to get the transcriptions and show it on the screen like most of the subtitles works. Which is the problem with my case? We can stream the audio but we can't get the transcriptions from the stream without stoping it.
I tried to do somthing like "intermediateDecodeAndRelease" It will excecute the decoding and throw away the old logits, due to my limited knowledge in C++ it did not work :(

Related to #1757

carlfm01 on 19 Jan 2019

I think it all boils down to the fact that the decoding step is not yet streamable

lissyx on 23 Jan 2019

👍1

@Chidhambararajan That being said, we already have streaming for the audio feeding, and on desktop with decent CPU or a GPU it should be faster than realtime, as well as on mid-range Android smartphone with TFLite quantized model.

So you can build realtime transcription, not perfectly yet, and it should be more perfect once we have streaming decoder (soon).

lissyx on 20 Feb 2019

👍1

Currently, the decoder we use (native_client/ctcdecode) exposes a batch API that takes a probabilities matrix and returns a list of decoded strings. The implementation is a beam search loop over all the time steps in the input probabilities. To implement a streaming decoder, one would have to refactor the decoder API from a single ctc_beam_search_decoder() call into a state-struct style API which is split into three stages: decoder_init, decoder_next, and decoder_finish or decoder_decode. At the start, you set the state for the decoder with the decoder_init() call:

https://github.com/mozilla/DeepSpeech/blob/a4b35d2f2487de69ce5ef2926fcd26e2698c7d69/native_client/ctcdecode/ctc_beam_search_decoder.cpp#L26-L46

Which returns a decoder state struct which contains all of the variables needed for the main loop. Then eventually you feed a batch of probabilities into the decoder with a decoder_next() step which performs N steps of the main loop over time:

https://github.com/mozilla/DeepSpeech/blob/a4b35d2f2487de69ce5ef2926fcd26e2698c7d69/native_client/ctcdecode/ctc_beam_search_decoder.cpp#L48-L121

And then finally you'd have a decoder_finish() or decoder_decode() step that does the final score adjustments if necessary and returns a list of decoder strings.

https://github.com/mozilla/DeepSpeech/blob/a4b35d2f2487de69ce5ef2926fcd26e2698c7d69/native_client/ctcdecode/ctc_beam_search_decoder.cpp#L124-L175

This step could then be called from DS_IntermediateDecode to quickly get the current decoding of the stream without having to always start from scratch. With this API in place, after a batch is computed with the acoustic model we can immediately feed the probabilities into the decoder_next() step. I'm fairly certain there are more performance gains to be had in the decoder, but this would be an amazing first step.

reuben on 11 May 2019

👍1

In the end, here's how the API would be used:

DS_SetupStream() -> decoder_init
DS_FeedAudioContent() -> ... -> StreamingState::processBatch -> decoder_next
DS_IntermediateDecode() and DS_FinishStream -> decoder_decode