Deepspeech: Feature request: streaming decoder (fast DS_IntermediateDecode calls)

Created on 19 Jan 2019  路  10Comments  路  Source: mozilla/DeepSpeech

Can the streaming recognition service be added to deep speech client, cause currently an audio file is recorded and later it is transcribed by the engine . However most of the big STT services provide a feature of streaming realtime audio from mic and getting back results simultaneously.. That feature will in fact give a boost to the applications of this project for realtime recognition.

Most helpful comment

I refactored all of this and got it working. Just need to tidy it up a bit and then I'll post a PR.

All 10 comments

Is the vad_transcriber insufficient? If so, why?

@kdavis-mozilla I think @Chidhambararajan is referring to the ability of being able to get the transcription as soon as we get enough accumulated logits to do so.
For example: I've tried to use the .NET client to stream the Windows audio output to get the transcriptions and show it on the screen like most of the subtitles works. Which is the problem with my case? We can stream the audio but we can't get the transcriptions from the stream without stoping it.
I tried to do somthing like "intermediateDecodeAndRelease" It will excecute the decoding and throw away the old logits, due to my limited knowledge in C++ it did not work :(

Related to #1757

I think it all boils down to the fact that the decoding step is not yet streamable

@Chidhambararajan That being said, we already have streaming for the audio feeding, and on desktop with decent CPU or a GPU it should be faster than realtime, as well as on mid-range Android smartphone with TFLite quantized model.

So you can build realtime transcription, not perfectly yet, and it should be more perfect once we have streaming decoder (soon).

Currently, the decoder we use (native_client/ctcdecode) exposes a batch API that takes a probabilities matrix and returns a list of decoded strings. The implementation is a beam search loop over all the time steps in the input probabilities. To implement a streaming decoder, one would have to refactor the decoder API from a single ctc_beam_search_decoder() call into a state-struct style API which is split into three stages: decoder_init, decoder_next, and decoder_finish or decoder_decode. At the start, you set the state for the decoder with the decoder_init() call:

https://github.com/mozilla/DeepSpeech/blob/a4b35d2f2487de69ce5ef2926fcd26e2698c7d69/native_client/ctcdecode/ctc_beam_search_decoder.cpp#L26-L46

Which returns a decoder state struct which contains all of the variables needed for the main loop. Then eventually you feed a batch of probabilities into the decoder with a decoder_next() step which performs N steps of the main loop over time:

https://github.com/mozilla/DeepSpeech/blob/a4b35d2f2487de69ce5ef2926fcd26e2698c7d69/native_client/ctcdecode/ctc_beam_search_decoder.cpp#L48-L121

And then finally you'd have a decoder_finish() or decoder_decode() step that does the final score adjustments if necessary and returns a list of decoder strings.

https://github.com/mozilla/DeepSpeech/blob/a4b35d2f2487de69ce5ef2926fcd26e2698c7d69/native_client/ctcdecode/ctc_beam_search_decoder.cpp#L124-L175

This step could then be called from DS_IntermediateDecode to quickly get the current decoding of the stream without having to always start from scratch. With this API in place, after a batch is computed with the acoustic model we can immediately feed the probabilities into the decoder_next() step. I'm fairly certain there are more performance gains to be had in the decoder, but this would be an amazing first step.

In the end, here's how the API would be used:

DS_SetupStream() -> decoder_init
DS_FeedAudioContent() -> ... -> StreamingState::processBatch -> decoder_next
DS_IntermediateDecode() and DS_FinishStream -> decoder_decode

I refactored all of this and got it working. Just need to tidy it up a bit and then I'll post a PR.

See PR #2121.

This has now been merged with master so I'll close this issue.

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

Was this page helpful?
0 / 5 - 0 ratings