Deepspeech: batching during inferencing

Created on 18 Sep 2017 · 21Comments · Source: mozilla/DeepSpeech

Hello,
Does native_client support inferencing of > 1 audio file at the same time? I am looking to use my GPU for inferencing and optimize the utilization by batching the requests from multiple audio files.

Source

abuvaneswari

👍4

Most helpful comment

Chiming in again- I didn't have access to evaluate.py since we were on an old 0.2.0 alpha release. But having updated to 0.4.1, I've been very pleased by evaluate.py's performance. 5 hours to transcribe 300 hours of audio on one GPU machine, bloody awesome.

So consider me happy, where this is concerned.

mathematiguy on 12 Feb 2019

👍3

All 21 comments

my program to test all a csv file :
https://pastebin.mozilla.org/9032686
it exports time and numbers of inferences done

capture : https://pastebin.mozilla.org/9032687

elpimous on 19 Sep 2017

Inferencing of a 15 sec long wav file on a GTX 1080 GPU takes 7 seconds. Is that expected? It seems to be a long time to me.

abuvaneswari on 20 Sep 2017

Do you use existing binaries ? Or did you compile native_client ? (Compile with cuda option)

elpimous on 20 Sep 2017

yes. Compiled native_client with CUDA option.

abuvaneswari on 20 Sep 2017

Well, on my small tx2, without overclocking, I just did a test on batch containing 101 wav (average 3s/wav)
101 inferences in 48.903s, so inference takes wavetime/3.
You just have to compare gpu boards in a net bench

elpimous on 20 Sep 2017

my program to test all a csv file :
https://pastebin.mozilla.org/9032686
it exports time and numbers of inferences done

capture : https://pastebin.mozilla.org/9032687

hi @elpimous , thanks for the answer.
the links are broken now, could you please share it again ?
Best regards

nicolaspanel on 28 Nov 2018

@elpimous @kdavis-mozilla
It would be great to have this feature.
I can work on the PR but since it will take some time to develop, could you first confirm that it is something you are interested in ?
best regards

nicolaspanel on 5 Dec 2018

👍1

@nicolaspanel I find it interesting @reuben what's your take?

kdavis-mozilla on 5 Dec 2018

@nicolaspanel I'm definitely interested in having this feature. Do you have an idea of what the batch API would look like?

reuben on 5 Dec 2018

It would be great if we have this feature for deep speech pre-build binary as well as, Inferencing more than one audio file at the same time. Currently, I've written a python script and passing audio file name one by one.

nullbyte91 on 6 Dec 2018

@nicolaspanel I'm definitely interested in having this feature. Do you have an idea of what the batch API would look like?

Right now, python client.py looks like

ds = Model(...)
fin = wave.open(args.audio, 'rb')
fs = fin.getframerate()  # 16000
audio = np.frombuffer(fin.readframes(fin.getnframes()), np.int16)  # audio.shape => (n_frames,)
transcript = ds.stt(audio, fs)

we could just add Model#stt_batch -> List[str] method expecting (BATCH_SIZE, max(n_frames),) array and framerate inputs.

ds = Model(...)
fin = wave.open(args.audio, 'rb')
fs = fin.getframerate()  # 16000
audio = np.frombuffer(fin.readframes(fin.getnframes()), np.int16)  # audio.shape => (n_frames,)
audios = audio.reshape((1, audio.shape[0]))
transcript = ds.stt_batch(audios, fs)

the tricky part is of course the underlying DeepSpeech/native_client/deepspeech.cc code

nicolaspanel on 6 Dec 2018

(Caveat: We're currently running on 0.2.0-alpha.7, but my findings seem to be consistent with everyone else)

I was just about to put through a feature request rather similar to this.

In our work we've noticed an inference rate of about 0.3s to every 1s of audio uploaded, but we can also see that both the CPU and GPU are underutilised. In CPU only inference on my 4 core (8 thread) laptop the CPU hovers around 25% during inference and achieves only slower than real time inference.

With a GPU we achieve the previously mentioned (by @elpimous ) 0.3s to inference for every 1 second of audio, but even there we can see that the GPU is pretty underutilised by inspecting nvidia-smi. At the moment, for our 300 hours of audio, GPU inference still takes about 5-6 days to run completely, which is disappointingly slow.

We can spin up other instances, or more expensive instances but if we could achieve a significant enough speedup by batching this would be a big win for evaluating and optimising new models.

All of this is to say, thanks @nicolaspanel for looking into this. I'll also be keen to hear how much of an impact this makes on local inference times.

mathematiguy on 30 Dec 2018

At the moment, for our 300 hours of audio, GPU inference still takes about 5-6 days to run completely, which is disappointingly slow.

It might help if you could explain your usecase.

lissyx on 30 Dec 2018

I’m just transcribing audio to evaluate our model performance. We want evaluations to be fast so we can try out lots of model parameters.

I could sample instead, but since it’s the Christmas break I didn’t mind leaving a long running job.

It seems to me that batching jobs will also help speed up our current transcriptions api which can take audio files over an hour long. Also it sits on an aws ec2 instance at the moment, so faster inference means we can possibly reduce costs.

mathematiguy on 31 Dec 2018

For evaluating model performance I'd strongly encourage you to use evaluate.py rather than the clients, as it's optimized for throughput rather than latency. It takes the same arguments as DeepSpeech.py but only does evaluation, so it'll only look at test_files, test_batch_size, etc. You'll need to point it at a checkpoint (with --checkpoint_dir) rather than at a frozen model.

reuben on 31 Dec 2018

@nicolaspanel I'm definitely interested in having this feature. Do you have an idea of what the batch API would look like?

@reuben @kdavis-mozilla since everyone here seems interested by such feature, maybe we could include it in incoming releases (https://github.com/mozilla/DeepSpeech/projects). What do you think ?

Like I said, I can contribute if needed

nicolaspanel on 12 Feb 2019

So consider me happy, where this is concerned.

mathematiguy on 12 Feb 2019

👍3

@nicolaspanel Are you working on this feature? I am also interested in such a feature.

CP-4 on 5 Nov 2019

@reuben Has someone started working on that feature?

phtephanx on 16 Jan 2020

It would be great if we have this feature for deep speech pre-build binary as well as, Inferencing more than one audio file at the same time. Currently, I've written a python script and passing audio file name one by one.

@nullbyte91 I'm trying to run audio fie one by one through PythonScript in deepspeech 0.6.1 model.
Could you please help me out.

rakeshku93 on 27 Apr 2020

For evaluating model performance I'd strongly encourage you to use evaluate.py rather than the clients, as it's optimized for throughput rather than latency. It takes the same arguments as DeepSpeech.py but only does evaluation, so it'll only look at test_files, test_batch_size, etc. You'll need to point it at a checkpoint (with --checkpoint_dir) rather than at a frozen model.

It is also possible to load a frozen graph like this (posting it just in case someone else needs it). Model_path was added to the FLAGS.

if FLAGS.model_path:
    with tfv1.gfile.FastGFile(FLAGS.model_path, 'rb') as fin:
        graph_def = tfv1.GraphDef()
        graph_def.ParseFromString(fin.read())

    var_names = [v.name for v in tfv1.trainable_variables()]
    var_tensors = tfv1.import_graph_def(graph_def, return_elements=var_names)

    # build a { var_name: var_tensor } dict
    var_tensors = dict(zip(var_names, var_tensors))

    training_graph = tfv1.get_default_graph()

    assign_ops = []
    for name, restored_tensor in var_tensors.items():
        training_tensor = training_graph.get_tensor_by_name(name)
        assign_ops.append(tfv1.assign(training_tensor, restored_tensor))

    init_from_frozen_model_op = tfv1.group(*assign_ops)
    session.run(init_from_frozen_model_op)
else:
    load_graph_for_evaluation(session)