Deepspeech: batching during inferencing

Created on 18 Sep 2017  路  21Comments  路  Source: mozilla/DeepSpeech

Hello,
Does native_client support inferencing of > 1 audio file at the same time? I am looking to use my GPU for inferencing and optimize the utilization by batching the requests from multiple audio files.

Most helpful comment

Chiming in again- I didn't have access to evaluate.py since we were on an old 0.2.0 alpha release. But having updated to 0.4.1, I've been very pleased by evaluate.py's performance. 5 hours to transcribe 300 hours of audio on one GPU machine, bloody awesome.

So consider me happy, where this is concerned.

All 21 comments

my program to test all a csv file :
https://pastebin.mozilla.org/9032686
it exports time and numbers of inferences done

capture : https://pastebin.mozilla.org/9032687

Inferencing of a 15 sec long wav file on a GTX 1080 GPU takes 7 seconds. Is that expected? It seems to be a long time to me.

Do you use existing binaries ? Or did you compile native_client ? (Compile with cuda option)

yes. Compiled native_client with CUDA option.

Well, on my small tx2, without overclocking, I just did a test on batch containing 101 wav (average 3s/wav)
101 inferences in 48.903s, so inference takes wavetime/3.
You just have to compare gpu boards in a net bench

my program to test all a csv file :
https://pastebin.mozilla.org/9032686
it exports time and numbers of inferences done

capture : https://pastebin.mozilla.org/9032687

hi @elpimous , thanks for the answer.
the links are broken now, could you please share it again ?
Best regards

@elpimous @kdavis-mozilla
It would be great to have this feature.
I can work on the PR but since it will take some time to develop, could you first confirm that it is something you are interested in ?
best regards

@nicolaspanel I find it interesting @reuben what's your take?

@nicolaspanel I'm definitely interested in having this feature. Do you have an idea of what the batch API would look like?

It would be great if we have this feature for deep speech pre-build binary as well as, Inferencing more than one audio file at the same time. Currently, I've written a python script and passing audio file name one by one.

@nicolaspanel I'm definitely interested in having this feature. Do you have an idea of what the batch API would look like?

Right now, python client.py looks like

ds = Model(...)
fin = wave.open(args.audio, 'rb')
fs = fin.getframerate()  # 16000
audio = np.frombuffer(fin.readframes(fin.getnframes()), np.int16)  # audio.shape => (n_frames,)
transcript = ds.stt(audio, fs)

we could just add Model#stt_batch -> List[str] method expecting (BATCH_SIZE, max(n_frames),) array and framerate inputs.

ds = Model(...)
fin = wave.open(args.audio, 'rb')
fs = fin.getframerate()  # 16000
audio = np.frombuffer(fin.readframes(fin.getnframes()), np.int16)  # audio.shape => (n_frames,)
audios = audio.reshape((1, audio.shape[0]))
transcript = ds.stt_batch(audios, fs)

the tricky part is of course the underlying DeepSpeech/native_client/deepspeech.cc code

(Caveat: We're currently running on 0.2.0-alpha.7, but my findings seem to be consistent with everyone else)

I was just about to put through a feature request rather similar to this.

In our work we've noticed an inference rate of about 0.3s to every 1s of audio uploaded, but we can also see that both the CPU and GPU are underutilised. In CPU only inference on my 4 core (8 thread) laptop the CPU hovers around 25% during inference and achieves only slower than real time inference.

With a GPU we achieve the previously mentioned (by @elpimous ) 0.3s to inference for every 1 second of audio, but even there we can see that the GPU is pretty underutilised by inspecting nvidia-smi. At the moment, for our 300 hours of audio, GPU inference still takes about 5-6 days to run completely, which is disappointingly slow.

We can spin up other instances, or more expensive instances but if we could achieve a significant enough speedup by batching this would be a big win for evaluating and optimising new models.

All of this is to say, thanks @nicolaspanel for looking into this. I'll also be keen to hear how much of an impact this makes on local inference times.

At the moment, for our 300 hours of audio, GPU inference still takes about 5-6 days to run completely, which is disappointingly slow.

It might help if you could explain your usecase.

I鈥檓 just transcribing audio to evaluate our model performance. We want evaluations to be fast so we can try out lots of model parameters.

I could sample instead, but since it鈥檚 the Christmas break I didn鈥檛 mind leaving a long running job.

It seems to me that batching jobs will also help speed up our current transcriptions api which can take audio files over an hour long. Also it sits on an aws ec2 instance at the moment, so faster inference means we can possibly reduce costs.

For evaluating model performance I'd strongly encourage you to use evaluate.py rather than the clients, as it's optimized for throughput rather than latency. It takes the same arguments as DeepSpeech.py but only does evaluation, so it'll only look at test_files, test_batch_size, etc. You'll need to point it at a checkpoint (with --checkpoint_dir) rather than at a frozen model.

@nicolaspanel I'm definitely interested in having this feature. Do you have an idea of what the batch API would look like?

@reuben @kdavis-mozilla since everyone here seems interested by such feature, maybe we could include it in incoming releases (https://github.com/mozilla/DeepSpeech/projects). What do you think ?

Like I said, I can contribute if needed

Chiming in again- I didn't have access to evaluate.py since we were on an old 0.2.0 alpha release. But having updated to 0.4.1, I've been very pleased by evaluate.py's performance. 5 hours to transcribe 300 hours of audio on one GPU machine, bloody awesome.

So consider me happy, where this is concerned.

@nicolaspanel Are you working on this feature? I am also interested in such a feature.

@reuben Has someone started working on that feature?

It would be great if we have this feature for deep speech pre-build binary as well as, Inferencing more than one audio file at the same time. Currently, I've written a python script and passing audio file name one by one.

@nullbyte91 I'm trying to run audio fie one by one through PythonScript in deepspeech 0.6.1 model.
Could you please help me out.

For evaluating model performance I'd strongly encourage you to use evaluate.py rather than the clients, as it's optimized for throughput rather than latency. It takes the same arguments as DeepSpeech.py but only does evaluation, so it'll only look at test_files, test_batch_size, etc. You'll need to point it at a checkpoint (with --checkpoint_dir) rather than at a frozen model.

It is also possible to load a frozen graph like this (posting it just in case someone else needs it). Model_path was added to the FLAGS.

if FLAGS.model_path:
    with tfv1.gfile.FastGFile(FLAGS.model_path, 'rb') as fin:
        graph_def = tfv1.GraphDef()
        graph_def.ParseFromString(fin.read())

    var_names = [v.name for v in tfv1.trainable_variables()]
    var_tensors = tfv1.import_graph_def(graph_def, return_elements=var_names)

    # build a { var_name: var_tensor } dict
    var_tensors = dict(zip(var_names, var_tensors))

    training_graph = tfv1.get_default_graph()

    assign_ops = []
    for name, restored_tensor in var_tensors.items():
        training_tensor = training_graph.get_tensor_by_name(name)
        assign_ops.append(tfv1.assign(training_tensor, restored_tensor))

    init_from_frozen_model_op = tfv1.group(*assign_ops)
    session.run(init_from_frozen_model_op)
else:
    load_graph_for_evaluation(session)
Was this page helpful?
0 / 5 - 0 ratings

Related issues

shyamalschandra picture shyamalschandra  路  25Comments

aaronzira picture aaronzira  路  29Comments

beriberikix picture beriberikix  路  36Comments

verloka picture verloka  路  23Comments

khu834 picture khu834  路  48Comments