Bert: Reduce prediction time for question answering

Created on 28 Feb 2019 · 9Comments · Source: google-research/bert

Hi,

i am executing BERT solution on machine with GPU (Tesla K80 - 12 GB) . for question answering prediction for single question is taking more than 5 seconds. Can we reduce it to below 1 second.

Do we need to configure any thing to make it possible ?

Thank you

Source

shivamani-ans

👍1

Most helpful comment

That's how estimator is working. I managed to export it as pb but didn't have time to test it yet. I used
```
def serving_input_fn():
unique_ids = tf.placeholder(tf.int32, [None], name='unique_ids')
input_ids = tf.placeholder(tf.int32, [None, max_seq_length], name='input_ids')
input_mask = tf.placeholder(tf.int32, [None, max_seq_length], name='input_mask')
segment_ids = tf.placeholder(tf.int32, [None, max_seq_length], name='segment_ids')
input_fn = tf.estimator.export.build_raw_serving_input_receiver_fn({
'unique_ids': unique_ids,
'input_ids': input_ids,
'input_mask': input_mask,
'segment_ids': segment_ids,
})()
return input_fn

save_hook = tf.train.CheckpointSaverHook("google_bertbert_large", save_secs=1)
estimator.predict(input_fn=predict_input_fn, hooks=[save_hook])
estimator._export_to_tpu = False

now you will get graph.pbtxt which is used in SavedModel, and then

estimator.export_savedmodel(export_dir_base="export", serving_input_receiver_fn=serving_input_fn,checkpoint_path=init_checkpoint)
```

anasuna on 1 Mar 2019

👍2

All 9 comments

You have the problem where the model is loaded again from the start?

anasuna on 28 Feb 2019

Yes

for example i have 5 questions which are from parallel requests, for every request model get loads and response is more than 5 seconds.

shivamani-ans on 28 Feb 2019

save_hook = tf.train.CheckpointSaverHook("google_bertbert_large", save_secs=1)
estimator.predict(input_fn=predict_input_fn, hooks=[save_hook])
estimator._export_to_tpu = False

now you will get graph.pbtxt which is used in SavedModel, and then

estimator.export_savedmodel(export_dir_base="export", serving_input_receiver_fn=serving_input_fn,checkpoint_path=init_checkpoint)
```

anasuna on 1 Mar 2019

👍2

check this https://github.com/google-research/bert/issues/146

anasuna on 1 Mar 2019

Thank you for your response,

I have executed above piece code by placing in run_squad.py. saved_model.pb got generated, can we use that .pb file for further predictions or how i should utilize that.

Can you please let us know how to proceed further .

shivamani-ans on 4 Mar 2019

I got cause exactly as below

The TensorFlow graph is recreated and the checkpoint is reloaded EVERY time you want to use a trained model to make a prediction on new data.

shivamani-ans on 5 Mar 2019

Hi,

@shivamani-ans @anasuna How did you test the QA model using own question without using the dev-test file?