Bert: inference time on CPU take so long

Created on 11 Feb 2019 · 7Comments · Source: google-research/bert

I fine-tuning a classification model using bert, however the inference time on CPU is so long,
I run the inference process is so long. It takes nearly 15 seconds for one call (15s is only for prediction, not for loading the model). Below is the code for the inference:

print("time 10: ", datetime.datetime.now())
result = estimator.predict(input_fn=predict_input_fn)
print("time 11: ", datetime.datetime.now())
predicts = []
i = 0
for prediction in result:
    print("time 11-1: ", datetime.datetime.now())
    probabilities = [p for p in prediction["probabilities"]]

and here is the output of time:

time 10: 2019-02-11 13:06:25.594185
time 11: 2019-02-11 13:06:25.594229
time 11-1: 2019-02-11 13:06:39.175300

How we can serve faster ?

Thank you very much.

Source

ntson2002

Most helpful comment

The idea I found is to export it as a static graph and use tensorflow serving to do the predictions. But am stuck onhow to export it. I get Couldn't find trained model at model when using estimator.export_savedmodel

anasuna on 28 Feb 2019

👍2

All 7 comments

I think tensorflow estimators reload the graph during predictions (estimator.predict(input_fn=predict_input_fn)). Therefore, you may still be including model load time during predictions. Do confirm on this.
How big is your passage? If its too big you can chunk the passage and pass the most relevant chunk as
an input to the BERT model.

kaushalshetty on 13 Feb 2019

👍2

Thank you for your comment.

I think tensorflow estimators reload the graph during predictions (estimator.predict(input_fn=predict_input_fn)). Therefore, you may still be including model load time during predictions. Do confirm on this.

I think that is the reason for time-consuming. Do you have any idea to avoid this ?

How big is your passage? If its too big you can chunk the passage and pass the most relevant chunk as
an input to the BERT model.

My input length is 128. I think that is not too big.

ntson2002 on 13 Feb 2019

You can try pytorch implementation of BERT from this repo: https://github.com/huggingface/pytorch-pretrained-BERT. I am not very comfortable in tensorflow.

kaushalshetty on 14 Feb 2019

👍1

adjust input length and layers deep can faster than before.

susoooon on 15 Feb 2019

👍1

anasuna on 28 Feb 2019

👍2

I have sold this issue by using estimator.export_savedmodel, however, although the inference time is reduced from ~10s --> ~1-2s/time however this number is difficult for a product. Do you have any ideas on how to use BERT in production?

ntson2002 on 3 Apr 2019

👍1

can you elaborate what you did with estimator.export_savedmodel ? I am trying to do the inference stuck at "it's loading the graph everytime". I am using FLAGS.do_predict in run_classifier.py for inference.