I fine-tuning a classification model using bert, however the inference time on CPU is so long,
I run the inference process is so long. It takes nearly 15 seconds for one call (15s is only for prediction, not for loading the model). Below is the code for the inference:
print("time 10: ", datetime.datetime.now()) result = estimator.predict(input_fn=predict_input_fn) print("time 11: ", datetime.datetime.now()) predicts = [] i = 0 for prediction in result: print("time 11-1: ", datetime.datetime.now()) probabilities = [p for p in prediction["probabilities"]]
and here is the output of time:
time 10: 2019-02-11 13:06:25.594185
time 11: 2019-02-11 13:06:25.594229
time 11-1: 2019-02-11 13:06:39.175300
How we can serve faster ?
Thank you very much.
I think tensorflow estimators reload the graph during predictions (estimator.predict(input_fn=predict_input_fn)). Therefore, you may still be including model load time during predictions. Do confirm on this.
How big is your passage? If its too big you can chunk the passage and pass the most relevant chunk as
an input to the BERT model.
Thank you for your comment.
I think tensorflow estimators reload the graph during predictions (
estimator.predict(input_fn=predict_input_fn)). Therefore, you may still be including model load time during predictions. Do confirm on this.
I think that is the reason for time-consuming. Do you have any idea to avoid this ?
How big is your passage? If its too big you can chunk the passage and pass the most relevant chunk as
an input to the BERT model.
My input length is 128. I think that is not too big.
You can try pytorch implementation of BERT from this repo: https://github.com/huggingface/pytorch-pretrained-BERT. I am not very comfortable in tensorflow.
adjust input length and layers deep can faster than before.
The idea I found is to export it as a static graph and use tensorflow serving to do the predictions. But am stuck onhow to export it. I get Couldn't find trained model at model when using estimator.export_savedmodel
I have sold this issue by using estimator.export_savedmodel, however, although the inference time is reduced from ~10s --> ~1-2s/time however this number is difficult for a product. Do you have any ideas on how to use BERT in production?
can you elaborate what you did with estimator.export_savedmodel ? I am trying to do the inference stuck at "it's loading the graph everytime". I am using FLAGS.do_predict in run_classifier.py for inference.
Most helpful comment
The idea I found is to export it as a static graph and use tensorflow serving to do the predictions. But am stuck onhow to export it. I get
Couldn't find trained model at modelwhen usingestimator.export_savedmodel