Describe the bug
Inference time of onnxruntime is slower as compare to the pytorch model
System information
To Reproduce
def preprocess(tokenizer, text):
max_seq_length = 128
tokens = tokenizer.tokenize(text)
tokens.insert(0, "[CLS]")
tokens.append("[SEP]")
segment_ids = []
for i in range(len(tokens)):
segment_ids.append(0)
input_ids = tokenizer.convert_tokens_to_ids(tokens)
input_mask = [1] * len(input_ids)
while len(input_ids) < max_seq_length:
input_ids.append(0)
input_mask.append(0)
segment_ids.append(0)
input_ids = torch.tensor([input_ids], dtype=torch.long)
input_mask = torch.tensor([input_mask], dtype=torch.long)
segment_ids = torch.tensor([segment_ids], dtype=torch.long)
return input_ids, input_mask, segment_ids
"""
Inference on pretrained pytorch model
"""
def inference_pytorch(model, input_ids, input_mask, segment_ids):
with torch.no_grad():
outputs = model(input_ids, input_mask, segment_ids)
logits = outputs[0]
logits = F.softmax(logits, dim=1)
return logits
"""
This function stores pretrained bert model
into onnx format
"""
def convert_bert_to_onnx(text):
model_dir = "/home/ramesh/github/pytorch-pretrained-model-to-onnx"
# config = BertConfig.from_pretrained(model_dir)
tokenizer = BertTokenizer.from_pretrained(model_dir)
# model = BertForSequenceClassification.from_pretrained(model_dir, config=config)
input_ids, input_mask, segment_ids = preprocess(tokenizer, text)
torch.onnx.export(model, (input_ids, input_mask, segment_ids), "bert.onnx", input_names = ["input_ids", "input_mask", "segment_ids"],
output_names = ["output"])
print("model convert to onnx format successfully")
def inference(model_name, examples):
onnx_inference = []
pytorch_inference = []
model_dir = "/home/ramesh/github/pytorch-pretrained-model-to-onnx/models"
#onnx session
ort_session = onnxruntime.InferenceSession(model_name)
#pytorch pretrained model and tokenizer
tokenizer = BertTokenizer.from_pretrained(model_dir)
config = BertConfig.from_pretrained(model_dir)
model = BertForSequenceClassification.from_pretrained(model_dir, config=config)
# model.to("cpu")
for example in examples:
"""
Onnx inference
"""
input_ids, input_mask, segment_ids = preprocess(tokenizer, example)
ort_inputs = {ort_session.get_inputs()[0].name: to_numpy(input_ids),
ort_session.get_inputs()[1].name: to_numpy(input_mask),
ort_session.get_inputs()[2].name: to_numpy(segment_ids)}
ort_outs = ort_session.run(["output"], ort_inputs)
torch_onnx_output = torch.tensor(ort_outs[0], dtype=torch.float32)
onnx_logits = F.softmax(torch_onnx_output, dim=1)
logits_label = torch.argmax(onnx_logits, dim=1)
onnx_inference.append(logits_label[0])
"""
Pretrained bert pytorch model
"""
#
torch_out = inference_pytorch(model, input_ids, input_mask, segment_ids)
logits_label = torch.argmax(torch_out, dim=1)
label = logits_label.detach().cpu().numpy()
pytorch_inference.append(label[0])
#
# # compare ONNX Runtime and PyTorch results
# np.testing.assert_allclose(to_numpy(torch_out), onnx_logits, rtol=1e-03, atol=1e-05)
#
# print("Exported model has been tested with ONNXRuntime, and the result looks good!")
return onnx_inference, pytorch_inference
def to_numpy(tensor):
return tensor.detach().cpu().numpy() if tensor.requires_grad else tensor.cpu().numpy()
if __name__ == '__main__':
text = "tick tock tick"
convert_bert_to_onnx(text)
examples, labels = load_data("dev.tsv")
start_time = time.time()
top_n = 50
# returns results from pytorch pretrained model and onnx
onnx_labels, pytorch_labels = inference("bert.onnx", examples[:top_n])
print("\n ************ \n")
print("total time ", time.time() - start_time)
print("accuracy score of pytorch model", accuracy_score(labels[:top_n], pytorch_labels[:top_n]))
print("accuracy score of onnx model", accuracy_score(labels[:top_n], onnx_labels[:top_n]))
Current behavior
Inference time of Onnx on 50 examples: 47 seconds
Inference time of Pytorch on 50 examples: 8 seconds
Does anyone has idea, why onnx inference time is more? Any leads will be appreciable :)
Thank you.
@rameshjesswani , could you also try export the model with dynamic length, like I mentioned in the other thread:
dynamic_axes = {
'input_id': {0:'batch',1:'max_seq_len'},
'sequence_id': {0:'batch',1:'max_seq_len'},
'input_mask': {0:'batch',1:'max_seq_len'},
'qp_scores': {0:'batch'},
}
torch.onnx.export(model, (input_ids, segment_ids, input_mask), config["onnx_model"], verbose=False,
opset_version=11, input_names=['input_id', 'sequence_id', 'input_mask'],
output_names=['qp_scores'],
do_constant_folding=True, dynamic_axes=dynamic_axes)
And could you share us your model? We can take a look why the optimization doesn't take effect.
thanks @yufenglee . I tried exporting with dynamic length. Now, inference of ONNX is better than Pytorch.
So here is the comparison after exporting with dynamic length:
Inference time of Onnx on 872 examples: 141.43 seconds
Inference time of Pytorch on 872 examples: 176.8 seconds
Just another question, do you expect more improvement in onnx inference time as compare to pytorch?
many thanks :)
You can get better performance with GraphOptimization by replacing code:
#onnx session
ort_session = onnxruntime.InferenceSession(model_name)
with
so = ort.SessionOptions()
so.graph_optimization_level = ort.GraphOptimizationLevel.ORT_ENABLE_ALL
ort_session = onnxruntime.InferenceSession(model_name, so)
To verify BERT optimization, add a session option to output the optimized model like
so = ort.SessionOptions()
so.graph_optimization_level = ort.GraphOptimizationLevel.ORT_ENABLE_ALL
so.optimized_model_filepath = "optimized.onnx"
ort_session = onnxruntime.InferenceSession(model_name, so)
After running, optimized.onnx can be found. Then you can open the optimized.onnx using Netron to view the graph. The optimized graph for CPU is like the following:
input_ids segment_ids input_mask
| | |
Cast Cast Cast
\ | /
EmbedLayerNormalization
| |
Attention | --- start of one layer
| |
MatMul /
| /
Add /
\ /
Add
|
LayerNormalization
| |
MatMul |
| |
BaisGelu |
| |
MatMul /
| /
Add /
\ /
Add
|
LayerNormalization --- end of one layer
| |
Attention | --- start of next layer (total 12 layers for BERT base model)
| |
...
If you run onnxruntime in GPU, the optimized graph is slightly different.
You can get better performance with GraphOptimization by replacing code:
#onnx session ort_session = onnxruntime.InferenceSession(model_name)with
so = ort.SessionOptions() so.graph_optimization_level = ort.GraphOptimizationLevel.ORT_ENABLE_ALL ort_session = onnxruntime.InferenceSession(model_name, so)
I am using the same parameters, that you have specified. Here is way of exporting pytorch model to onnx:
def convert_bert_to_onnx(text, model_dir, task_name, onnx_model_name):
config = BertConfig.from_pretrained(model_dir)
tokenizer = BertTokenizer.from_pretrained(model_dir)
model = BertForSequenceClassification.from_pretrained(model_dir, config=config)
model.to("cpu")
input_ids, input_mask, segment_ids = preprocess(tokenizer, text)
dynamic_axes = {
'input_id': {0: 1, 1:128},
'input_mask': {0:1, 1:128},
'segment_ids': {0: 1 ,1:128},
'output': {0: 1},
}
torch.onnx.export(model, (input_ids, input_mask, segment_ids), onnx_model_name, input_names = ["input_ids", "input_mask", "segment_ids"],
output_names = ["output"], opset_version=10, do_constant_folding=True)
print("SST model convert to onnx format successfully")
Here is how I am performing inference:
#onnx session
so = onnxruntime.SessionOptions()
so.graph_optimization_level = onnxruntime.GraphOptimizationLevel.ORT_ENABLE_ALL
so.intra_op_num_threads=1
ort_session = onnxruntime.InferenceSession(onnx_model_name, so)
for example in examples:
input_ids, input_mask, segment_ids = preprocess(tokenizer, example)
ort_inputs = {ort_session.get_inputs()[0].name: to_numpy(input_ids),
ort_session.get_inputs()[1].name: to_numpy(input_mask),
ort_session.get_inputs()[2].name: to_numpy(segment_ids)}
ort_outs = ort_session.run(["output"], ort_inputs)
torch_onnx_output = torch.tensor(ort_outs[0], dtype=torch.float32)
onnx_logits = F.softmax(torch_onnx_output, dim=1)
logits_label = torch.argmax(onnx_logits, dim=1)
label = logits_label.detach().cpu().numpy()
onnx_inference.append(label[0])
Thanks :)
@rameshjesswani, have you checked whether the graph of optimized model is same as I described above?
In my machine, inference time of onnxruntime is 47% of that of PyTorch 1.3 when running BERT base (batch_size=8, max_seq_len=128) on SQuAD dev set. Different result might be seen in your hardware.
@tianleiwu its not same, it seems different. I am using BERT large and pyTorch 1.2 . Could you tell me where to share this model with you, so you can test on your hardware. Thank you :)
Also, does batch size affects the inference time much in onnxruntime? I am comparing pyTorch and onnxruntime using batch size = 1

@rameshjesswani, I've tried PyTorch1.2 to export a Bert large model for SQuAD based on transformers. The graph is like the one you posted.
You can add a session option to output the optimized model and open the optimized.onnx file to view the optimized graph:
so.optimized_model_filepath = "optimized.onnx"
The optimized graph is like the following:

In my machine, inference time of onnxruntime is 73% of that of PyTorch 1.2 when running BERT large (batch_size=1, max_seq_len=128) on SQuAD dev set. In your experiment, the ratio is 80%. So our results are very close.
@tianleiwu thanks. Just to summarize this, using BERT Base, you observed inference time of onnxruntime is 47% of that of PyTorch 1.3 and using BERT Large, inference time of onnxruntime is 73% of PyTorch.
There is huge difference between improvement in inference time of BERT Base and Large as compare to the PyTorch, is this the expected behavior?
@rameshjesswani, this is expected since model size is different. Note that inference time improvement could differ for different models on different hardware.
For BERT-large, it is not likely using CPU in production due to the latency. It need powerful GPU (like V100 or T4) to achieve reasonable latency (<5 ms) for real time inference.
@tianleiwu many thanks. I am closing this issue as i have got the answer :)
@tianleiwu thanks. Just to summarize this, using BERT Base, you observed inference time of onnxruntime is
47%of that ofPyTorch 1.3and using BERT Large, inference time of onnxruntime is73%of PyTorch.
Can you share which device did you used?