Onnxruntime: Inference time of onnxruntime vs pytorch

Created on 8 Jan 2020 · 12Comments · Source: microsoft/onnxruntime

Describe the bug
Inference time of onnxruntime is slower as compare to the pytorch model

System information

OS Platform and Distribution (e.g., Linux Ubuntu 16.04): 16.04
ONNX Runtime installed from (source or binary): binary
ONNX Runtime version: 1.1.0
Python version: 3.6

To Reproduce

def preprocess(tokenizer, text):

    max_seq_length = 128
    tokens = tokenizer.tokenize(text)
    tokens.insert(0, "[CLS]")
    tokens.append("[SEP]")
    segment_ids = []
    for i in range(len(tokens)):
        segment_ids.append(0)
    input_ids = tokenizer.convert_tokens_to_ids(tokens)
    input_mask = [1] * len(input_ids)
    while len(input_ids) < max_seq_length:
        input_ids.append(0)
        input_mask.append(0)
        segment_ids.append(0)

    input_ids = torch.tensor([input_ids], dtype=torch.long)
    input_mask = torch.tensor([input_mask], dtype=torch.long)
    segment_ids = torch.tensor([segment_ids], dtype=torch.long)

    return input_ids, input_mask, segment_ids

"""
Inference on pretrained pytorch model
"""

def inference_pytorch(model, input_ids, input_mask, segment_ids):

    with torch.no_grad():
        outputs = model(input_ids, input_mask, segment_ids)

    logits = outputs[0]
    logits = F.softmax(logits, dim=1)
    return logits

"""
This function stores pretrained bert model
into onnx format
"""

def convert_bert_to_onnx(text):

    model_dir = "/home/ramesh/github/pytorch-pretrained-model-to-onnx"
    # config = BertConfig.from_pretrained(model_dir)
    tokenizer = BertTokenizer.from_pretrained(model_dir)
    # model = BertForSequenceClassification.from_pretrained(model_dir, config=config)
    input_ids, input_mask, segment_ids = preprocess(tokenizer, text)

    torch.onnx.export(model, (input_ids, input_mask, segment_ids), "bert.onnx",  input_names = ["input_ids", "input_mask", "segment_ids"],
    output_names = ["output"])

    print("model convert to onnx format successfully")


def inference(model_name, examples):

    onnx_inference = []
    pytorch_inference = []
    model_dir = "/home/ramesh/github/pytorch-pretrained-model-to-onnx/models"
    #onnx session
    ort_session = onnxruntime.InferenceSession(model_name)
    #pytorch pretrained model and tokenizer
    tokenizer = BertTokenizer.from_pretrained(model_dir)
    config = BertConfig.from_pretrained(model_dir)
    model = BertForSequenceClassification.from_pretrained(model_dir, config=config)
    # model.to("cpu")

    for example in examples:
        """
        Onnx inference
        """
        input_ids, input_mask, segment_ids = preprocess(tokenizer, example)
        ort_inputs = {ort_session.get_inputs()[0].name: to_numpy(input_ids),
                        ort_session.get_inputs()[1].name: to_numpy(input_mask),
                        ort_session.get_inputs()[2].name: to_numpy(segment_ids)}
        ort_outs = ort_session.run(["output"], ort_inputs)
        torch_onnx_output = torch.tensor(ort_outs[0], dtype=torch.float32)
        onnx_logits = F.softmax(torch_onnx_output, dim=1)

        logits_label = torch.argmax(onnx_logits, dim=1)
        onnx_inference.append(logits_label[0])

        """
        Pretrained bert pytorch model
        """
        #

        torch_out = inference_pytorch(model, input_ids, input_mask, segment_ids)

        logits_label = torch.argmax(torch_out, dim=1)
        label = logits_label.detach().cpu().numpy()
        pytorch_inference.append(label[0])

        #
        # # compare ONNX Runtime and PyTorch results
        # np.testing.assert_allclose(to_numpy(torch_out), onnx_logits, rtol=1e-03, atol=1e-05)
        #
        # print("Exported model has been tested with ONNXRuntime, and the result looks good!")

    return onnx_inference, pytorch_inference


def to_numpy(tensor):
    return tensor.detach().cpu().numpy() if tensor.requires_grad else tensor.cpu().numpy()

if __name__ == '__main__':

    text = "tick tock tick"
    convert_bert_to_onnx(text)

    examples, labels = load_data("dev.tsv")
    start_time = time.time()
    top_n = 50
    # returns results from pytorch pretrained model and onnx
    onnx_labels, pytorch_labels = inference("bert.onnx", examples[:top_n])
    print("\n ************ \n")

    print("total time ", time.time() - start_time)
    print("accuracy score of pytorch model", accuracy_score(labels[:top_n], pytorch_labels[:top_n]))
    print("accuracy score of onnx model", accuracy_score(labels[:top_n], onnx_labels[:top_n]))

Current behavior

Inference time of Onnx on 50 examples: 47 seconds
Inference time of Pytorch on 50 examples: 8 seconds

Does anyone has idea, why onnx inference time is more? Any leads will be appreciable :)

Thank you.

performance

Source

rameshjes

👍3

All 12 comments

@rameshjesswani , could you also try export the model with dynamic length, like I mentioned in the other thread:

    dynamic_axes = {
        'input_id': {0:'batch',1:'max_seq_len'},
        'sequence_id': {0:'batch',1:'max_seq_len'},
        'input_mask': {0:'batch',1:'max_seq_len'},
        'qp_scores': {0:'batch'},
    }
    torch.onnx.export(model, (input_ids, segment_ids, input_mask), config["onnx_model"], verbose=False,
        opset_version=11, input_names=['input_id', 'sequence_id', 'input_mask'],
        output_names=['qp_scores'],
        do_constant_folding=True, dynamic_axes=dynamic_axes)

2803

And could you share us your model? We can take a look why the optimization doesn't take effect.

yufenglee on 14 Jan 2020

thanks @yufenglee . I tried exporting with dynamic length. Now, inference of ONNX is better than Pytorch.
So here is the comparison after exporting with dynamic length:

Inference time of Onnx on 872 examples: 141.43 seconds
Inference time of Pytorch on 872 examples: 176.8 seconds

Just another question, do you expect more improvement in onnx inference time as compare to pytorch?

many thanks :)

rameshjes on 14 Jan 2020

You can get better performance with GraphOptimization by replacing code:

    #onnx session
    ort_session = onnxruntime.InferenceSession(model_name)

with

so = ort.SessionOptions()
so.graph_optimization_level = ort.GraphOptimizationLevel.ORT_ENABLE_ALL
ort_session = onnxruntime.InferenceSession(model_name, so)

yufenglee on 14 Jan 2020

To verify BERT optimization, add a session option to output the optimized model like

so = ort.SessionOptions()
so.graph_optimization_level = ort.GraphOptimizationLevel.ORT_ENABLE_ALL
so.optimized_model_filepath = "optimized.onnx"
ort_session = onnxruntime.InferenceSession(model_name, so)

After running, optimized.onnx can be found. Then you can open the optimized.onnx using Netron to view the graph. The optimized graph for CPU is like the following:

 input_ids  segment_ids  input_mask
    |            |           |
   Cast         Cast        Cast
    \            |          /
      EmbedLayerNormalization
                 |        |
               Attention  |                    --- start of one layer
                 |        |
               MatMul    /
                 |      /
                Add    /
                  \   /   
                   Add
                    |
              LayerNormalization
                    |        |
                 MatMul      |
                    |        |
                 BaisGelu    |
                    |        |
                  MatMul    /
                    |      /
                   Add    /
                     \   /
                      Add
                       |
            LayerNormalization                  ---  end of one layer
                 |        |
               Attention  |                     --- start of next layer (total 12 layers for BERT base model)
                 |        |
                    ...

If you run onnxruntime in GPU, the optimized graph is slightly different.

tianleiwu on 14 Jan 2020

👍1

You can get better performance with GraphOptimization by replacing code:

    #onnx session
    ort_session = onnxruntime.InferenceSession(model_name)

with

so = ort.SessionOptions()
so.graph_optimization_level = ort.GraphOptimizationLevel.ORT_ENABLE_ALL
ort_session = onnxruntime.InferenceSession(model_name, so)

I am using the same parameters, that you have specified. Here is way of exporting pytorch model to onnx:

def convert_bert_to_onnx(text, model_dir, task_name, onnx_model_name):

        config = BertConfig.from_pretrained(model_dir)
        tokenizer = BertTokenizer.from_pretrained(model_dir)
        model = BertForSequenceClassification.from_pretrained(model_dir, config=config)
        model.to("cpu")
        input_ids, input_mask, segment_ids = preprocess(tokenizer, text)

        dynamic_axes = {
            'input_id': {0: 1, 1:128},
            'input_mask': {0:1, 1:128},
            'segment_ids': {0: 1 ,1:128},
            'output': {0: 1},
        }

        torch.onnx.export(model, (input_ids, input_mask, segment_ids), onnx_model_name,  input_names = ["input_ids", "input_mask", "segment_ids"],
        output_names = ["output"], opset_version=10, do_constant_folding=True)

        print("SST model convert to onnx format successfully")

Here is how I am performing inference:

#onnx session
so = onnxruntime.SessionOptions()
so.graph_optimization_level = onnxruntime.GraphOptimizationLevel.ORT_ENABLE_ALL
so.intra_op_num_threads=1
ort_session = onnxruntime.InferenceSession(onnx_model_name, so)

for example in examples:
        input_ids, input_mask, segment_ids = preprocess(tokenizer, example)


       ort_inputs = {ort_session.get_inputs()[0].name: to_numpy(input_ids),
                        ort_session.get_inputs()[1].name: to_numpy(input_mask),
                        ort_session.get_inputs()[2].name: to_numpy(segment_ids)}
        ort_outs = ort_session.run(["output"], ort_inputs)
        torch_onnx_output = torch.tensor(ort_outs[0], dtype=torch.float32)
        onnx_logits = F.softmax(torch_onnx_output, dim=1)

        logits_label = torch.argmax(onnx_logits, dim=1)
        label = logits_label.detach().cpu().numpy()
        onnx_inference.append(label[0])

Thanks :)

rameshjes on 15 Jan 2020

@rameshjesswani, have you checked whether the graph of optimized model is same as I described above?

In my machine, inference time of onnxruntime is 47% of that of PyTorch 1.3 when running BERT base (batch_size=8, max_seq_len=128) on SQuAD dev set. Different result might be seen in your hardware.

tianleiwu on 16 Jan 2020

@tianleiwu its not same, it seems different. I am using BERT large and pyTorch 1.2 . Could you tell me where to share this model with you, so you can test on your hardware. Thank you :)

Also, does batch size affects the inference time much in onnxruntime? I am comparing pyTorch and onnxruntime using batch size = 1

rameshjes on 16 Jan 2020

@rameshjesswani, I've tried PyTorch1.2 to export a Bert large model for SQuAD based on transformers. The graph is like the one you posted.

You can add a session option to output the optimized model and open the optimized.onnx file to view the optimized graph:

so.optimized_model_filepath = "optimized.onnx"

The optimized graph is like the following:

In my machine, inference time of onnxruntime is 73% of that of PyTorch 1.2 when running BERT large (batch_size=1, max_seq_len=128) on SQuAD dev set. In your experiment, the ratio is 80%. So our results are very close.

tianleiwu on 17 Jan 2020

👍1

@tianleiwu thanks. Just to summarize this, using BERT Base, you observed inference time of onnxruntime is 47% of that of PyTorch 1.3 and using BERT Large, inference time of onnxruntime is 73% of PyTorch.

There is huge difference between improvement in inference time of BERT Base and Large as compare to the PyTorch, is this the expected behavior?

rameshjes on 20 Jan 2020

@rameshjesswani, this is expected since model size is different. Note that inference time improvement could differ for different models on different hardware.

For BERT-large, it is not likely using CPU in production due to the latency. It need powerful GPU (like V100 or T4) to achieve reasonable latency (<5 ms) for real time inference.

tianleiwu on 22 Jan 2020

👍1

@tianleiwu many thanks. I am closing this issue as i have got the answer :)

rameshjes on 22 Jan 2020

@tianleiwu thanks. Just to summarize this, using BERT Base, you observed inference time of onnxruntime is 47% of that of PyTorch 1.3 and using BERT Large, inference time of onnxruntime is 73% of PyTorch.

Can you share which device did you used?