Onnxruntime: BERT performance slower than default pytorch on CPU

Created on 9 Jan 2020  路  18Comments  路  Source: microsoft/onnxruntime

Describe the bug
I have exported a BERT model from huggingface's transformers models.

Batch size: 1, sequence length: 256
Pytorch: 0.149689 seconds
ONNX: 0.281283 seconds

Batch size: 8, sequence length: 256
Pytorch: 0.761311seconds
ONNX: 2.792252 seconds

https://github.com/huggingface/transformers/blob/master/examples/benchmarks.py#L366

Urgency
January/2020

System information

  • OS Platform and Distribution: Linux Ubuntu 16.04
  • ONNX Runtime installed from: binary (pip install onnxruntime)
  • ONNX Runtime version: 1.1.0
  • Python version: 3.7.4
  • Visual Studio version (if applicable):
  • GCC/Compiler version (if compiling from source):
  • CUDA/cuDNN version: CPU only
  • GPU model and memory:

To Reproduce

model = AutoModel.from_pretrained(model_name, config=config)
.....

torch.onnx.export(model, sequence, "bert_" + str(slice_size) + ".onnx",
                                   input_names=['input'],
                                   output_names=['output'],
                                   dynamic_axes={'input': {0: 'batch'}, 'output': {0: 'batch'}}, verbose=True)
  • ONNX model execution
sess = ort.InferenceSession("bert_" + str(slice_size) + ".onnx")
runtimes = timeit.repeat(lambda: sess.run([], {'input':np.random.randn(batch_size, slice_size).astype(np.longlong)}), repeat=average_over, number=3)

Expected behavior
ONNX converted version should be faster.

Screenshots
If applicable, add screenshots to help explain your problem.

Additional context
Add any other context about the problem here. If the issue is about a particular model, please share the model details as well to facilitate debugging.

performance

Most helpful comment

All 18 comments

Please enable optimization like the following and try again:

so = ort.SessionOptions()
so.graph_optimization_level = ort.GraphOptimizationLevel.ORT_ENABLE_ALL
sess = ort.InferenceSession(model, so)

Currently, some BERT optimizations are not enabled by default.

@ykim362,

Please enable optimization like the following and try again:
so = ort.SessionOptions()
so.graph_optimization_level = ort.GraphOptimizationLevel.ORT_ENABLE_ALL
sess = ort.InferenceSession(model, so)

Currently, some BERT optimizations are not enabled by default.

And please also try using openmp with setting:
so.intra_op_num_threads=1
then tuning the OMP_NUM_THREADS to get the best perf.

And could you please share the device configuration, i.e., the configuration of your CPU?

hi @tianleiwu and @yufenglee I am also facing same issue (posted here). I have tried the above parameters. But there is not any improvement in inference time.

Configuration of CPU:

Memory: 15.4 GB
Processsor: Intel Core i7-8550U CPU @ 1.8GHZ x 8 
Graphics: Intel UHD Graphics 620
GNOME: 3.28.2
OS type: 64-bit

Thanks :)

Thanks, @tianleiwu and @yufenglee .
I've tried both settings. (ort.GraphOptimizationLevel.ORT_ENABLE_ALL and so.intra_op_num_threads=1)
I could get a better performance, but they are still a little behind.

Batch size: 1, sequence length: 256
Pytorch: 0.149689 seconds
ONNX: 0.198 seconds (best threads setting: OMP_NUM_THREADS=32)

Batch size: 8, sequence length: 256
Pytorch: 0.761311seconds
ONNX: 1.661 seconds (best threads setting: OMP_NUM_THREADS=40)

My machine has AVX2 capable 24 cores (48 threads). Haswell

Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                48
On-line CPU(s) list:   0-47
Thread(s) per core:    2
Core(s) per socket:    12
Socket(s):             2
NUMA node(s):          2
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 63
Model name:            Intel(R) Xeon(R) CPU E5-2670 v3 @ 2.30GHz
Stepping:              2
CPU MHz:               1212.621
CPU max MHz:           3100.0000
CPU min MHz:           1200.0000
BogoMIPS:              4601.77
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              30720K
NUMA node0 CPU(s):     0-11,24-35
NUMA node1 CPU(s):     12-23,36-47

@tianleiwu, @yufenglee
Would you be able to advise the best settings? And, are there any models available with your optimized settings?

@ykim362, we looked at your model, could you try export the model with dynamic length for sequence axis as well?

torch.onnx.export(model, sequence, "bert_" + str(slice_size) + ".onnx",
                                   input_names=['input'],
                                   output_names=['output'],
                                   dynamic_axes={'input': {0: 'batch', 1: 'seq_len'}, 'output': {0: 'batch'}}, verbose=True)

And we also found that our latest optimization in 1.1 is not applied to your model automatically. We can provide you a script to fuse the model if the dynamic length change doesn't get the perf you expect.

@ykim362, please add two parameters during calling torch.onnx.export: opset_version=10 (or 11) and do_constant_folding=True besides dynamic axes mentioned by yufenglee. That's the setting used in our tests.

Thanks, @yufenglee.
I tried the dynamic sequence length. The performance was just same as reported yesterday.

Thanks, @tianleiwu .
I am already using those settings.

I've shared the newly converted model in the same directory fyi. ('bert_dynamic_len_11.onnx')

Thanks, @yufenglee.
I tried the dynamic sequence length. The performance was just same as reported yesterday.

Thanks, @tianleiwu .
I am already using those settings.

I've shared the newly converted model in the same directory fyi. ('bert_dynamic_len_11.onnx')

@ykim362 , where do you uploaded the model to? And could you also share us your script to benchmark the onnx?

sess = ort.InferenceSession("bert_" + str(slice_size) + ".onnx")
runtimes = timeit.repeat(lambda: sess.run([], {'input':np.random.randn(batch_size, slice_size).astype(np.longlong)}), repeat=average_over, number=3)

Another thing you may try is to use Nuphar execution provider, which compiles the model for optimized inference on CPU. You may follow its tutorial on how to run BERT model. To try it out, you may build from source, or use docker image with prebuilt Nuphar:

docker pull mcr.microsoft.com/azureml/onnxruntime:latest-nuphar

You may add following lines in Python with a Nuphar-enabled build:

import onnxruntime as ort
import numpy as np
from onnxruntime.nuphar.symbolic_shape_infer import SymbolicShapeInference
SymbolicShapeInference.infer_shapes("bert_" + str(slice_size) + ".onnx", "bert_" + str(slice_size) + ".onnx", auto_merge=True)
sess = ort.InferenceSession("bert_" + str(slice_size) + ".onnx")
runtimes = timeit.repeat(lambda: sess.run([], {'input':np.random.randn(batch_size, slice_size).astype(np.longlong)}), repeat=average_over, number=3)
ort_average_time = sum(runtimes) / float(len(runtimes)) / 3.0

I measured ~35% speed-up in batch=8, on Xeon E5-2690v4 (dual sockets, 14-core/28-HT each socket)

Thanks, @KeDengMS !
I will follow the instructions. Is the speed-up from int8 quantization? Or, does Nuphar also improve fp32 speed?

The model is floating point, and Nuphar works for both fp32 and int8. The speed-up is mainly from fusing ops automatically, and running element-wise ops like Erf in parallel. Quantization to int8 might give you more speed-ups.

@KeDengMS Thanks for the clarification! I will try both fp32 and int8.

@faxu @yualan @tianleiwu Thanks! I will follow the tutorials.

@ykim362 Did you resolve this? My understanding is that the onnxruntime should be faster even without Nuphar runtime. I'm uaving the same problem.

The pytorch runtime is faster for me as well. I followed all the tricks here to speed it up to no avail

@DomHudson, @JustinMBrown,

Here are latest Jupyter Notebooks:

Bert model for SQuAD (CPU inference)

Bert model for SQuAD (GPU inference)

You could try it in your machine, and let me know the result. Note that currently OnnxRuntime need one run to warm up, so you need measure many runs instead of looking at the first run.

Was this page helpful?
0 / 5 - 0 ratings