Describe the bug
Hi, I'm trying to run a simple CNN model with onnxruntime on GPU. However, the inference speed is 1.5~2x slow than pytoch.
Urgency
..
System information
To Reproduce
Pytorch code:
import torch
import torch.nn as nn
from time import time
class Model(nn.Module):
def __init__(self, num_class):
super(Model, self).__init__()
self.backbone = nn.Sequential(
nn.Conv2d(3, 32, kernel_size=3, padding=1),
nn.ReLU(inplace=True),
nn.MaxPool2d(kernel_size=2, stride=2),
nn.Conv2d(32, 64, kernel_size=3, padding=1),
nn.ReLU(inplace=True),
nn.MaxPool2d(kernel_size=2, stride=2),
nn.Conv2d(64, 128, kernel_size=3, padding=1),
nn.ReLU(inplace=True),
nn.MaxPool2d(kernel_size=2, stride=2),
nn.Conv2d(128, 256, kernel_size=3, padding=1),
nn.ReLU(inplace=True),
nn.MaxPool2d(kernel_size=2, stride=2)
)
self.classifier = nn.Sequential(
nn.Linear(7*7*256, 4096),
nn.ReLU(inplace=True),
nn.Linear(4096, num_class),
nn.Softmax(dim=1)
)
def forward(self, x):
x = self.backbone(x)
x = x.view(x.size(0), -1)
x = self.classifier(x)
return x
if __name__ == "__main__":
model = Model(1000)
model.eval()
input_tensor = torch.randn(32, 3, 112, 112)
input_tensor = input_tensor.cuda()
model = model.cuda()
# warm up
for i in range(3):
_ = model(input_tensor)
# run
print('Start running..')
t0 = time()
for i in range(100):
_ = model(input_tensor)
t1 = time()
print('PTH time: %f ms' %((t1-t0)*10))
print('Export to onnx..')
torch.onnx.export(model, input_tensor, "test_model.onnx",
verbose=False,
input_names=['input1'],
output_names=['output1'],
dynamic_axes={'input1':{0: 'batch'}, 'output1': {0: 'batch'}})
ORT code:
import onnxruntime as ort
import numpy as np
from time import time
print(ort.get_device())
model = 'test_model.onnx'
so = ort.SessionOptions()
so.graph_optimization_level = ort.GraphOptimizationLevel.ORT_ENABLE_ALL
sess = ort.InferenceSession(model, so)
inputs = np.empty((32, 3, 112, 112), dtype=np.float32)
inputs = {sess.get_inputs()[0].name: inputs}
# warm up
for i in range(3):
_ = sess.run(['output1'], inputs)
t0 = time()
for i in range(100):
_ = sess.run(['output1'], inputs)
t1 = time()
print('ORT time: %f ms' %((t1-t0)*10))
Expected behavior
Pytorch average forward time: 2.003217ms
onnxruntime average forward time: 3.717608 ms
There is a big gap of speed between Pytorch and onnxruntime.
Screenshots
None
Additional context
The command for build onnxruntime from source as below:
./build.sh --config RelWithDebInfo --build_wheel --use_cuda --skip_onnx_tests --parallel --cuda_home /usr/local/cuda --cudnn_home /usr/local/cuda
Thank you!
If you set the logger severity level to verbose like this ort.set_default_logger_severity(0), you'll see that not all nodes have been placed on the GPU. See the following output
2019-11-14 23:25:25.3927876 [I:onnxruntime:Default, cuda_execution_provider.cc:1114 onnxruntime::CUDAExecutionProvider::GetCapability] Force fallback to CPU execution provider for Op type: Gather node name:
2019-11-14 23:25:25.3929819 [I:onnxruntime:Default, cuda_execution_provider.cc:1114 onnxruntime::CUDAExecutionProvider::GetCapability] Force fallback to CPU execution provider for Op type: Unsqueeze node name:
2019-11-14 23:25:25.3933796 [I:onnxruntime:Default, cuda_execution_provider.cc:1114 onnxruntime::CUDAExecutionProvider::GetCapability] Force fallback to CPU execution provider for Op type: Concat node name:
2019-11-14 23:25:25.3939818 [V:onnxruntime:, inference_session.cc:438 onnxruntime::InferenceSession::TransformGraph] Node placements
2019-11-14 23:25:25.3941451 [V:onnxruntime:, inference_session.cc:445 onnxruntime::InferenceSession::TransformGraph] Provider: [CUDAExecutionProvider]: [Conv (), Relu (), MaxPool (), Conv (), Relu (), MaxPool (), Conv (), Relu (), MaxPool (), Conv (), Relu (), MaxPool (), Shape (), Reshape (), Gemm (), Relu (), Gemm (), Softmax (), ]
2019-11-14 23:25:25.3945522 [V:onnxruntime:, inference_session.cc:445 onnxruntime::InferenceSession::TransformGraph] Provider: [CPUExecutionProvider]: [Gather (), Unsqueeze (), Concat (), ]
This is because of the Shape node which is registered to produce an output on the CPU. This cascades to the following 3 nodes (Gather, Unsqueeze and Concat). The relevant piece of code is here.
@pranavsharma Thank you.
I modified the model to avoid view operator,
self.classifier = nn.Sequential(
nn.MaxPool2d(kernel_size=7, stride=1),
nn.Conv2d(256, 1024, kernel_size=1),
nn.ReLU(inplace=True),
nn.Conv2d(1024, num_class, kernel_size=1),
nn.Softmax(dim=1)
)
.......
def forward(self, x):
x = self.backbone(x)
x = self.classifier(x)
return x
Now, all nodes have been placed on GPU, however, the speed of onnxruntime is much slow than pytorch.
Pytorch average forward time: 1.614020ms
onnxruntime average forward time: 3.353541ms
This is weird...
Any idea?
The comparison is not apples to apples. In the pytorch script, you're making the input available on GPU (input_tensor = input_tensor.cuda()) before calling Run. The ORT python script doesn't have this capability. Hence the input is copied from CPU to GPU as part of the Run call. The same applies for output. In order to rule this out, I wrote a test using our internal C++ APIs where I made the input and output available on GPU apriori and measured only the Run call and found that this model takes about ~320 us. You can find the test here.
Thank you for your answering!
@pranavsharma How are we meant to perform this in production? I notice you included test code to get this to work properly.
My data is already on GPU and I want to just use it in session.Run(..).
Nevermind, just found: https://github.com/microsoft/onnxruntime/issues/1621
Most helpful comment
The comparison is not apples to apples. In the pytorch script, you're making the input available on GPU (
input_tensor = input_tensor.cuda()) before calling Run. The ORT python script doesn't have this capability. Hence the input is copied from CPU to GPU as part of the Run call. The same applies for output. In order to rule this out, I wrote a test using our internal C++ APIs where I made the input and output available on GPU apriori and measured only the Run call and found that this model takes about ~320 us. You can find the test here.