onnxruntime is 1.5~2x slow than pytorch on GPU

Created on 15 Nov 2019 · 6Comments · Source: microsoft/onnxruntime

Describe the bug
Hi, I'm trying to run a simple CNN model with onnxruntime on GPU. However, the inference speed is 1.5~2x slow than pytoch.

Urgency
..

System information

OS Platform and Distribution: Linux Ubuntu 18.04
ONNX Runtime installed from: source
ONNX Runtime version: 1.0.0
Python version: 3.6
GCC/Compiler version: 7.4.0
CUDA/cuDNN version: CUDA 10.1, cudnn 7.6.0
GPU model and memory: RTX2080 Ti, 11G
Pytorch version: 1.3.1

To Reproduce
Pytorch code:

import torch
import torch.nn as nn
from time import time

class Model(nn.Module):
    def __init__(self, num_class):
        super(Model, self).__init__()

        self.backbone = nn.Sequential(
            nn.Conv2d(3, 32, kernel_size=3, padding=1),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(kernel_size=2, stride=2),
            nn.Conv2d(32, 64, kernel_size=3, padding=1),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(kernel_size=2, stride=2),
            nn.Conv2d(64, 128, kernel_size=3, padding=1),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(kernel_size=2, stride=2),
            nn.Conv2d(128, 256, kernel_size=3, padding=1),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(kernel_size=2, stride=2)
        )

        self.classifier = nn.Sequential(
            nn.Linear(7*7*256, 4096),
            nn.ReLU(inplace=True),
            nn.Linear(4096, num_class),
            nn.Softmax(dim=1)
        )

    def forward(self, x):
        x = self.backbone(x)
        x = x.view(x.size(0), -1)
        x = self.classifier(x)
        return x


if __name__ == "__main__":
    model = Model(1000)
    model.eval()

    input_tensor = torch.randn(32, 3, 112, 112)
    input_tensor = input_tensor.cuda()
    model = model.cuda()

    # warm up
    for i in range(3):
        _ = model(input_tensor)

    # run
    print('Start running..')
    t0 = time()
    for i in range(100):
        _ = model(input_tensor)
    t1 = time()
    print('PTH time: %f ms' %((t1-t0)*10))

    print('Export to onnx..')
    torch.onnx.export(model, input_tensor, "test_model.onnx", 
                        verbose=False, 
                        input_names=['input1'], 
                        output_names=['output1'],
                        dynamic_axes={'input1':{0: 'batch'}, 'output1': {0: 'batch'}})

ORT code:

import onnxruntime as ort
import numpy as np
from time import time

print(ort.get_device())
model = 'test_model.onnx'

so = ort.SessionOptions()
so.graph_optimization_level = ort.GraphOptimizationLevel.ORT_ENABLE_ALL
sess = ort.InferenceSession(model, so)

inputs = np.empty((32, 3, 112, 112), dtype=np.float32)
inputs = {sess.get_inputs()[0].name: inputs}

# warm up
for i in range(3):
    _ = sess.run(['output1'], inputs)

t0 = time()
for i in range(100):
    _ = sess.run(['output1'], inputs)
t1 = time()
print('ORT time: %f ms' %((t1-t0)*10))

Expected behavior
Pytorch average forward time: 2.003217ms
onnxruntime average forward time: 3.717608 ms

There is a big gap of speed between Pytorch and onnxruntime.

Screenshots
None

Additional context

The command for build onnxruntime from source as below:
./build.sh --config RelWithDebInfo --build_wheel --use_cuda --skip_onnx_tests --parallel --cuda_home /usr/local/cuda --cudnn_home /usr/local/cuda

Thank you!

performance

Source

TianzhongSong

Most helpful comment

The comparison is not apples to apples. In the pytorch script, you're making the input available on GPU (input_tensor = input_tensor.cuda()) before calling Run. The ORT python script doesn't have this capability. Hence the input is copied from CPU to GPU as part of the Run call. The same applies for output. In order to rule this out, I wrote a test using our internal C++ APIs where I made the input and output available on GPU apriori and measured only the Run call and found that this model takes about ~320 us. You can find the test here.

pranavsharma on 16 Nov 2019

👍5

All 6 comments

If you set the logger severity level to verbose like this ort.set_default_logger_severity(0), you'll see that not all nodes have been placed on the GPU. See the following output

2019-11-14 23:25:25.3927876 [I:onnxruntime:Default, cuda_execution_provider.cc:1114 onnxruntime::CUDAExecutionProvider::GetCapability] Force fallback to CPU execution provider for Op type: Gather node name:
2019-11-14 23:25:25.3929819 [I:onnxruntime:Default, cuda_execution_provider.cc:1114 onnxruntime::CUDAExecutionProvider::GetCapability] Force fallback to CPU execution provider for Op type: Unsqueeze node name:
2019-11-14 23:25:25.3933796 [I:onnxruntime:Default, cuda_execution_provider.cc:1114 onnxruntime::CUDAExecutionProvider::GetCapability] Force fallback to CPU execution provider for Op type: Concat node name:
2019-11-14 23:25:25.3939818 [V:onnxruntime:, inference_session.cc:438 onnxruntime::InferenceSession::TransformGraph] Node placements
2019-11-14 23:25:25.3941451 [V:onnxruntime:, inference_session.cc:445 onnxruntime::InferenceSession::TransformGraph]  Provider: [CUDAExecutionProvider]: [Conv (), Relu (), MaxPool (), Conv (), Relu (), MaxPool (), Conv (), Relu (), MaxPool (), Conv (), Relu (), MaxPool (), Shape (), Reshape (), Gemm (), Relu (), Gemm (), Softmax (), ]
2019-11-14 23:25:25.3945522 [V:onnxruntime:, inference_session.cc:445 onnxruntime::InferenceSession::TransformGraph]  Provider: [CPUExecutionProvider]: [Gather (), Unsqueeze (), Concat (), ]

This is because of the Shape node which is registered to produce an output on the CPU. This cascades to the following 3 nodes (Gather, Unsqueeze and Concat). The relevant piece of code is here.

pranavsharma on 15 Nov 2019

Other similar issues

https://github.com/microsoft/onnxruntime/issues/1368
https://github.com/microsoft/onnxruntime/issues/1246

pranavsharma on 15 Nov 2019

@pranavsharma Thank you.

I modified the model to avoid view operator,

self.classifier = nn.Sequential(
    nn.MaxPool2d(kernel_size=7, stride=1),
    nn.Conv2d(256, 1024, kernel_size=1),
    nn.ReLU(inplace=True),
    nn.Conv2d(1024, num_class, kernel_size=1),
    nn.Softmax(dim=1)
)
.......
def forward(self, x):
    x = self.backbone(x)
    x = self.classifier(x)
    return x

Now, all nodes have been placed on GPU, however, the speed of onnxruntime is much slow than pytorch.

Pytorch average forward time: 1.614020ms
onnxruntime average forward time: 3.353541ms

This is weird...
Any idea?

TianzhongSong on 15 Nov 2019

pranavsharma on 16 Nov 2019

👍5

Thank you for your answering!

TianzhongSong on 19 Nov 2019

@pranavsharma How are we meant to perform this in production? I notice you included test code to get this to work properly.

My data is already on GPU and I want to just use it in session.Run(..).

Nevermind, just found: https://github.com/microsoft/onnxruntime/issues/1621