Onnxruntime: ONNXRuntime Using GPU Slower Than Using CPU?

Created on 12 Nov 2019  路  13Comments  路  Source: microsoft/onnxruntime

Describe the bug
A clear and concise description of what the bug is.

I have been using ONNXRuntime for a while and I found the ONNXRuntime GPU instance is usually 5-200 times slower than the CPU instance. I just wonder what is the reason behind this, or there is something wrong with my implementation.

The CPU is just an Intel E5 CPU and the GPU is a V100 card.

Urgency
If there are particular important use cases blocked by this or strict project-related timelines, please share more information and dates. If there are no hard deadlines, please specify none.

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): 18.04.3
  • ONNX Runtime installed from (source or binary): binary
  • ONNX Runtime version: 1.0.0
  • Python version: 3.6.8
  • Visual Studio version (if applicable):
  • GCC/Compiler version (if compiling from source):
  • CUDA/cuDNN version: 10.0/7.6.3
  • GPU model and memory: V100

To Reproduce
Describe steps/code to reproduce the behavior:

I am using the official model logreg_iris.onnx as an example.

import onnxruntime
import numpy as np
import time
from onnxruntime.datasets import get_example

example_model = get_example("logreg_iris.onnx")
sess = onnxruntime.InferenceSession(example_model)

input_name = sess.get_inputs()[0].name
input_shape = sess.get_inputs()[0].shape
input_type = sess.get_inputs()[0].type

print(input_name, input_shape, input_type)

output_name = sess.get_outputs()[0].name
output_shape = sess.get_outputs()[0].shape
output_type = sess.get_outputs()[0].type

print(output_name, output_shape, output_type)

x = np.random.random(input_shape)
x = x.astype(np.float32)

print(sess.get_providers())

sess.set_providers(['CPUExecutionProvider'])
start_time = time.time()
for i in range(1000):
    result = sess.run([output_name], {input_name: x})
end_time = time.time()
time_elapsed = end_time - start_time
print(time_elapsed)


sess.set_providers(['CUDAExecutionProvider'])
start_time = time.time()
for i in range(1000):
    result = sess.run([output_name], {input_name: x})
end_time = time.time()
time_elapsed = end_time - start_time
print(time_elapsed)

Got the following outputs:

root@79d922d8c004:/mnt# python onnx_test.py 
float_input [3, 2] tensor(float)
label [3] tensor(int64)
['CUDAExecutionProvider', 'CPUExecutionProvider']
0.011708498001098633
1.7355647087097168

Expected behavior
A clear and concise description of what you expected to happen.

I expect the GPU instance should be faster than CPU instance. But it is not. Is it due to memory copy between host and device?

Screenshots
If applicable, add screenshots to help explain your problem.

Additional context
Add any other context about the problem here. If the issue is about a particular model, please share the model details as well to facilitate debugging.

support

Most helpful comment

A few notes:

1) The model only contains traditional ML operators, and there are no CUDA implementations of those ops.
2) If you add an initial call to sess.Run prior to doing the timing the CPU and CUDA times are much closer, so there's some work in the first call that is cached.
3) There's a TODO in the code about not calling a per-Run initialization method on unused execution providers. That's the cause of the CUDA run being slower as that (unnecessary) setup is expensive relative to the extremely small model which is taking less than a millisecond in total to run.

So for this model, as none of the nodes can run on CUDA, adding the CUDA execution provider adds some overhead with no benefit from doing so, and that overhead is significant given the very small model.

All 13 comments

If you set the session log severity level to VERBOSE, it'll print which nodes were assigned to GPU and which ones to CPU. Let us know what this prints. It might explain the behavior. We don't have CUDA implementations for all ops.

If you set the session log severity level to VERBOSE, it'll print which nodes were assigned to GPU and which ones to CPU. Let us know what this prints. It might explain the behavior. We don't have CUDA implementations for all ops and hence memory copies between nodes that are on CPU and GPU increase the latency.

Is there any doc on this verbose?

I am currently using a 9900K CPU and a 2080Ti GPU on my home computer. I run the inference one time (instead of 1000 times), with verbose. This is what was printed, and I was very confused by that. There is a lot of time spend Session successfully initialized and Begin execution in the GPU instance. But it seems that all the ops were on conducted on CPU so it should be similar to what it is on the CPU instance.

root@591ec2895f68:/mnt# python onnx_test.py 
2019-11-12 01:53:28.776392356 [I:onnxruntime:, inference_session.cc:545 Initialize] Initializing session.
2019-11-12 01:53:28.777603627 [V:onnxruntime:, inference_session.cc:438 TransformGraph] Node placements
2019-11-12 01:53:28.777612480 [V:onnxruntime:, inference_session.cc:440 TransformGraph] All nodes have been placed on [CPUExecutionProvider].
2019-11-12 01:53:28.777622407 [I:onnxruntime:, session_state.cc:22 SetGraph] SaveMLValueNameIndexMapping
2019-11-12 01:53:28.777627385 [I:onnxruntime:, session_state.cc:67 SetGraph] Done saving OrtValue mappings.
2019-11-12 01:53:28.777641500 [I:onnxruntime:, session_state_initializer.cc:177 SaveInitializedTensors] Saving initialized tensors.
2019-11-12 01:53:28.777646732 [I:onnxruntime:, session_state_initializer.cc:222 SaveInitializedTensors] Done saving initialized tensors
2019-11-12 01:53:28.777678387 [I:onnxruntime:, inference_session.cc:620 Initialize] Session successfully initialized.
float_input [3, 2] tensor(float)
label [3] tensor(int64)
['CUDAExecutionProvider', 'CPUExecutionProvider']
2019-11-12 01:53:28.778672090 [I:onnxruntime:, inference_session.cc:545 Initialize] Initializing session.
2019-11-12 01:53:28.778715144 [V:onnxruntime:, inference_session.cc:438 TransformGraph] Node placements
2019-11-12 01:53:28.778721320 [V:onnxruntime:, inference_session.cc:440 TransformGraph] All nodes have been placed on [CPUExecutionProvider].
2019-11-12 01:53:28.778726298 [I:onnxruntime:, session_state.cc:22 SetGraph] SaveMLValueNameIndexMapping
2019-11-12 01:53:28.778731216 [I:onnxruntime:, session_state.cc:67 SetGraph] Done saving OrtValue mappings.
2019-11-12 01:53:28.778742335 [I:onnxruntime:, session_state_initializer.cc:177 SaveInitializedTensors] Saving initialized tensors.
2019-11-12 01:53:28.778746963 [I:onnxruntime:, session_state_initializer.cc:222 SaveInitializedTensors] Done saving initialized tensors
2019-11-12 01:53:28.778765919 [I:onnxruntime:, inference_session.cc:620 Initialize] Session successfully initialized.
2019-11-12 01:53:28.778809343 [I:onnxruntime:, sequential_executor.cc:51 Execute] Begin execution
6.508827209472656e-05
2019-11-12 01:53:28.779271477 [I:onnxruntime:, inference_session.cc:545 Initialize] Initializing session.
2019-11-12 01:53:28.779280612 [I:onnxruntime:, inference_session.cc:559 Initialize] Adding default CPU execution provider.
2019-11-12 01:53:28.779324679 [V:onnxruntime:, inference_session.cc:438 TransformGraph] Node placements
2019-11-12 01:53:28.779332636 [V:onnxruntime:, inference_session.cc:440 TransformGraph] All nodes have been placed on [CPUExecutionProvider].
2019-11-12 01:53:28.779342979 [I:onnxruntime:, session_state.cc:22 SetGraph] SaveMLValueNameIndexMapping
2019-11-12 01:53:28.779349260 [I:onnxruntime:, session_state.cc:67 SetGraph] Done saving OrtValue mappings.
2019-11-12 01:53:28.779362716 [I:onnxruntime:, session_state_initializer.cc:177 SaveInitializedTensors] Saving initialized tensors.
2019-11-12 01:53:28.779368882 [I:onnxruntime:, session_state_initializer.cc:222 SaveInitializedTensors] Done saving initialized tensors
2019-11-12 01:53:28.779382934 [I:onnxruntime:, inference_session.cc:620 Initialize] Session successfully initialized.
2019-11-12 01:53:29.716479965 [I:onnxruntime:, sequential_executor.cc:51 Execute] Begin execution
0.9371423721313477

A few notes:

1) The model only contains traditional ML operators, and there are no CUDA implementations of those ops.
2) If you add an initial call to sess.Run prior to doing the timing the CPU and CUDA times are much closer, so there's some work in the first call that is cached.
3) There's a TODO in the code about not calling a per-Run initialization method on unused execution providers. That's the cause of the CUDA run being slower as that (unnecessary) setup is expensive relative to the extremely small model which is taking less than a millisecond in total to run.

So for this model, as none of the nodes can run on CUDA, adding the CUDA execution provider adds some overhead with no benefit from doing so, and that overhead is significant given the very small model.

A few notes:

  1. The model only contains traditional ML operators, and there are no CUDA implementations of those ops.
  2. If you add an initial call to sess.Run prior to doing the timing the CPU and CUDA times are much closer, so there's some work in the first call that is cached.
  3. There's a TODO in the code about not calling a per-Run initialization method on unused execution providers. That's the cause of the CUDA run being slower as that (unnecessary) setup is expensive relative to the extremely small model which is taking less than a millisecond in total to run.

So for this model, as none of the nodes can run on CUDA, adding the CUDA execution provider adds some overhead with no benefit from doing so, and that overhead is significant given the very small model.

Thank you @skottmckay for the explanation. For note 3, how do I add the TODO code in this scenario?

The GraphPartitioner that is called in InferenceSession::TransformGraph would need to keep track of which execution providers were assigned to nodes, and make that info available to InferenceSession so that InferenceSession::Run only calls OnRunStart() for that subset of execution providers.

@leimao can this be closed now?

It seems that there is no Python interface to this right?

It seems that there is no Python interface to this right?

What are you looking for in the Python interface?

If you set the session log severity level to VERBOSE, it'll print which nodes were assigned to GPU and which ones to CPU. Let us know what this prints. It might explain the behavior. We don't have CUDA implementations for all ops and hence memory copies between nodes that are on CPU and GPU increase the latency.

Is there any doc on this verbose?

import onnxruntime as ort
ort.set_default_logger_severity(0)

If you set the session log severity level to VERBOSE, it'll print which nodes were assigned to GPU and which ones to CPU. Let us know what this prints. It might explain the behavior. We don't have CUDA implementations for all ops and hence memory copies between nodes that are on CPU and GPU increase the latency.

Is there any doc on this verbose?

import onnxruntime as ort
ort.set_default_logger_severity(0)

I did this and I think this is equivalent to yours:

log_severity_level = 0
model_name = "logreg_iris.onnx"

sess_options = onnxruntime.SessionOptions()
# Log severity level for a particular Run() invocation. 0:Verbose, 1:Info, 2:Warning. 3:Error, 4:Fatal. Default is 2.
sess_options.log_severity_level=log_severity_level

# Available official model examples
# https://github.com/microsoft/onnxruntime/tree/master/onnxruntime/python/datasets
example_model = get_example(model_name)
sess = onnxruntime.InferenceSession(example_model, sess_options=sess_options)

My questions was what is the Python interface to what @skottmckay just said.

What I mentioned were C++ code changes that would need to happen internally to avoid the unnecessary call. It's not something that can be enabled/disabled via python.

What I mentioned were C++ code changes that would need to happen internally to avoid the unnecessary call. It's not something that can be enabled/disabled via python.

Sounds good. I will close the issue then.

Was this page helpful?
0 / 5 - 0 ratings