Onnxruntime: Cuda Memory Allocation throws error on Session->Run

Created on 20 Oct 2020 · 9Comments · Source: microsoft/onnxruntime

Hey there, i dont know if it is a bug or me using the api wrong. I want to use a GPU allocator for the Ort::MemoryInfo, I think the tensor will then be placed on the gpu directly?

My goal is to use multiple onnx models on the same image and as the images are huge and only PCIe x1 is available the cost of the transfer each time for the image is significant

Describe the bug
We use the following code for a CPU Memory Allocation:

memory_info = new Ort::MemoryInfo("Cpu", OrtAllocatorType::OrtArenaAllocator, 0, OrtMemTypeDefault);
val = Ort::Value::CreateTensor<float>(memory_info_, input_tensor_values[k].data(), input_tensor_size, input_node_dims.data(), 4);
output_tensors[j] = session_ptr->Run(Ort::RunOptions{nullptr}, input_node_names.data(), &val, num_input_nodes, output_node_names.data(), num_output_nodes);

this is working as expected but (I think) each time the model gets copied to the gpu.

Next I tried to use the CUDA Memory Allocator in the following way

//we have a vector of allocators and define a memory info + session option for each model
vector<Ort::Allocator*>  cuda_allocator;
...
memory_info = new Ort::MemoryInfo("Cuda", OrtAllocatorType::OrtArenaAllocator, 0, OrtMemTypeDefault);
session_options = new Ort::SessionOptions();
session_options->SetGraphOptimizationLevel(GraphOptimizationLevel::ORT_ENABLE_ALL);
session_options->AddConfigEntry(kOrtSessionOptionsConfigUseEnvAllocators, "1");
shared_ptr<Ort::Session> new_pointer(new Ort::Session(*env, model_data, (size_t)model_size, *session_options));

cuda_allocator.push_back(new Ort::Allocator(*session, *memory_info));
cuda_allocator.back()->Alloc(4000000000); // does not matter what we put in here

... 

// next we create the input tensor and try to run it as above
val = Ort::Value::CreateTensor<float>(cuda_allocator[i]->GetInfo(), input_tensor_values[k].data(), input_tensor_size, input_node_dims.data(), 4);
output_tensors[j] = session_ptr->Run(Ort::RunOptions{nullptr}, input_node_names.data(), &val, num_input_nodes, output_node_names.data(), num_output_nodes);

but the run method throws the following error:

2020-10-20 22:21:57.0518971 [E:onnxruntime:Evaluation, cuda_call.cc:119 onnxruntime::CudaCall] CUDNN failure 4: CUDNN_STATUS_INTERNAL_ERROR ; GPU=0 ; hostname=XXXX ; expr=cudnnFindConvolutionForwardAlgorithmEx( CudnnHandle(), s_.x_tensor, x_data, s_.filter_desc, w_data, s_.conv_desc, s_.y_tensor, y_data, 1, &algo_count, &perf, algo_search_workspace.get(), AlgoSearchWorkspaceSize);
2020-10-20 22:21:57.0521207 [E:onnxruntime:, sequential_executor.cc:318 onnxruntime::SequentialExecutor::Execute] Non-zero status code returned while running Conv node. Name:'Conv_0' Status Message: CUDNN error executing cudnnFindConvolutionForwardAlgorithmEx( CudnnHandle(), s_.x_tensor, x_data, s_.filter_desc, w_data, s_.conv_desc, s_.y_tensor, y_data, 1, &algo_count, &perf, algo_search_workspace.get(), AlgoSearchWorkspaceSize)
2020-10-20 22:21:57.0534116 [E:onnxruntime:Evaluation, cuda_call.cc:119 onnxruntime::CudaCall] CUDA failure 700: an illegal memory access was encountered ; GPU=0 ; hostname=XXXX ; expr=cudaEventRecord(current_deferred_release_event, nullptr);

System information

OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Windows 10 x64
ONNX Runtime installed from (source or binary): binary
ONNX Runtime version: 1.5.2
Python version: --
Visual Studio version (if applicable): 2019
GCC/Compiler version (if compiling from source): Visual Studio Compiler
CUDA/cuDNN version: 10.2/8.0.3
GPU model and memory: RTX 2070 Super / 8GB

Thanks in advance!

support

Source

marcown

All 9 comments

The problem is with this line:

val = Ort::Value::CreateTensor(cuda_allocator[i]->GetInfo(), input_tensor_values[k].data(), input_tensor_size, input_node_dims.data(), 4);

input_tensor_values[k].data() will provide a pointer to memory on CPU whereas the tensor is advertised to ORT as a tensor on CUDA (cuda_allocator[i]->GetInfo()). This is a mis-match - you must provide it a buffer on CUDA to with cuda_allocator[i]->GetInfo() while creating the tensor.

Could you please try that ?

hariharans29 on 21 Oct 2020

Thank you for the response!

Could you provide an example how to get the data on the gpu / in a cuda buffer before using it as an input to CreateTensor? I cannot find it in the docs or code examples.

marcown on 22 Oct 2020

I couldn’t find one example directly. But you are almost there- once you have used cuda allocator to allocate memory on CUDA, you can use cudaMempy (not part of ORT API, it is part of part of CUDA toolkit) to memcpy cpu data over to the device allocated memory and you should be able to construct the OrtValue using this buffer and use it. I ll try to add some example soon - but I may not be able to get it done immediately.

hariharans29 on 22 Oct 2020

👍1

Ok, I will try it and post my results. Thank you for the fast responses! :)

marcown on 22 Oct 2020

Sure - I quickly hacked some code - (maybe a little raw):

include

const std::array x_shape = {3, 2};
std::array x_values = {1.0f, 2.0f, 3.0f, 4.0f, 5.0f, 6.0f};

// allocate memory on CUDA
// De-allocate using Free() when the OrtValue using it is no longer going to be used
void* input_data = cuda_allocator.Alloc(x_values.size() * sizeof(float));
ASSERT_NE(input_data, nullptr);

// initialize the memory on CUDA with data we want (blocking call)
// MISSING PIECE IN YOUR CODE
cudaMemcpy(input_data, x_values.data(), sizeof(float) * x_values.size(), cudaMemcpyHostToDevice);

// Create an OrtValue tensor backed by data on CUDA memory
Ort::Value bound_x = Ort::Value::CreateTensor(info_cuda, reinterpret_cast(input_data), x_values.size(),
x_shape.data(), x_shape.size());

cudaMemcpy() should be defined in cudart (part of CUDA library) , so the app should link to the cuda lib (It is not enough to link with onnxruntime as this is not an onnxruntime api - notice the cuda header include). It shouldn't be very hard to get it to work as it is a popular method and there should be a lot of snippets on the web. I ll try and add a test for IOBinding in test_inference.cc

hariharans29 on 22 Oct 2020

👍1

Thank you, I got it working with the following code:

    vector<void*> input_data(partitions);
    ....

    shared_ptr<Ort::Session> session_ptr = container_collection[0].get_session();
    vector<void*> input_data(partitions);
    Ort::MemoryInfo memory_info_cuda("Cuda", OrtAllocatorType::OrtArenaAllocator, 0, OrtMemTypeDefault);
    Ort::Allocator memory_allocator(*session_ptr, memory_info_cuda);

    ...

    input_data[k] = memory_allocator.Alloc(sizeof(float) * input_tensor_size);
    cudaMemcpy(input_data[k], input_tensor_values[k].data(), sizeof(float) * input_tensor_size, cudaMemcpyHostToDevice);
    Ort::Value input_tensor = Ort::Value::CreateTensor<float>(memory_allocator.GetInfo(), reinterpret_cast<float*>(input_data[k]), input_tensor_size, input_node_dims.data(), 4);

However, I have another question:

When I set kOrtSessionOptionsConfigUseEnvAllocators to 1 multiple sessions should use the same allocator, is that correct? But we only create the memory_allocator once with the sesssion_ptr pointing to the first session (see above):

Ort::Allocator memory_allocator(*session_ptr, memory_info_cuda);

There is no difference in the final inference of the multiple sessions if we set this option.

marcown on 22 Oct 2020

Hi @pranavsharma - Could you please help answer the shared allocator question above ?

hariharans29 on 22 Oct 2020

However, I have another question:

When I set kOrtSessionOptionsConfigUseEnvAllocators to 1 multiple sessions should use the same allocator, is that correct? But we only create the memory_allocator once with the sesssion_ptr pointing to the first session (see above):
Ort::Allocator memory_allocator(*session_ptr, memory_info_cuda);
There is no difference in the final inference of the multiple sessions if we set this option.

Please take a look at this test case that demonstrates the usage of the shared allocator feature: https://github.com/microsoft/onnxruntime/blob/master/onnxruntime/test/shared_lib/test_inference.cc#L904. Yes, using a shared allocator shouldn't make any difference to the result of the inference; it might consume less memory though. Also, at this point only CPU allocators can be shared.