Hey there, i dont know if it is a bug or me using the api wrong. I want to use a GPU allocator for the Ort::MemoryInfo, I think the tensor will then be placed on the gpu directly?
My goal is to use multiple onnx models on the same image and as the images are huge and only PCIe x1 is available the cost of the transfer each time for the image is significant
Describe the bug
We use the following code for a CPU Memory Allocation:
memory_info = new Ort::MemoryInfo("Cpu", OrtAllocatorType::OrtArenaAllocator, 0, OrtMemTypeDefault);
val = Ort::Value::CreateTensor<float>(memory_info_, input_tensor_values[k].data(), input_tensor_size, input_node_dims.data(), 4);
output_tensors[j] = session_ptr->Run(Ort::RunOptions{nullptr}, input_node_names.data(), &val, num_input_nodes, output_node_names.data(), num_output_nodes);
this is working as expected but (I think) each time the model gets copied to the gpu.
Next I tried to use the CUDA Memory Allocator in the following way
//we have a vector of allocators and define a memory info + session option for each model
vector<Ort::Allocator*> cuda_allocator;
...
memory_info = new Ort::MemoryInfo("Cuda", OrtAllocatorType::OrtArenaAllocator, 0, OrtMemTypeDefault);
session_options = new Ort::SessionOptions();
session_options->SetGraphOptimizationLevel(GraphOptimizationLevel::ORT_ENABLE_ALL);
session_options->AddConfigEntry(kOrtSessionOptionsConfigUseEnvAllocators, "1");
shared_ptr<Ort::Session> new_pointer(new Ort::Session(*env, model_data, (size_t)model_size, *session_options));
cuda_allocator.push_back(new Ort::Allocator(*session, *memory_info));
cuda_allocator.back()->Alloc(4000000000); // does not matter what we put in here
...
// next we create the input tensor and try to run it as above
val = Ort::Value::CreateTensor<float>(cuda_allocator[i]->GetInfo(), input_tensor_values[k].data(), input_tensor_size, input_node_dims.data(), 4);
output_tensors[j] = session_ptr->Run(Ort::RunOptions{nullptr}, input_node_names.data(), &val, num_input_nodes, output_node_names.data(), num_output_nodes);
but the run method throws the following error:
2020-10-20 22:21:57.0518971 [E:onnxruntime:Evaluation, cuda_call.cc:119 onnxruntime::CudaCall] CUDNN failure 4: CUDNN_STATUS_INTERNAL_ERROR ; GPU=0 ; hostname=XXXX ; expr=cudnnFindConvolutionForwardAlgorithmEx( CudnnHandle(), s_.x_tensor, x_data, s_.filter_desc, w_data, s_.conv_desc, s_.y_tensor, y_data, 1, &algo_count, &perf, algo_search_workspace.get(), AlgoSearchWorkspaceSize);
2020-10-20 22:21:57.0521207 [E:onnxruntime:, sequential_executor.cc:318 onnxruntime::SequentialExecutor::Execute] Non-zero status code returned while running Conv node. Name:'Conv_0' Status Message: CUDNN error executing cudnnFindConvolutionForwardAlgorithmEx( CudnnHandle(), s_.x_tensor, x_data, s_.filter_desc, w_data, s_.conv_desc, s_.y_tensor, y_data, 1, &algo_count, &perf, algo_search_workspace.get(), AlgoSearchWorkspaceSize)
2020-10-20 22:21:57.0534116 [E:onnxruntime:Evaluation, cuda_call.cc:119 onnxruntime::CudaCall] CUDA failure 700: an illegal memory access was encountered ; GPU=0 ; hostname=XXXX ; expr=cudaEventRecord(current_deferred_release_event, nullptr);
System information
Thanks in advance!
The problem is with this line:
val = Ort::Value::CreateTensor
input_tensor_values[k].data() will provide a pointer to memory on CPU whereas the tensor is advertised to ORT as a tensor on CUDA (cuda_allocator[i]->GetInfo()). This is a mis-match - you must provide it a buffer on CUDA to with cuda_allocator[i]->GetInfo() while creating the tensor.
Could you please try that ?
Thank you for the response!
Could you provide an example how to get the data on the gpu / in a cuda buffer before using it as an input to CreateTensor? I cannot find it in the docs or code examples.
I couldn鈥檛 find one example directly. But you are almost there- once you have used cuda allocator to allocate memory on CUDA, you can use cudaMempy (not part of ORT API, it is part of part of CUDA toolkit) to memcpy cpu data over to the device allocated memory and you should be able to construct the OrtValue using this buffer and use it. I ll try to add some example soon - but I may not be able to get it done immediately.
Ok, I will try it and post my results. Thank you for the fast responses! :)
Sure - I quickly hacked some code - (maybe a little raw):
const std::array
std::array
// allocate memory on CUDA
// De-allocate using Free() when the OrtValue using it is no longer going to be used
void* input_data = cuda_allocator.Alloc(x_values.size() * sizeof(float));
ASSERT_NE(input_data, nullptr);
// initialize the memory on CUDA with data we want (blocking call)
// MISSING PIECE IN YOUR CODE
cudaMemcpy(input_data, x_values.data(), sizeof(float) * x_values.size(), cudaMemcpyHostToDevice);
// Create an OrtValue tensor backed by data on CUDA memory
Ort::Value bound_x = Ort::Value::CreateTensor(info_cuda, reinterpret_cast
x_shape.data(), x_shape.size());
cudaMemcpy() should be defined in cudart (part of CUDA library) , so the app should link to the cuda lib (It is not enough to link with onnxruntime as this is not an onnxruntime api - notice the cuda header include). It shouldn't be very hard to get it to work as it is a popular method and there should be a lot of snippets on the web. I ll try and add a test for IOBinding in test_inference.cc
Thank you, I got it working with the following code:
vector<void*> input_data(partitions);
....
shared_ptr<Ort::Session> session_ptr = container_collection[0].get_session();
vector<void*> input_data(partitions);
Ort::MemoryInfo memory_info_cuda("Cuda", OrtAllocatorType::OrtArenaAllocator, 0, OrtMemTypeDefault);
Ort::Allocator memory_allocator(*session_ptr, memory_info_cuda);
...
input_data[k] = memory_allocator.Alloc(sizeof(float) * input_tensor_size);
cudaMemcpy(input_data[k], input_tensor_values[k].data(), sizeof(float) * input_tensor_size, cudaMemcpyHostToDevice);
Ort::Value input_tensor = Ort::Value::CreateTensor<float>(memory_allocator.GetInfo(), reinterpret_cast<float*>(input_data[k]), input_tensor_size, input_node_dims.data(), 4);
However, I have another question:
When I set kOrtSessionOptionsConfigUseEnvAllocators to 1 multiple sessions should use the same allocator, is that correct? But we only create the memory_allocator once with the sesssion_ptr pointing to the first session (see above):
Ort::Allocator memory_allocator(*session_ptr, memory_info_cuda);
There is no difference in the final inference of the multiple sessions if we set this option.
Hi @pranavsharma - Could you please help answer the shared allocator question above ?
However, I have another question:
When I set kOrtSessionOptionsConfigUseEnvAllocators to 1 multiple sessions should use the same allocator, is that correct? But we only create the memory_allocator once with the sesssion_ptr pointing to the first session (see above):
Ort::Allocator memory_allocator(*session_ptr, memory_info_cuda);There is no difference in the final inference of the multiple sessions if we set this option.
Please take a look at this test case that demonstrates the usage of the shared allocator feature: https://github.com/microsoft/onnxruntime/blob/master/onnxruntime/test/shared_lib/test_inference.cc#L904. Yes, using a shared allocator shouldn't make any difference to the result of the inference; it might consume less memory though. Also, at this point only CPU allocators can be shared.
Ok, thank you!