I've built a library that loads an engine, and this works well.
Now I am testing on two cards. In the library, when initialized, I call setCudaDevice(which GPU) to assign to net to the correct device.
At the end of the lib, there are two pointers I'm using:
std::shared_ptr<nvinfer1::ICudaEngine> m_Engine;
std::shared_ptr<nvinfer1::IExecutionContext> m_Context;
After these are initialized, I run a test to see if the inference results are correct, and they are.
However, after, when running just using the pointers in the library, the first card runs fine, but the second is throwing this error: Cudnn Error in nvinfer1::rt::CommonContext::configure: 7 (CUDNN_STATUS_MAPPING_ERROR)
OS: Windows 10
TensorRT Version: TensorRT 6
GPU: 2080 ti and 2080
Nvidia Driver: 441
CUDA: 10.1
CUDNN: 7.4
Any ideas?
Hi @ttdd11,
I'm not too familiar with this, but might be related to https://github.com/NVIDIA/TensorRT/issues/143
You might have to clean up some of the resources before using on the second card instead of re-using them, but that's just a guess.
Not sure if you're using the same IExecutionContext for both devices, but maybe using createExecutionContextWithoutDeviceMemory() is necessary to achieve that: https://docs.nvidia.com/deeplearning/sdk/tensorrt-api/c_api/classnvinfer1_1_1_i_cuda_engine.html#a773f668a04e6d9aa6b2ac1a02c37f251
As it looks like the IExecutionContext may be tied directly to the memory of the device, so re-using it on a different device would likely cause a problem.
It's not the same execution context. Each thread has a member that is an execution context.
I'm following the documentation to run on multiple GPUs, using the cudaSetDevice, followed by building a unique engine and unique execution context.
Can I provide some more information that would make debugging this a bit easier?
Do you guys have sample code that does something similar to this on multiple GPUs? Ie. different threads with different context and engines running in parallel?
@rmccorm4 Let me know if there is anything I can provide to assist in this.
I'll get back to you on this one, sorry for the delay.
Using cudaSetDevice before copying data works.
It's actually pretty simple (now that it's working :)).
I built a class that inherited from a thread.Each thread had a member to the context and engine. I pass the gpu id to the class (0,1..) and call cudaSetDevice before building the engine and the context.
Now we have two threads, each with the context and engine mapping to the correct card.
When running inference using these threads, before copying your buffer to the gpu (using a buffer manager) or cuda copy, you have to call cudaSetDevice within that thread. Every single time you send a batch to the card.
So the threads run asynchronous with tonnes of calls to cudaSetDevice which I didn't think was necessary, but works in practice. I ran this on 4 cards all afternoon with no problem.
I was thinking there would be a threading issue when calling cudaSetDevice (ie. I call set device from one thread as the other thread is copying memory), but I logged that case and it didn't cause any issues.It seems that cudaSetDevice is only operating within the thread which is good.
@ttdd11 Can you provide your example for me? I recently study tensorrt (in python).
Thank you.
Hi, @ttdd11. Would you mind provide your code demo that run separate instances in multiple threads with TensorRT? I try to use one contex and one engine object to run multi-infer asynchronous by enqueueV2() function with multiple threads, but it did not work. Appreciated for your reply, thanks.
Most helpful comment
Using cudaSetDevice before copying data works.
It's actually pretty simple (now that it's working :)).
I built a class that inherited from a thread.Each thread had a member to the context and engine. I pass the gpu id to the class (0,1..) and call cudaSetDevice before building the engine and the context.
Now we have two threads, each with the context and engine mapping to the correct card.
When running inference using these threads, before copying your buffer to the gpu (using a buffer manager) or cuda copy, you have to call cudaSetDevice within that thread. Every single time you send a batch to the card.
So the threads run asynchronous with tonnes of calls to cudaSetDevice which I didn't think was necessary, but works in practice. I ran this on 4 cards all afternoon with no problem.
I was thinking there would be a threading issue when calling cudaSetDevice (ie. I call set device from one thread as the other thread is copying memory), but I logged that case and it didn't cause any issues.It seems that cudaSetDevice is only operating within the thread which is good.