Tensorrt: Unable to run separate instances in multiple threads

Created on 13 Nov 2019 · 9Comments · Source: NVIDIA/TensorRT

I've built a library that loads an engine, and this works well.

Now I am testing on two cards. In the library, when initialized, I call setCudaDevice(which GPU) to assign to net to the correct device.

At the end of the lib, there are two pointers I'm using:

    std::shared_ptr<nvinfer1::ICudaEngine>             m_Engine; 
    std::shared_ptr<nvinfer1::IExecutionContext>    m_Context;

After these are initialized, I run a test to see if the inference results are correct, and they are.

However, after, when running just using the pointers in the library, the first card runs fine, but the second is throwing this error: Cudnn Error in nvinfer1::rt::CommonContext::configure: 7 (CUDNN_STATUS_MAPPING_ERROR)

Environment

OS: Windows 10
TensorRT Version: TensorRT 6
GPU: 2080 ti and 2080
Nvidia Driver: 441
CUDA: 10.1
CUDNN: 7.4

Any ideas?

Windows 6.x Multi-GPU help wanted needs-info

Source

ttdd11

Most helpful comment

Using cudaSetDevice before copying data works.

It's actually pretty simple (now that it's working :)).

I built a class that inherited from a thread.Each thread had a member to the context and engine. I pass the gpu id to the class (0,1..) and call cudaSetDevice before building the engine and the context.

Now we have two threads, each with the context and engine mapping to the correct card.

When running inference using these threads, before copying your buffer to the gpu (using a buffer manager) or cuda copy, you have to call cudaSetDevice within that thread. Every single time you send a batch to the card.

So the threads run asynchronous with tonnes of calls to cudaSetDevice which I didn't think was necessary, but works in practice. I ran this on 4 cards all afternoon with no problem.

I was thinking there would be a threading issue when calling cudaSetDevice (ie. I call set device from one thread as the other thread is copying memory), but I logged that case and it didn't cause any issues.It seems that cudaSetDevice is only operating within the thread which is good.

ttdd11 on 27 Nov 2019

👍8 🎉2 ❤1 😄1

All 9 comments

Hi @ttdd11,

I'm not too familiar with this, but might be related to https://github.com/NVIDIA/TensorRT/issues/143

You might have to clean up some of the resources before using on the second card instead of re-using them, but that's just a guess.

rmccorm4 on 13 Nov 2019

Not sure if you're using the same IExecutionContext for both devices, but maybe using createExecutionContextWithoutDeviceMemory() is necessary to achieve that: https://docs.nvidia.com/deeplearning/sdk/tensorrt-api/c_api/classnvinfer1_1_1_i_cuda_engine.html#a773f668a04e6d9aa6b2ac1a02c37f251

As it looks like the IExecutionContext may be tied directly to the memory of the device, so re-using it on a different device would likely cause a problem.

rmccorm4 on 13 Nov 2019

It's not the same execution context. Each thread has a member that is an execution context.

I'm following the documentation to run on multiple GPUs, using the cudaSetDevice, followed by building a unique engine and unique execution context.

Can I provide some more information that would make debugging this a bit easier?

ttdd11 on 14 Nov 2019

Do you guys have sample code that does something similar to this on multiple GPUs? Ie. different threads with different context and engines running in parallel?

ttdd11 on 15 Nov 2019

@rmccorm4 Let me know if there is anything I can provide to assist in this.

ttdd11 on 15 Nov 2019

I'll get back to you on this one, sorry for the delay.

rmccorm4 on 15 Nov 2019

Using cudaSetDevice before copying data works.

It's actually pretty simple (now that it's working :)).

Now we have two threads, each with the context and engine mapping to the correct card.

So the threads run asynchronous with tonnes of calls to cudaSetDevice which I didn't think was necessary, but works in practice. I ran this on 4 cards all afternoon with no problem.

ttdd11 on 27 Nov 2019

👍8 🎉2 ❤1 😄1

@ttdd11 Can you provide your example for me? I recently study tensorrt (in python).
Thank you.

Eric-Zhang1990 on 25 Dec 2019

Hi, @ttdd11. Would you mind provide your code demo that run separate instances in multiple threads with TensorRT? I try to use one contex and one engine object to run multi-infer asynchronous by enqueueV2() function with multiple threads, but it did not work. Appreciated for your reply, thanks.