I've setup tensorRT to work on my yolov3 model where I'm running inference on each frame of a video stream. When I run with a single video stream and process each frame one at at time, I notice that the tensorRT version of the model gets a solid speedup over the regular model (going from 43 fps to 57 fps). However, when I try to process frames from larger batch sizes, like 5 different videos (and batch together 1 frame from each video into a batch size of 5), I don't see any speedup with tensorRT.
I'm trying to understand why I see a speedup with batch size of 1 vs a batch size of 5. Any ideas why this might be happening or what I can look into for improving batch performance? I'm running with float 32 but would still expect a speedup for larger batch sizes for the tensorRT model.
Here is an outline of my steps for creating and running the tensorRT engine:
torch.onnx.export with the dynamic batches paramprofile.set_shape(inp.name, min=(batch_size, *shape), opt=(batch_size, *shape), max=(batch_size, *shape))context.set_binding_shape(0, (BATCH_SIZE, 3, IMAGE_SIZE))context.active_optimization_profile = 0Not sure if there's anything else I should be doing but these steps seem to be fine for handling inference with larger batch sizes. I'm running this on the latest TensorRT version.
Hello @prathik-naidu , Thanks for reporting.
We will have larger workload when use larger batch size, the increased workload will potentially increase occupancy on each SM if the GPU is "hungry" in small batch size. However, if there is already enough workload to fully occupy the device for batch size one, then there is no more perf gain when increate the batch size.
Could you provide nsightCompute dump for both batch size 1 and batch size 5 for further triage? thanks.
please use command 'nvidia-smi' to check GPU-Util when increasing batch size, if gpu-util is already close to 100%, there is no speedup when use larger batch size.
if gpu is not busy, less than 90%, maybe you should check your preprocess pipeline
closing since no response for a long time, please reopen if you still have question. thanks!
Most helpful comment
please use command 'nvidia-smi' to check GPU-Util when increasing batch size, if gpu-util is already close to 100%, there is no speedup when use larger batch size.
if gpu is not busy, less than 90%, maybe you should check your preprocess pipeline