Tensorrt: No speedup on batch size larger than 1

Created on 3 Jul 2020 · 3Comments · Source: NVIDIA/TensorRT

I've setup tensorRT to work on my yolov3 model where I'm running inference on each frame of a video stream. When I run with a single video stream and process each frame one at at time, I notice that the tensorRT version of the model gets a solid speedup over the regular model (going from 43 fps to 57 fps). However, when I try to process frames from larger batch sizes, like 5 different videos (and batch together 1 frame from each video into a batch size of 5), I don't see any speedup with tensorRT.

I'm trying to understand why I see a speedup with batch size of 1 vs a batch size of 5. Any ideas why this might be happening or what I can look into for improving batch performance? I'm running with float 32 but would still expect a speedup for larger batch sizes for the tensorRT model.

Here is an outline of my steps for creating and running the tensorRT engine:

Export yolo model to onnx using torch.onnx.export with the dynamic batches param
Convert onnx to tensorRT engine
- parse onnx model
- create a single optimization profile for a specific batch size: profile.set_shape(inp.name, min=(batch_size, *shape), opt=(batch_size, *shape), max=(batch_size, *shape))
- build engine
Load the tensorRT engine + context
- select the right tensorRT engine based on input batch size to inference function
- Set the binding shape: context.set_binding_shape(0, (BATCH_SIZE, 3, IMAGE_SIZE))
- Set the optimization profile: context.active_optimization_profile = 0

Not sure if there's anything else I should be doing but these steps seem to be fine for handling inference with larger batch sizes. I'm running this on the latest TensorRT version.

Performance question triaged

Source

prathik-naidu

❤1

Most helpful comment

please use command 'nvidia-smi' to check GPU-Util when increasing batch size, if gpu-util is already close to 100%, there is no speedup when use larger batch size.
if gpu is not busy, less than 90%, maybe you should check your preprocess pipeline

zhangkui669 on 13 Nov 2020

👍2

All 3 comments

Hello @prathik-naidu , Thanks for reporting.

We will have larger workload when use larger batch size, the increased workload will potentially increase occupancy on each SM if the GPU is "hungry" in small batch size. However, if there is already enough workload to fully occupy the device for batch size one, then there is no more perf gain when increate the batch size.

Could you provide nsightCompute dump for both batch size 1 and batch size 5 for further triage? thanks.