hi there, i have a quesetion nms operator.
If i use torchvision.ops.nms to filter bbox, about 900MB GPU memory is used, where the input box and score are put into GPU. But there is no problem if the box and score in cpu. meanwhile the time cost of gpu is 0.0007s, 0.0018s in cpu.
i do not know actually why this operator uses such much GPU mem. or is there any configuration about nms to save gpu mem?
my torchvision version is 0.4.0. thanks~
@ThomsonW it may depend on how do you measure GPU memory. For example, with nvidia-smi you can also see allocated cuda context which not related to the requested memory size.
Please, provide more details (code snippet, pytorch/torcvision/cuda versions, how do you measure the memory) if you'd like to get more support on that.
You can also take a look at pytorch forum for similar questions: https://discuss.pytorch.org/search?q=GPU%20memory
hi @vfdev-5 ~ thanks for your quick reply very much.
indeed, that is postprocess of retinaface.
bboxes = pred[:,:4]
scores = pred[:,4]
landms = pred[:,5:]
the CPU version is :
bboxes = torch.Tensor(bboxes)
scores = torch.Tensor(scores)
indices = torchvision.ops.nms(bboxes, scores, iou_threshold=0.4)
r_bboxes = bboxes[indices, :]
r_scores = scores[indices, :]
....
and the GPU version is :
bboxes = torch.Tensor(bboxes).cuda()
scores = torch.Tensor(scores).cuda()
indices = torchvision.ops.nms(bboxes, scores, iou_threshold=0.4).cup()
r_bboxes = bboxes[indices, :].cpu()
r_scores = scores[indices, :].cpu()
....
that is all difference between two versions.
The version info:
python 3.6.8
torch 1.2.0
torchvision 0.4.0
cuda 10.2.89
GPU RTX2080 Ti
How do i measure the GPU mem:
nvidia-smi is used, and i look at the process GPU mem usage by PID.
If i only do inference, 630MB GPU mem is used. Meanwhile, if i do nms with CPU version, the usage will not change. But if i change my code to GPU version, then the GPU mem usage then come to 1500MB, more 870MB than CPU version.
Also, the time cost is different. the CPU version costs about 0.0018s while GPU version costs about 0.0007s
If you'd like to trace memory allocated by pytorch only, please try torch.cuda.memory_allocated(). Maybe, updating to the latest version could reduce memory consumption. Here my reproduction of what you say with torch 1.7.0 and torchvision 0.8.1
!nvidia-smi -i 0
> 0 GeForce GTX 108... On | 00000000:02:00.0 Off
> 0% 31C P8 11W / 280W | 4MiB / 11178MiB
import torch
import torchvision
torch.__version__, torchvision.__version__
> ('1.7.0', '0.8.1')
# See CUDA context
t = torch.rand(1).cuda()
!nvidia-smi -i 0
> 0% 32C P2 58W / 280W | 599MiB / 11178MiB
# Torch CUDA mem allocated in bytes
torch.cuda.memory_allocated()
> 512
preds = torch.rand(12, 5)
bboxes = torch.Tensor(preds[:, :4]).cuda()
scores = torch.Tensor(preds[:, 4]).cuda()
indices = torchvision.ops.nms(bboxes, scores, iou_threshold=0.4).cpu()
# Torch CUDA mem allocated in bytes
torch.cuda.memory_allocated()
> 1536
!nvidia-smi -i 0
> 0% 33C P2 58W / 280W | 599MiB / 11178MiB
i traced the usage of pytorch during nms, which is ~6000 bytes. But the GPU mem usage changed a lot as i said.
it is so confused. what i only changed is puting the data into CUDA memory.
this is not torchvision's problem at least
thank you very much~~
@ThomsonW could you, please, provide fully executable code snippet such that I could reproduce it from my side with the latest pytorch/torchvision with the measurements and values you get in order to compare. Thanks
@vfdev-5 sorry to reply late.
the issue is located at RetinafaceTRT.post_process
the libdecodeplugin.so and retina_mnet.engine is generated from Tensorrt C++ API.
The whole code is modifed from https://github.com/wang-xinyu/tensorrtx
the code is here:
import ctypes
import os
import random
import sys
import threading
import time
import cv2
import numpy as np
import pycuda.autoinit
import pycuda.driver as cuda
import tensorrt as trt
import torch
import torchvision
INPUT_W = 640
INPUT_H = 480
IOU_THRESHOLD = 0.4
VIS_THRESHOLD = 0.6
class RetinafaceTRT(object):
def __init__(self, engine_file_path):
self.cfx = cuda.Device(0).make_context()
stream = cuda.Stream()
TRT_LOGGER = trt.Logger(trt.Logger.INFO)
runtime = trt.Runtime(TRT_LOGGER)
with open(engine_file_path, "rb") as f:
engine = runtime.deserialize_cuda_engine(f.read())
context = engine.create_execution_context()
host_inputs = []
cuda_inputs = []
host_outputs = []
cuda_outputs = []
bindings = []
for binding in engine:
size = trt.volume(engine.get_binding_shape(binding)) * engine.max_batch_size
dtype = trt.nptype(engine.get_binding_dtype(binding))
host_mem = cuda.pagelocked_empty(size, dtype)
cuda_mem = cuda.mem_alloc(host_mem.nbytes)
bindings.append(int(cuda_mem))
if engine.binding_is_input(binding):
host_inputs.append(host_mem)
cuda_inputs.append(cuda_mem)
else:
host_outputs.append(host_mem)
cuda_outputs.append(cuda_mem)
self.stream = stream
self.context = context
self.engine = engine
self.host_inputs = host_inputs
self.cuda_inputs = cuda_inputs
self.host_outputs = host_outputs
self.cuda_outputs = cuda_outputs
self.bindings = bindings
def infer(self, input_image_path, is_imgSave=False):
self.cfx.push()
stream = self.stream
context = self.context
engine = self.engine
host_inputs = self.host_inputs
cuda_inputs = self.cuda_inputs
host_outputs = self.host_outputs
cuda_outputs = self.cuda_outputs
bindings = self.bindings
input_image, image_raw, origin_h, origin_w = self.preprocess_image(input_image_path)
np.copyto(host_inputs[0], input_image.ravel())
cuda.memcpy_htod_async(cuda_inputs[0], host_inputs[0], stream)
context.execute_async(bindings=bindings, stream_handle=stream.handle)
cuda.memcpy_dtoh_async(host_outputs[0], cuda_outputs[0], stream)
stream.synchronize()
self.cfx.pop()
output = host_outputs[0]
result_boxes, result_scores, reslut_landm = self.post_process(output, origin_h, origin_w)
if is_imgSave:
for i in range(len(result_boxes)):
box = result_boxes[i]
landm = reslut_landm[i]
score = result_scores[i]
if score < VIS_THRESHOLD:
continue
plot_one_box(box,landm,image_raw,label="{}:{:.2f}".format("score", result_scores[i]),)
parent, filename = os.path.split(input_image_path)
save_name = os.path.join(parent, "output_" + filename)
cv2.imwrite(save_name, image_raw)
else:
det = { "bbox": [], "score": [], "landmark": []}
for idx, score in enumerate(result_scores):
if score >= VIS_THRESHOLD:
det["score"].append(score.numpy().tolist())
det["bbox"].append(result_boxes[idx,:].numpy().tolist())
det["landmark"].append(reslut_landm[idx,:].tolist())
return det
def destroy(self):
self.cfx.pop()
def preprocess_image(self, input_image_path):
image_raw = cv2.imread(input_image_path)
h, w, c = image_raw.shape
# image = cv2.cvtColor(image_raw, cv2.COLOR_BGR2RGB)
image = image_raw
r_w = INPUT_W / w
r_h = INPUT_H / h
if r_h > r_w:
tw = INPUT_W
th = int(r_w * h)
tx1 = tx2 = 0
ty1 = int((INPUT_H - th) / 2)
ty2 = INPUT_H - th - ty1
else:
tw = int(r_h * w)
th = INPUT_H
tx1 = int((INPUT_W - tw) / 2)
tx2 = INPUT_W - tw - tx1
ty1 = ty2 = 0
image = cv2.resize(image, (tw, th))
image = cv2.copyMakeBorder(image, ty1, ty2, tx1, tx2, cv2.BORDER_CONSTANT, (128, 128, 128))
image = image.astype(np.float32)
image -=(104,117,123)
image = np.transpose(image, [2, 0, 1])
image = np.expand_dims(image, axis=0)
image = np.ascontiguousarray(image)
return image, image_raw, h, w
def scaleback(self, origin_h, origin_w, boxes, landms):
r_w = INPUT_W / origin_w
r_h = INPUT_H / origin_h
if r_h > r_w:
boxes[:, 1] -= (INPUT_H - r_w * origin_h) / 2
boxes[:, 3] -= (INPUT_H - r_w * origin_h) / 2
boxes /= r_w
landms[:, 1] -= (INPUT_H - r_w * origin_h) / 2
landms[:, 3] -= (INPUT_H - r_w * origin_h) / 2
landms[:, 5] -= (INPUT_H - r_w * origin_h) / 2
landms[:, 7] -= (INPUT_H - r_w * origin_h) / 2
landms[:, 9] -= (INPUT_H - r_w * origin_h) / 2
landms /= r_w
else:
boxes[:, 0] -= (INPUT_W - r_h * origin_w) / 2
boxes[:, 2] -= (INPUT_W - r_h * origin_w) / 2
boxes /= r_h
landms[:, 0] -= (INPUT_W - r_h * origin_w) / 2
landms[:, 2] -= (INPUT_W - r_h * origin_w) / 2
landms[:, 4] -= (INPUT_W - r_h * origin_w) / 2
landms[:, 6] -= (INPUT_W - r_h * origin_w) / 2
landms[:, 8] -= (INPUT_W - r_h * origin_w) / 2
landms /= r_h
return boxes, landms
def post_process(self, output, origin_h, origin_w):
num = int(output[0])
pred = np.reshape(output[1:num*15+1], (-1, 15))
pred = pred[np.where(pred[:,4]>0.5)[0], :]
boxes = pred[:, :4]
scores = pred[:, 4]
landms = pred[:, 5:]
boxes = torch.Tensor(boxes)
scores = torch.Tensor(scores)
indices = torchvision.ops.nms(boxes, scores, iou_threshold=IOU_THRESHOLD)
result_boxes = boxes[indices, :]
result_scores = scores[indices]
result_landms = landms[indices,:]
result_boxes, result_landms = self.scaleback(origin_h, origin_w, result_boxes, result_landms)
return result_boxes, result_scores, result_landms
if __name__ == "__main__":
PLUGIN_LIBRARY = "lib/libdecodeplugin.so"
ctypes.CDLL(PLUGIN_LIBRARY)
engine_file_path = "lib/retina_mnet.engine"
retinaface_wrapper = RetinafaceTRT(engine_file_path)
retinaface_wrapper.infer("/root/Desktop/worlds-largest-selfie.jpg")
retinaface_wrapper.destroy()
@ThomsonW OK I understand better what you are doing. However, I think that pycuda and torch instantiate 2 different cuda contexts and thus you see that while you pass torch tensor to cuda. I checked that quickly locally and it seems like that:
!nvidia-smi -i 0
> 0% 29C P8 14W / 280W | 4MiB / 11178MiB
import pycuda.driver as cuda
import pycuda.autoinit
!nvidia-smi -i 0
> 0% 31C P2 58W / 280W | 139MiB / 11178MiB
import numpy
a = numpy.random.randn(16, 4)
a = a.astype(numpy.float32)
a_gpu = cuda.mem_alloc(a.nbytes)
cuda.memcpy_htod(a_gpu, a)
!nvidia-smi -i 0
> 0% 31C P2 58W / 280W | 141MiB / 11178MiB
import torch
preds = torch.rand(12, 5)
bboxes = torch.Tensor(preds[:, :4]).cuda()
scores = torch.Tensor(preds[:, 4]).cuda()
!nvidia-smi -i 0
> 0% 33C P2 58W / 280W | 736MiB / 11178MiB
I'm not sure if we can reuse cuda context for torch cuda. @ptrblck any thoughts ?
Anyway, let me close this issue as unrelated to torchvision library. Feel free to ask more questions if needed.
Most helpful comment
@ThomsonW OK I understand better what you are doing. However, I think that pycuda and torch instantiate 2 different cuda contexts and thus you see that while you pass torch tensor to cuda. I checked that quickly locally and it seems like that:
I'm not sure if we can reuse cuda context for torch cuda. @ptrblck any thoughts ?
Anyway, let me close this issue as unrelated to torchvision library. Feel free to ask more questions if needed.