Vision: torchvison.ops.nms uses too much gpu memory

Created on 1 Dec 2020  Â·  7Comments  Â·  Source: pytorch/vision

hi there, i have a quesetion nms operator.
If i use torchvision.ops.nms to filter bbox, about 900MB GPU memory is used, where the input box and score are put into GPU. But there is no problem if the box and score in cpu. meanwhile the time cost of gpu is 0.0007s, 0.0018s in cpu.
i do not know actually why this operator uses such much GPU mem. or is there any configuration about nms to save gpu mem?

my torchvision version is 0.4.0. thanks~

question

Most helpful comment

@ThomsonW OK I understand better what you are doing. However, I think that pycuda and torch instantiate 2 different cuda contexts and thus you see that while you pass torch tensor to cuda. I checked that quickly locally and it seems like that:

!nvidia-smi -i 0
>  0%   29C    P8    14W / 280W |      4MiB / 11178MiB

import pycuda.driver as cuda
import pycuda.autoinit

!nvidia-smi -i 0
> 0%   31C    P2    58W / 280W |    139MiB / 11178MiB

import numpy

a = numpy.random.randn(16, 4)
a = a.astype(numpy.float32)
a_gpu = cuda.mem_alloc(a.nbytes)
cuda.memcpy_htod(a_gpu, a)

!nvidia-smi -i 0
> 0%   31C    P2    58W / 280W |    141MiB / 11178MiB

import torch

preds = torch.rand(12, 5)
bboxes = torch.Tensor(preds[:, :4]).cuda()
scores = torch.Tensor(preds[:, 4]).cuda()

!nvidia-smi -i 0
> 0%   33C    P2    58W / 280W |    736MiB / 11178MiB

I'm not sure if we can reuse cuda context for torch cuda. @ptrblck any thoughts ?

Anyway, let me close this issue as unrelated to torchvision library. Feel free to ask more questions if needed.

All 7 comments

@ThomsonW it may depend on how do you measure GPU memory. For example, with nvidia-smi you can also see allocated cuda context which not related to the requested memory size.
Please, provide more details (code snippet, pytorch/torcvision/cuda versions, how do you measure the memory) if you'd like to get more support on that.
You can also take a look at pytorch forum for similar questions: https://discuss.pytorch.org/search?q=GPU%20memory

hi @vfdev-5 ~ thanks for your quick reply very much.

indeed, that is postprocess of retinaface.

bboxes = pred[:,:4]
scores = pred[:,4]
landms = pred[:,5:]

the CPU version is :

bboxes = torch.Tensor(bboxes)
scores = torch.Tensor(scores)
indices = torchvision.ops.nms(bboxes, scores, iou_threshold=0.4)
r_bboxes = bboxes[indices, :]
r_scores = scores[indices, :]
....

and the GPU version is :

bboxes = torch.Tensor(bboxes).cuda()
scores = torch.Tensor(scores).cuda()
indices = torchvision.ops.nms(bboxes, scores, iou_threshold=0.4).cup()
r_bboxes = bboxes[indices, :].cpu()
r_scores = scores[indices, :].cpu()
....

that is all difference between two versions.

The version info:
python 3.6.8
torch 1.2.0
torchvision 0.4.0
cuda 10.2.89
GPU RTX2080 Ti

How do i measure the GPU mem:
nvidia-smi is used, and i look at the process GPU mem usage by PID.
If i only do inference, 630MB GPU mem is used. Meanwhile, if i do nms with CPU version, the usage will not change. But if i change my code to GPU version, then the GPU mem usage then come to 1500MB, more 870MB than CPU version.
Also, the time cost is different. the CPU version costs about 0.0018s while GPU version costs about 0.0007s

If you'd like to trace memory allocated by pytorch only, please try torch.cuda.memory_allocated(). Maybe, updating to the latest version could reduce memory consumption. Here my reproduction of what you say with torch 1.7.0 and torchvision 0.8.1

!nvidia-smi -i 0
> 0  GeForce GTX 108...  On   | 00000000:02:00.0 Off
> 0%   31C    P8    11W / 280W |      4MiB / 11178MiB

import torch
import torchvision
torch.__version__, torchvision.__version__
> ('1.7.0', '0.8.1')

# See CUDA context
t = torch.rand(1).cuda()
!nvidia-smi -i 0
> 0%   32C    P2    58W / 280W |    599MiB / 11178MiB

# Torch CUDA mem allocated in bytes
torch.cuda.memory_allocated()
> 512

preds = torch.rand(12, 5)
bboxes = torch.Tensor(preds[:, :4]).cuda()
scores = torch.Tensor(preds[:, 4]).cuda()
indices = torchvision.ops.nms(bboxes, scores, iou_threshold=0.4).cpu()
# Torch CUDA mem allocated in bytes
torch.cuda.memory_allocated()
> 1536

!nvidia-smi -i 0
> 0%   33C    P2    58W / 280W |    599MiB / 11178MiB

i traced the usage of pytorch during nms, which is ~6000 bytes. But the GPU mem usage changed a lot as i said.
it is so confused. what i only changed is puting the data into CUDA memory.
this is not torchvision's problem at least
thank you very much~~

@ThomsonW could you, please, provide fully executable code snippet such that I could reproduce it from my side with the latest pytorch/torchvision with the measurements and values you get in order to compare. Thanks

@vfdev-5 sorry to reply late.
the issue is located at RetinafaceTRT.post_process
the libdecodeplugin.so and retina_mnet.engine is generated from Tensorrt C++ API.
The whole code is modifed from https://github.com/wang-xinyu/tensorrtx
the code is here:

import ctypes
import os
import random
import sys
import threading
import time

import cv2
import numpy as np
import pycuda.autoinit
import pycuda.driver as cuda
import tensorrt as trt
import torch
import torchvision

INPUT_W = 640
INPUT_H = 480

IOU_THRESHOLD = 0.4
VIS_THRESHOLD = 0.6

class RetinafaceTRT(object):
    def __init__(self, engine_file_path):
        self.cfx = cuda.Device(0).make_context()
        stream = cuda.Stream()
        TRT_LOGGER = trt.Logger(trt.Logger.INFO)
        runtime = trt.Runtime(TRT_LOGGER)

        with open(engine_file_path, "rb") as f:
            engine = runtime.deserialize_cuda_engine(f.read())
        context = engine.create_execution_context()

        host_inputs = []
        cuda_inputs = []
        host_outputs = []
        cuda_outputs = []
        bindings = []

        for binding in engine:
            size = trt.volume(engine.get_binding_shape(binding)) * engine.max_batch_size
            dtype = trt.nptype(engine.get_binding_dtype(binding))
            host_mem = cuda.pagelocked_empty(size, dtype)
            cuda_mem = cuda.mem_alloc(host_mem.nbytes)
            bindings.append(int(cuda_mem))
            if engine.binding_is_input(binding):
                host_inputs.append(host_mem)
                cuda_inputs.append(cuda_mem)
            else:
                host_outputs.append(host_mem)
                cuda_outputs.append(cuda_mem)

        self.stream = stream
        self.context = context
        self.engine = engine
        self.host_inputs = host_inputs
        self.cuda_inputs = cuda_inputs
        self.host_outputs = host_outputs
        self.cuda_outputs = cuda_outputs
        self.bindings = bindings

    def infer(self, input_image_path, is_imgSave=False):
        self.cfx.push()
        stream = self.stream
        context = self.context
        engine = self.engine
        host_inputs = self.host_inputs
        cuda_inputs = self.cuda_inputs
        host_outputs = self.host_outputs
        cuda_outputs = self.cuda_outputs
        bindings = self.bindings

        input_image, image_raw, origin_h, origin_w = self.preprocess_image(input_image_path)
        np.copyto(host_inputs[0], input_image.ravel())
        cuda.memcpy_htod_async(cuda_inputs[0], host_inputs[0], stream)

        context.execute_async(bindings=bindings, stream_handle=stream.handle)
        cuda.memcpy_dtoh_async(host_outputs[0], cuda_outputs[0], stream)
        stream.synchronize()
        self.cfx.pop()

        output = host_outputs[0]
        result_boxes, result_scores, reslut_landm = self.post_process(output, origin_h, origin_w)
        if is_imgSave:
            for i in range(len(result_boxes)):
                box = result_boxes[i]
                landm = reslut_landm[i]
                score = result_scores[i]
                if score < VIS_THRESHOLD:
                    continue
                plot_one_box(box,landm,image_raw,label="{}:{:.2f}".format("score", result_scores[i]),)

            parent, filename = os.path.split(input_image_path)
            save_name = os.path.join(parent, "output_" + filename)
            cv2.imwrite(save_name, image_raw)
        else:
            det = { "bbox": [], "score": [], "landmark": []}
            for idx, score in enumerate(result_scores):
                if score >= VIS_THRESHOLD:
                    det["score"].append(score.numpy().tolist())
                    det["bbox"].append(result_boxes[idx,:].numpy().tolist())
                    det["landmark"].append(reslut_landm[idx,:].tolist())
            return det

    def destroy(self):
        self.cfx.pop()

    def preprocess_image(self, input_image_path):
        image_raw = cv2.imread(input_image_path)
        h, w, c = image_raw.shape
        # image = cv2.cvtColor(image_raw, cv2.COLOR_BGR2RGB)
        image = image_raw
        r_w = INPUT_W / w
        r_h = INPUT_H / h
        if r_h > r_w:
            tw = INPUT_W
            th = int(r_w * h)
            tx1 = tx2 = 0
            ty1 = int((INPUT_H - th) / 2)
            ty2 = INPUT_H - th - ty1
        else:
            tw = int(r_h * w)
            th = INPUT_H
            tx1 = int((INPUT_W - tw) / 2)
            tx2 = INPUT_W - tw - tx1
            ty1 = ty2 = 0
        image = cv2.resize(image, (tw, th))
        image = cv2.copyMakeBorder(image, ty1, ty2, tx1, tx2, cv2.BORDER_CONSTANT, (128, 128, 128))
        image = image.astype(np.float32)
        image -=(104,117,123)
        image = np.transpose(image, [2, 0, 1])
        image = np.expand_dims(image, axis=0)
        image = np.ascontiguousarray(image)
        return image, image_raw, h, w

    def scaleback(self, origin_h, origin_w, boxes, landms):
        r_w = INPUT_W / origin_w
        r_h = INPUT_H / origin_h
        if r_h > r_w:
            boxes[:, 1] -= (INPUT_H - r_w * origin_h) / 2
            boxes[:, 3] -= (INPUT_H - r_w * origin_h) / 2
            boxes /= r_w

            landms[:, 1] -= (INPUT_H - r_w * origin_h) / 2
            landms[:, 3] -= (INPUT_H - r_w * origin_h) / 2
            landms[:, 5] -= (INPUT_H - r_w * origin_h) / 2
            landms[:, 7] -= (INPUT_H - r_w * origin_h) / 2
            landms[:, 9] -= (INPUT_H - r_w * origin_h) / 2
            landms /= r_w
        else:
            boxes[:, 0] -= (INPUT_W - r_h * origin_w) / 2
            boxes[:, 2] -= (INPUT_W - r_h * origin_w) / 2
            boxes /= r_h

            landms[:, 0] -= (INPUT_W - r_h * origin_w) / 2
            landms[:, 2] -= (INPUT_W - r_h * origin_w) / 2
            landms[:, 4] -= (INPUT_W - r_h * origin_w) / 2
            landms[:, 6] -= (INPUT_W - r_h * origin_w) / 2
            landms[:, 8] -= (INPUT_W - r_h * origin_w) / 2
            landms /= r_h
        return boxes, landms

    def post_process(self, output, origin_h, origin_w):
        num = int(output[0])
        pred = np.reshape(output[1:num*15+1], (-1, 15))
        pred = pred[np.where(pred[:,4]>0.5)[0], :]

        boxes = pred[:, :4]
        scores = pred[:, 4]
        landms = pred[:, 5:]

        boxes = torch.Tensor(boxes)
        scores = torch.Tensor(scores)
        indices = torchvision.ops.nms(boxes, scores, iou_threshold=IOU_THRESHOLD)


        result_boxes = boxes[indices, :]
        result_scores = scores[indices]
        result_landms = landms[indices,:]

        result_boxes, result_landms = self.scaleback(origin_h, origin_w, result_boxes, result_landms)
        return result_boxes, result_scores, result_landms


if __name__ == "__main__":
    PLUGIN_LIBRARY = "lib/libdecodeplugin.so"
    ctypes.CDLL(PLUGIN_LIBRARY)
    engine_file_path = "lib/retina_mnet.engine"
    retinaface_wrapper = RetinafaceTRT(engine_file_path)
    retinaface_wrapper.infer("/root/Desktop/worlds-largest-selfie.jpg")
    retinaface_wrapper.destroy()

@ThomsonW OK I understand better what you are doing. However, I think that pycuda and torch instantiate 2 different cuda contexts and thus you see that while you pass torch tensor to cuda. I checked that quickly locally and it seems like that:

!nvidia-smi -i 0
>  0%   29C    P8    14W / 280W |      4MiB / 11178MiB

import pycuda.driver as cuda
import pycuda.autoinit

!nvidia-smi -i 0
> 0%   31C    P2    58W / 280W |    139MiB / 11178MiB

import numpy

a = numpy.random.randn(16, 4)
a = a.astype(numpy.float32)
a_gpu = cuda.mem_alloc(a.nbytes)
cuda.memcpy_htod(a_gpu, a)

!nvidia-smi -i 0
> 0%   31C    P2    58W / 280W |    141MiB / 11178MiB

import torch

preds = torch.rand(12, 5)
bboxes = torch.Tensor(preds[:, :4]).cuda()
scores = torch.Tensor(preds[:, 4]).cuda()

!nvidia-smi -i 0
> 0%   33C    P2    58W / 280W |    736MiB / 11178MiB

I'm not sure if we can reuse cuda context for torch cuda. @ptrblck any thoughts ?

Anyway, let me close this issue as unrelated to torchvision library. Feel free to ask more questions if needed.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

mcleonard picture mcleonard  Â·  26Comments

soldierofhell picture soldierofhell  Â·  36Comments

varunagrawal picture varunagrawal  Â·  45Comments

fmassa picture fmassa  Â·  45Comments

fmassa picture fmassa  Â·  34Comments