Opencv: GPU not working with DNN_BACKEND_OPENCV

Created on 14 Jan 2020 · 34Comments · Source: opencv/opencv

System information (version)

OpenCV => 4.1.2
Operating System / Platform => Windows 64 Bit
Compiler => Visual Studio 2017
Cuda => 10.2

Hello !

I use darknet Yolo for object detection and it works very well. Unfortunately with the CPU it's very slow! I can make Darknet.exe work on the GPU but not in python.

net = cv2.dnn.readNet("dark/yolov3.weights", "dark/yolov3.cfg")
classes = []
with open("dark/coco.names", "r") as f:
    classes = [line.strip() for line in f.readlines()]
layer_names = net.getLayerNames()
output_layers = [layer_names[i[0] - 1] for i in net.getUnconnectedOutLayers()]
colors = np.random.uniform(0, 255, size=(len(classes), 3))

net.setPreferableBackend(cv2.dnn.DNN_BACKEND_OPENCV)
net.setPreferableTarget(cv2.dnn.DNN_TARGET_OPENCL_FP16)

Output :

OpenCV(ocl4dnn): consider to specify kernel configuration cache directory
via OPENCV_OCL4DNN_CONFIG_PATH parameter.
OpenCL program build log: dnn/dummy
Status -11: CL_BUILD_PROGRAM_FAILURE
-cl-no-subgroup-ifp
Error in processing command line: Don't understand command line argument "-cl-no-subgroup-ifp"!

The execution doesn't crash but it's the CPU that does the calculations.

Can u help ? Thx !

duplicate question (invalid tracker)

Source

lucaspojo

Most helpful comment

I have run tests with adding nms_threshold=0 in the latest built openCV 4.4.

This was fixed in OpenCV 4.4.0. It's required for OpenCV 4.3 and below.

YashasSamaga on 16 Sep 2020

👍2

All 34 comments

@lucaspojo what is your hardware configuration?
Its important `cuz if you are using nVidia card its better to use Cuda backend/target.

BadMachine on 15 Jan 2020

@BadMachine Sorry, my config :
Proc : i7 9700K
CG : GeForce RTX 2060
RAM : 16Go DDR4
CUDA VERSION : v10.2

lucaspojo on 15 Jan 2020

@lucaspojo then make sure that your OpenCV built with CUDA support and try to use

net.setPreferableBackend(cv2.dnn.DNN_BACKEND_CUDA)
net.setPreferableTarget(cv2.dnn.DNN_TARGET_CUDA)

Details

BadMachine on 15 Jan 2020

@BadMachine I compiled opencv with cuda and it works, thank you!
On the other hand I have a very low performance gain.
-> CPU -> 4FPS -> 80% usage
-> GPU -> 11FPS - > 11% usage

lucaspojo on 15 Jan 2020

@lucaspojo
My conf:
CG : GeForce RTX 2060 SUPER
RAM : 16GB DDR4
CUDA VERSION : v10.1

First of all for monitoring your GPU use specialized soft (for ex. ASUS GPU Tweak)
Are you sure that u compiling your project in release mode?)
My fps in:
debug - 12:15
release - 20:25

BadMachine on 16 Jan 2020

Compile my project? I use python so there is no compiling, I must not understand.
I make u a video : https://www.youtube.com/watch?v=ea13TxSe3rc

I don't understand, with Darknet.exe I can get up to 100 FPS with yolov3.weight

lucaspojo on 16 Jan 2020

@lucaspojo sorry, forgot u r using python.
Can u share a bit more code with dnn forwarding and calculating fps?

I do not know what is "Darknet.exe", but bet on pauses and lack of parallel threads.

BadMachine on 16 Jan 2020

Github of darknetYolo : https://github.com/lucaspojo/darknet
Demo of darknetYolo with mu current config : https://www.youtube.com/watch?v=2qyGW-fyxV0

The video demo file is 1280x720 with ~60FPS and with my Python test it's 200x200 with 15FPS

My code : https://pastebin.com/vkcZxrPS

Thank you for taking the time to answer me :)

lucaspojo on 16 Jan 2020

My clue is that original code using optimization with multithreading

BadMachine on 16 Jan 2020

@lucaspojo could you replace and test

cv2.waitKey(1)

cv2.waitKey(0)

BadMachine on 16 Jan 2020

@lucaspojo can you try running this code and report what FPS you get for 416x416:



import cv2
import numpy as np
import time

confidence_threshold = 0.5
nms_threshold = 0.4
num_classes = 80

net = cv2.dnn.readNet("yolov3.cfg", "yolov3.weights")
net.setPreferableBackend(cv2.dnn.DNN_BACKEND_CUDA)
net.setPreferableTarget(cv2.dnn.DNN_TARGET_CUDA)

frame = np.random.randint(255, size=(416, 416, 3), dtype=np.uint8) # put your image here!
blob = cv2.dnn.blobFromImage(frame, 0.00392, (416, 416), [0, 0, 0], True, False)

# warmup
for i in range(3):
    net.setInput(blob)
    detections = net.forward(net.getUnconnectedOutLayersNames())

# benchmark
start = time.time()
for i in range(100):
    net.setInput(blob)
    detections = net.forward(net.getUnconnectedOutLayersNames())
end = time.time()

ms_per_image = (end - start) * 1000 / 100

print("Time per inference: %f ms" % (ms_per_image))
print("FPS: ", 1000.0 / ms_per_image)

I get 20FPS on GTX 1050. RTX GPUs should be capable of hitting 100FPS in FP16.

YashasSamaga on 16 Jan 2020

@YashasSamaga i ~get 45 FPS with your code
@BadMachine cv2.waitKey(0) do nothing =/

lucaspojo on 16 Jan 2020

@YashasSamaga, I think it's because of NMS. @lucaspojo, try to set nms threshold to zero in the .cfg file.

dkurt on 16 Jan 2020

@lucaspojo How are you measuring darknet fps? What is the build configuration (CUDNN, CUDNN_HALF) you used?

How did you build OpenCV? Can you upload your CMakeCache.txt?

Set nms_threshold=0 in all yolo blocks in network configuration.

You can get some more speed by using DNN_TARGET_CUDA_FP16.

YashasSamaga on 16 Jan 2020

I have a better performance with DNN_TARGET_CUDA_FP16
@YashasSamaga I can't find nms_threshold. It's in yolo.cfg ?

lucaspojo on 17 Jan 2020

@lucaspojo You need to add nms_threshold=0 entry at the end of every [yolo] block in the cfg file. You can expect 45FPS to hit ~50-60FPS in FP32. You can find more details about this here.

A quick look at the darknet code reveals that it uses a separate threads for IO and detection. The python code you presented does IO and inference serially and hence the two codes are not comparable.

Since you get 45FPS (and higher with the nms_threshold setting) with just inference, the reduced FPS is because of the code which is not related to the DNN performance. I don't know why you have such a huge drop in FPS. I get 90FPS on RTX 2080 Ti using a single thread with IO and detection done serially.

DNN_TARGET_CUDA_FP16 is less accurate than DNN_TARGET_CUDA but the outputs are still good enough for most tasks.

YashasSamaga on 17 Jan 2020

@YashasSamaga
" ...is less accurate..." means less decimal precision?

BadMachine on 17 Jan 2020

@BadMachine Yes, but the errors from each layer can accumulate and give a big error at the end. It depends on the model. YOLO does quite well in FP16.

YashasSamaga on 18 Jan 2020

@YashasSamaga gotcha.
Just as I thought... I thought it is because tails in the weights at each layer are discarded. This affects the output at the last layer.
PS: Thank you so much for your cuda dnn backend realization! Cheers!

BadMachine on 18 Jan 2020

Let me close this issue due it seems resolved.

dkurt on 20 Jan 2020

Hello, everyone,

I'm sorry to open that issue again but I have exactly the same setup and also the same problem: my RTX 2060 doesn't use all VRAM with cudnn. I compiled opencv with CUDA 10.1 and Cudnn 8.0.2 and I also use yolov3 model with darknet and I hit around 13 fps with a 400x400 image. on windows 10. I also have a i7 9700K and 16GB DDR4.

I use the cvlib package and here is the part that should be interesssant in my case:

    if initialize:
        classes = populate_class_labels()`
        net = cv2.dnn.readNet(weights_file_abs_path, config_file_abs_path)
        initialize = False

    # enables opencv dnn module to use CUDA on Nvidia card instead of cpu
    if enable_gpu:
        net.setPreferableBackend(cv2.dnn.DNN_BACKEND_CUDA)
        try:
            net.setPreferableTarget(cv2.dnn.DNN_TARGET_CUDA_FP16)
        except:
            net.setPreferableTarget(cv2.dnn.DNN_TARGET_CUDA)

    net.setInput(blob)

    outs = net.forward(get_output_layers(net))

I tried to use

net.setPreferableTarget(cv2.dnn.DNN_TARGET_CUDA_FP16)

which gave me some better results and I now hit 16-17 fps. I tryied to add, as recommended by @YashasSamaga nms_threshold=0 at the end of each [yolo] block in the yolov3.cfg but it seem that it doesn't affect the framerate (it doesn't looks to change anything!).

So, it maybe looks like a hardware problem? How did you solved this issue @lucaspojo ?

I can clearly see that CUDA is only using few VRAM with nvidia-smi:

QBarbeAusy on 12 Sep 2020

@QBarbeAusy

I compiled opencv with CUDA 10.1 and Cudnn 8.0.2 and I also use yolov3 model with darknet and I hit around 13 fps with a 400x400 image.

There is a performance regression in cuDNN 8 that affects OpenCV and Darknet. I am working on a update that might get around the regression.

Related discussion at NVIDIA's developer forums: https://forums.developer.nvidia.com/t/cudnn8-regression-in-algorithm-selection-heuristics/153667/3

I can clearly see that CUDA is only using few VRAM with nvidia-smi:

There is no reason why all the GPU memory needs to be consumed. More memory consumption does not imply it's faster.

If downgrading to an older version of cuDNN does not fix, check if the comments in https://github.com/opencv/opencv/issues/17422 answer your question.

YashasSamaga on 12 Sep 2020

👍1

Thanks a lot for your quick (and precise) answer @YashasSamaga !
So, you mean that I have to downgrade to Cudnn 7.6.5 for make it works faster? is it compatible with opencv 4.4.0 ?

Anyway, I'll try to build opencv using this old cuDNN version and I'll be back to say if it's better in that case, in order to solve this issue.

EDIT : I built using cuDNN 7.6.5 now I hit 31 fps. From my point of view it's not 'incredible' for a RTX 2060 TI but it's 2 times better than my previous result so we can say that @YashasSamaga solved it! Thanks a lot.

QBarbeAusy on 12 Sep 2020

Hello everyone,

I am having the same issue as stated above by @QBarbeAusy.
Getting FPS around 18 with OpenCV DNN and with darkent3 28fps .

I am using the below system configuration
i7, 9th gen 32GB RAM and RTX 2060 GPU card.
Opencv 4.4.0 , CUDA 11.0 , Cudnn 7.6.4 .

The Detection is doing with a live camera(has the capability of 50fps) but getting delayed output.

let me know any suggestions on this to improve the fps.

Thank you .

utsembedded on 15 Sep 2020

@QBarbeAusy

So, you mean that I have to downgrade to Cudnn 7.6.5 for make it works faster? is it compatible with opencv 4.4.0 ?

Yes and yes.

I built using cuDNN 7.6.5 now I hit 31 fps. From my point of view it's not 'incredible' for a RTX 2060 TI but it's 2 times better than my previous result so we can say that @YashasSamaga solved it! Thanks a lot.

How are you measuring the FPS? YOLOv3 hits ~140 FPS for 608 x 608 input on RTX 2080 Ti if you measure the time taken for net.forward(all output blobs) to complete.

@utsembedded

Getting FPS around 18 with OpenCV DNN and with darkent3 28fps .

Set target to DNN_TARGET_CUDA_FP16 if you haven't already.

I suspect you are measuring Darknet and OpenCV DNN FPS differently. How did you measure darknet FPS and OpenCV FPS?

@QBarbeAusy @utsembedded Please try with this script.

YashasSamaga on 15 Sep 2020

@YashasSamaga

Using your scirpt, DNN_TARGET_CUDA_FP16 and yolov4 I hit ~60 fps with inputParams = (416,416).

I mesured the compilation time using time.time()

QBarbeAusy on 15 Sep 2020

Using your scirpt, DNN_TARGET_CUDA_FP16 and yolov4 I hit ~60 fps with inputParams = (416,416).

60 FPS includes preprocessing, inference and postprocessing if you used the python script. The C++ version measures the inference time only.

That's your maximum FPS if you use a single Net object. You can make a pipeline to achieve the peak performance (details can be found in the comments here https://github.com/opencv/opencv/issues/17422). You can then group images into a batch or use multiple Net objects to further increase the FPS (maybe 80 or 100).

YashasSamaga on 15 Sep 2020

Thanks @YashasSamaga . My problem is the following : you said that I could achieve 140 FPS with a 608x608 input image with my GPU. As you can see here I only have 60 fps with yolov4 (using python).

In order to be clear and to understand my performances, when you speak about140 FPS with a 608x608 input image, is it with C++? is it achievable on windows 10? with Python? With only one Net object?

QBarbeAusy on 15 Sep 2020

140 FPS with a 608x608 input image with my GPU

That was for RTX 2080 Ti for YOLOv3 from actual measurements (using this code). More detailed FPS stats for YOLOv4.

YashasSamaga on 15 Sep 2020

Ok, sorry! I switched the 8 into a 6 ! at least we can see that a 2060 is ~1.75 times worst than a 2080 in the same conditions, I'm quite impressed by this huge difference.

If someone get something similar, I think that I understood why cvlib (for example) take much much more time. It looks that for each frame it'll create a new Net object:

def detect_common_objects(image, confidence=0.5, nms_thresh=0.3, model='yolov3', enable_gpu=True):

    Height, Width = image.shape[:2]
    scale = 0.00390625

    global classes
    global dest_dir

    if model == 'yolov3-tiny':
        config_file_name = 'yolov3-tiny.cfg'
        cfg_url = "https://github.com/pjreddie/darknet/raw/master/cfg/yolov3-tiny.cfg"
        weights_file_name = 'yolov3-tiny.weights'
        weights_url = 'https://pjreddie.com/media/files/yolov3-tiny.weights'
        blob = cv2.dnn.blobFromImage(image, scale, (416,416), (0,0,0), True, crop=False)
    elif model =='yolov4':
        # print('oui')
        config_file_name = 'yolov4.cfg'
        cfg_url = ""
        weights_file_name = 'yolov4.weights'
        weights_url = ''
        blob = cv2.dnn.blobFromImage(image, scale, (416,416), (0,0,0), True, crop=False)
        #224x224 ou 320x320 ou 416x416
    else:
        config_file_name = 'yolov3.cfg'
        cfg_url = 'https://github.com/arunponnusamy/object-detection-opencv/raw/master/yolov3.cfg'
        weights_file_name = 'yolov3.weights'
        weights_url = 'https://pjreddie.com/media/files/yolov3.weights'
        blob = cv2.dnn.blobFromImage(image, scale, (416,416), (0,0,0), True, crop=False)    

    config_file_abs_path = dest_dir + os.path.sep + config_file_name
    weights_file_abs_path = dest_dir + os.path.sep + weights_file_name    

    if not os.path.exists(config_file_abs_path):
        download_file(url=cfg_url, file_name=config_file_name, dest_dir=dest_dir)

    if not os.path.exists(weights_file_abs_path):
        download_file(url=weights_url, file_name=weights_file_name, dest_dir=dest_dir)    

    global initialize
    global net

    if initialize:
        classes = populate_class_labels()
        net = cv2.dnn.readNet(weights_file_abs_path, config_file_abs_path)
        initialize = False

    # enables opencv dnn module to use CUDA on Nvidia card instead of cpu
    if enable_gpu:
        net.setPreferableBackend(cv2.dnn.DNN_BACKEND_CUDA)
        try:
            net.setPreferableTarget(cv2.dnn.DNN_TARGET_CUDA_FP16)
        except:
            net.setPreferableTarget(cv2.dnn.DNN_TARGET_CUDA)

    net.setInput(blob)

    outs = net.forward(get_output_layers(net))

    class_ids = []
    confidences = []
    boxes = []

    for out in outs:
        for detection in out:
            scores = detection[5:]
            class_id = np.argmax(scores)
            max_conf = scores[class_id]
            if max_conf > confidence:
                center_x = int(detection[0] * Width)
                center_y = int(detection[1] * Height)
                w = int(detection[2] * Width)
                h = int(detection[3] * Height)
                x = center_x - (w / 2)
                y = center_y - (h / 2)
                class_ids.append(class_id)
                confidences.append(float(max_conf))
                boxes.append([x, y, w, h])


    indices = cv2.dnn.NMSBoxes(boxes, confidences, confidence, nms_thresh)

    bbox = []
    label = []
    conf = []

    for i in indices:
        i = i[0]
        box = boxes[i]
        x = box[0]
        y = box[1]
        w = box[2]
        h = box[3]
        bbox.append([int(x), int(y), int(x+w), int(y+h)])
        label.append(str(classes[class_ids[i]]))
        conf.append(confidences[i])

    return bbox, label, conf

That's maybe the point where we loose a lot of time.

QBarbeAusy on 15 Sep 2020

@QBarbeAusy, try to avoid array operations in Python - it's quite ineffective. Take a look at cv::dnn::DetectionModel.

dkurt on 15 Sep 2020

@lucaspojo You need to add nms_threshold=0 entry at the end of every [yolo] block in the cfg file. You can expect 45FPS to hit ~50-60FPS in FP32. You can find more details about this here.

@YashasSamaga I have run tests with adding nms_threshold=0 in the latest built openCV 4.4.

I am running this on the yolov4-tiny models on the Jetson AGX Xavier with cv2.dnn.DNN_TARGET_CUDA_FP16. I have confirmed that there is no performance increase when adding in nms_threshold=0 to the .cfg file.

FPS without nms_threshold=0 in .cfg

nvidia@nvidia:~/ai/yolo-inference$ python3 test_fps.py 
Time per inference: 5.109808 ms
FPS:  195.7020523896643
nvidia@nvidia:~/ai/yolo-inference$ python3 test_fps.py 
Time per inference: 5.106299 ms
FPS:  195.8365569628764
nvidia@nvidia:~/ai/yolo-inference$ python3 test_fps.py 
Time per inference: 4.993160 ms
FPS:  200.27398398401743
nvidia@nvidia:~/ai/yolo-inference$ python3 test_fps.py 
Time per inference: 5.095820 ms
FPS:  196.23925417644907
nvidia@nvidia:~/ai/yolo-inference$ python3 test_fps.py 
Time per inference: 5.115252 ms
FPS:  195.49380748098804
nvidia@nvidia:~/ai/yolo-inference$ python3 test_fps.py 
Time per inference: 4.921196 ms
FPS:  203.2026564617298
nvidia@nvidia:~/ai/yolo-inference$ python3 test_fps.py 
Time per inference: 5.092177 ms
FPS:  196.37964729138398

FPS with nms_threshold=0 in .cfg

nvidia@nvidia:~/ai/yolo-inference$ python3 test_fps.py 
Time per inference: 4.994926 ms
FPS:  200.2031480307624
nvidia@nvidia:~/ai/yolo-inference$ python3 test_fps.py 
Time per inference: 5.111146 ms
FPS:  195.6508394090784
nvidia@nvidia:~/ai/yolo-inference$ python3 test_fps.py 
Time per inference: 4.972160 ms
FPS:  201.11984080365383
nvidia@nvidia:~/ai/yolo-inference$ python3 test_fps.py 
Time per inference: 5.107520 ms
FPS:  195.789751798227
nvidia@nvidia:~/ai/yolo-inference$ python3 test_fps.py 
Time per inference: 4.865999 ms
FPS:  205.5076366472835
nvidia@nvidia:~/ai/yolo-inference$ python3 test_fps.py 
Time per inference: 4.900401 ms
FPS:  204.06494769572686

Let me know is there is something wrong here. If you want to me to run any other tests please let me know.