Darknet: Very low FPS in Jetson TX2 with OpenCV in Python

Created on 21 Nov 2018  路  27Comments  路  Source: AlexeyAB/darknet

Hi, Im working on the Jetson TX2:
I have pretained a YOLOv3 Tiny model with my own data-set successfully. And I have test it with the ./darknet commands. Using the on-board camera I get between 25 and 30 FPS. Now that I have the model and the weights I want to use my model in Python with OpenCV because doing that allows me to get the prediction boxes coordinates easily.
I want the coordinates to perform some calculations like approximating distances. But when I try my code It is very slow, less than 1FPS. Im using the OpenCV 3.4.2 version and Python 2.7. I have compiled darknet with GPU, CUDNN, and CUDNN_HALF support. I have compiled OpenCV with GPU, CUDNN and GStream support.
This is my code:

import numpy as np
import cv2
import time
import sys
import os

gst_str = ('nvcamerasrc ! '
               'video/x-raw(memory:NVMM), '
               'width=(int)1280, height=(int)720, '
               'format=(string)I420, framerate=(fraction)120/1 ! '
               'nvvidconv ! '
               'video/x-raw, width=(int)1280, height=(int)720, '
               'format=(string)BGRx ! '
               'videoconvert ! appsink')

cap = cv2.VideoCapture(gst_str, cv2.CAP_GSTREAMER)

if not cap.isOpened(): 
    sys.exit('Failed to open camera')
else:
    #Read a frame repeatedly
    while True:
        img_ok, img = cap.read()
        if not img_ok:
            break
        else:   
            #Process the frame
            blobVideo = cv2.dnn.blobFromImage(img, 1/255.0, (416, 416), swapRB = True, crop = False)
            net.setInput(blobVideo)
            startVideo = time.time()
            layerOutputsVideo = net.forward(layerNames)
            endVideo = time.time()
            cv2.imshow(WINDOW_NAME, img)
            elap = (endVideo - startVideo)
            print('[INFO] TIME TO PROCESS A FRAME: {:.4f} SEG'.format(elap))
            key = cv2.waitKey(1000/30)
            if key >= 0: # ESC key: quit program
                break

    cap.release()

Why there are such a big difference between using my model with my own code and using it in darknet? Is OpenCV net.forward function not optimized?
Does exist an easy way like in OpenCV to get the coordinates of prediction boxes but in darknet?

Thnaks for help!

EDIT
Actually the exact FPS Im getting are 5FPS. And other problem added is that I have a huge delay on the image.

All 27 comments

you can use darknet python wrapper with CUDA. OpenCV DNN is running without CUDA.

@PabloDIGITS

Compile Darknet with GPU=1 CUDNN=1 CUDNN_HALF=0 OPENCV=1 LIBSO=1

And try to use this Python code: darknet_video.zip

Change line lib = CDLL("yolo_cpp_dll.dll", RTLD_GLOBAL)
to the lib = CDLL("./darknet.so", RTLD_GLOBAL)
in the darknet_video.py

And set your cfg, weight and videofile in the darknet_video.py

The huge delay is not Darknet issue. It's OpenCV's capturing buffer issue.
You can try this way.
https://www.pyimagesearch.com/2015/12/21/increasing-webcam-fps-with-python-and-opencv/

@AlexeyAB Hello! Can you explain please? How do i can use gstream video for detection? I have script for stream - gst-launch-1.0.exe udpsrc port=5000 caps="application/x-rtp.." .. ! autovideosink .

I tryied run like:
./darknet detector demo ... gst-launch udpsrc port=5000 caps="application/x-rtp.." .. ! autovideosink

It's not work. (PS. I work on Ubuntu)

Regards!

@AshleyRoth
I work in Ubuntu too. The exact command I use in the root directory of darknet is this and it works for me, but I'm using a Jetson TX2:
./darknet detector demo cfg/coco.data cfg/yolov3-tiny.cfg yolov3-tiny.weights "nvcamerasrc ! video/x-raw(memory:NVMM), width=(int)1280, height=(int)720,format=(string)I420, framerate=(fraction)30/1 ! nvvidconv flip-method=0 ! video/x-raw, format=(string)BGRx ! videoconvert ! video/x-raw, format=(string)BGR ! appsink"

@AlexeyAB
I have tried what you recommends. I modified these lines in order to make it works for me:

cap = cv2.VideoCapture('nvcamerasrc ! '
               'video/x-raw(memory:NVMM), '
               'width=(int)1280, height=(int)720, '
               'format=(string)I420, framerate=(fraction)120/1 ! '
               'nvvidconv ! '
               'video/x-raw, width=(int)1280, height=(int)720, '
               'format=(string)BGRx ! '
               'videoconvert ! appsink')
    #cap = cv2.VideoCapture("test.mp4")
    #cap.set(3, 1280)
    #cap.set(4, 720)
    out = cv2.VideoWriter(
        "output.avi", cv2.VideoWriter_fourcc(*"MJPG"), 10.0,
        (lib.network_width(netMain), lib.network_height(netMain)))
    print("Starting the YOLO loop...")
    while True:
        prev_time = time.time()
        ret, frame_read = cap.read()
        frame_rgb = cv2.cvtColor(frame_read, cv2.COLOR_BGR2RGB)
        frame_resized = cv2.resize(frame_rgb,
                                   (lib.network_width(netMain),
                                    lib.network_height(netMain)),
                                   interpolation=cv2.INTER_LINEAR)
        detections = detect(netMain, metaMain, frame_resized, thresh=0.25)
        image = cvDrawBoxes(detections, frame_resized)
        image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
        print(1/(time.time()-prev_time))
    cap.release()
    out.release()

I'm getting this output (I assume the time is in seconds right?):

Starting the YOLO loop...
2.67668169133
3.08765689299
3.24111231487
3.27261463269

I need to work in real time so I will try with the YOLOv3 TINY version and I will comment how it goes.

Thank you all for the help!.
EDIT
I added the following code for showing the video stream:

cv2.imshow('video', image)
print(1/(time.time()-prev_time))
key = cv2.waitKey(1000/30)
if key >= 0: # ESC key: quit program
       break

But I have a delay of 5 seconds in the video

@PyreMiz
Thanks for the info but in that example he does not use the gstream pipeline. In the example he is using v4l2 interface on /dev/video0. But the onboard camera is a bayer sensor, frame format should not match with what opencv expects, so I cant use cap.VideoCapture(0), I need to use Gstreamer.

@AlexeyAB
With the TINY YOLOv3 version I get this output:

Starting the YOLO loop...
1.83460369921
10.2824754602
10.9070445976
11.7825802374
11.6740628584

And now the delay is 2 seconds. I think the delay problem could be caused by the Gstreamer pipeline, but when I use it in other OpenCV applications it works well so I am confused.

The FPS with Tiny version have improved, but it is a little far from the FPS i get when I use the darknet command:
./darknet detector demo cfg/cinta.data cfg/cintatest.cfg backup/cintatrain_final.weights "nvcamerasrc ! video/x-raw(memory:NVMM), width=(int)2592, height=(int)1458, format=(string)I420, framerate=(fraction)30/1 ! nvvidconv ! video/x-raw, width=(int)2592, height=(int)1458, format=(string)BGRx ! videoconvert ! appsink"

Thanks for the help.

I added this to the end of my pipeline and that improves the delay problems:
appsink drop=true sync=false
Now the delay is less than a second. I will keep trying in improving the delay performance, but at first it could be valid for my real-time application.

@PabloDIGITS , Thanks! Unfortunately I mean another video source. (not Jetson TX onboard camera). I can get with gstream video, but i can't use this video with Darknet :(

@PabloDIGITS

@AlexeyAB
With the TINY YOLOv3 version I get this output:

Starting the YOLO loop...
1.83460369921
10.2824754602
10.9070445976
11.7825802374
11.6740628584

This is FPS. I.e. you can get ~10 FPS.


Now the CPU is a bottleneck.
So you should execute 3 part of code in the 3 CPU-threads:
1.

        ret, frame_read = cap.read()
        frame_rgb = cv2.cvtColor(frame_read, cv2.COLOR_BGR2RGB)
        frame_resized = cv2.resize(frame_rgb,
                                   (lib.network_width(netMain),
                                    lib.network_height(netMain)),
                                   interpolation=cv2.INTER_LINEAR)
    2.
        detections = detect(netMain, metaMain, frame_resized, thresh=0.25)
        image = cvDrawBoxes(detections, frame_resized)
        image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)

So you will get about 3x FPS =~ 25-30 FPS.

@AlexeyAB
Hi, there is a thing a dont understand:
The part 3 of your above suggestion needs the detection object to begin the execution, and the part 2 needs the frame_resized object to begin the execution, so I think the 3 parts will execute sequentially. How the execution in threads will improve the speed if the 3 parts cant run in paralell? What Im missing?

@AlexeyAB
Hi, there is a thing a dont understand:
The part 3 of your above suggestion needs the detection object to begin the execution, and the part 2 needs the frame_resized object to begin the execution, so I think the 3 parts will execute sequentially. How the execution in threads will improve the speed if the 3 parts cant run in paralell? What Im missing?

You need to implement three threads and two shared data buffer, the first thread is used to read images from video and put them in data buffer 1, while thread 2 get image data from buffer 1 and do detection, then put the results to buffer 2, and thread 3 use data generated by thread2 and display on the screen.

@AlexeyAB
Hi, there is a thing a dont understand:
The part 3 of your above suggestion needs the detection object to begin the execution, and the part 2 needs the frame_resized object to begin the execution, so I think the 3 parts will execute sequentially. How the execution in threads will improve the speed if the 3 parts cant run in paralell? What Im missing?

By the way, I also encounted the same problem as you does. I use the same device Jetson TX2 with cuda 9.0, cudnn 7.1 and opencv 3.3. I will try if I can improve the speed by method of AlexeyAB.

@Sephirot1st
Ok. Im new to python, I have to learn how to use threads first. In the case you succeed improving the speed could you tell me how you use your threads? I will try in my own too.
Thanks

@Sephirot1st
Ok. Im new to python, I have to learn how to use threads first. In the case you succeed improving the speed could you tell me how you use your threads? I will try in my own too.
Thanks

Oh, I am also new to this language, I will try out this method as soon as I learn how to implentent them in python and please keep this issue open for a while to see if someone else has new idea.

It seems that the drop in performance is not due to CPU bottleneck, actually I time the detection function like below, it does not include the time consumed by opencv when read video stream:

    timeofstart=time.time()
    detections = detect(netMain, metaMain, frame_rgb, thresh=0.5)
    print(1.0/(time.time()-timeofstart))

the time it takes is 2 times longer than C implementation in your repo. I use three process to handle the data and do detection but it does no help. So is there any better way to improve FPS? Or is this the best result I can get using python? Waiting for your response and thank you in advance.@AlexeyAB

@Sephirot1st

Try to measure Seconds print(time.time()-timeofstart)
instead of FPS print(1.0/(time.time()-timeofstart))

Try to use this code:

    timeofstart=time.time()
    detections = detect(netMain, metaMain, frame_rgb, thresh=0.5)
    print(time.time()-timeofstart)

What time can you get by using Python and C?

@Sephirot1st

Try to measure Seconds print(time.time()-timeofstart)
instead of FPS print(1.0/(time.time()-timeofstart))

Try to use this code:

    timeofstart=time.time()
    detections = detect(netMain, metaMain, frame_rgb, thresh=0.5)
    print(time.time()-timeofstart)

What time can you get by using Python and C?

Thank you for prompt reply, but where am I supposed to modify to measure the runing time of C version? I am now outside the laboratory. I will try it ASAP.

For reference:
When using the same configuration(yolov3-tiny) and weigths file on my computer, C version achieves FPS 220, Python version achieves FPS 110.
When runing on Jetson TX2, C version achieves FPS 37, Python version achieves FPS 18.
Inputs used in experiment are the same 480*360 video file.

@Sephirot1st
Can you compress and attach your current version of darknet_video.py file?

Because multi-thread in python can not run in parallel due to GIL, actually I did wrote a python version using multi-process to read video, detect, and then display on the screen, but it is much more slower. I use multiprocess.Pipe to exchange data between processes, which is proved to be the most time-consuming operation. Using this multi-process version, I can get 9 FPS on Jetson TX2, that's awful.
So when I am talking about python version, it refers to the original single thread one that you provide above.
Here are my cfg and python code:

python.zip

PS:
When using

cap = cv2.VideoCapture(0)       
ret, frame_read = cap.read()

to read from a webcam, it will take 0.075s on average to read each frame. Could there be any approach to speed up the read operation?


PS2:
Oh, that's probably owing to the hardware limit of my webcam, it can only take about 13 shot in a second.

@Sephirot1st
Can you compress and attach your current version of darknet_video.py file?

PS2:
Oh, that's probably owing to the hardware limit of my webcam, it can only take about 13 shot in a second.

@Sephirot1st Try to test on Video file, for example cap = cv2.VideoCapture('test.mp4')

PS2:
Oh, that's probably owing to the hardware limit of my webcam, it can only take about 13 shot in a second.

@Sephirot1st Try to test on Video file, for example cap = cv2.VideoCapture('test.mp4')

Yes, all the former tests runs on local video file. When I connect the webcam, I just want to try out if it works. It seems that the hardware of the camera itself leads to a low FPS, nothing to do with python. When I turn TX2 to MAXN mode, it can achieve FPS 19~20 on local video file.

So the conclusion is that the parallel execution with python is not posible and dont provide a better performance isnt it?

So the conclusion is that the parallel execution with python is not posible and dont provide a better performance isnt it?

Yes, it does execute in parallel when using multi-process instead of multi-thread. I use three process to read video frames, detect and display. I time these three process respectively, the first process loads frame at the speed of about 200 FPS, and the third process displays at the speed of about 100 FPS, while the second process could only detects 20 frames in a second. So there was one time that the first process loaded far too fast that it ate up all the memory of TX2 and crashed. I have to make it sleep like this:

time.sleep(1.0/20)

and then it runs smoothly.

That proves the processes are running in parallel, but it makes little change because the detection process slow them down.

Thanks all for the help. I learn several things from this. I will take a look at your code ans see how you implement the multiprocesses. I close this issue by now, if I have any problem I will re-open.

@PabloDIGITS

Compile Darknet with GPU=1 CUDNN=1 CUDNN_HALF=0 OPENCV=1 LIBSO=1

And try to use this Python code: darknet_video.zip

Change line lib = CDLL("yolo_cpp_dll.dll", RTLD_GLOBAL)
to the lib = CDLL("./darknet.so", RTLD_GLOBAL)
in the darknet_video.py

And set your cfg, weight and videofile in the darknet_video.py

@AlexeyAB are u asking about your repo ? or this one: https://pjreddie.com/darknet/yolo/

@AlexeyAB Hello! What are your current recommendations for using weights yolov3? To get a stable 10 frames per second and is it possible to get so many frames in the regular version of weights? (Not tiny). Input image size 1280x1024
Thanks!

Was this page helpful?
0 / 5 - 0 ratings