Darknet: No performance improvement with CUDNN_HALF=1 on Jetson Xavier AGX

Created on 15 Apr 2020  路  13Comments  路  Source: AlexeyAB/darknet

Hi @AlexeyAB I am having the same issue on Jetson Xavier AGX seen here in issue #4691.

Jetson Xavier AGX

  • Jetpack 4.3 (latest)

    • CUDA 10.0 & CUDNN

    • OpenCV 4.2 (Also tried 4.1)

    • Built using the latest darknet repo cloned 14.04.2020

    • Untouched from source yolov3.cfg

    • Pre-trained yolov3.weights

    • Testvideo

  1. Building with CUDNN_HALF=0 or =1 using makefile gives the same AVG_FPS 14.8 when using the demo - benchmark. See the details below.

  2. If I build with CMAKE it compiles with CUDNN_HALF=0. Not sure if this is expected behaviour or is a clue into issue / if is environment issue.

So I have deleted repo and tried recompiling multiple times with make / make clean and by adjusting the makefile as below. I have also reflashed the device and installed OPENCV4.2 with CUDA & CUDNN.

Any ideas to fix would be greatly appreciated. I see the FPS performance is exactly the same as @vitotsai HALF=0 performance.

Make with:

GPU=1
CUDNN=1
CUDNN_HALF=0 or 1
OPENCV=1
AVX=0
OPENMP=1
LIBSO=0
ZED_CAMERA=0 # ZED SDK 3.0 and above
ZED_CAMERA_v2_8=0 # ZED SDK 2.X

ARCH= -gencode arch=compute_72,code=[sm_72,compute_72]

-benchmark

CUDNN_HALF=0
FPS:14.8 AVG_FPS:14.8

./darknet detector demo cfg/coco.data cfg/yolov3.cfg yolov3.weights cartest.mp4 -benchmark
CUDA-version: 10000 (10000), cuDNN: 7.6.3, GPU count: 1
OpenCV version: 4.1.1
Demo
compute_capability = 720, cudnn_half = 0
net.optimized_memory = 0
mini_batch = 1, batch = 1, time_steps = 1, train = 0
.....
Total BFLOPS 65.879
avg_outputs = 532444
Allocate additional workspace_size = 52.43 MB
Loading weights from yolov3.weights...
seen 64, trained: 32013 K-images (500 Kilo-batches_64)
Done! Loaded 107 layers from weights-file
video file: cartest.mp4
Video stream: 1280 x 720

**-benchmark

CUDNN_HALF=1:
FPS:14.8 AVG_FPS:14.7**

./darknet detector demo cfg/coco.data cfg/yolov3.cfg yolov3.weights cartest.mp4 -benchmark
CUDA-version: 10000 (10000), cuDNN: 7.6.3, CUDNN_HALF=1, GPU count: 1
CUDNN_HALF=1
OpenCV version: 4.1.1
Demo
compute_capability = 720, cudnn_half = 1
net.optimized_memory = 0
mini_batch = 1, batch = 1, time_steps = 1, train = 0
....
[yolo] params: iou loss: mse (2), iou_norm: 0.75, cls_norm: 1.00, scale_x_y: 1.00
Total BFLOPS 65.879
avg_outputs = 532444
Allocate additional workspace_size = 52.43 MB
Loading weights from yolov3.weights...
seen 64, trained: 32013 K-images (500 Kilo-batches_64)
Done! Loaded 107 layers from weights-file
video file: cartest.mp4
Video stream: 1280 x 720

Bug fixed

Most helpful comment

@pullmyleg Download the latest Darknet code, the new code is +10% faster.

All 13 comments

Thanks @AlexeyAB fixed with latest check in.

@pullmyleg What FPS ca you get now?

@AlexeyAB from 14 FPS to 20.9FPS using the standard Yolov3 config at 416x416.

This is a huge improvement and allows me to run 1080p detection from the UAV for very small objects (dolphins) using YoloTiny at 19FPS.

Thanks!

@pullmyleg Do you use yolov3-tiny.cfg with width=1088 height=1088 in cfg-file and get 19 FPS?

@pullmyleg Download the latest Darknet code, the new code is +10% faster.

@AlexeyAB thanks. On standard yolov3 config, the latest benchmark is 22.4FPS for the Jetson Xavier AGX. ~10% improvement. This is great, thanks!

The Yolo-tiny runs detection at 19FPS at 1920 x 1088 (21FPS now with the latest change). Please let me know if this doesn't make sense.

We are a non-for-profit (MAUI63) and what we are doing is looking for the world's rarest dolphin (Maui) using object detection and a large UAV that flies at 120km/h. If you are interested see fundraising video here :). The higher we can fly and the smaller the objects (dolphins) we can detect the more area we can cover per flight. Once we spot a dolphin the UAV will circle and follow the pod of dolphins until the pilot tells it to continue surveying.

The goal is to find the model that performs the most accurately with the smallest objects possible from 1080p 30fps footage using a Jetson Xavier AGX on board. We need a minimum of 12FPS to be able to spot dolphins @120kmh, but I think 20FPS+ is preferable and will work better.

I am currently training and benchmarking a range of different configurations for this project at various configurations and models I have compiled by reading through issues and suggestions on this project. You can see the list below, it is not complete and is a work in progress. I am still gathering results and training the different models for comparison.

If you or anyone else have any suggestions on other models/configurations I should use please let me know.

Tomorrow I will have access to some new hardware (1 x Tesla V100 32gb, 48gb ram,12CPU) and soon will have access to Azure NCv3 VM's where I will do some large batch training trialing GPU & CPU memory. The models so far have been trained on a 1080ti (beast) / 2070 (gs65).

Training Results | 聽 | 聽 | 聽 | 聽 | 聽 | 聽 | 聽 | 聽 | 聽 | 聽 | 聽 | 聽 | 聽 | 聽 | 聽 | 聽
-- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | --
Name | Source | Model | Machine | Dataset | CPU Memory | Batch | Subdivisions | Random | Width | Height | Calc Anchors | Iterations | Map | Xavier FPS 1920x1080 | Video Result (Manual) | Notes
yolov3-tiny-maui-1536.txt | Darknet | Yolov3-Tiny | Beast | Complete | N | 64 | 16 | Y | 1536 | 1536 | N | 7200 | 89.7% | 21 | Good @1920x1088 Detection | 聽
yolov3-tiny-maui-20161152.txt | Darknet | Yolov3-Tiny | Beast | Complete | N | 64 | 16 | Y | 2016 | 1152 | N | 6900 | 84.0% | 聽 | 聽 | Accuracy dropped with width / height changed not square no anchor re calc|
yolov3-tiny-maui-544544.txt | Darknet | Yolov3-Tiny | GS65 | Complete | N | 64 | 4 | Y | 544 | 544 | N | 19000 | 83.8% | 聽 | 聽 | Large batch = better result. Even at low resolutions. Also higher iterations help
yolov3-tiny-maui20161152-anc | Darknet | Yolov3-Tiny | Beast | Complete | N | 64 | 16 | Y | 2016 | 1152 | Y | 聽 | 聽 | 聽 | 聽 | 聽
yolov3-maui-576.txt | Darknet | Yolov3 | Beast | Complete | N | 64 | 16 | Y | 576 | 576 | N | 8800 | 88.4% | 聽 | 聽 | 聽
yolov3-tiny-maui-1536-anc.txt | Darknet | Yolov3-Tiny | Beast | Complete | N | 64 | 16 | Y | 1536 | 1536 | Y | 6900 | 83.9% | 聽 | 聽 | Worst Map performance w calculated anchors.
yolov3-tiny-maui-small-1536 | Darknet | Yolov3-Tiny | Beast | Small obj only | N | 64 | 16 | Y | 1504 | 1504 | N | 聽 | 聽 | 聽 | 聽 | 聽
yolov3-tiny-maui-small-1536-anc | Darknet | Yolov3-Tiny | Beast | Small obj only | N | 64 | 16 | Y | 1504 | 1504 | Y | 聽 | 聽 | 聽 | 聽 | 聽
yolov3-tiny-maui-small-20161152 | Darknet | Yolov3-Tiny | Beast | Small obj only | N | 64 | 16 | Y | 2016 | 1152 | N | 聽 | 聽 | 聽 | 聽 | 聽
yolov3-tiny-maui-small-20161152-anc | Darknet | Yolov3-Tiny | Beast | Small obj only | N | 64 | 16 | Y | 2016 | 1152 | Y | 聽 | 聽 | 聽 | 聽 | 聽
yolov3-tiny-maui-bb | Darknet | Yolov3-Tiny | UoA | Complete | N | 64 | 4 | Y | 1536 | 1536 | N | 聽 | 聽 | 聽 | 聽 | 聽
yolov3-tiny-maui-bb-memory | Darknet | Yolov3-Tiny | Azure | Complete | Y | 64 | 2 | N | 1920 | 1920 | N | 聽 | 聽 | 聽 | 聽 | 聽
yolov3-19201080 | Darknet | yolov3 | UoA | 聽 | 聽 | 聽 | 聽 | 聽 | 1920 | 1080 | ? | 聽 | 聽 | 聽 | 聽 | 聽
Yolov3-SPP | Darknet | yolov3-spp | 聽 | 聽 | 聽 | 聽 | 聽 | 聽 | 聽 | 聽 | 聽 | 聽 | 聽 | 聽 | 聽 | 聽
Yolov3-SPP-tiny | Darknet | yolov3-spp-tiny | 聽 | 聽 | 聽 | 聽 | 聽 | 聽 | 聽 | 聽 | 聽 | 聽 | 聽 | 聽 | 聽 | 聽
Yolov3-LSTM | Darknet | yolov3-spp | 聽 | Complete - ordered frames | 聽 | 聽 | 聽 | 聽 | 聽 | 聽 | 聽 | 聽 | 聽 | 聽 | 聽 | 聽
Yolov3-LSTM-tiny | Darknet | yolov3-spp-tiny | 聽 | Complete - ordered frames | 聽 | 聽 | 聽 | 聽 | 聽 | 聽 | 聽 | 聽 | 聽 | 聽 | 聽 | 聽
Yolo-v3-tp3 | Darknet | yolo_v3_tiny_pan3.cfg | 聽 | 聽 | 聽 | 聽 | 聽 | 聽 | 聽 | 聽 | 聽 | 聽 | 聽 | 聽 | 聽 | 聽
Yolo_tiny-prn | Darknet | yolo_tiny - prn | 聽 | 聽 | 聽 | 聽 | 聽 | 聽 | 聽 | 聽 | 聽 | 聽 | 聽 | 聽 | 聽 | 聽
yolov3-tiny-maui-1536-ten | Tensorflow | Yolov3-Tiny | 聽 | 聽 | 聽 | 聽 | 聽 | 聽 | 聽 | 聽 | 聽 | 聽 | 聽 | 聽 | 聽 | 聽
yolov3-light-maui-1536 | Tensorflow | Yolov3-Tiny-Light | 聽 | 聽 | 聽 | 聽 | 聽 | 聽 | 聽 | 聽 | 聽 | 聽 | 聽 | 聽 | 聽 | 聽
yolov3-Nano-maui-1536 | Tensorflow | Yolov3-nano | 聽 | 聽 | 聽 | 聽 | 聽 | 聽 | 聽 | 聽 | 聽 | 聽 | 聽 | 聽 | 聽 | 聽
yolov3-light-maui-1536-spp | Tensorflow | Yolov3-Tiny-Light-SPP | 聽 | 聽 | 聽 | 聽 | 聽 | 聽 | 聽 | 聽 | 聽 | 聽 | 聽 | 聽 | 聽 | 聽
yolov3-tiny-maui-1536 | Pytorch | Yolov3-Tiny | 聽 | 聽 | 聽 | 聽 | 聽 | 聽 | 聽 | 聽 | 聽 | 聽 | 聽 | 聽 | 聽 | 聽
yolov3-maui-1536 | Pytorch | Yolov3 | 聽 | 聽 | 聽 | 聽 | 聽 | 聽 | 聽 | 聽 | 聽 | 聽 | 聽 | 聽 | 聽 | 聽
small network model | Darknet| Network-Model| 聽 | 聽 | 聽 | 聽 | 聽 | 聽 | 聽 | 聽 | 聽 | 聽 | 聽 | 聽 | 聽 | 聽

Training Results | 聽 | 聽 | 聽 | 聽 | 聽 | 聽 | 聽 | 聽 | 聽 | 聽 | 聽 | 聽 | 聽 | 聽 | 聽 | 聽
-- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | --
Name | Source | Model | Machine | Dataset | CPU Memory | Batch | Subdivisions | Random | Width | Height | Calc Anchors | Iterations | Map | Xavier FPS 1920x1080 | Video Result (Manual) | Notes
yolov3-tiny-maui-1536.txt | Darknet | Yolov3-Tiny | Beast | Complete | N | 64 | 16 | Y | 1536 | 1536 | N | 7200 | 89.7% | 21 | Good @1920x1088 Detection | 聽

  • Does it mean that you trained model with width=1536 height=1536 in cfg, and after training changed width=1920 height=1088 in cfg for detection? Don't do it, if you train on images from the same camera.

  • Do you use separate validation dataset for mAP calculation?


Try to train these 3 yolov3-tiny models - these models are implemented for aerial detection: https://github.com/AlexeyAB/darknet/issues/4495#issuecomment-578538967

Train with width=1920 height=1088 in cfg and use the same width=1920 height=1088 for detection, and train by using pre-trained weights file yolov3-tiny.conv.15 https://github.com/AlexeyAB/darknet#how-to-train-tiny-yolo-to-detect-your-custom-objects

  1. Tiny_3l_roatate_whole_maxout - https://github.com/AlexeyAB/darknet/files/3995740/yolov3-tiny_3l_rotate_whole_maxout.cfg.txt

  2. Tiny_3l_stretch_sway_whole_concat_maxout - https://github.com/AlexeyAB/darknet/files/4003688/yolov3-tiny_3l_stretch_sway_whole_concat_maxout.cfg.txt

  3. Tiny_3l_resize - https://github.com/AlexeyAB/darknet/files/3995772/yolov3-tiny_3l_resize.cfg.txt

Thanks @AlexeyAB

Does it mean that you trained model with width=1536 height=1536 in cfg, and after training changed width=1920 height=1088 in cfg for detection?

Yes in that example I trained at 1536x1536 and detect at 1920x1088. Why should I not change? Assuming it's because I should train in the same aspect ratio that I would like to detect?

Don't do it, if you train on images from the same camera.

I am training from images from a different camera then will be in the final UAV. The images are frames (6 per second) from 4k Video. 3840 x 2160px. I do not have footage of the dolphins from the final UAV camera yet. It is still being built.

Do you use a separate validation dataset for mAP calculation?

Yes, the training set is different from the validation set. 1 of the 8 videos is used in the testing set. The final video for manual testing is one of the videos in the validation set.

Complete data set

  • Training set ~9000 images.
  • Validation set ~ 300 images

Small images only data set (from high heights, very small dolphins only)

  • Training set ~3200 images.
  • Validation set ~ 700 images

Try to train these 3 yolov3-tiny models - these models are implemented for aerial detection: #4495

Ok, thank you. I will train these next and post results when finished.

Yes in that example I trained at 1536x1536 and detect at 1920x1088. Why should I not change? Assuming it's because I should train in the same aspect ratio that I would like to detect?

Yes, aspect ratio should be the same. So use equal network resolution for training and detection.

Also try to train 4-th yolov3-tiny model with width=1920 height=1088 in cfg: yolo_v3_tiny_pan3_scale_giou.cfg.txt

Hi @AlexeyAB , I know this question has been answered many times. But I just want to confirm what I am doing is correct re calculated Anchors. I understand that the anchors are the width and height of the closest object height in that layer, but what I don't understand is why they are required at each size between each layer? E.g. Why Anchors greater than 60x60 go in the first layer.

My understanding from the readme is:

  • Anchors greater than 60x60 layer 1.
  • Anchors greater than 30x30 but smaller than 60x60 layer 2
  • Anchors greater smaller than 30x30 layer 3.

Note I have 2 x datasets (complete and small). Small is from 40m+ high only footage (very small objects) and complete: from 10m - 40m (very small and medium-small sized objects).

This is for the small dataset, I am using the small object dataset because the smaller the object can be the higher we can fly and more area can be covered in one flight.

num_of_clusters = 9, width = 1920, height = 1080 
 read labels from 3232 images 
 loaded      image: 3232     box: 3019
 all loaded. 

 calculating k-means++ ...

 iterations = 16 

counters_per_class = 3019

 avg IoU = 79.59 % 

Saving anchors to the file: anchors.txt 
anchors =  29, 23,  21, 39,  50, 26,  38, 36,  30, 50,  49, 49,  70, 38,  42, 73,  70, 69

Option 1 - based on one number from each anchor fitting the size:

Layer 1 mask = 6,7,8
Layer 2 mask = 3,4,5
Layer 3 mask = 0,1,2
anchors = 29,23, 21,39, 50,26, 38,36, 30,50, 49,49, 70,38, 42,73, 70,69.

Mask 6 is actually smaller than 60 x 60 and Mask 2 greater than 30x30 but I noticed in the original weights a similar approach was being used if it was close or one of the values we're => 60. e.g. Mask 5 in layer 2 in original config is: 59,119 which is > 60x60.

Option 2 - based on total object size e.g. 60*60

Layer 1 mask = 8
Layer 2 mask = 2,3,4,5,6,7
Layer 3 mask = 0,1
anchors = 29,23, 21,39, 50,26, 38,36, 30,50, 49,49, 70,38, 42,73, 70,69

I will adjust the filters accordingly to the masks used in each layer.

Thanks again for your help!

There is no strict rule. There is just an empirical recommendation:

  • Anchors greater than 64x64 for layer with 5 subsampling layers (stride=2), because it has receptive filed >= 32 = pow(2,5) (actually it is higher than 32x32 because conv3x3 also increases receptive field, not only layers with stride=2)
  • Anchors greater than 32x32 but smaller than 64x64 for layer with 4 subsampling layers (stride=2)
  • Anchors greater smaller than 32x32 for layer with 3 subsampling layers (stride=2)

You can add [net] show_receptive_field=1 in cfg to show receptive field size in the console for each layer during network initialization.


This is a more complex issue - you should take into account number of objects per image for each size, and number of overlapped object for each size, ...

I would recommend you to use :

  • or use default anchors
  • or use Option 2, but add default anchors: 2 anchors to layer-1 ad 1 anchor to layer-3

Ok thank you @AlexeyAB . I will try train with both and compare result.

To confirm options 2 with additional default anchors should look like:

All bold are new.

Layer 1 mask = 9,10,11
Layer 2 mask = 3,4,5,6,7,8
Layer 3 mask = 0,1,2
anchors = 10,13, 29,23, 21,39, 50,26, 38,36, 30,50, 49,49, 70,38, 42,73, 70,69, 116,90, 156,198

@pullmyleg Yes.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

Cipusha picture Cipusha  路  3Comments

Yumin-Sun-00 picture Yumin-Sun-00  路  3Comments

Mididou picture Mididou  路  3Comments

Jacky3213 picture Jacky3213  路  3Comments

HilmiK picture HilmiK  路  3Comments