Darknet: Darknet is almost 10 times slower on Nvidia JetPack 4.4 compared to JetPack 4.3 on Jetson Nano

Created on 5 Jul 2020 · 5Comments · Source: AlexeyAB/darknet

Hi,
Using a customized yolov4-tiny cfg file, average prediction time of darknet detector test is almost 10 times slower on Nvidia JetPack 4.4 compared to JetPack 4.3. I got similar results for another cfg file based on yolov3-tiny.

Hardware: Nvidia Jetson Nano
Makefile Variables:

GPU=1
CUDNN=1
CUDNN_HALF=1
OPENCV=0
AVX=0
OPENMP=1
LIBSO=0
ZED_CAMERA=0

Sample Output for Jetpack 4.4:

 CUDA-version: 10020 (10020), cuDNN: 8.0.0, CUDNN_HALF=1, GPU count: 1  
 CUDNN_HALF=1 
 OpenCV isn't used - data augmentation will be slow 
 0 : compute_capability = 530, cudnn_half = 0, GPU: NVIDIA Tegra X1 
net.optimized_memory = 0 
mini_batch = 1, batch = 16, time_steps = 1, train = 0 
   layer   filters  size/strd(dil)      input                output
   0 conv     32       3 x 3/ 2    416 x 416 x   1 ->  208 x 208 x  32 0.025 BF
   1 conv     64       3 x 3/ 2    208 x 208 x  32 ->  104 x 104 x  64 0.399 BF
   2 conv     64       3 x 3/ 1    104 x 104 x  64 ->  104 x 104 x  64 0.797 BF
   3 route  2                              1/2 ->  104 x 104 x  32 
   4 conv     32       3 x 3/ 1    104 x 104 x  32 ->  104 x 104 x  32 0.199 BF
   5 conv     32       3 x 3/ 1    104 x 104 x  32 ->  104 x 104 x  32 0.199 BF
   6 route  5 4                                ->  104 x 104 x  64 
   7 conv     64       1 x 1/ 1    104 x 104 x  64 ->  104 x 104 x  64 0.089 BF
   8 route  2 7                                ->  104 x 104 x 128 
   9 max                2x 2/ 2    104 x 104 x 128 ->   52 x  52 x 128 0.001 BF
  10 conv    128       3 x 3/ 1     52 x  52 x 128 ->   52 x  52 x 128 0.797 BF
  11 route  10                             1/2 ->   52 x  52 x  64 
  12 conv     64       3 x 3/ 1     52 x  52 x  64 ->   52 x  52 x  64 0.199 BF
  13 conv     64       3 x 3/ 1     52 x  52 x  64 ->   52 x  52 x  64 0.199 BF
  14 route  13 12                              ->   52 x  52 x 128 
  15 conv    128       1 x 1/ 1     52 x  52 x 128 ->   52 x  52 x 128 0.089 BF
  16 route  10 15                              ->   52 x  52 x 256 
  17 max                2x 2/ 2     52 x  52 x 256 ->   26 x  26 x 256 0.001 BF
  18 conv    256       3 x 3/ 1     26 x  26 x 256 ->   26 x  26 x 256 0.797 BF
  19 route  18                             1/2 ->   26 x  26 x 128 
  20 conv    128       3 x 3/ 1     26 x  26 x 128 ->   26 x  26 x 128 0.199 BF
  21 conv    128       3 x 3/ 1     26 x  26 x 128 ->   26 x  26 x 128 0.199 BF
  22 route  21 20                              ->   26 x  26 x 256 
  23 conv    256       1 x 1/ 1     26 x  26 x 256 ->   26 x  26 x 256 0.089 BF
  24 route  18 23                              ->   26 x  26 x 512 
  25 max                2x 2/ 2     26 x  26 x 512 ->   13 x  13 x 512 0.000 BF
  26 conv    512       3 x 3/ 1     13 x  13 x 512 ->   13 x  13 x 512 0.797 BF
  27 conv    256       1 x 1/ 1     13 x  13 x 512 ->   13 x  13 x 256 0.044 BF
  28 conv    512       3 x 3/ 1     13 x  13 x 256 ->   13 x  13 x 512 0.399 BF
  29 conv     18       1 x 1/ 1     13 x  13 x 512 ->   13 x  13 x  18 0.003 BF
  30 yolo
[yolo] params: iou loss: ciou (4), iou_norm: 0.07, cls_norm: 1.00, scale_x_y: 1.05
nms_kind: greedynms (1), beta = 0.600000 
  31 route  27                                 ->   13 x  13 x 256 
  32 conv    128       1 x 1/ 1     13 x  13 x 256 ->   13 x  13 x 128 0.011 BF
  33 upsample                 2x    13 x  13 x 128 ->   26 x  26 x 128
  34 route  33 23                              ->   26 x  26 x 384 
  35 conv    256       3 x 3/ 1     26 x  26 x 384 ->   26 x  26 x 256 1.196 BF
  36 conv     18       1 x 1/ 1     26 x  26 x 256 ->   26 x  26 x  18 0.006 BF
  37 yolo
[yolo] params: iou loss: ciou (4), iou_norm: 0.07, cls_norm: 1.00, scale_x_y: 1.05
nms_kind: greedynms (1), beta = 0.600000 
Total BFLOPS 6.737 
avg_outputs = 299663 
 Allocate additional workspace_size = 24.92 MB 
Loading weights from yolov4-tiny_6200.weights...
 seen 64, trained: 793 K-images (12 Kilo-batches_64) 
Done! Loaded 38 layers from weights-file 
test.jpg: Predicted in 769.621000 milli-seconds.

Sample Output for Jetpack 4.3:

 CUDA-version: 10000 (10000), cuDNN: 7.6.3, CUDNN_HALF=1, GPU count: 1  
 CUDNN_HALF=1 
 OpenCV isn't used - data augmentation will be slow 
 0 : compute_capability = 530, cudnn_half = 0, GPU: NVIDIA Tegra X1 
net.optimized_memory = 0 
mini_batch = 1, batch = 16, time_steps = 1, train = 0 
   layer   filters  size/strd(dil)      input                output
   0 conv     32       3 x 3/ 2    416 x 416 x   1 ->  208 x 208 x  32 0.025 BF
   1 conv     64       3 x 3/ 2    208 x 208 x  32 ->  104 x 104 x  64 0.399 BF
   2 conv     64       3 x 3/ 1    104 x 104 x  64 ->  104 x 104 x  64 0.797 BF
   3 route  2                              1/2 ->  104 x 104 x  32 
   4 conv     32       3 x 3/ 1    104 x 104 x  32 ->  104 x 104 x  32 0.199 BF
   5 conv     32       3 x 3/ 1    104 x 104 x  32 ->  104 x 104 x  32 0.199 BF
   6 route  5 4                                ->  104 x 104 x  64 
   7 conv     64       1 x 1/ 1    104 x 104 x  64 ->  104 x 104 x  64 0.089 BF
   8 route  2 7                                ->  104 x 104 x 128 
   9 max                2x 2/ 2    104 x 104 x 128 ->   52 x  52 x 128 0.001 BF
  10 conv    128       3 x 3/ 1     52 x  52 x 128 ->   52 x  52 x 128 0.797 BF
  11 route  10                             1/2 ->   52 x  52 x  64 
  12 conv     64       3 x 3/ 1     52 x  52 x  64 ->   52 x  52 x  64 0.199 BF
  13 conv     64       3 x 3/ 1     52 x  52 x  64 ->   52 x  52 x  64 0.199 BF
  14 route  13 12                              ->   52 x  52 x 128 
  15 conv    128       1 x 1/ 1     52 x  52 x 128 ->   52 x  52 x 128 0.089 BF
  16 route  10 15                              ->   52 x  52 x 256 
  17 max                2x 2/ 2     52 x  52 x 256 ->   26 x  26 x 256 0.001 BF
  18 conv    256       3 x 3/ 1     26 x  26 x 256 ->   26 x  26 x 256 0.797 BF
  19 route  18                             1/2 ->   26 x  26 x 128 
  20 conv    128       3 x 3/ 1     26 x  26 x 128 ->   26 x  26 x 128 0.199 BF
  21 conv    128       3 x 3/ 1     26 x  26 x 128 ->   26 x  26 x 128 0.199 BF
  22 route  21 20                              ->   26 x  26 x 256 
  23 conv    256       1 x 1/ 1     26 x  26 x 256 ->   26 x  26 x 256 0.089 BF
  24 route  18 23                              ->   26 x  26 x 512 
  25 max                2x 2/ 2     26 x  26 x 512 ->   13 x  13 x 512 0.000 BF
  26 conv    512       3 x 3/ 1     13 x  13 x 512 ->   13 x  13 x 512 0.797 BF
  27 conv    256       1 x 1/ 1     13 x  13 x 512 ->   13 x  13 x 256 0.044 BF
  28 conv    512       3 x 3/ 1     13 x  13 x 256 ->   13 x  13 x 512 0.399 BF
  29 conv     18       1 x 1/ 1     13 x  13 x 512 ->   13 x  13 x  18 0.003 BF
  30 yolo
[yolo] params: iou loss: ciou (4), iou_norm: 0.07, cls_norm: 1.00, scale_x_y: 1.05
nms_kind: greedynms (1), beta = 0.600000 
  31 route  27                                 ->   13 x  13 x 256 
  32 conv    128       1 x 1/ 1     13 x  13 x 256 ->   13 x  13 x 128 0.011 BF
  33 upsample                 2x    13 x  13 x 128 ->   26 x  26 x 128
  34 route  33 23                              ->   26 x  26 x 384 
  35 conv    256       3 x 3/ 1     26 x  26 x 384 ->   26 x  26 x 256 1.196 BF
  36 conv     18       1 x 1/ 1     26 x  26 x 256 ->   26 x  26 x  18 0.006 BF
  37 yolo
[yolo] params: iou loss: ciou (4), iou_norm: 0.07, cls_norm: 1.00, scale_x_y: 1.05
nms_kind: greedynms (1), beta = 0.600000 
Total BFLOPS 6.737 
avg_outputs = 299663 
 Allocate additional workspace_size = 26.22 MB 
Loading weights from yolov4-tiny_6200.weights...
 seen 64, trained: 793 K-images (12 Kilo-batches_64) 
Done! Loaded 38 layers from weights-file 
test.jpg: Predicted in 59.075000 milli-seconds.

Is this problem because of different CUDA and CUDNN versions? Or is there a bug with darknet?

Bug in 3rd party libary

Source

mrhosseini

Most helpful comment

nvidia still marks JetPack 4.4 as developer preview! so it could be issue on their side as well.

I think this is the issue.

Here, confirmation from nvidia that they are aware https://forums.developer.nvidia.com/t/darknet-slower-using-jetpack-4-4-cudnn-8-0-0-cuda-10-2-than-jetpack-4-3-cudnn-7-6-3-cuda-10-0/121579/7

hlacik on 6 Jul 2020

👍2

All 5 comments

Compile with OPENMP=0 DEBUG=0
Do you use jetson_clocks in both cases?
Do you use 1 Jetson nano or 2 jetsons nano?
Run detection with flag -benchmark_layers at the end of command for both cases and post results with timings for each layer.

AlexeyAB on 5 Jul 2020

i can confirm same experience / my custom yolov4_tiny went from 18fps to 2fps.! rolled back to JetPack 4.3 for now. Also please note that nvidia still marks JetPack 4.4 as developer preview! so it could be issue on their side as well.

hlacik on 5 Jul 2020