Hi,
Using a customized yolov4-tiny cfg file, average prediction time of darknet detector test is almost 10 times slower on Nvidia JetPack 4.4 compared to JetPack 4.3. I got similar results for another cfg file based on yolov3-tiny.
Hardware: Nvidia Jetson Nano
Makefile Variables:
GPU=1
CUDNN=1
CUDNN_HALF=1
OPENCV=0
AVX=0
OPENMP=1
LIBSO=0
ZED_CAMERA=0
Sample Output for Jetpack 4.4:
CUDA-version: 10020 (10020), cuDNN: 8.0.0, CUDNN_HALF=1, GPU count: 1
CUDNN_HALF=1
OpenCV isn't used - data augmentation will be slow
0 : compute_capability = 530, cudnn_half = 0, GPU: NVIDIA Tegra X1
net.optimized_memory = 0
mini_batch = 1, batch = 16, time_steps = 1, train = 0
layer filters size/strd(dil) input output
0 conv 32 3 x 3/ 2 416 x 416 x 1 -> 208 x 208 x 32 0.025 BF
1 conv 64 3 x 3/ 2 208 x 208 x 32 -> 104 x 104 x 64 0.399 BF
2 conv 64 3 x 3/ 1 104 x 104 x 64 -> 104 x 104 x 64 0.797 BF
3 route 2 1/2 -> 104 x 104 x 32
4 conv 32 3 x 3/ 1 104 x 104 x 32 -> 104 x 104 x 32 0.199 BF
5 conv 32 3 x 3/ 1 104 x 104 x 32 -> 104 x 104 x 32 0.199 BF
6 route 5 4 -> 104 x 104 x 64
7 conv 64 1 x 1/ 1 104 x 104 x 64 -> 104 x 104 x 64 0.089 BF
8 route 2 7 -> 104 x 104 x 128
9 max 2x 2/ 2 104 x 104 x 128 -> 52 x 52 x 128 0.001 BF
10 conv 128 3 x 3/ 1 52 x 52 x 128 -> 52 x 52 x 128 0.797 BF
11 route 10 1/2 -> 52 x 52 x 64
12 conv 64 3 x 3/ 1 52 x 52 x 64 -> 52 x 52 x 64 0.199 BF
13 conv 64 3 x 3/ 1 52 x 52 x 64 -> 52 x 52 x 64 0.199 BF
14 route 13 12 -> 52 x 52 x 128
15 conv 128 1 x 1/ 1 52 x 52 x 128 -> 52 x 52 x 128 0.089 BF
16 route 10 15 -> 52 x 52 x 256
17 max 2x 2/ 2 52 x 52 x 256 -> 26 x 26 x 256 0.001 BF
18 conv 256 3 x 3/ 1 26 x 26 x 256 -> 26 x 26 x 256 0.797 BF
19 route 18 1/2 -> 26 x 26 x 128
20 conv 128 3 x 3/ 1 26 x 26 x 128 -> 26 x 26 x 128 0.199 BF
21 conv 128 3 x 3/ 1 26 x 26 x 128 -> 26 x 26 x 128 0.199 BF
22 route 21 20 -> 26 x 26 x 256
23 conv 256 1 x 1/ 1 26 x 26 x 256 -> 26 x 26 x 256 0.089 BF
24 route 18 23 -> 26 x 26 x 512
25 max 2x 2/ 2 26 x 26 x 512 -> 13 x 13 x 512 0.000 BF
26 conv 512 3 x 3/ 1 13 x 13 x 512 -> 13 x 13 x 512 0.797 BF
27 conv 256 1 x 1/ 1 13 x 13 x 512 -> 13 x 13 x 256 0.044 BF
28 conv 512 3 x 3/ 1 13 x 13 x 256 -> 13 x 13 x 512 0.399 BF
29 conv 18 1 x 1/ 1 13 x 13 x 512 -> 13 x 13 x 18 0.003 BF
30 yolo
[yolo] params: iou loss: ciou (4), iou_norm: 0.07, cls_norm: 1.00, scale_x_y: 1.05
nms_kind: greedynms (1), beta = 0.600000
31 route 27 -> 13 x 13 x 256
32 conv 128 1 x 1/ 1 13 x 13 x 256 -> 13 x 13 x 128 0.011 BF
33 upsample 2x 13 x 13 x 128 -> 26 x 26 x 128
34 route 33 23 -> 26 x 26 x 384
35 conv 256 3 x 3/ 1 26 x 26 x 384 -> 26 x 26 x 256 1.196 BF
36 conv 18 1 x 1/ 1 26 x 26 x 256 -> 26 x 26 x 18 0.006 BF
37 yolo
[yolo] params: iou loss: ciou (4), iou_norm: 0.07, cls_norm: 1.00, scale_x_y: 1.05
nms_kind: greedynms (1), beta = 0.600000
Total BFLOPS 6.737
avg_outputs = 299663
Allocate additional workspace_size = 24.92 MB
Loading weights from yolov4-tiny_6200.weights...
seen 64, trained: 793 K-images (12 Kilo-batches_64)
Done! Loaded 38 layers from weights-file
test.jpg: Predicted in 769.621000 milli-seconds.
Sample Output for Jetpack 4.3:
CUDA-version: 10000 (10000), cuDNN: 7.6.3, CUDNN_HALF=1, GPU count: 1
CUDNN_HALF=1
OpenCV isn't used - data augmentation will be slow
0 : compute_capability = 530, cudnn_half = 0, GPU: NVIDIA Tegra X1
net.optimized_memory = 0
mini_batch = 1, batch = 16, time_steps = 1, train = 0
layer filters size/strd(dil) input output
0 conv 32 3 x 3/ 2 416 x 416 x 1 -> 208 x 208 x 32 0.025 BF
1 conv 64 3 x 3/ 2 208 x 208 x 32 -> 104 x 104 x 64 0.399 BF
2 conv 64 3 x 3/ 1 104 x 104 x 64 -> 104 x 104 x 64 0.797 BF
3 route 2 1/2 -> 104 x 104 x 32
4 conv 32 3 x 3/ 1 104 x 104 x 32 -> 104 x 104 x 32 0.199 BF
5 conv 32 3 x 3/ 1 104 x 104 x 32 -> 104 x 104 x 32 0.199 BF
6 route 5 4 -> 104 x 104 x 64
7 conv 64 1 x 1/ 1 104 x 104 x 64 -> 104 x 104 x 64 0.089 BF
8 route 2 7 -> 104 x 104 x 128
9 max 2x 2/ 2 104 x 104 x 128 -> 52 x 52 x 128 0.001 BF
10 conv 128 3 x 3/ 1 52 x 52 x 128 -> 52 x 52 x 128 0.797 BF
11 route 10 1/2 -> 52 x 52 x 64
12 conv 64 3 x 3/ 1 52 x 52 x 64 -> 52 x 52 x 64 0.199 BF
13 conv 64 3 x 3/ 1 52 x 52 x 64 -> 52 x 52 x 64 0.199 BF
14 route 13 12 -> 52 x 52 x 128
15 conv 128 1 x 1/ 1 52 x 52 x 128 -> 52 x 52 x 128 0.089 BF
16 route 10 15 -> 52 x 52 x 256
17 max 2x 2/ 2 52 x 52 x 256 -> 26 x 26 x 256 0.001 BF
18 conv 256 3 x 3/ 1 26 x 26 x 256 -> 26 x 26 x 256 0.797 BF
19 route 18 1/2 -> 26 x 26 x 128
20 conv 128 3 x 3/ 1 26 x 26 x 128 -> 26 x 26 x 128 0.199 BF
21 conv 128 3 x 3/ 1 26 x 26 x 128 -> 26 x 26 x 128 0.199 BF
22 route 21 20 -> 26 x 26 x 256
23 conv 256 1 x 1/ 1 26 x 26 x 256 -> 26 x 26 x 256 0.089 BF
24 route 18 23 -> 26 x 26 x 512
25 max 2x 2/ 2 26 x 26 x 512 -> 13 x 13 x 512 0.000 BF
26 conv 512 3 x 3/ 1 13 x 13 x 512 -> 13 x 13 x 512 0.797 BF
27 conv 256 1 x 1/ 1 13 x 13 x 512 -> 13 x 13 x 256 0.044 BF
28 conv 512 3 x 3/ 1 13 x 13 x 256 -> 13 x 13 x 512 0.399 BF
29 conv 18 1 x 1/ 1 13 x 13 x 512 -> 13 x 13 x 18 0.003 BF
30 yolo
[yolo] params: iou loss: ciou (4), iou_norm: 0.07, cls_norm: 1.00, scale_x_y: 1.05
nms_kind: greedynms (1), beta = 0.600000
31 route 27 -> 13 x 13 x 256
32 conv 128 1 x 1/ 1 13 x 13 x 256 -> 13 x 13 x 128 0.011 BF
33 upsample 2x 13 x 13 x 128 -> 26 x 26 x 128
34 route 33 23 -> 26 x 26 x 384
35 conv 256 3 x 3/ 1 26 x 26 x 384 -> 26 x 26 x 256 1.196 BF
36 conv 18 1 x 1/ 1 26 x 26 x 256 -> 26 x 26 x 18 0.006 BF
37 yolo
[yolo] params: iou loss: ciou (4), iou_norm: 0.07, cls_norm: 1.00, scale_x_y: 1.05
nms_kind: greedynms (1), beta = 0.600000
Total BFLOPS 6.737
avg_outputs = 299663
Allocate additional workspace_size = 26.22 MB
Loading weights from yolov4-tiny_6200.weights...
seen 64, trained: 793 K-images (12 Kilo-batches_64)
Done! Loaded 38 layers from weights-file
test.jpg: Predicted in 59.075000 milli-seconds.
Is this problem because of different CUDA and CUDNN versions? Or is there a bug with darknet?
Compile with OPENMP=0 DEBUG=0
Do you use jetson_clocks in both cases?
Do you use 1 Jetson nano or 2 jetsons nano?
Run detection with flag -benchmark_layers at the end of command for both cases and post results with timings for each layer.
i can confirm same experience / my custom yolov4_tiny went from 18fps to 2fps.! rolled back to JetPack 4.3 for now. Also please note that nvidia still marks JetPack 4.4 as developer preview! so it could be issue on their side as well.
nvidia still marks JetPack 4.4 as developer preview! so it could be issue on their side as well.
I think this is the issue.
nvidia still marks JetPack 4.4 as developer preview! so it could be issue on their side as well.
I think this is the issue.
Here, confirmation from nvidia that they are aware https://forums.developer.nvidia.com/t/darknet-slower-using-jetpack-4-4-cudnn-8-0-0-cuda-10-2-than-jetpack-4-3-cudnn-7-6-3-cuda-10-0/121579/7
Thank you @AlexeyAB and @hlacik
Most helpful comment
Here, confirmation from nvidia that they are aware https://forums.developer.nvidia.com/t/darknet-slower-using-jetpack-4-4-cudnn-8-0-0-cuda-10-2-than-jetpack-4-3-cudnn-7-6-3-cuda-10-0/121579/7