Darknet: Training fails only if cuDNN enabled with RTX3090

Created on 19 Oct 2020 · 8Comments · Source: AlexeyAB/darknet

Bug Overview

Layer loading is extremely slow, takes about 30 minutes
After layer loading, training fails: avg loss = nan
The bug occurs only if darknet is built with CUDNN=1

Reproduction

Created Dockerfile to reproduce the bug. I built the docker image on a Workstation(refer a environment section).
Prepared my custom dataset and set up on the Workstation. I uploaded a configuration file here.
Run a docker container, with dataset folder mounted and gpu provided.

Log

$ docker run -v ~/darknet/mydata:/app/darknet/data -it --gpus device=0 myname/darknet:1.0.3 detector train -dont_show ./data/obj.data ./data/yolo4.cfg ./data/backup/yolov4.conv.137
 CUDA-version: 10020 (11010), cuDNN: 7.6.5, CUDNN_HALF=1, GPU count: 1  
 CUDNN_HALF=1 
 OpenCV version: 3.2.0
valid: Using default 'data/train.txt'
yolo4
 0 : compute_capability = 860, cudnn_half = 1, GPU: GeForce RTX 3090 
net.optimized_memory = 0 
mini_batch = 2, batch = 64, time_steps = 1, train = 1 
   layer   filters  size/strd(dil)      input                output
   0 
### HANG UP HERE FOR 30 MINUTES ###
conv     32       3 x 3/ 1    768 x 768 x   3 ->  768 x 768 x  32 1.019 BF
   1 conv     64       3 x 3/ 2    768 x 768 x  32 ->  384 x 384 x  64 5.436 BF
   2 conv     64       1 x 1/ 1    384 x 384 x  64 ->  384 x 384 x  64 1.208 BF
   3 route  1                                  ->  384 x 384 x  64 
   4 conv     64       1 x 1/ 1    384 x 384 x  64 ->  384 x 384 x  64 1.208 BF
   5 conv     32       1 x 1/ 1    384 x 384 x  64 ->  384 x 384 x  32 0.604 BF
   6 conv     64       3 x 3/ 1    384 x 384 x  32 ->  384 x 384 x  64 5.436 BF
   7 Shortcut Layer: 4,  wt = 0, wn = 0, outputs: 384 x 384 x  64 0.009 BF
   8 conv     64       1 x 1/ 1    384 x 384 x  64 ->  384 x 384 x  64 1.208 BF
   9 route  8 2                                ->  384 x 384 x 128 
### ... ###
Total BFLOPS 203.057 
avg_outputs = 1670200 
 Allocate additional workspace_size = 52.43 MB 
Loading weights from ./data/backup/yolov4.conv.137...
 seen 64, trained: 0 K-images (0 Kilo-batches_64) 
Done! Loaded 137 layers from weights-file 
Learning Rate: 0.001, Momentum: 0.949, Decay: 0.0005
 Detection layer: 139 - type = 28 
 Detection layer: 150 - type = 28 
 Detection layer: 161 - type = 28 
Resizing, random_coef = 1.40 

 1120 x 1120 
 Create 6 permanent cpu-threads 
 try to allocate additional workspace_size = 52.43 MB 
 CUDA allocate done! 
Loaded: 0.000061 seconds
v3 (iou loss, Normalizer: (iou: 0.07, cls: 1.00) Region 139 Avg (IOU: 0.000000, GIOU: 0.000000), Class: 0.000000, Obj: 0.000000, No Obj: 0.509620, .5R: 0.000000, .75R: 0.000000, count: 1, class_loss = 15572.158203, iou_loss = 0.000000, total_loss = 15572.158203 
### ... ###
 1: -nan, -nan avg loss, 0.000000 rate, 5.492178 seconds, 64 images, -1.000000 hours left
### ... ###
 2: -nan, -nan avg loss, 0.000000 rate, 5.540169 seconds, 128 images, 45.767176 hours left

see here for full log

Environment

Docker container on the Workstation

Ubuntu 18.04.5 LTS
darknet d65909fbea471d06e52a2e4a41132380dc2edaa6
CUDA 10.2
cuDNN 7.6.5
1x RTX 3090 provided

Workstation

Ubuntu 20.04.1 LTS
2x RTX 3090 installed
Docker version 19.03.13, build 4484c46d9d

Client (for investigation)

Windows 10 build 19608
1x GTX 1080 installed
CUDA 10.0
cuDNN 7.6.0

Investigation

Validation of dataset

Apart from the workstation, I installed darknet using vcpkg on the client(windows), and trained the same dataeset with gpu. It worked well and trained model seems to be fine, therefore I think there is nothing wrong with dataset.

Identification of causes

I changed compile option and tested.

CUDNN=0 CUDNN_HALF=0: fine(ofcource it's slow since cuDNN is disabled)
CUDNN=1 CUDNN_HALF=0: bug occurs
CUDNN=1 CUDNN_HALF=1: bug occurs

Thus, I concluded this bug occurs only if darknet is built with CUDNN=1.

Also, I tested some CUDA version.

CUDA 10.0 bug occurs
CUDA 10.1 bug occurs
CUDA 10.2 bug occurs
CUDA 11.0 compile fail Unsupported gpu architecture 'compute_30'

CUDA versions do not matter. It seems to be a problem with CUDA 11 is another issue.

And then, I tested some darknet version. I'm sorry that I forget the version I tested, but it still failed with version around 2020.03 . Therefore I think it's not caused by recent change. This might be a compatibility issue with RTX 3090, docker, or cuDNN 7.6.5.

Source

m1kit

Most helpful comment

I tried
CUDNN=1 CUDNN_HALF=1
cudnn 8.0.4
cuda 11.1
CV 4.5

comment out compute_30
"+" -gencode arch=compute_86,code=[sm_86,compute_86]

Build is OK!

takashide on 23 Oct 2020

👍2

All 8 comments

I'm facing the exact same issue. Also running on RTX 3090. Should not be docker-related issue, since CUDNN is working fine on AWS GPU accelerated instances (using Tesla cards) through docker

LeKristapino on 19 Oct 2020

👀1

I tested cuDNN with TITAN X, it worked well. Maybe darknet does not fully support RTX3090...

m1kit on 21 Oct 2020

I tried
CUDNN=1 CUDNN_HALF=1
cudnn 8.0.4
cuda 11.1
CV 4.5

comment out compute_30
"+" -gencode arch=compute_86,code=[sm_86,compute_86]

Build is OK!

takashide on 23 Oct 2020

👍2

@takashide Thanks for the solution. Works great on CUDA 11.1 CUDNN Docker Image, with your suggested modifications

LeKristapino on 3 Nov 2020

Thank you @takashide 👍

m1kit on 3 Nov 2020

@takashide Thanks for the solution. Works great on CUDA 11.1 CUDNN Docker Image, with your suggested modifications

Do you have inference performace metrics with rtx3090?

AlgirdasKartavicius on 3 Nov 2020

@takashide Thanks for the solution. Works great on CUDA 11.1 CUDNN Docker Image, with your suggested modifications

Do you have inference performace metrics with rtx3090?

In my experience, it's more than twice as much as 1080ti.

takashide on 4 Nov 2020

@AlgirdasKartavicius I recently compared RTX3090 vs MSI gs65 laptop running RTX 2070

Inference:
RTX 3090 - 0,047 seconds
RTX 2070 Laptop card - 0,11 seconds

Planning to compare it also with 3070, 3080 and any other NVIDIA cards I can get my hands on, since it's hard to find good comparisons for deep learning, and yolo specifically.

LeKristapino on 14 Nov 2020

Was this page helpful?

0 / 5 - 0 ratings