Darknet: nan consistently with RTX2080 Ti; works on Azure GPU VM

Created on 1 Apr 2019 · 13Comments · Source: AlexeyAB/darknet

I have an RTX 2080Ti and am training using Yolo2. I have CUDA 10.1 installed. I am also using the latest published release of darknet (Feb 18), vs. compiling my own.

When I train, I get positive results for 5-50 iterations, then I consistently get 'nan'. Once a 'nan' appears, all further output values are 'nan' and it never recovers. I have seen numerous other issues reported where nan is occasionally reported, but not every iteration & consistently.

Any idea what is going on or what I'm doing wrong?

compute_capability = 750, cudnn_half = 1
layer filters size input output
0 conv 32 3 x 3 / 1 416 x 416 x 3 -> 416 x 416 x 32 0.299 BF
1 max 2 x 2 / 2 416 x 416 x 32 -> 208 x 208 x 32 0.006 BF
2 conv 64 3 x 3 / 1 208 x 208 x 32 -> 208 x 208 x 64 1.595 BF
3 max 2 x 2 / 2 208 x 208 x 64 -> 104 x 104 x 64 0.003 BF
4 conv 128 3 x 3 / 1 104 x 104 x 64 -> 104 x 104 x 128 1.595 BF
5 conv 64 1 x 1 / 1 104 x 104 x 128 -> 104 x 104 x 64 0.177 BF
6 conv 128 3 x 3 / 1 104 x 104 x 64 -> 104 x 104 x 128 1.595 BF
7 max 2 x 2 / 2 104 x 104 x 128 -> 52 x 52 x 128 0.001 BF
8 conv 256 3 x 3 / 1 52 x 52 x 128 -> 52 x 52 x 256 1.595 BF
9 conv 128 1 x 1 / 1 52 x 52 x 256 -> 52 x 52 x 128 0.177 BF
10 conv 256 3 x 3 / 1 52 x 52 x 128 -> 52 x 52 x 256 1.595 BF
11 max 2 x 2 / 2 52 x 52 x 256 -> 26 x 26 x 256 0.001 BF
12 conv 512 3 x 3 / 1 26 x 26 x 256 -> 26 x 26 x 512 1.595 BF
13 conv 256 1 x 1 / 1 26 x 26 x 512 -> 26 x 26 x 256 0.177 BF
14 conv 512 3 x 3 / 1 26 x 26 x 256 -> 26 x 26 x 512 1.595 BF
15 conv 256 1 x 1 / 1 26 x 26 x 512 -> 26 x 26 x 256 0.177 BF
16 conv 512 3 x 3 / 1 26 x 26 x 256 -> 26 x 26 x 512 1.595 BF
17 max 2 x 2 / 2 26 x 26 x 512 -> 13 x 13 x 512 0.000 BF
18 conv 1024 3 x 3 / 1 13 x 13 x 512 -> 13 x 13 x1024 1.595 BF
19 conv 512 1 x 1 / 1 13 x 13 x1024 -> 13 x 13 x 512 0.177 BF
20 conv 1024 3 x 3 / 1 13 x 13 x 512 -> 13 x 13 x1024 1.595 BF
21 conv 512 1 x 1 / 1 13 x 13 x1024 -> 13 x 13 x 512 0.177 BF
22 conv 1024 3 x 3 / 1 13 x 13 x 512 -> 13 x 13 x1024 1.595 BF
23 conv 1024 3 x 3 / 1 13 x 13 x1024 -> 13 x 13 x1024 3.190 BF
24 conv 1024 3 x 3 / 1 13 x 13 x1024 -> 13 x 13 x1024 3.190 BF
25 route 16
26 conv 64 1 x 1 / 1 26 x 26 x 512 -> 26 x 26 x 64 0.044 BF
27 reorg / 2 26 x 26 x 64 -> 13 x 13 x 256
28 route 27 24
29 conv 1024 3 x 3 / 1 13 x 13 x1280 -> 13 x 13 x1024 3.987 BF
30 conv 40 1 x 1 / 1 13 x 13 x1024 -> 13 x 13 x 40 0.014 BF
31 detection
mask_scale: Using default '1.000000'
Total BFLOPS 29.342
Allocate additional workspace_size = 131.08 MB
Learning Rate: 0.001, Momentum: 0.9, Decay: 0.0005

.....

26: 37.931534, 40.051895 avg loss, 0.000000 rate, 3.514118 seconds, 1664 images
Loaded: 0.000000 seconds
Region Avg IOU: 0.233431, Class: 0.427117, Obj: 0.549900, No Obj: 0.499038, Avg Recall: 0.043478, count: 23
Region Avg IOU: 0.242383, Class: 0.389269, Obj: 0.573786, No Obj: 0.500438, Avg Recall: 0.041667, count: 48
Region Avg IOU: 0.224121, Class: 0.360028, Obj: 0.568328, No Obj: 0.499204, Avg Recall: 0.076923, count: 26
Region Avg IOU: 0.222602, Class: 0.407702, Obj: 0.547704, No Obj: 0.499739, Avg Recall: 0.000000, count: 45
Region Avg IOU: 0.229694, Class: 0.357403, Obj: 0.553654, No Obj: 0.499204, Avg Recall: 0.060606, count: 33
Region Avg IOU: 0.269491, Class: 0.401898, Obj: 0.529683, No Obj: 0.499720, Avg Recall: 0.095238, count: 42
Region Avg IOU: 0.264497, Class: 0.402740, Obj: 0.569932, No Obj: 0.498378, Avg Recall: 0.000000, count: 36
Region Avg IOU: 0.230749, Class: 0.374505, Obj: 0.553974, No Obj: 0.498808, Avg Recall: 0.030303, count: 33
Region Avg IOU: 0.188117, Class: 0.392308, Obj: 0.476783, No Obj: 0.498699, Avg Recall: 0.025000, count: 40
Region Avg IOU: 0.194585, Class: 0.306415, Obj: 0.496396, No Obj: 0.499812, Avg Recall: 0.032258, count: 31
Region Avg IOU: 0.230667, Class: 0.374361, Obj: 0.546179, No Obj: 0.498302, Avg Recall: 0.068182, count: 44
Region Avg IOU: 0.263723, Class: 0.342670, Obj: 0.541158, No Obj: 0.500428, Avg Recall: 0.117647, count: 34
Region Avg IOU: 0.193961, Class: 0.295183, Obj: 0.551441, No Obj: 0.499098, Avg Recall: 0.040000, count: 25
Region Avg IOU: 0.231925, Class: 0.329111, Obj: 0.581167, No Obj: 0.499581, Avg Recall: 0.000000, count: 27
Region Avg IOU: 0.229697, Class: 0.363912, Obj: 0.553527, No Obj: 0.498122, Avg Recall: 0.055556, count: 36
Region Avg IOU: 0.243271, Class: 0.340143, Obj: 0.572841, No Obj: 0.500377, Avg Recall: 0.032258, count: 31

Tensor Cores are disabled until the first 3000 iterations are reached.
27: 38.511658, 39.897873 avg loss, 0.000000 rate, 0.818893 seconds, 1728 images
Loaded: 0.000000 seconds
Region Avg IOU: 0.245377, Class: 0.506618, Obj: 0.551534, No Obj: 0.499389, Avg Recall: 0.000000, count: 26
Region Avg IOU: nan, Class: nan, Obj: -nan, No Obj: -nan, Avg Recall: 0.000000, count: 36
Region Avg IOU: nan, Class: nan, Obj: -nan, No Obj: -nan, Avg Recall: 0.000000, count: 35
Region Avg IOU: nan, Class: nan, Obj: -nan, No Obj: -nan, Avg Recall: 0.000000, count: 32
Region Avg IOU: nan, Class: nan, Obj: -nan, No Obj: -nan, Avg Recall: 0.000000, count: 27
Region Avg IOU: nan, Class: nan, Obj: -nan, No Obj: -nan, Avg Recall: 0.000000, count: 29
Region Avg IOU: nan, Class: nan, Obj: -nan, No Obj: -nan, Avg Recall: 0.000000, count: 34
Region Avg IOU: nan, Class: nan, Obj: -nan, No Obj: -nan, Avg Recall: 0.000000, count: 44
Region Avg IOU: nan, Class: nan, Obj: -nan, No Obj: -nan, Avg Recall: 0.000000, count: 42
Region Avg IOU: nan, Class: nan, Obj: -nan, No Obj: -nan, Avg Recall: 0.000000, count: 41
Region Avg IOU: nan, Class: nan, Obj: -nan, No Obj: -nan, Avg Recall: 0.000000, count: 29
Region Avg IOU: nan, Class: nan, Obj: -nan, No Obj: -nan, Avg Recall: 0.000000, count: 32
Region Avg IOU: nan, Class: nan, Obj: -nan, No Obj: -nan, Avg Recall: 0.000000, count: 23
Region Avg IOU: nan, Class: nan, Obj: -nan, No Obj: -nan, Avg Recall: 0.000000, count: 39
Region Avg IOU: nan, Class: nan, Obj: -nan, No Obj: -nan, Avg Recall: 0.000000, count: 38
Region Avg IOU: nan, Class: nan, Obj: -nan, No Obj: -nan, Avg Recall: 0.000000, count: 46

Tensor Cores are disabled until the first 3000 iterations are reached.
28: -nan, -nan avg loss, 0.000000 rate, 0.797576 seconds, 1792 images
Loaded: 0.000000 seconds
Region Avg IOU: nan, Class: nan, Obj: -nan, No Obj: -nan, Avg Recall: 0.000000, count: 36
Region Avg IOU: nan, Class: nan, Obj: -nan, No Obj: -nan, Avg Recall: 0.000000, count: 46
Region Avg IOU: nan, Class: nan, Obj: -nan, No Obj: -nan, Avg Recall: 0.000000, count: 41
Region Avg IOU: nan, Class: nan, Obj: -nan, No Obj: -nan, Avg Recall: 0.000000, count: 36
Region Avg IOU: nan, Class: nan, Obj: -nan, No Obj: -nan, Avg Recall: 0.000000, count: 35
Region Avg IOU: nan, Class: nan, Obj: -nan, No Obj: -nan, Avg Recall: 0.000000, count: 63
Region Avg IOU: nan, Class: nan, Obj: -nan, No Obj: -nan, Avg Recall: 0.000000, count: 38
Region Avg IOU: nan, Class: nan, Obj: -nan, No Obj: -nan, Avg Recall: 0.000000, count: 37
Region Avg IOU: nan, Class: nan, Obj: -nan, No Obj: -nan, Avg Recall: 0.000000, count: 30
Region Avg IOU: nan, Class: nan, Obj: -nan, No Obj: -nan, Avg Recall: 0.000000, count: 31
Region Avg IOU: nan, Class: nan, Obj: -nan, No Obj: -nan, Avg Recall: 0.000000, count: 34
Region Avg IOU: nan, Class: nan, Obj: -nan, No Obj: -nan, Avg Recall: 0.000000, count: 37
Region Avg IOU: nan, Class: nan, Obj: -nan, No Obj: -nan, Avg Recall: 0.000000, count: 36
Region Avg IOU: nan, Class: nan, Obj: -nan, No Obj: -nan, Avg Recall: 0.000000, count: 50
Region Avg IOU: nan, Class: nan, Obj: -nan, No Obj: -nan, Avg Recall: 0.000000, count: 26
Region Avg IOU: nan, Class: nan, Obj: -nan, No Obj: -nan, Avg Recall: 0.000000, count: 30

Tensor Cores are disabled until the first 3000 iterations are reached.
29: -nan, -nan avg loss, 0.000000 rate, 0.782177 seconds, 1856 images

Hardware bug Solved

Source

jklemmack

All 13 comments

I should also note that the exact same setup works on an Azure GPU Compute (NC12) without issue. The only notable difference in config is that it has a dual K80 setup (CUDA 3.7) and the RTX2080 Ti uses CUDA 7.5.

jklemmack on 1 Apr 2019

Hi @jklemmack,
I also have same issues with my RTX 2060, although mine happens when tensor cores are activated (iteration 3000). However, maybe you can give a try to the suggestions @AlexeyAB gave to my in the issue (#2783).

Btw, here

K80 setup (CUDA 3.7) and the RTX2080 Ti uses CUDA 7.5

You mean CUDA 10 and cudnn 7.5 rigth? If not you should update to CUDA 10 in order to use that 2080ti

drapado on 1 Apr 2019

@jklemmack

Can you show output of commands?

nvidia-smi
nvcc --version

What CUDA and cuDNN versions do you use?
Did you use GPU=1 CUDNN=1 CUDNN_HALF=1 OPENCV=1 in the Makefile before make ?
Try to un-comment this line and recompile: https://github.com/AlexeyAB/darknet/blob/6231b748c44e2007b5c3cbf765a50b122782c5a2/Makefile#L28

AlexeyAB on 1 Apr 2019

@drapado Yes - you're right. I have CUDA 10.1 + cudnn 7.5.0.56.

@AlexeyAB I'm way out of my depth with C++, so I was hoping to use the pre-built executables in the Releases section. Sounds like I may need to learn more about the build system :(

However:

nvcc:

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Fri_Feb__8_19:08:26_Pacific_Standard_Time_2019
Cuda compilation tools, release 10.1, V10.1.105

nvidia-smi

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 419.67       Driver Version: 419.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name            TCC/WDDM | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce RTX 208... WDDM  | 00000000:01:00.0 Off |                  N/A |
| 25%   32C    P8    26W / 260W |    464MiB / 11264MiB |      3%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      1520    C+G   C:\Windows\System32\LogonUI.exe            N/A      |
|    0      2024    C+G   ...6)\Google\Chrome\Application\chrome.exe N/A      |
|    0      2672    C+G   ...dows.Cortana_cw5n1h2txyewy\SearchUI.exe N/A      |
|    0      5004    C+G   C:\Windows\explorer.exe                    N/A      |
|    0      6508    C+G   ...t_cw5n1h2txyewy\ShellExperienceHost.exe N/A      |
|    0      7112    C+G   C:\Windows\System32\dwm.exe                N/A      |
|    0      7524    C+G   C:\Windows\System32\dwm.exe                N/A      |
+-----------------------------------------------------------------------------+

jklemmack on 1 Apr 2019

@jklemmack

Try to check your Dataset by using Yolo_mark: https://github.com/AlexeyAB/Yolo_mark

AlexeyAB on 1 Apr 2019

@AlexeyAB I've struggled to get everything built, but it all may have been a massive misdirection. Turns out that a batch of GeForce RTX 2080 Ti cards are [known to be defective]
(https://www.tomshardware.com/news/rtx-2080-ti-gpu-defects-launch,37995.html).

You can close this for now, and I'll re-open if I experience the issue(s) again after verifying my data set and local build.

jklemmack on 1 Apr 2019

👍1

Closed b/c of hardware defect!

jklemmack on 23 Apr 2019

👍1

@jklemmack Did you try the same CUDA/cuDNN/Darknet-code/model/dataset on another GPU and it works well?

AlexeyAB on 23 Apr 2019

Yes. I tried it on both an Azure NC-series VM (per this comment) and on a woefully under-powered mobile GXT960M. They both ran without issue.

jklemmack on 23 Apr 2019

👍1

@jklemmack
Are you using a video card from an early installment, 2018? As far as I know now this bug (with heating memory) is fixed in new GPUs.

AlexeyAB on 23 Apr 2019

The card was purchased ~Feb 2019, but it exhibited all signs of being defective a RTX 2080 Ti - random chars displayed on screen, then (as I know now) system restarts after its under load for a bit. Didn't smoke out, but did hit 38+C according to the nVidia tools. It is currently RMAed at MSI.

jklemmack on 23 Apr 2019

👍1

Got the replacement card yesterday. I dropped it in and ran the exact same training (same .exe, same commands, etc. on machine that hasn't been touched since this issue was opened) and I've run over 5000 training iterations without issue.

Root problem was likely a defective board.

jklemmack on 3 May 2019

👍1

It is very interesting that even the February cards are buggy.
Thanks for information!

AlexeyAB on 4 May 2019

Was this page helpful?

0 / 5 - 0 ratings