I have an RTX 2080Ti and am training using Yolo2. I have CUDA 10.1 installed. I am also using the latest published release of darknet (Feb 18), vs. compiling my own.
When I train, I get positive results for 5-50 iterations, then I consistently get 'nan'. Once a 'nan' appears, all further output values are 'nan' and it never recovers. I have seen numerous other issues reported where nan is occasionally reported, but not every iteration & consistently.
Any idea what is going on or what I'm doing wrong?
compute_capability = 750, cudnn_half = 1
layer filters size input output
0 conv 32 3 x 3 / 1 416 x 416 x 3 -> 416 x 416 x 32 0.299 BF
1 max 2 x 2 / 2 416 x 416 x 32 -> 208 x 208 x 32 0.006 BF
2 conv 64 3 x 3 / 1 208 x 208 x 32 -> 208 x 208 x 64 1.595 BF
3 max 2 x 2 / 2 208 x 208 x 64 -> 104 x 104 x 64 0.003 BF
4 conv 128 3 x 3 / 1 104 x 104 x 64 -> 104 x 104 x 128 1.595 BF
5 conv 64 1 x 1 / 1 104 x 104 x 128 -> 104 x 104 x 64 0.177 BF
6 conv 128 3 x 3 / 1 104 x 104 x 64 -> 104 x 104 x 128 1.595 BF
7 max 2 x 2 / 2 104 x 104 x 128 -> 52 x 52 x 128 0.001 BF
8 conv 256 3 x 3 / 1 52 x 52 x 128 -> 52 x 52 x 256 1.595 BF
9 conv 128 1 x 1 / 1 52 x 52 x 256 -> 52 x 52 x 128 0.177 BF
10 conv 256 3 x 3 / 1 52 x 52 x 128 -> 52 x 52 x 256 1.595 BF
11 max 2 x 2 / 2 52 x 52 x 256 -> 26 x 26 x 256 0.001 BF
12 conv 512 3 x 3 / 1 26 x 26 x 256 -> 26 x 26 x 512 1.595 BF
13 conv 256 1 x 1 / 1 26 x 26 x 512 -> 26 x 26 x 256 0.177 BF
14 conv 512 3 x 3 / 1 26 x 26 x 256 -> 26 x 26 x 512 1.595 BF
15 conv 256 1 x 1 / 1 26 x 26 x 512 -> 26 x 26 x 256 0.177 BF
16 conv 512 3 x 3 / 1 26 x 26 x 256 -> 26 x 26 x 512 1.595 BF
17 max 2 x 2 / 2 26 x 26 x 512 -> 13 x 13 x 512 0.000 BF
18 conv 1024 3 x 3 / 1 13 x 13 x 512 -> 13 x 13 x1024 1.595 BF
19 conv 512 1 x 1 / 1 13 x 13 x1024 -> 13 x 13 x 512 0.177 BF
20 conv 1024 3 x 3 / 1 13 x 13 x 512 -> 13 x 13 x1024 1.595 BF
21 conv 512 1 x 1 / 1 13 x 13 x1024 -> 13 x 13 x 512 0.177 BF
22 conv 1024 3 x 3 / 1 13 x 13 x 512 -> 13 x 13 x1024 1.595 BF
23 conv 1024 3 x 3 / 1 13 x 13 x1024 -> 13 x 13 x1024 3.190 BF
24 conv 1024 3 x 3 / 1 13 x 13 x1024 -> 13 x 13 x1024 3.190 BF
25 route 16
26 conv 64 1 x 1 / 1 26 x 26 x 512 -> 26 x 26 x 64 0.044 BF
27 reorg / 2 26 x 26 x 64 -> 13 x 13 x 256
28 route 27 24
29 conv 1024 3 x 3 / 1 13 x 13 x1280 -> 13 x 13 x1024 3.987 BF
30 conv 40 1 x 1 / 1 13 x 13 x1024 -> 13 x 13 x 40 0.014 BF
31 detection
mask_scale: Using default '1.000000'
Total BFLOPS 29.342
Allocate additional workspace_size = 131.08 MB
Learning Rate: 0.001, Momentum: 0.9, Decay: 0.0005.....
26: 37.931534, 40.051895 avg loss, 0.000000 rate, 3.514118 seconds, 1664 images
Loaded: 0.000000 seconds
Region Avg IOU: 0.233431, Class: 0.427117, Obj: 0.549900, No Obj: 0.499038, Avg Recall: 0.043478, count: 23
Region Avg IOU: 0.242383, Class: 0.389269, Obj: 0.573786, No Obj: 0.500438, Avg Recall: 0.041667, count: 48
Region Avg IOU: 0.224121, Class: 0.360028, Obj: 0.568328, No Obj: 0.499204, Avg Recall: 0.076923, count: 26
Region Avg IOU: 0.222602, Class: 0.407702, Obj: 0.547704, No Obj: 0.499739, Avg Recall: 0.000000, count: 45
Region Avg IOU: 0.229694, Class: 0.357403, Obj: 0.553654, No Obj: 0.499204, Avg Recall: 0.060606, count: 33
Region Avg IOU: 0.269491, Class: 0.401898, Obj: 0.529683, No Obj: 0.499720, Avg Recall: 0.095238, count: 42
Region Avg IOU: 0.264497, Class: 0.402740, Obj: 0.569932, No Obj: 0.498378, Avg Recall: 0.000000, count: 36
Region Avg IOU: 0.230749, Class: 0.374505, Obj: 0.553974, No Obj: 0.498808, Avg Recall: 0.030303, count: 33
Region Avg IOU: 0.188117, Class: 0.392308, Obj: 0.476783, No Obj: 0.498699, Avg Recall: 0.025000, count: 40
Region Avg IOU: 0.194585, Class: 0.306415, Obj: 0.496396, No Obj: 0.499812, Avg Recall: 0.032258, count: 31
Region Avg IOU: 0.230667, Class: 0.374361, Obj: 0.546179, No Obj: 0.498302, Avg Recall: 0.068182, count: 44
Region Avg IOU: 0.263723, Class: 0.342670, Obj: 0.541158, No Obj: 0.500428, Avg Recall: 0.117647, count: 34
Region Avg IOU: 0.193961, Class: 0.295183, Obj: 0.551441, No Obj: 0.499098, Avg Recall: 0.040000, count: 25
Region Avg IOU: 0.231925, Class: 0.329111, Obj: 0.581167, No Obj: 0.499581, Avg Recall: 0.000000, count: 27
Region Avg IOU: 0.229697, Class: 0.363912, Obj: 0.553527, No Obj: 0.498122, Avg Recall: 0.055556, count: 36
Region Avg IOU: 0.243271, Class: 0.340143, Obj: 0.572841, No Obj: 0.500377, Avg Recall: 0.032258, count: 31Tensor Cores are disabled until the first 3000 iterations are reached.
27: 38.511658, 39.897873 avg loss, 0.000000 rate, 0.818893 seconds, 1728 images
Loaded: 0.000000 seconds
Region Avg IOU: 0.245377, Class: 0.506618, Obj: 0.551534, No Obj: 0.499389, Avg Recall: 0.000000, count: 26
Region Avg IOU: nan, Class: nan, Obj: -nan, No Obj: -nan, Avg Recall: 0.000000, count: 36
Region Avg IOU: nan, Class: nan, Obj: -nan, No Obj: -nan, Avg Recall: 0.000000, count: 35
Region Avg IOU: nan, Class: nan, Obj: -nan, No Obj: -nan, Avg Recall: 0.000000, count: 32
Region Avg IOU: nan, Class: nan, Obj: -nan, No Obj: -nan, Avg Recall: 0.000000, count: 27
Region Avg IOU: nan, Class: nan, Obj: -nan, No Obj: -nan, Avg Recall: 0.000000, count: 29
Region Avg IOU: nan, Class: nan, Obj: -nan, No Obj: -nan, Avg Recall: 0.000000, count: 34
Region Avg IOU: nan, Class: nan, Obj: -nan, No Obj: -nan, Avg Recall: 0.000000, count: 44
Region Avg IOU: nan, Class: nan, Obj: -nan, No Obj: -nan, Avg Recall: 0.000000, count: 42
Region Avg IOU: nan, Class: nan, Obj: -nan, No Obj: -nan, Avg Recall: 0.000000, count: 41
Region Avg IOU: nan, Class: nan, Obj: -nan, No Obj: -nan, Avg Recall: 0.000000, count: 29
Region Avg IOU: nan, Class: nan, Obj: -nan, No Obj: -nan, Avg Recall: 0.000000, count: 32
Region Avg IOU: nan, Class: nan, Obj: -nan, No Obj: -nan, Avg Recall: 0.000000, count: 23
Region Avg IOU: nan, Class: nan, Obj: -nan, No Obj: -nan, Avg Recall: 0.000000, count: 39
Region Avg IOU: nan, Class: nan, Obj: -nan, No Obj: -nan, Avg Recall: 0.000000, count: 38
Region Avg IOU: nan, Class: nan, Obj: -nan, No Obj: -nan, Avg Recall: 0.000000, count: 46Tensor Cores are disabled until the first 3000 iterations are reached.
28: -nan, -nan avg loss, 0.000000 rate, 0.797576 seconds, 1792 images
Loaded: 0.000000 seconds
Region Avg IOU: nan, Class: nan, Obj: -nan, No Obj: -nan, Avg Recall: 0.000000, count: 36
Region Avg IOU: nan, Class: nan, Obj: -nan, No Obj: -nan, Avg Recall: 0.000000, count: 46
Region Avg IOU: nan, Class: nan, Obj: -nan, No Obj: -nan, Avg Recall: 0.000000, count: 41
Region Avg IOU: nan, Class: nan, Obj: -nan, No Obj: -nan, Avg Recall: 0.000000, count: 36
Region Avg IOU: nan, Class: nan, Obj: -nan, No Obj: -nan, Avg Recall: 0.000000, count: 35
Region Avg IOU: nan, Class: nan, Obj: -nan, No Obj: -nan, Avg Recall: 0.000000, count: 63
Region Avg IOU: nan, Class: nan, Obj: -nan, No Obj: -nan, Avg Recall: 0.000000, count: 38
Region Avg IOU: nan, Class: nan, Obj: -nan, No Obj: -nan, Avg Recall: 0.000000, count: 37
Region Avg IOU: nan, Class: nan, Obj: -nan, No Obj: -nan, Avg Recall: 0.000000, count: 30
Region Avg IOU: nan, Class: nan, Obj: -nan, No Obj: -nan, Avg Recall: 0.000000, count: 31
Region Avg IOU: nan, Class: nan, Obj: -nan, No Obj: -nan, Avg Recall: 0.000000, count: 34
Region Avg IOU: nan, Class: nan, Obj: -nan, No Obj: -nan, Avg Recall: 0.000000, count: 37
Region Avg IOU: nan, Class: nan, Obj: -nan, No Obj: -nan, Avg Recall: 0.000000, count: 36
Region Avg IOU: nan, Class: nan, Obj: -nan, No Obj: -nan, Avg Recall: 0.000000, count: 50
Region Avg IOU: nan, Class: nan, Obj: -nan, No Obj: -nan, Avg Recall: 0.000000, count: 26
Region Avg IOU: nan, Class: nan, Obj: -nan, No Obj: -nan, Avg Recall: 0.000000, count: 30Tensor Cores are disabled until the first 3000 iterations are reached.
29: -nan, -nan avg loss, 0.000000 rate, 0.782177 seconds, 1856 images
I should also note that the exact same setup works on an Azure GPU Compute (NC12) without issue. The only notable difference in config is that it has a dual K80 setup (CUDA 3.7) and the RTX2080 Ti uses CUDA 7.5.
Hi @jklemmack,
I also have same issues with my RTX 2060, although mine happens when tensor cores are activated (iteration 3000). However, maybe you can give a try to the suggestions @AlexeyAB gave to my in the issue (#2783).
Btw, here
K80 setup (CUDA 3.7) and the RTX2080 Ti uses CUDA 7.5
You mean CUDA 10 and cudnn 7.5 rigth? If not you should update to CUDA 10 in order to use that 2080ti
@jklemmack
nvidia-smi
nvcc --version
What CUDA and cuDNN versions do you use?
Did you use GPU=1 CUDNN=1 CUDNN_HALF=1 OPENCV=1 in the Makefile before make ?
Try to un-comment this line and recompile: https://github.com/AlexeyAB/darknet/blob/6231b748c44e2007b5c3cbf765a50b122782c5a2/Makefile#L28
@drapado Yes - you're right. I have CUDA 10.1 + cudnn 7.5.0.56.
@AlexeyAB I'm way out of my depth with C++, so I was hoping to use the pre-built executables in the Releases section. Sounds like I may need to learn more about the build system :(
However:
nvcc:
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Fri_Feb__8_19:08:26_Pacific_Standard_Time_2019
Cuda compilation tools, release 10.1, V10.1.105
nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 419.67 Driver Version: 419.67 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name TCC/WDDM | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce RTX 208... WDDM | 00000000:01:00.0 Off | N/A |
| 25% 32C P8 26W / 260W | 464MiB / 11264MiB | 3% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 1520 C+G C:\Windows\System32\LogonUI.exe N/A |
| 0 2024 C+G ...6)\Google\Chrome\Application\chrome.exe N/A |
| 0 2672 C+G ...dows.Cortana_cw5n1h2txyewy\SearchUI.exe N/A |
| 0 5004 C+G C:\Windows\explorer.exe N/A |
| 0 6508 C+G ...t_cw5n1h2txyewy\ShellExperienceHost.exe N/A |
| 0 7112 C+G C:\Windows\System32\dwm.exe N/A |
| 0 7524 C+G C:\Windows\System32\dwm.exe N/A |
+-----------------------------------------------------------------------------+
@jklemmack
Try to check your Dataset by using Yolo_mark: https://github.com/AlexeyAB/Yolo_mark
@AlexeyAB I've struggled to get everything built, but it all may have been a massive misdirection. Turns out that a batch of GeForce RTX 2080 Ti cards are [known to be defective]
(https://www.tomshardware.com/news/rtx-2080-ti-gpu-defects-launch,37995.html).
You can close this for now, and I'll re-open if I experience the issue(s) again after verifying my data set and local build.
Closed b/c of hardware defect!
@jklemmack Did you try the same CUDA/cuDNN/Darknet-code/model/dataset on another GPU and it works well?
Yes. I tried it on both an Azure NC-series VM (per this comment) and on a woefully under-powered mobile GXT960M. They both ran without issue.
@jklemmack
Are you using a video card from an early installment, 2018? As far as I know now this bug (with heating memory) is fixed in new GPUs.
The card was purchased ~Feb 2019, but it exhibited all signs of being defective a RTX 2080 Ti - random chars displayed on screen, then (as I know now) system restarts after its under load for a bit. Didn't smoke out, but did hit 38+C according to the nVidia tools. It is currently RMAed at MSI.
Got the replacement card yesterday. I dropped it in and ran the exact same training (same .exe, same commands, etc. on machine that hasn't been touched since this issue was opened) and I've run over 5000 training iterations without issue.
Root problem was likely a defective board.
It is very interesting that even the February cards are buggy.
Thanks for information!