I am training Darknet YOLO-V3 on cat-dog dataset. When I do the training portion, following error occurs. Can someone help me. The error is
layer filters size input output
0 conv 16 3 x 3 / 1 416 x 416 x 3 -> 416 x 416 x 16 0.150 BFLOPs
1 max 2 x 2 / 2 416 x 416 x 16 -> 208 x 208 x 16
2 conv 32 3 x 3 / 1 208 x 208 x 16 -> 208 x 208 x 32 0.399 BFLOPs
3 max 2 x 2 / 2 208 x 208 x 32 -> 104 x 104 x 32
4 conv 64 3 x 3 / 1 104 x 104 x 32 -> 104 x 104 x 64 0.399 BFLOPs
5 max 2 x 2 / 2 104 x 104 x 64 -> 52 x 52 x 64
6 conv 128 3 x 3 / 1 52 x 52 x 64 -> 52 x 52 x 128 0.399 BFLOPs
7 max 2 x 2 / 2 52 x 52 x 128 -> 26 x 26 x 128
8 conv 256 3 x 3 / 1 26 x 26 x 128 -> 26 x 26 x 256 0.399 BFLOPs
9 max 2 x 2 / 2 26 x 26 x 256 -> 13 x 13 x 256
10 conv 512 3 x 3 / 1 13 x 13 x 256 -> 13 x 13 x 512 0.399 BFLOPs
11 max 2 x 2 / 1 13 x 13 x 512 -> 13 x 13 x 512
12 conv 1024 3 x 3 / 1 13 x 13 x 512 -> 13 x 13 x1024 1.595 BFLOPs
13 conv 256 1 x 1 / 1 13 x 13 x1024 -> 13 x 13 x 256 0.089 BFLOPs
14 conv 512 3 x 3 / 1 13 x 13 x 256 -> 13 x 13 x 512 0.399 BFLOPs
15 conv 21 1 x 1 / 1 13 x 13 x 512 -> 13 x 13 x 21 0.004 BFLOPs
16 yolo
17 route 13
18 conv 128 1 x 1 / 1 13 x 13 x 256 -> 13 x 13 x 128 0.011 BFLOPs
19 upsample 2x 13 x 13 x 128 -> 26 x 26 x 128
20 route 19 8
21 conv 256 3 x 3 / 1 26 x 26 x 384 -> 26 x 26 x 256 1.196 BFLOPs
22 conv 21 1 x 1 / 1 26 x 26 x 256 -> 26 x 26 x 21 0.007 BFLOPs
23 yolo
Loading weights from darknet53.conv.74...Done!
Learning Rate: 0.001, Momentum: 0.9, Decay: 0.0005
Resizing
576
Floating point exception (core dumped)
How can I solve this?
I am having the exact same problem. But I cannot even get to the first learning Rate
I have gone through my cfg file line by line too
Same problem here, trying to train darknet on CPU.
Same same. I migrated it from CPU to a GPU Gcloud instance but am still seeing the floating point issue. Wondering if the annotation text file conversion from BBox to Yolo got messed up somewhere.
I have the same problem and I followed this.
I believe Floating point exception is because batch/subdivision in .cfg file is not integer. I changed it to generate integer and it started working.
@ashnaeldho could you verify this, if this helps?
@harshthakkar01 that is the tutorial I followed too.
After fixing an incorrect path on my training set per my comment here, I was still having issues with the CPU defaulting over GPU.
I ended up needing to stop using CMake because it was improperly configuring my Makefile, which needs to include
GPU=1
CUDNN=1
OPENCV=1
DEBUG=1
I've also seen this comment which has helped other people to add
PATH=/usr/local/cuda-<YOUR_VERSION>/bin${PATH:+:${PATH}}
LD_LIBRARY_PATH=/usr/local/cuda-<YOUR_VERSION>/lib64${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}
NVCC = /usr/local/cuda/bin/nvcc
to .bashrc
I came across the same problem.
It was caused by a small mistake in the .data file.
It was supposed to locate the train.txt file like the following, but I didn't.
1 classes= 20
2 train = <path-to-voc>/train.txt
3 valid = <path-to-voc>2007_test.txt
hope it may help.
In my case i was compiling darknet with cmake when the problem occurs, i changed compiling mode to make and then it worked.
In my case, it worked after setting the subdivisions to a ~lower~ bigger number (4)
In my case, it worked after setting the
subdivisionsto a lower number (4)
If batch_size is also a low number, 4 for example, it still doesn't work. However, a higher batch_size with lower subdivisions often leads to the error "GPU out of memory".
@cloudy-sfu subdivisions is basically how many images batches to consider while passing it to the model. Meaning, batch_size of 64 and subdivisions of 64 would mean only 1 image would be passed. I have corrected my previous message what I meant is increase the number of subdivisions, not decrease.
Error occurred due to an empty train.txt file created by an external script. Since I have encountered this script several times now on the web, I would add this workaround here. While populating the test and train files the current directory is searched, but the path to the data isn't added.
for pathAndFilename in glob.iglob(os.path.join(current_dir, ".jpg")):
=>
for pathAndFilename in glob.iglob(os.path.join(current_dir, path_data, ".jpg")):
Most helpful comment
I have the same problem and I followed this.
I believe Floating point exception is because batch/subdivision in .cfg file is not integer. I changed it to generate integer and it started working.
@ashnaeldho could you verify this, if this helps?