Hello,
I am using original repo on my linux system. Trying to train tiny yolo model but it gives following error after some steps. Number is not consistent sometime it throws error at 2nd step some time at 40th. Can you please help me with it?
41: 107.253304, 415.541656 avg, 0.001000 rate, 4.394401 seconds, 1312 images
Loaded: 0.000124 seconds
Region Avg IOU: 0.018014, Class: 1.000000, Obj: 0.002887, No Obj: 0.003492, Avg Recall: 0.000000, count: 18
Region Avg IOU: 0.401208, Class: 1.000000, Obj: 0.007435, No Obj: 0.007720, Avg Recall: 0.500000, count: 4
Region Avg IOU: 0.107763, Class: 1.000000, Obj: 0.003388, No Obj: 0.004642, Avg Recall: 0.142857, count: 7
Segmentation fault (core dumped)
Using following command
./darknet detector train data/obj.data cfg/tiny-yolo-voc-obj.cfg tiny-yolo-voc.conv.13
Hi, may be something wrong with your dataset.
Hello @AlexeyAB,
Thank you for your quick response.
Do you use OpenCV, CUDA and cuDNN?
Yes, I do. Using this on NVIDIA Jetson TX2 board
What OpenCV version do you use?
3.4.0
Do you get this error if you use this repo to training on Linux?
I am running on Jetson TX2 linux only.
It starts training with these steps
1: 206.234924, 206.234924 avg, 0.001000 rate, 0.910286 seconds, 12 images
Loaded: 0.000072 seconds
Region Avg IOU: 0.002696, Class: 1.000000, Obj: 0.091525, No Obj: 0.225260, Avg Recall: 0.000000, count: 17
Region Avg IOU: 0.002618, Class: 1.000000, Obj: 0.264026, No Obj: 0.211167, Avg Recall: 0.000000, count: 33
Region Avg IOU: 0.000733, Class: 1.000000, Obj: 0.263204, No Obj: 0.214708, Avg Recall: 0.000000, count: 32
Region Avg IOU: 0.000000, Class: 1.000000, Obj: 0.147834, No Obj: 0.221345, Avg Recall: 0.000000, count: 11
but in last steps before it fails it shows data like this
10: nan, nan avg, 0.001000 rate, 1.093313 seconds, 120 images
Resizing
544
Loaded: 0.000286 seconds
Region Avg IOU: 0.000000, Class: nan, Obj: 0.000000, No Obj: nan, Avg Recall: 0.000000, count: 30
Region Avg IOU: 0.000000, Class: nan, Obj: nan, No Obj: nan, Avg Recall: 0.000000, count: 20
Region Avg IOU: 0.000000, Class: nan, Obj: 0.000000, No Obj: nan, Avg Recall: 0.000000, count: 20
Region Avg IOU: 0.000000, Class: nan, Obj: 0.000000, No Obj: nan, Avg Recall: 0.000000, count: 23
Try to train with OPENCV=0 in the Makefile and with random=0 in your cfg file.
Try to use this repo on Linux.
I tried setting these 2 but the error is similar.
This is trange. Try to use CUDNN=0 or use this repo. Sometimes people are faced with a problem when used CUDNN on TX2: https://github.com/AlexeyAB/darknet/issues/436
If the error still occur, then problem might be in your dataset.
It's not working either. Can you tell me how can I change batch and subdivision? Is there any relation between these 2?
In terms of dataset I have done something like this
obj.dat
classes=0
train = data/train.txt
valid = data/test.txt
names = data/obj.names
backup = backup/
train.txt file contains
data/obj/img1.jpg
data/obj/img2.jpg
data/obj/img3.jpg
.
.
data/obj/img500.jpg
cfg file
[convolutional]
filters=30
[region]
classes=1
img1.txt
0 0.66 0.5 0.3 0.2
obj.names
face
Everything looks correct, except for
obj.dat
classes=0
There should be classes=1
Did you get your img1.txt using Yolo_mark tool? https://github.com/AlexeyAB/Yolo_mark
Your can try to change batch and subdivision in your cfg-file: https://github.com/AlexeyAB/darknet/blob/880cf187d87c904f5fe574802ecff99118643f2d/cfg/yolo-voc.2.0.cfg#L2-L3
Memory consumption proportionally ~= batch/subdivision
I have not used Yolo_mark tool. I had all of my coordinates so I just converted it to text file.
You should check that each your imgX.txt file contains correct values 0< x,y,w,h <1 or and x+w/2 < 1 , y+h/2 < 1
Also that no one imgX.txt file contains empty extra lines.
Just to make sure
X and Y here are top left co-ordinates relative to width and height OR rectangle center point relative to width and height?
I mean is it
"top left x"/width , "top left y"/height
OR
"rectangle center x"/width , "rectangle center y"/height?
and W and H are rectangle size relative to width and height.
"rectangle center x"/width , "rectangle center y"/height?
Yes, x, y is center of object relative to image size. I fixed answer.
Hello,
I managed to run training (in Mac book with 1080ti with adding 2 class and reducing my dataset). But now it is saving weights till 900 only. Can you please let me know how to change that?
Thank you
@smitshilu Change this line: https://github.com/pjreddie/darknet/blob/80d9bec20f0a44ab07616215c6eadb2d633492fe/examples/detector.c#L136
to this if( i%100 == 0 ){
Or use this fork: https://github.com/AlexeyAB/darknet
I think the reason behind the segmentation fault or Bus error is big data. I tried using almost 3000, 2000, 1000 and 500 images but I got error. When I tried with 50 it worked and ran for around 6000 steps. I am now using MacBook with 12GB NVIDIA GTX 1080Ti and facing same problem.
Is there any solution to this problem? What to do if I want to train on 3000 images??
I train with 10 000 and 40 000 images, and it works well.
Try to do ./darknet detector recall data/obj.data yolo-obj.cfg backup\yolo-obj_7000.weights for your data, cfg and weights files, when train.txt file contains all 3000 images. Will it work out to the end successfully?
What CUDA and cuDNN versions do you use?
Try to use my repository: https://github.com/AlexeyAB/darknet
Is there a segmentation error during training?
Yes on Linux I am getting Segmentation Fault
This is very strange. Probably something wrong with CUDA, cuDNN or nVidia GPU driver.
I used your repo and getting same error. But it is reading the images like this
small w = 0.000000, h = 0.011283
small w = 0.000000, h = 0.021157
small w = 0.000000, h = 0.004231
small w = 0.000000, h = 0.033632
small w = 0.000000, h = 0.023543
small w = 0.000000, h = 0.028027
small w = 0.000000, h = 0.021300
small w = 0.000000, h = 0.033632
small w = 0.000000, h = 0.019058
small w = 0.000000, h = 0.033632
small w = 0.000000, h = 0.025785
small w = 0.061381, h = 0.000000
small w = 0.057971, h = 0.000000
small w = 0.056266, h = 0.000000
small w = 0.000000, h = 0.032258
small w = 0.000000, h = 0.057315
small w = 0.000000, h = 0.048265
small w = 0.042333, h = 0.000000
small w = 0.041392, h = 0.000000
small w = 0.040452, h = 0.000000
small w = 0.000000, h = 0.017956
small w = 0.000000, h = 0.017956
small w = 0.000000, h = 0.013812
small w = 0.000000, h = 0.013812
small w = 0.000000, h = 0.019337
small w = 0.000000, h = 0.015193
small w = 0.000000, h = 0.017956
small w = 0.000000, h = 0.028829
small w = 0.000000, h = 0.027027
small w = 0.000000, h = 0.025225
small w = 0.000000, h = 0.021622
small w = 0.000000, h = 0.028829
small w = 0.000000, h = 0.021622
small w = 0.000000, h = 0.023423
small w = 0.000000, h = 0.023423
small w = 0.000000, h = 0.028829
small w = 0.000000, h = 0.014041
small w = 0.000000, h = 0.023401
small w = 0.000000, h = 0.017161
small w = 0.000000, h = 0.026521
small w = 0.000000, h = 0.007800
small w = 0.000000, h = 0.014673
small w = 0.000000, h = 0.052885
small w = 0.000000, h = 0.086522
small w = 0.000000, h = 0.084859
small w = 0.000000, h = 0.032686
small w = 0.000000, h = 0.020429
small w = 0.000000, h = 0.007150
small w = 0.000000, h = 0.034729
small w = 0.000000, h = 0.008172
small w = 0.000000, h = 0.026059
small w = 0.000000, h = 0.047774
small w = 0.000000, h = 0.023887
small w = 0.000000, h = 0.039088
small w = 0.000000, h = 0.013029
small w = 0.000000, h = 0.036916
small w = 0.000000, h = 0.040332
small w = 0.000000, h = 0.045977
small w = 0.000000, h = 0.035439
small w = 0.000000, h = 0.023112
small w = 0.000000, h = 0.018490
small w = 0.000000, h = 0.024653
small w = 0.000000, h = 0.032357
small w = 0.000000, h = 0.055909
small w = 0.000000, h = 0.023438
small w = 0.000000, h = 0.025054
small w = 0.152021, h = 0.000000
small w = 0.000000, h = 0.013598
small w = 0.000000, h = 0.017782
small w = 0.000000, h = 0.016371
small w = 0.000000, h = 0.030014
small w = 0.000000, h = 0.019100
small w = 0.015539, h = 0.000000
small w = 0.015539, h = 0.000000
small w = 0.018282, h = 0.000000
small w = 0.016453, h = 0.000000
small w = 0.012797, h = 0.000000
small w = 0.017367, h = 0.000000
small w = 0.017367, h = 0.000000
small w = 0.016453, h = 0.000000
small w = 0.019196, h = 0.000000
small w = 0.014625, h = 0.000000
small w = 0.018282, h = 0.000000
small w = 0.000000, h = 0.053302
small w = 0.000000, h = 0.055041
small w = 0.000000, h = 0.060091
small w = 0.039900, h = 0.000000
small w = 0.032419, h = 0.000000
small w = 0.044888, h = 0.000000
small w = 0.072319, h = 0.000000
small w = 0.062344, h = 0.000000
small w = 0.021186, h = 0.000000
small w = 0.000000, h = 0.012461
small w = 0.000000, h = 0.014019
small w = 0.000000, h = 0.010903
small w = 0.000000, h = 0.012461
small w = 0.000000, h = 0.023885
small w = 0.000000, h = 0.012329
small w = 0.000000, h = 0.008219
small w = 0.000000, h = 0.010959
small w = 0.000000, h = 0.010959
small w = 0.000000, h = 0.026224
small w = 0.000000, h = 0.024476
small w = 0.000000, h = 0.020979
small w = 0.000000, h = 0.022727
small w = 0.000000, h = 0.022727
small w = 0.000000, h = 0.027972
small w = 0.000000, h = 0.027972
small w = 0.000000, h = 0.020979
small w = 0.000000, h = 0.027972
try to allocate workspace = 16777216 * sizeof(float), CUDA allocate done!
That issue is still there but when I trained with your repo I got results in 400 epocs only. Thank you very much for this repo and all the help. If you can please help me with training with more data as this model is not much accurate
Thank you for all the help. I found the problem there were 3 photos with bounding box as 0 0 0 0. It was creating problem.
Most helpful comment
@smitshilu Change this line: https://github.com/pjreddie/darknet/blob/80d9bec20f0a44ab07616215c6eadb2d633492fe/examples/detector.c#L136
to this
if( i%100 == 0 ){Or use this fork: https://github.com/AlexeyAB/darknet