Darknet: Segmentation fault (core dumped)!

Created on 9 Mar 2018 · 24Comments · Source: AlexeyAB/darknet

Hello,
I am using original repo on my linux system. Trying to train tiny yolo model but it gives following error after some steps. Number is not consistent sometime it throws error at 2nd step some time at 40th. Can you please help me with it?

41: 107.253304, 415.541656 avg, 0.001000 rate, 4.394401 seconds, 1312 images Loaded: 0.000124 seconds
Region Avg IOU: 0.018014, Class: 1.000000, Obj: 0.002887, No Obj: 0.003492, Avg Recall: 0.000000, count: 18
Region Avg IOU: 0.401208, Class: 1.000000, Obj: 0.007435, No Obj: 0.007720, Avg Recall: 0.500000, count: 4
Region Avg IOU: 0.107763, Class: 1.000000, Obj: 0.003388, No Obj: 0.004642, Avg Recall: 0.142857, count: 7
Segmentation fault (core dumped)

Using following command

./darknet detector train data/obj.data cfg/tiny-yolo-voc-obj.cfg tiny-yolo-voc.conv.13

Source

smitshilu

Most helpful comment

@smitshilu Change this line: https://github.com/pjreddie/darknet/blob/80d9bec20f0a44ab07616215c6eadb2d633492fe/examples/detector.c#L136
to this if( i%100 == 0 ){

Or use this fork: https://github.com/AlexeyAB/darknet

AlexeyAB on 12 Mar 2018

🎉1 😄1 👍1

All 24 comments

Hi, may be something wrong with your dataset.

Do you use OpenCV, CUDA and cuDNN?
What OpenCV version do you use?
Do you get this error if you use this repo to training on Linux?

AlexeyAB on 9 Mar 2018

Hello @AlexeyAB,
Thank you for your quick response.
Do you use OpenCV, CUDA and cuDNN?
Yes, I do. Using this on NVIDIA Jetson TX2 board

What OpenCV version do you use?
3.4.0

Do you get this error if you use this repo to training on Linux?
I am running on Jetson TX2 linux only.

smitshilu on 9 Mar 2018

It starts training with these steps

1: 206.234924, 206.234924 avg, 0.001000 rate, 0.910286 seconds, 12 images
Loaded: 0.000072 seconds
Region Avg IOU: 0.002696, Class: 1.000000, Obj: 0.091525, No Obj: 0.225260, Avg Recall: 0.000000, count: 17
Region Avg IOU: 0.002618, Class: 1.000000, Obj: 0.264026, No Obj: 0.211167, Avg Recall: 0.000000, count: 33
Region Avg IOU: 0.000733, Class: 1.000000, Obj: 0.263204, No Obj: 0.214708, Avg Recall: 0.000000, count: 32
Region Avg IOU: 0.000000, Class: 1.000000, Obj: 0.147834, No Obj: 0.221345, Avg Recall: 0.000000, count: 11

but in last steps before it fails it shows data like this

10: nan, nan avg, 0.001000 rate, 1.093313 seconds, 120 images
Resizing
544
Loaded: 0.000286 seconds
Region Avg IOU: 0.000000, Class: nan, Obj: 0.000000, No Obj: nan, Avg Recall: 0.000000, count: 30
Region Avg IOU: 0.000000, Class: nan, Obj: nan, No Obj: nan, Avg Recall: 0.000000, count: 20
Region Avg IOU: 0.000000, Class: nan, Obj: 0.000000, No Obj: nan, Avg Recall: 0.000000, count: 20
Region Avg IOU: 0.000000, Class: nan, Obj: 0.000000, No Obj: nan, Avg Recall: 0.000000, count: 23

smitshilu on 9 Mar 2018

Try to train with OPENCV=0 in the Makefile and with random=0 in your cfg file.
Try to use this repo on Linux.

AlexeyAB on 9 Mar 2018

I tried setting these 2 but the error is similar.

smitshilu on 9 Mar 2018

This is trange. Try to use CUDNN=0 or use this repo. Sometimes people are faced with a problem when used CUDNN on TX2: https://github.com/AlexeyAB/darknet/issues/436
If the error still occur, then problem might be in your dataset.

AlexeyAB on 9 Mar 2018

It's not working either. Can you tell me how can I change batch and subdivision? Is there any relation between these 2?

In terms of dataset I have done something like this

obj.dat
classes=0
train = data/train.txt
valid = data/test.txt
names = data/obj.names
backup = backup/

train.txt file contains
data/obj/img1.jpg
data/obj/img2.jpg
data/obj/img3.jpg
.
.
data/obj/img500.jpg

cfg file
[convolutional]
filters=30

[region]
classes=1

img1.txt
0 0.66 0.5 0.3 0.2

obj.names
face

smitshilu on 9 Mar 2018

Everything looks correct, except for

obj.dat
classes=0

There should be classes=1

Did you get your img1.txt using Yolo_mark tool? https://github.com/AlexeyAB/Yolo_mark

Your can try to change batch and subdivision in your cfg-file: https://github.com/AlexeyAB/darknet/blob/880cf187d87c904f5fe574802ecff99118643f2d/cfg/yolo-voc.2.0.cfg#L2-L3

Memory consumption proportionally ~= batch/subdivision

AlexeyAB on 9 Mar 2018

I have not used Yolo_mark tool. I had all of my coordinates so I just converted it to text file.

smitshilu on 9 Mar 2018

You should check that each your imgX.txt file contains correct values 0< x,y,w,h <1 or and x+w/2 < 1 , y+h/2 < 1
Also that no one imgX.txt file contains empty extra lines.

AlexeyAB on 9 Mar 2018

Just to make sure
X and Y here are top left co-ordinates relative to width and height OR rectangle center point relative to width and height?
I mean is it
"top left x"/width , "top left y"/height
OR
"rectangle center x"/width , "rectangle center y"/height?

and W and H are rectangle size relative to width and height.

smitshilu on 9 Mar 2018

"rectangle center x"/width , "rectangle center y"/height?

Yes, x, y is center of object relative to image size. I fixed answer.

AlexeyAB on 9 Mar 2018

🎉1

Hello,
I managed to run training (in Mac book with 1080ti with adding 2 class and reducing my dataset). But now it is saving weights till 900 only. Can you please let me know how to change that?

Thank you

smitshilu on 12 Mar 2018

@smitshilu Change this line: https://github.com/pjreddie/darknet/blob/80d9bec20f0a44ab07616215c6eadb2d633492fe/examples/detector.c#L136
to this if( i%100 == 0 ){

Or use this fork: https://github.com/AlexeyAB/darknet

AlexeyAB on 12 Mar 2018

🎉1 😄1 👍1

I think the reason behind the segmentation fault or Bus error is big data. I tried using almost 3000, 2000, 1000 and 500 images but I got error. When I tried with 50 it worked and ran for around 6000 steps. I am now using MacBook with 12GB NVIDIA GTX 1080Ti and facing same problem.

smitshilu on 12 Mar 2018

Is there any solution to this problem? What to do if I want to train on 3000 images??

smitshilu on 14 Mar 2018

I train with 10 000 and 40 000 images, and it works well.

Try to do ./darknet detector recall data/obj.data yolo-obj.cfg backup\yolo-obj_7000.weights for your data, cfg and weights files, when train.txt file contains all 3000 images. Will it work out to the end successfully?
What CUDA and cuDNN versions do you use?

AlexeyAB on 14 Mar 2018

Yes it work out successfully. With last step
3225 958 39708 RPs/Img: 115.88 IOU: 7.79% Recall:2.41%
I am using CUDA 9 and cuDNN 7

smitshilu on 15 Mar 2018

Try to use my repository: https://github.com/AlexeyAB/darknet
Is there a segmentation error during training?

AlexeyAB on 15 Mar 2018

Yes on Linux I am getting Segmentation Fault

smitshilu on 15 Mar 2018

This is very strange. Probably something wrong with CUDA, cuDNN or nVidia GPU driver.

AlexeyAB on 15 Mar 2018

I used your repo and getting same error. But it is reading the images like this

small w = 0.000000, h = 0.011283
small w = 0.000000, h = 0.021157
small w = 0.000000, h = 0.004231
small w = 0.000000, h = 0.033632
small w = 0.000000, h = 0.023543
small w = 0.000000, h = 0.028027
small w = 0.000000, h = 0.021300
small w = 0.000000, h = 0.033632
small w = 0.000000, h = 0.019058
small w = 0.000000, h = 0.033632
small w = 0.000000, h = 0.025785
small w = 0.061381, h = 0.000000
small w = 0.057971, h = 0.000000
small w = 0.056266, h = 0.000000
small w = 0.000000, h = 0.032258
small w = 0.000000, h = 0.057315
small w = 0.000000, h = 0.048265
small w = 0.042333, h = 0.000000
small w = 0.041392, h = 0.000000
small w = 0.040452, h = 0.000000
small w = 0.000000, h = 0.017956
small w = 0.000000, h = 0.017956
small w = 0.000000, h = 0.013812
small w = 0.000000, h = 0.013812
small w = 0.000000, h = 0.019337
small w = 0.000000, h = 0.015193
small w = 0.000000, h = 0.017956
small w = 0.000000, h = 0.028829
small w = 0.000000, h = 0.027027
small w = 0.000000, h = 0.025225
small w = 0.000000, h = 0.021622
small w = 0.000000, h = 0.028829
small w = 0.000000, h = 0.021622
small w = 0.000000, h = 0.023423
small w = 0.000000, h = 0.023423
small w = 0.000000, h = 0.028829
small w = 0.000000, h = 0.014041
small w = 0.000000, h = 0.023401
small w = 0.000000, h = 0.017161
small w = 0.000000, h = 0.026521
small w = 0.000000, h = 0.007800
small w = 0.000000, h = 0.014673
small w = 0.000000, h = 0.052885
small w = 0.000000, h = 0.086522
small w = 0.000000, h = 0.084859
small w = 0.000000, h = 0.032686
small w = 0.000000, h = 0.020429
small w = 0.000000, h = 0.007150
small w = 0.000000, h = 0.034729
small w = 0.000000, h = 0.008172
small w = 0.000000, h = 0.026059
small w = 0.000000, h = 0.047774
small w = 0.000000, h = 0.023887
small w = 0.000000, h = 0.039088
small w = 0.000000, h = 0.013029
small w = 0.000000, h = 0.036916
small w = 0.000000, h = 0.040332
small w = 0.000000, h = 0.045977
small w = 0.000000, h = 0.035439
small w = 0.000000, h = 0.023112
small w = 0.000000, h = 0.018490
small w = 0.000000, h = 0.024653
small w = 0.000000, h = 0.032357
small w = 0.000000, h = 0.055909
small w = 0.000000, h = 0.023438
small w = 0.000000, h = 0.025054
small w = 0.152021, h = 0.000000
small w = 0.000000, h = 0.013598
small w = 0.000000, h = 0.017782
small w = 0.000000, h = 0.016371
small w = 0.000000, h = 0.030014
small w = 0.000000, h = 0.019100
small w = 0.015539, h = 0.000000
small w = 0.015539, h = 0.000000
small w = 0.018282, h = 0.000000
small w = 0.016453, h = 0.000000
small w = 0.012797, h = 0.000000
small w = 0.017367, h = 0.000000
small w = 0.017367, h = 0.000000
small w = 0.016453, h = 0.000000
small w = 0.019196, h = 0.000000
small w = 0.014625, h = 0.000000
small w = 0.018282, h = 0.000000
small w = 0.000000, h = 0.053302
small w = 0.000000, h = 0.055041
small w = 0.000000, h = 0.060091
small w = 0.039900, h = 0.000000
small w = 0.032419, h = 0.000000
small w = 0.044888, h = 0.000000
small w = 0.072319, h = 0.000000
small w = 0.062344, h = 0.000000
small w = 0.021186, h = 0.000000
small w = 0.000000, h = 0.012461
small w = 0.000000, h = 0.014019
small w = 0.000000, h = 0.010903
small w = 0.000000, h = 0.012461
small w = 0.000000, h = 0.023885
small w = 0.000000, h = 0.012329
small w = 0.000000, h = 0.008219
small w = 0.000000, h = 0.010959
small w = 0.000000, h = 0.010959
small w = 0.000000, h = 0.026224
small w = 0.000000, h = 0.024476
small w = 0.000000, h = 0.020979
small w = 0.000000, h = 0.022727
small w = 0.000000, h = 0.022727
small w = 0.000000, h = 0.027972
small w = 0.000000, h = 0.027972
small w = 0.000000, h = 0.020979
small w = 0.000000, h = 0.027972
try to allocate workspace = 16777216 * sizeof(float), CUDA allocate done!

smitshilu on 15 Mar 2018

That issue is still there but when I trained with your repo I got results in 400 epocs only. Thank you very much for this repo and all the help. If you can please help me with training with more data as this model is not much accurate

smitshilu on 15 Mar 2018

Thank you for all the help. I found the problem there were 3 photos with bounding box as 0 0 0 0. It was creating problem.

smitshilu on 20 Mar 2018

👍1

Was this page helpful?

0 / 5 - 0 ratings