Darknet: Fluctuating mAP for custom dataset !

Created on 18 Apr 2019 · 25Comments · Source: AlexeyAB/darknet

chart

Hi all,

I am trying to use yolov3-tiny_3l.cfg for my custom dataset with 2 classes.
I changed my .cfg for classes, filters and also obj.data files. I generated anchors for my custom dataset and put it into .cfg file.

no_train_images = 5400

no_test_images = 1200

I can see the loss going down, but the mAP fluctuates very much.(see graph with mAP)
How can I solve this problem?Any suggestions..?
THanks

Source

aditbhrgv

Most helpful comment

Hi @aditbhrgv - I found this explanation helpful for determining custom anchors.

DarylWM on 1 May 2019

👍2

All 25 comments

@aditbhrgv Hi,

How many classes do you have?
Did you separate your dataset to Training and Validation randomly, without itersections?
Can you attach your cfg-file?

AlexeyAB on 18 Apr 2019

@AlexeyAB Thanks for your reply !

2 classes
The training and validation datasets are separate. There are no intersections between them
Attached is .cfg file !

[net]

Testing

batch=1

subdivisions=1

Training

batch=64
subdivisions=32
width=608
height=608
channels=3
momentum=0.9
decay=0.0005
angle=0
saturation = 1.5
exposure = 1.5
hue=.1

learning_rate=0.0005
burn_in=2000
max_batches = 35000
policy=steps
steps=360000,380000
scales=.1,.1

[convolutional]
batch_normalize=1
filters=16
size=3
stride=1
pad=1
activation=leaky

[maxpool]
size=2
stride=2

[convolutional]
batch_normalize=1
filters=32
size=3
stride=1
pad=1
activation=leaky

[maxpool]
size=2
stride=2

[convolutional]
batch_normalize=1
filters=64
size=3
stride=1
pad=1
activation=leaky

[maxpool]
size=2
stride=2

[convolutional]
batch_normalize=1
filters=128
size=3
stride=1
pad=1
activation=leaky

[maxpool]
size=2
stride=2

[convolutional]
batch_normalize=1
filters=256
size=3
stride=1
pad=1
activation=leaky

[maxpool]
size=2
stride=2

[convolutional]
batch_normalize=1
filters=512
size=3
stride=1
pad=1
activation=leaky

[maxpool]
size=2
stride=1

[convolutional]
batch_normalize=1
filters=1024
size=3
stride=1
pad=1
activation=leaky

#

[convolutional]
batch_normalize=1
filters=256
size=1
stride=1
pad=1
activation=leaky

[convolutional]
batch_normalize=1
filters=512
size=3
stride=1
pad=1
activation=leaky

[convolutional]
size=1
stride=1
pad=1
filters=21
activation=linear

[yolo]
mask = 6,7,8
anchors = 8, 10, 11, 12, 14, 11, 18, 14, 25, 15, 36, 18, 49, 23, 71, 25, 93, 42
classes=2
num=9
jitter=.3
ignore_thresh = .7
truth_thresh = 1
random=1

[route]
layers = -4

[convolutional]
batch_normalize=1
filters=128
size=1
stride=1
pad=1
activation=leaky

[upsample]
stride=2

[route]
layers = -1, 8

[convolutional]
batch_normalize=1
filters=256
size=3
stride=1
pad=1
activation=leaky

[convolutional]
size=1
stride=1
pad=1
filters=21
activation=linear

[yolo]
mask = 3,4,5
anchors = 8, 10, 11, 12, 14, 11, 18, 14, 25, 15, 36, 18, 49, 23, 71, 25, 93, 42

classes=2
num=9
jitter=.3
ignore_thresh = .7
truth_thresh = 1
random=1

[route]
layers = -3

[convolutional]
batch_normalize=1
filters=128
size=1
stride=1
pad=1
activation=leaky

[upsample]
stride=2

[route]
layers = -1, 6

[convolutional]
batch_normalize=1
filters=128
size=3
stride=1
pad=1
activation=leaky

[convolutional]
size=1
stride=1
pad=1
filters=21
activation=linear

[yolo]
mask = 0,1,2
anchors = 8, 10, 11, 12, 14, 11, 18, 14, 25, 15, 36, 18, 49, 23, 71, 25, 93, 42
classes=2
num=9
jitter=.3
ignore_thresh = .7
truth_thresh = 1
random=1

aditbhrgv on 18 Apr 2019

The training and validation datasets are separate. There are no intersections between them

Did you divide it uniform randomly or not?

Did you check your dataset by using Yolo_mark?

Can you show cloud.png image after this command?
./darknet detector calc_anchors data/obj.data -num_of_clusters 9 -width 608 -height 608 -show

AlexeyAB on 18 Apr 2019

👍1

No
I checked using Yolo mark, it shows correct BB on images.
Attached is the cloud.png
cloud

aditbhrgv on 18 Apr 2019

@aditbhrgv

Try to train by using these mask and filters from the begining

filters=7

[yolo]
mask = 8
anchors = 8, 10, 11, 12, 14, 11, 18, 14, 25, 15, 36, 18, 49, 23, 71, 25, 93, 42
.....




filters=14

[yolo]
mask = 6,7
anchors = 8, 10, 11, 12, 14, 11, 18, 14, 25, 15, 36, 18, 49, 23, 71, 25, 93, 42
...




filters=42

[yolo]
mask = 0,1,2,3,4,5
anchors = 8, 10,   11, 12,   14, 11,   18, 14,   25, 15,   36, 18,   49, 23,   71, 25,   93, 42

AlexeyAB on 18 Apr 2019

👍1

THanks ! I'll try that..
Can you please tell me the reasoning behind doing this ?
WOuld be really helpful!
Thanks

aditbhrgv on 18 Apr 2019

After training - show your Loss & mAP chart

https://github.com/AlexeyAB/darknet#how-to-improve-object-detection

recalculate anchors for your dataset for width and height from cfg-file: darknet.exe detector calc_anchors data/obj.data -num_of_clusters 9 -width 416 -height 416 then set the same 9 anchors in each of 3 [yolo]-layers in your cfg-file. But you should change indexes of anchors masks= for each [yolo]-layer, so that 1st-[yolo]-layer has anchors larger than 60x60, 2nd larger than 30x30, 3rd remaining. Also you should change the filters=(classes + 5)* before each [yolo]-layer. If many of the calculated anchors do not fit under the appropriate layers - then just try using all the default anchors.

AlexeyAB on 18 Apr 2019

👍1

@AlexeyAB Can you please let me know the possible reasons for this fluctuating mAP?
I have currently set random=0 in .cfg file and started training. This lead to less fluctuating behavior(than previous attached graph).
I have started training with changed anchors you decribed before and share the results once its done.
Also, could you please give me the a bit more interpretation of cloud.png ?
And , I tried to train same dataset on Pytorch implemetation and my mAP got converged after 23 epochs. My inital LR was 0.01 and decreased by 10 after 20,50,100 epochs.
Can I set the same LR schedule iin .cfg file ?
Thanks

aditbhrgv on 20 Apr 2019

Can you please let me know the possible reasons for this fluctuating mAP?

There can be many reasons.

And , I tried to train same dataset on Pytorch implemetation and my mAP got converged after 23 epochs. My inital LR was 0.01 and decreased by 10 after 20,50,100 epochs.
Can I set the same LR schedule iin .cfg file ?

If you have 5400 training images and set batch=64, then epoch = 5400/64 = 84 iterations
So
20 epochs = 1680 iterations
50 epochs = 4200 iterations
100 epochs = 8400 iterations

Set

 steps=1680, 4200, 8400 
 scales=0.1, 0.1, 0.1

instead of
https://github.com/AlexeyAB/darknet/blob/099b71d1de6b992ce8f9d7ff585c84efd0d4bf94/cfg/yolov3.cfg#L22-L23

and learning_rate=0.01 instead of https://github.com/AlexeyAB/darknet/blob/099b71d1de6b992ce8f9d7ff585c84efd0d4bf94/cfg/yolov3.cfg#L18

AlexeyAB on 20 Apr 2019

👍1

Hi @AlexeyAB ,
I got the below result after following the above LR schedule.

learning_rate=0.01
steps=1680, 4200, 8400
scales=0.1, 0.1, 0.1

chart

But , I trained this without random option in .cfg file. I can try to train with random option in .cfg file again and obtain the results again.
Looking at the mAP graph, I think I reduced the LR too quickly as it converged to 75% mAP finallly which could be better around 82% (as seen from the graph.)
I will try to set "scales=0.05, 0.05, 0.05" in .cfg file again and see the results. Do you have any other suggestions?

Also, can I generate a video of the predictions on the validation set using my trained model ? I can use "./build/darknet detector test" option to see the visualizations but it gives one image at a time. I want to give whole validation set and save the output.

aditbhrgv on 23 Apr 2019

Also, can I generate a video of the predictions on the validation set using my trained model ? I can use "./build/darknet detector test" option to see the visualizations but it gives one image at a time. I want to give whole validation set and save the output.

Are your validation images - frames from video?
Just run detection on this video.

Also you can downlod http://mplayerwin.sourceforge.net/downloads.html and run this command in the folder where are only Validation images
mencoder mf://*.jpg -mf w=1280:h=720:fps=15:type=jpg -ovc lavc -lavcopts vcodec=mpeg4:vbitrate=4000:mbd=2:trell -oac copy -o conveyor_valid.avi
so videofile conveyor_valid.avi will be generated

Then run:
./darknet detector demo data/conveyor.data yolov3-tiny_occlusion_track.cfg backup/yolov3-tiny_occlusion_track_last.weights conveyor_valid.avi -out_filename out_conveyor_valid.avi

Also you can try

./darknet detector test data/conveyor.data yolov3-tiny_occlusion_track.cfg backup/yolov3-tiny_occlusion_track_last.weights < data/conveyor_valid.txt

AlexeyAB on 23 Apr 2019

Are your validation images - frames from video?

No, they are .jpg files located in a folder.

aditbhrgv on 23 Apr 2019

Also you can downlod http://mplayerwin.sourceforge.net/downloads.html and run this command in the folder where are only Validation images

Is there same tool for Ubuntu ?

aditbhrgv on 23 Apr 2019

@aditbhrgv https://tecadmin.net/install-mencoder-and-mplayer-on-linux/

AlexeyAB on 23 Apr 2019

👍1

./darknet detector demo data/conveyor.data yolov3-tiny_occlusion_track.cfg backup/yolov3-tiny_occlusion_track_last.weights conveyor_valid.avi -out_filename out_conveyor_valid.avi

@AlexeyAB I used this command to draw the BB on the .avi but I see a bit of offset on the detected objects. What can be a problem?

aditbhrgv on 23 Apr 2019

May be wrong annotations, check dataset by using https://github.com/AlexeyAB/Yolo_mark

AlexeyAB on 23 Apr 2019

annotations

I tested on single image and the BB is perfectly overlaid on the image using "/darknet detector test" command.
It seems it's only a problem when I give input .avi video. I see the offsets for the objects when they are relatively closer and not when they are at a some distance away.
Maybe, I can try ./darknet detector test data/conveyor.data yolov3-tiny_occlusion_track.cfg backup/yolov3-tiny_occlusion_track_last.weights < data/conveyor_valid.txt
instead of

./darknet detector demo data/conveyor.data yolov3-tiny_occlusion_track.cfg backup/yolov3-tiny_occlusion_track_last.weights conveyor_valid.avi -out_filename out_conveyor_valid.avi

aditbhrgv on 23 Apr 2019

@AlexeyAB How can I reduce the fps of the output video generated ? It's too fast as of now.

aditbhrgv on 23 Apr 2019

Change 1st line and comment 2nd: https://github.com/AlexeyAB/darknet/blob/099b71d1de6b992ce8f9d7ff585c84efd0d4bf94/src/demo.c#L186-L187

AlexeyAB on 23 Apr 2019

chart

@AlexeyAB Now, I get the new mAP which converged around 81%. Precision = 84%, REcall = 71% F1 = 77%. . However, these results I got without using "random" flag. I think results can be better with multi-scale option.

aditbhrgv on 24 Apr 2019

Yes, try to train with random=1

AlexeyAB on 24 Apr 2019

chart
@AlexeyAB I tried with random=1 option, but mAP, precision, recall and F1 reduced instead of increasing.
Could you please suggest something?
Thanks

aditbhrgv on 25 Apr 2019

cloud

@AlexeyAB I have a new dataset for which cloud.png is shown. How can I set mask for the anchors according to this distribution? Is there any link where I can better understand cloud.png interpretation?

aditbhrgv on 25 Apr 2019

Hi @aditbhrgv - I found this explanation helpful for determining custom anchors.

DarylWM on 1 May 2019

👍2

HI @DarylWM
Thank you !
Can you please explain the significance of cloud.png.?
I can see the anchors and the training data points distributed along them. Is my understanding correct ?
If yes, how does the training samples lying outside these anchors will be detected ?
Thanks again !