Darknet: Fluctuating mAP for custom dataset !

Created on 18 Apr 2019  路  25Comments  路  Source: AlexeyAB/darknet

chart

Hi all,

I am trying to use yolov3-tiny_3l.cfg for my custom dataset with 2 classes.
I changed my .cfg for classes, filters and also obj.data files. I generated anchors for my custom dataset and put it into .cfg file.

no_train_images = 5400

no_test_images = 1200

I can see the loss going down, but the mAP fluctuates very much.(see graph with mAP)
How can I solve this problem?Any suggestions..?
THanks

Most helpful comment

Hi @aditbhrgv - I found this explanation helpful for determining custom anchors.

All 25 comments

@aditbhrgv Hi,

  • How many classes do you have?
  • Did you separate your dataset to Training and Validation randomly, without itersections?
  • Can you attach your cfg-file?

@AlexeyAB Thanks for your reply !

  1. 2 classes
  2. The training and validation datasets are separate. There are no intersections between them
  3. Attached is .cfg file !

    [net]

    Testing

batch=1

subdivisions=1

Training

batch=64
subdivisions=32
width=608
height=608
channels=3
momentum=0.9
decay=0.0005
angle=0
saturation = 1.5
exposure = 1.5
hue=.1

learning_rate=0.0005
burn_in=2000
max_batches = 35000
policy=steps
steps=360000,380000
scales=.1,.1

[convolutional]
batch_normalize=1
filters=16
size=3
stride=1
pad=1
activation=leaky

[maxpool]
size=2
stride=2

[convolutional]
batch_normalize=1
filters=32
size=3
stride=1
pad=1
activation=leaky

[maxpool]
size=2
stride=2

[convolutional]
batch_normalize=1
filters=64
size=3
stride=1
pad=1
activation=leaky

[maxpool]
size=2
stride=2

[convolutional]
batch_normalize=1
filters=128
size=3
stride=1
pad=1
activation=leaky

[maxpool]
size=2
stride=2

[convolutional]
batch_normalize=1
filters=256
size=3
stride=1
pad=1
activation=leaky

[maxpool]
size=2
stride=2

[convolutional]
batch_normalize=1
filters=512
size=3
stride=1
pad=1
activation=leaky

[maxpool]
size=2
stride=1

[convolutional]
batch_normalize=1
filters=1024
size=3
stride=1
pad=1
activation=leaky

#

[convolutional]
batch_normalize=1
filters=256
size=1
stride=1
pad=1
activation=leaky

[convolutional]
batch_normalize=1
filters=512
size=3
stride=1
pad=1
activation=leaky

[convolutional]
size=1
stride=1
pad=1
filters=21
activation=linear

[yolo]
mask = 6,7,8
anchors = 8, 10, 11, 12, 14, 11, 18, 14, 25, 15, 36, 18, 49, 23, 71, 25, 93, 42
classes=2
num=9
jitter=.3
ignore_thresh = .7
truth_thresh = 1
random=1

[route]
layers = -4

[convolutional]
batch_normalize=1
filters=128
size=1
stride=1
pad=1
activation=leaky

[upsample]
stride=2

[route]
layers = -1, 8

[convolutional]
batch_normalize=1
filters=256
size=3
stride=1
pad=1
activation=leaky

[convolutional]
size=1
stride=1
pad=1
filters=21
activation=linear

[yolo]
mask = 3,4,5
anchors = 8, 10, 11, 12, 14, 11, 18, 14, 25, 15, 36, 18, 49, 23, 71, 25, 93, 42

classes=2
num=9
jitter=.3
ignore_thresh = .7
truth_thresh = 1
random=1

[route]
layers = -3

[convolutional]
batch_normalize=1
filters=128
size=1
stride=1
pad=1
activation=leaky

[upsample]
stride=2

[route]
layers = -1, 6

[convolutional]
batch_normalize=1
filters=128
size=3
stride=1
pad=1
activation=leaky

[convolutional]
size=1
stride=1
pad=1
filters=21
activation=linear

[yolo]
mask = 0,1,2
anchors = 8, 10, 11, 12, 14, 11, 18, 14, 25, 15, 36, 18, 49, 23, 71, 25, 93, 42
classes=2
num=9
jitter=.3
ignore_thresh = .7
truth_thresh = 1
random=1

The training and validation datasets are separate. There are no intersections between them

Did you divide it uniform randomly or not?

Did you check your dataset by using Yolo_mark?

Can you show cloud.png image after this command?
./darknet detector calc_anchors data/obj.data -num_of_clusters 9 -width 608 -height 608 -show

No
I checked using Yolo mark, it shows correct BB on images.
Attached is the cloud.png
cloud

@aditbhrgv

Try to train by using these mask and filters from the begining

filters=7

[yolo]
mask = 8
anchors = 8, 10, 11, 12, 14, 11, 18, 14, 25, 15, 36, 18, 49, 23, 71, 25, 93, 42
.....




filters=14

[yolo]
mask = 6,7
anchors = 8, 10, 11, 12, 14, 11, 18, 14, 25, 15, 36, 18, 49, 23, 71, 25, 93, 42
...




filters=42

[yolo]
mask = 0,1,2,3,4,5
anchors = 8, 10,   11, 12,   14, 11,   18, 14,   25, 15,   36, 18,   49, 23,   71, 25,   93, 42

THanks ! I'll try that..
Can you please tell me the reasoning behind doing this ?
WOuld be really helpful!
Thanks

After training - show your Loss & mAP chart

https://github.com/AlexeyAB/darknet#how-to-improve-object-detection

recalculate anchors for your dataset for width and height from cfg-file: darknet.exe detector calc_anchors data/obj.data -num_of_clusters 9 -width 416 -height 416 then set the same 9 anchors in each of 3 [yolo]-layers in your cfg-file. But you should change indexes of anchors masks= for each [yolo]-layer, so that 1st-[yolo]-layer has anchors larger than 60x60, 2nd larger than 30x30, 3rd remaining. Also you should change the filters=(classes + 5)* before each [yolo]-layer. If many of the calculated anchors do not fit under the appropriate layers - then just try using all the default anchors.

@AlexeyAB Can you please let me know the possible reasons for this fluctuating mAP?
I have currently set random=0 in .cfg file and started training. This lead to less fluctuating behavior(than previous attached graph).
I have started training with changed anchors you decribed before and share the results once its done.
Also, could you please give me the a bit more interpretation of cloud.png ?
And , I tried to train same dataset on Pytorch implemetation and my mAP got converged after 23 epochs. My inital LR was 0.01 and decreased by 10 after 20,50,100 epochs.
Can I set the same LR schedule iin .cfg file ?
Thanks

Can you please let me know the possible reasons for this fluctuating mAP?

There can be many reasons.

And , I tried to train same dataset on Pytorch implemetation and my mAP got converged after 23 epochs. My inital LR was 0.01 and decreased by 10 after 20,50,100 epochs.
Can I set the same LR schedule iin .cfg file ?

If you have 5400 training images and set batch=64, then epoch = 5400/64 = 84 iterations
So
20 epochs = 1680 iterations
50 epochs = 4200 iterations
100 epochs = 8400 iterations

Set

 steps=1680, 4200, 8400 
 scales=0.1, 0.1, 0.1

instead of
https://github.com/AlexeyAB/darknet/blob/099b71d1de6b992ce8f9d7ff585c84efd0d4bf94/cfg/yolov3.cfg#L22-L23

and learning_rate=0.01 instead of https://github.com/AlexeyAB/darknet/blob/099b71d1de6b992ce8f9d7ff585c84efd0d4bf94/cfg/yolov3.cfg#L18

Hi @AlexeyAB ,
I got the below result after following the above LR schedule.

learning_rate=0.01
steps=1680, 4200, 8400
scales=0.1, 0.1, 0.1

chart

But , I trained this without random option in .cfg file. I can try to train with random option in .cfg file again and obtain the results again.
Looking at the mAP graph, I think I reduced the LR too quickly as it converged to 75% mAP finallly which could be better around 82% (as seen from the graph.)
I will try to set "scales=0.05, 0.05, 0.05" in .cfg file again and see the results. Do you have any other suggestions?

Also, can I generate a video of the predictions on the validation set using my trained model ? I can use "./build/darknet detector test" option to see the visualizations but it gives one image at a time. I want to give whole validation set and save the output.

Also, can I generate a video of the predictions on the validation set using my trained model ? I can use "./build/darknet detector test" option to see the visualizations but it gives one image at a time. I want to give whole validation set and save the output.

Are your validation images - frames from video?
Just run detection on this video.


Also you can downlod http://mplayerwin.sourceforge.net/downloads.html and run this command in the folder where are only Validation images
mencoder mf://*.jpg -mf w=1280:h=720:fps=15:type=jpg -ovc lavc -lavcopts vcodec=mpeg4:vbitrate=4000:mbd=2:trell -oac copy -o conveyor_valid.avi
so videofile conveyor_valid.avi will be generated

Then run:
./darknet detector demo data/conveyor.data yolov3-tiny_occlusion_track.cfg backup/yolov3-tiny_occlusion_track_last.weights conveyor_valid.avi -out_filename out_conveyor_valid.avi


Also you can try

./darknet detector test data/conveyor.data yolov3-tiny_occlusion_track.cfg backup/yolov3-tiny_occlusion_track_last.weights < data/conveyor_valid.txt

Are your validation images - frames from video?

No, they are .jpg files located in a folder.

Also you can downlod http://mplayerwin.sourceforge.net/downloads.html and run this command in the folder where are only Validation images

Is there same tool for Ubuntu ?

./darknet detector demo data/conveyor.data yolov3-tiny_occlusion_track.cfg backup/yolov3-tiny_occlusion_track_last.weights conveyor_valid.avi -out_filename out_conveyor_valid.avi

@AlexeyAB I used this command to draw the BB on the .avi but I see a bit of offset on the detected objects. What can be a problem?

May be wrong annotations, check dataset by using https://github.com/AlexeyAB/Yolo_mark

annotations

I tested on single image and the BB is perfectly overlaid on the image using "/darknet detector test" command.
It seems it's only a problem when I give input .avi video. I see the offsets for the objects when they are relatively closer and not when they are at a some distance away.
Maybe, I can try ./darknet detector test data/conveyor.data yolov3-tiny_occlusion_track.cfg backup/yolov3-tiny_occlusion_track_last.weights < data/conveyor_valid.txt
instead of

./darknet detector demo data/conveyor.data yolov3-tiny_occlusion_track.cfg backup/yolov3-tiny_occlusion_track_last.weights conveyor_valid.avi -out_filename out_conveyor_valid.avi

@AlexeyAB How can I reduce the fps of the output video generated ? It's too fast as of now.

chart

@AlexeyAB Now, I get the new mAP which converged around 81%. Precision = 84%, REcall = 71% F1 = 77%. . However, these results I got without using "random" flag. I think results can be better with multi-scale option.

Yes, try to train with random=1

chart
@AlexeyAB I tried with random=1 option, but mAP, precision, recall and F1 reduced instead of increasing.
Could you please suggest something?
Thanks

cloud

@AlexeyAB I have a new dataset for which cloud.png is shown. How can I set mask for the anchors according to this distribution? Is there any link where I can better understand cloud.png interpretation?

Hi @aditbhrgv - I found this explanation helpful for determining custom anchors.

HI @DarylWM
Thank you !
Can you please explain the significance of cloud.png.?
I can see the anchors and the training data points distributed along them. Is my understanding correct ?
If yes, how does the training samples lying outside these anchors will be detected ?
Thanks again !

Hi @aditbhrgv - I found this explanation helpful for determining custom anchors.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

siddharth2395 picture siddharth2395  路  3Comments

PROGRAMMINGENGINEER-NIKI picture PROGRAMMINGENGINEER-NIKI  路  3Comments

Jacky3213 picture Jacky3213  路  3Comments

jasleen137 picture jasleen137  路  3Comments

Mididou picture Mididou  路  3Comments