yolov3 🚀 - Train YOLOv3-SPP from scratch to 62.6 mAP@0.5

@Aurora33 the mAPs reported in https://github.com/ultralytics/yolov3#map are using the original darknet weights files. We are still trying to determine the correct loss function and optimal hyperparameters for training in pytorch. There are a few issues open on this, such as https://github.com/ultralytics/yolov3/issues/205 and https://github.com/ultralytics/yolov3/issues/12. A couple things of note:

The plotted mAPs are at 0.1 conf_thres (for speed during training). If you run test.py directly it will run mAP at 0.001 conf_thres, which will produce a higher mAP.
Your LR scheduler may or may not have applied here, depending on how you set your number of epochs argument in the argparser --epochs.
Darknet training uses multi_scale by default, with scaling from 50% to 150% of your default size.
Darknet training also involves several steps I believe, including training on other datasets and altering layers. You can read about this more in the YOLOv2 and YOLOv3 papers: https://pjreddie.com/publications/
This implementation lacks the 0.7 ignore theshold in the original darknet, which is on our TODO list but not yet implemented.

glenn-jocher on 1 Jun 2019

I also get 38% mAP until 170 epoches on COCO dataset, and the mAP don't really change much more

majuncai on 12 Jun 2019

@majuncai can you post your results? Did you use --multi-scale? We have made quite a few updates recently, in particular to multi-scale, which is required to achieved the best results, as well as training to the last epoch specified in order for the LR scheduler to take effect.

glenn-jocher on 12 Jun 2019

@glenn-jocher
QQ图片20190613142623
I didn't change any parameters, I didn't use multi-scale.

majuncai on 13 Jun 2019

@glenn-jocher
微信图片_20190613153553

majuncai on 13 Jun 2019

@majuncai I see. The main things I noticed is that your LR scheduler has not taken effect, since it only kicks in at 80% and 90% of the epoch count. The mAP typically increases significantly after this. Also multi-scale training has a large effect. Lastly, we reinterpreted the darknet training settings so that we believe you only need 68 epochs for full training.

All of these changes have been applied in the last few days. I recommend you git pull and train from scratch to 68 epochs. Then you can plot your results and upload them again here using the following command:
from utils.utils import *; plot_results()

glenn-jocher on 13 Jun 2019

Hi Glenn, thanks for your great work. For classification loss, YOLO V3 paper said they don't use softmax, but nn.CrossEntropyLoss() actually contain softmax. And the paper contains ignore_threshold. Could these affect the overall mAP?

XinjieInformatik on 13 Jun 2019

@XinjieInformatik yes these could affect the mAP greatly, in particular the ignore threshold. For some reason we've gotten reduced mAP with BCELoss for the classification than with CELoss. I don't know why. This may be a PyTorch phenomenon, as simpler tasks like MNIST also train better with CELoss than BCELoss in PyTorch.

glenn-jocher on 13 Jun 2019

@glenn-jocher Hi, I am curious about how the provided yolov3.pt is obtained. Is it transformed from yolov3.weight or trained with the code you shared?

fereenwong on 16 Jun 2019

@fereenwong yolov3.pt is exported from yolov3.weights.

glenn-jocher on 18 Jun 2019

@glenn-jocher Just finished an experiment training on the full COCO dataset from scratch, using the default hyperparameter values. My model was YOLOv3-320, and I trained to 200 epochs with multi-scale training on and rectangular training off. After running test.py, I managed to get 47.4 mAP, which unfortunately is not the 51.5 corresponding to pjreddie's experiment.

I can try training again but this time to 273 epochs, although it seems that each stage before the learning rate decreased per the scheduler had already plateaued, so I don't think it would benefit much. Is there a comprehensive TODO list that you think will improve mAP? I notice you mentioned the 0.7 ignore threshold. What do you mean by this? I searched through the darknet repo and didn't get any relevant 0.7 hits.

ktian08 on 25 Jul 2019

@ktian08 ah, thanks for the update! This is actually a bit better than I was expecting at this point, since we are still tuning hyperparameters. The last time we trained fully we trained 320 also to 100 epochs and ended up at 0.46 mAP below (no multiscale, no rect training), using older hyps from about a month aqo. Remember the plots are at conf_thres 0.1, test.py runs natively at conf_thres 0.001 which adds a few percent mAP compared to the training plots. Can you post a plot of your training results using from utils import utils; utils.plot_results() and a copy of your hyp dictionary from train.py?

0.70 is a threshold darknet uses to not punish anchors which aren't the best but still have an iou > 0.7. We use a slightly different method in this repo, which is the hyp['iou_t'] parameter.

Yes I also agree that training seems to be plateauing too quickly. This could be because our hyp tuning is based on epoch 0 results only, so it may be favoring aspects that aggressively increase mAP early on, which may not be best for training later epochs. Our hyp evolution code is:

python3 train.py --data data/coco.data --img-size 320 --epochs 1 --batch-size 64 --accumulate 1 --evolve

results_320

glenn-jocher on 25 Jul 2019

hey
Screen Shot 2019-07-24 at 4 42 14 PM

Ah, I see. My repo currently does have the reject boolean set to True, so it is thresholding by iou_t, just by a different value. Are you saying darknet uses 0.7 for this value?

I have not begun evolving hyperparameters yet, as the ones I've used were the default ones for yolov3-spp I believe. However, I've modified my train script to evolve every opt.epochs because that's how I interpreted the script rather than evolving based on the first epoch. To accomplish this, I've also changed train to output the best_result (based on 0.5 * mAP + 0.5 * f1) rather than the result from the last epoch so print_mutations has the correct value. I'll try evolving the hyperparameters based on a smaller number of epochs > 1 and let you know if I get better results.

ktian08 on 25 Jul 2019

@ktian08 ah excellent. Hmm, your results are very different than the ones I posted. The more recent results should see almost 0.15 mAP starting at epoch 0, whereas yours start around 0.01 at epoch 0 and increase slowly from there.

Clearly my plots show faster short term results, but I don't know if they are plateauing lower or higher than yours, its hard to tell.

No, the 0.7 value corresponds to a different type of iou thresholding in darknet. In this repo if iou < hyp['iou_t'] then no match is made. This prevents large anchors from attempting to match with small targets and vice versa. This parameter seems to evolve to 0.20-0.35 typically. In your version its at 0.3689, whereas now we have 0.194, though the latest unpublished hyperparameters show a best value of 0.292.

Unfortunately we are resource constrained so we can't evolve as much as we'd like. Ideally you'd probably want to run the evolution off of the result say the first 10 or 20 epochs, but we are running it off of epoch 0 results, which allows us to evolve many more generations, even as its unclear if epoch 0 success correlates 100% with epoch 273 success.

Also beware that we have added the augmentation parameters to the hyp dictionary, so you may want to git pull to get the latest. You can also evolve your own hyperparameters using the same code I posted before, or if you want you could contribute to our hyp study as well by evolving to a cloud bucket we have.

glenn-jocher on 25 Jul 2019

👍1

Right, I think you start at 0.15 mAP because you load in the darknet weights as your default setting. I modified the code so I'm truly training COCO from scratch.

I'll pull the new hyperparameters and try evolving within 10-20 epochs then. Thanks!

ktian08 on 25 Jul 2019

@ktian08 ah yes this makes sense then, we are looking at apples to oranges.

Regarding fitness, I set it as the average of mAP and F1, because I saw that when I set it as only mAP, the evolution would favor high R and low P to reach the highest mAP, so I added the F1 in attempt to balance it.
https://github.com/ultralytics/yolov3/blob/df4f25e610bc31af3ba458dce4e569bb49174745/train.py#L342-L343

If you are doing lots of training BTW, you should install Nvidia Apex for mixed precision if you haven't already. This repo will automatically use it if it detects it.

glenn-jocher on 25 Jul 2019

@ktian08 I've added a new issue https://github.com/ultralytics/yolov3/issues/392 which illustrates our hyperparameter evolution efforts in greater detail.

As I mentioned, with unlimited resources you would ideally evolve the full training results:

python3 train.py --data data/coco.data --img-size 320 --epochs 273 --batch-size 64 --accumulate 1 --evolve

But since we are resource constrained we evolve epoch 0 results instead, under the assumption that what's good for epoch 0 is good for full training. This may or may not be true, we simply are not sure at this point.

python3 train.py --data data/coco.data --img-size 320 --epochs 1 --batch-size 64 --accumulate 1 --evolve

glenn-jocher on 25 Jul 2019

👍1

OK, I'll try installing Apex and evolving to as many epochs as I can. Earlier, I made a mistake calculating the mAP for my experiment, as I didn't pass in the --img-size parameter to test.py and thus my model tested on size 416 images. My newly calculated mAP is 47.4.

ktian08 on 25 Jul 2019

@ktian08 ah I see. I forgot to mention that you should use the --save-json flag with test.py, as the official COCO mAP is usually about 1% higher than what the repo mAP code reports. You could try best.pt also instead of last.pt:

python3 test.py --weights weights/best.pt --img-size 320 --save-json

glenn-jocher on 26 Jul 2019

Yep, already using --save-json and best.pt!

ktian08 on 26 Jul 2019

@ktian08 I updated the code a bit to add a --img-weights option to train.py. When this is set the dataloader selects images randomly weighted by their value, which is defined as the type of objects they have and how well the mAP is evolving on those exact objects. If mAP is low on hair dryers for example, and there are few hair dryers in the dataset, then many more images of hairdryers will be selected than say images of people.

This seems to show better mAP, at least during the first few epochs, both when training from darknet53 as well as when training with no backbone (0.020 to 0.025 mAP first epoch at 416 without backbone). I don't know what effect it will have long term however. I am currently training a 416 model to 273 epochs using all the default settings with the --img-weights flag. I just started this, so I should have results out in about a week, and then I'll share here.

glenn-jocher on 30 Jul 2019

👍5

@glenn-jocher Does training seem to improve using --img-weights based on your experiments so far? I am currently retraining on new hyperparameters I got from evolving, but despite the promising mAPs gotten during evolution, I see that the mAP for my new experiment is pretty much the same as my control experiment ~60 epochs in.

ktian08 on 2 Aug 2019

@ktian08 I might be seeing a similar effect. It's possible that the first few epochs are much more sensitive to the hyperparameters, and small changes in them eventually converge to the same result after 50-100 epochs.

I'm not sure the conclusion to draw from this, other than hyperparameter searches based on quick results (epoch 0, epoch 1 results etc.) may not be as useful as they appear. Oddly enough I also saw about no change in mAP at the baseline img-size when using --multi-scale, even after 30-40 epochs.

glenn-jocher on 2 Aug 2019

@glenn-jocher Hmm... I trained to 20 epochs while tuning hyperparameters but when I compare the 20th epoch even I don't see the 7 mAP increase that I should've seen.

Maybe --img-size reduces AP for the classes doing well during training, so that the mAP in the end is the same regardless.

I noticed that pjreddie describes multi-scale training much differently in the YOLO9000 paper than what is being implemented here (scaling from /1.5 to * 1.5). He says every 10 batches he chose a new dimension from 320 to 608 as long as it was divisible by 32, allowing for a much larger range for YOLOv3-320, which might help. Does his YOLOv3 repo also implement it like this, or is it your way?

ktian08 on 2 Aug 2019

@ktian08 here is the current comparison using all default settings (416, no multi-scale, etc). I'll update daily (about 40 epochs/day). Training both the full 273.

Orange is baseline python3 train.py
Blue is experiment python3 train.py --img-weights

results

glenn-jocher on 2 Aug 2019

👍1

@ktian08 this should implement it as closely as possible to darknet. Every 10 batches (i.e. every 640 images) the img-size is randomly rescaled from 1/1.5 to 1*1.5, rounded to the nearest 32-multiple, so from 288-640 for img-size 416 or from 224-480 for img-size 320.

I've run AlexyAB/darknet to verify and it does the same.

glenn-jocher on 2 Aug 2019

👍2

how one should set hyper parameters for custom data single class? As data are different from coco we should make some changes in hyperparameters. Any hint on that. Thank you

sanazss on 2 Aug 2019

@sanazss You can try tune the hyperparameters using the --evolve flag, which will automatically search for the best hyperparameters based on finding the best value for a metric (fitness, set to 0.5f1 + 0.5mAP). You could also try manually tuning some hyperparameters like learning rate based on the graphs generated from results.txt.

ktian08 on 2 Aug 2019

Ideally you want your loss terms to have similar magnitudes, so you could manually see if any of your loss terms (GIoU, Confidence, Classification) is different than the others and adjust the weights a bit to get started, and then if you get some decent results from there (nonzero mAP) you can do python3 train.py --evolve on your custom data. This will evolve your hyperparameters looking for the best fitness based on random mutations. The results are recorded in evolve.txt. You probably want to run your evolve loop a few hundred times (200-300 evolutions seems to produce stable results).

glenn-jocher on 2 Aug 2019

Thank you for your prompt reply. Should I use opt.evolve= true. I am running train.py and I think it runs in normal way doesn’t save hyperparameters. Could you guid me on this? Thank you

sanazss on 2 Aug 2019

Its important to keep in mind that evolution is pretty advanced topic, that you get into when you are at the end of the road in terms of what you can achieve with the default setup. Since you are just getting started you should simply train your custom data to 300 epochs, and only then if you are unsatisfied with the results, and further training is not improving them (i.e. training to 400, 500 epochs), only then should you start exploring your more advanced options.

glenn-jocher on 2 Aug 2019

Also if you are in a position to collect more data, this should also be higher on your list than tuning the hyperparameters. This is all very experimental, and its not proven to have significant effect on the final mAP yet.

glenn-jocher on 2 Aug 2019

Thank you for your advice

sanazss on 2 Aug 2019

The latest. --img-weights is blue. It seems to produce better results at first but trends worse than the baseline (green) at higher epochs. Orange is using INTER_AREA cv2 resize when loading the images rather than the baseline's INTER_LINEAR. Orange may seem a tiny bit better, but the INTER_AREA function is much slower than INTER_LINEAR, it adds about 10% to each epoch time.

In general --img-weights seems to be a fail, but I´ll let it run another day to make sure.

results_epoch70

glenn-jocher on 3 Aug 2019

I've updated the code to save each component (GIoU, Conf, Class) of the validation losses, and I've added a new plotting function which helps see when overtraining starts to occur in each loss component. I got a very interesting result on our baseline model, which is at 100 epochs so far. It appears GIoU loss is doing fine, as both train and val GIoU losses are still trending downward, however the other two loss components show diverging fortunes, with clear overtraining in the Classification loss, and overtraining starting to gradually occur as well in Objectness loss (Confidence).

I'm open to suggestions about what to do with this information. My instinct is to lower the loss gains on Class significantly (maybe cut them in half), and also on Conf a bit. Conf/Class losses can be switched to BCE/CE from the current BCE/BCE setup also. Any ideas?

from utils.utils import *; plot_results_overlay()
results_gcp1

glenn-jocher on 5 Aug 2019

@glenn-jocher Managed to get 50.5 mAP on YOLOv3-320, just 1 mAP off (using --save-json)! These were my hyperparameters and results:
Screen Shot 2019-08-05 at 5 14 02 PM
Screen Shot 2019-08-05 at 5 02 12 PM
I trained using 273 epochs, batch size=32, accumulate=2, multiscale on, and GIoU loss. I did not use --img-weights. These hyperparameters were gotten by evolving to 20 epochs for ~15 or so generations.

ktian08 on 6 Aug 2019

@glenn-jocher I also think your instinct to lower the hyperparameters for conf, obj make sense since they seem to be dominating the loss (I think mine are x2 yours currently). So is the current loss implemented with BCE for class, i.e. all the incorrect classes are grouped as one "incorrect class," rather than calculating cross entropy for all classes? I think switching to CE would make it harder to overfit these loss terms. What was the reasoning behind BCE or CE? It seems that the original YOLO paper has like a MSE thing going on for class score difference.

ktian08 on 6 Aug 2019

@ktian08 this is fantastic!!! I think this is the first time the repo has produced >50 mAP from scratch!!

50.5 mAP is only -1.9 from 52.4 darknet mAP using python3 test.py --img-size 320 --save-json, but this is extremely close. The gap used to be around -10 mAP a few months back. A few things I noticed:

Your test loss is constantly decreasing, this is exactly what we want, there is no overfitting!
Your mAP jump on the first LR scheduler step is huge, almost 10%, much larger than before. I wonder what part of this is due to the new hyps and how much is due to training 73 epochs more.
Our evolution suspicions were correct: maximizing final mAP requires evolving hyperparams on a longer timeline (i.e. 5-10% of the total epochs seems to be a good guideline).
The evolved LR and momentum are very aggressive, and your hyp['momentum'] value has actually hit the artificial ceiling I placed there of 0.97. I put a ceiling on the value because I saw that when this approached 0.99 the epoch 0 mAPs all fell to near 0. You should raise the ceiling to 0.98, though probably not 0.99. I'll raise this to 0.98 in my next commit.
https://github.com/ultralytics/yolov3/blob/68a5f8e2078a853623d17ca6a11bfa2ce3ea4aba/train.py#L387-L392

UPDATE: python3 test.py --img-size 320 --save-json is returning 52.3 mAP now, so the correct comparison is 52.3 - 50.5 = -1.8.

glenn-jocher on 6 Aug 2019

@ktian08 about your CE vs BCE question: In the past I've always observed PyTorch CE loss outperforming BCE loss, even on pure classification problems like MNIST. I don't know if this is due to PyTorch's specific implementation of the two (I have not tested this in TensorFlow for example), or whether there is some basis in theory/mathematics for this result.
https://github.com/ultralytics/mnist

For most of the past year I've used CE for classification loss (due to the better MNIST results I saw firsthand), but a few months ago I realized that balancing the classification CE loss and the objectness BCE loss might be causing problems (since our object detection problem has a multi-component loss function), so I switched classification to BCE under the assumption that the obj and cls loss would play better together over the course of training if they were the same loss type.

So it's definitely a worthwhile experiment putting CE loss back for cls. The main issue though is that hyp['cls'] is then a complete unknown, so you'd need to search for a good value for it, possibly starting from the current value. Or maybe a better solution would be to add a new hyp called hyp['cls_ce'], and freeze all the other hyperparameters and search for this one. That would make it easier for people to use the bce hyps if they have a multilabel task. hyp['cls_pw'] would go unused, so you can leave as is and forget about it. The change itself is very simple, you would just comment the 3 BCE lines and uncomment the 1 CE line in the loss function:
https://github.com/ultralytics/yolov3/blob/68a5f8e2078a853623d17ca6a11bfa2ce3ea4aba/utils/utils.py#L317-L323

glenn-jocher on 6 Aug 2019

Hello,
Nice work. Thanks for the sharing. I would like to know is you used the
option --evolve ?

Thanks in advance for your reply.

Zhe

On Tue, Aug 6, 2019 at 2:21 AM Kane Tian notifications@github.com wrote:

@glenn-jocher https://github.com/glenn-jocher Managed to get 50.5 mAP
on YOLOv3-320, just 1 mAP off! These were my hyperparameters and results:
[image: Screen Shot 2019-08-05 at 5 14 02 PM]
https://user-images.githubusercontent.com/25041372/62502497-9a22c400-b7a4-11e9-8466-d35695e496d9.png
[image: Screen Shot 2019-08-05 at 5 02 12 PM]
https://user-images.githubusercontent.com/25041372/62502523-c1799100-b7a4-11e9-9a91-c0635a6e4f0e.png
I trained using 273 epochs, batch size=32, accumulate=2, multiscale on,
and GIoU loss. I did not use --img-weights. These hyperparameters were
gotten by evolving to 20 epochs for ~15 or so generations.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/ultralytics/yolov3/issues/310?email_source=notifications&email_token=AKI7IS3W6HG7CSCUMQV5U33QDC72NA5CNFSM4HR3QZB2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD3TOJKA#issuecomment-518448296,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AKI7ISZ3CZSROGVQE2UUJEDQDC72NANCNFSM4HR3QZBQ
.

--
Zhe LI

Aurora33 on 6 Aug 2019

@glenn-jocher To clarify, my experiment was actually run on YOLOv3-320, not YOLOv3-320spp. I just ran test.py and got 51.7 mAP for pjreddie's pretrained weights, so the difference is -1.2 currently for me. We can try the suggested momentum and loss changes. Did you ever use MSE for the class and object scores?

@Aurora33 Yes, I used --evolve to arrive at my current hyperparameters. Evolving from a previous (in July) commit's hyperparameters allowed my mAP to increase by ~3 mAP.

ktian08 on 6 Aug 2019

@ktian08 oh wow even better. The README is showing 51.8 mAP there, but if fire up a colab instance today yes I also currently see 51.7 mAP, so yes we are only -1.2 mAP with your hyperparameters, incredible.

I've always used -spp since it gives you an almost free +1-2 mAP, but I'm pretty sure what helps yolov3 with also help yolov3-spp.

I think we should probably commit your hyperparameters to the master branch, and then I was thinking of using them as a starting point for another search, perhaps to 27 epochs this time (10% of the total), using python3 train.py --data data/coco.data --img-size 320 --epochs 27 --batch-size 32 --accumulate 2 --multi-scale --evolve.

Oh, BTW, if you do --batch-size 64 --accumulate 1 this should produce the same results as --batch-size 32 --accumulate 2, but will train faster... though --multi-scale at --batch-size 64 might also cause a CUDA out of memory error.

Also forget about the --img-weights, it seems to hurt more than it helps.

glenn-jocher on 7 Aug 2019

@ktian08 BTW, since the momentum parameter is so close to its ceiling, we should lower the amount it varies from it's parent each mutation. I've called this inter-generation variation 'sigma' in the code. Most of the hyps have sigmas of 0.15-0.20, allowing them to vary greatly from one generation to the next, but momentum is especially sensitive, so I had lowered it previously to 0.05 sigma, and I've lowered it again to 0.02 sigma in the latest commit. This plot gives an example of how an offspring's hyp values will vary given a parent with a hyp value of 1.0, assuming 0.02 sigma (orange), and 0.01 sigma (blue).
Figure_1

This sigma value is tricky to set, because when you just start evolving you want great variation from one generation to the next (to ideally scan the vast n-d parameter space efficiently), but as you evolve to a more mature solution (like what you have now), it may be wise to reduce the sigmas, in much the same way as the LR is reduced in the later stages of training a neural network.

glenn-jocher on 7 Aug 2019

👍1

@glenn-jocher Sounds good! My teammates may look more into training to get the desired remaining mAP. Could I get collaborator access and push my hyperparameters/some of the training code that generated these hyperparameters? I pulled like in mid-July, so our codebases are quite different, but I implemented by hand some of the changes you've made.

ktian08 on 7 Aug 2019

@ktian08 @glenn-jocher I just finished a traing like @ktian08. I got a MAP 50.2%. I trained the model exactly like @ktian08 but with the latest code in the master.
The loss of confidence overfitting at the end ?
results

There are strange detections in sample images:
bus
zidane
Detections with original yoloV3:
bus
zidane
I found there is more false positive in our model than yoloV3

Aurora33 on 14 Aug 2019

@Aurora33 oh very interesting. @ktian08 trained with --multi-scale and did not use the darknet53.conv.74 backbone to get his results.

Since this model is trained with different hyperparameters it will have a different --conf-thres that you'll want to apply to it. As you can see all of the confidences are higher than with the default weights, so you may want to raise your conf-thres above the default setting in detect.py.