Yolov3: Darknet Training Comparison

Created on 4 Dec 2018 · 15Comments · Source: ultralytics/yolov3

All, I've started training using the official darknet repo to compare. The first two things I noticed are:

Darknet training speed appears quite slow. In darknet yolov3.cfg, max_batches = 500200 is the total train time, and batch=64 is the images per batch, then this will take about 28 days on a GCP P100 at about 18,000 batches per day (all train settings to default).
Darknet appears set to train for 267 epochs. This is 500200 batches times 64 images per batch divided by 120,000 images in the training set. Can this be right? This seems like a lot.
Darknet is using multi_scale training, changing the image size every 10 batches. I've set this behavior as well in this repo if -multi_scale = True in train.py (though currently this changes the size every batch).

question

Source

glenn-jocher

👍1

All 15 comments

Hi,
Max_batches = 500200 refers to the meaning of the batch: every time you process the batch image, this is not the same as the batch in cfg, it seems to need to multiply a subdivision, is the batch in cfg?

Hello526 on 9 Dec 2018

I am also doing a comparison of darknet and pytorch versions.

Hello526 on 9 Dec 2018

@Hello526 I don't understand your question. In darknet the yolov3.cfg default is batch_size = 64 and subdivision = 16. I think this means that the batch is divided into 4 groups of 16.

My epochs calculation is simply 64 * 500200 / 120000 = 267 epochs of darknet training.

glenn-jocher on 11 Dec 2018

@glenn-jocher I think it should be that the network processes 4 images at a time, 16 times, that is, net.input is 4×3×416×416.

Hello526 on 12 Dec 2018

@glenn-jocher @Hello526 Subdivisions are similar to iter_size to those familiar with Caffe. It's also sometimes referred to as _virtual batch-size_ or _accumulated gradients_ in the literature.
What this means in practice is that to emulate an effective batch size of 64 (which is the set of hard-coded hyperparameters chosen in the original paper for training) we use a subdivision of 16 - since we can only fit 64/16=4 actual images when training on a P100 (or whatever was used in the original paper implementation).

On GPUs with larger memory sizes this parameter can be reduced to 8, and for higher resolution should be 32. Implementing support for this in pytorch is quite easy. Note that this is very important, and training with batch sizes of 4 without any accumulation will probably not work as well due to noisier signals compared to using 64 images for every gradient step.

Also note that sometimes different frameworks vary slightly in terms of correct hyperparameters for optimization. I've experienced this when moving from Caffe to TF a couple of years ago and I wouldn't be surprised if this will be true for Darknet->PyTorch.

nirbenz on 12 Dec 2018

👍1

@nirbenz Thank you very much !

Hello526 on 12 Dec 2018

@nirbenz yes you are correct. I just checked and darknet is only computing 4 actual images at a time, but accumulating the gradient 16 times to arrive at an effective gradient for the entire batch of 64 images.

To compare, this repo only computes the gradient on 16 images at a time, and optimizes immediately afterward. I did an experiment in the past accumulating for 64 images before optimizing, but did not observe improvement after 1 epoch when resuming from yolov3.pt https://github.com/ultralytics/yolov3/issues/22#issuecomment-427602217.

I can try again starting from darknet weights. To accumulate 64 images one would use --batch-size 16, uncomment this if statement with accumulated_batches = 4.
https://github.com/ultralytics/yolov3/blob/c59193644620886becc9a3cfd7518ad74e6a7986/train.py#L156-L159

glenn-jocher on 12 Dec 2018

Anyone know how batch normalization work if only 4 images are used for batch statistics? Isn't that extremely low or do they calculate batch statistics over all 64 images somehow?

hello-hi1 on 26 Dec 2018

This is indeed a problem for the way batch normalisation is computed. I'm actually not sure how Darknet performs gradient accumulation inside BN layers. One possibility is to create a superclass for the original PyTorch implementation which saves statistics from past forward passes.

nirbenz on 8 Jan 2019

Hey I am training model on 4 classes and according to darknet repo

change line max_batches to (classes*2000 but not less than 4000), f.e. max_batches=6000 if you train for 3 classes
darknet

The default value for max_batches is 500200 in the cfg file. Do we need to change this value to our number of classes?
Also. for training on negative images they mentioned we need to add negative samples without bounded box (empty .txt files). Is this true for this repo as well?

shahidammer on 15 Nov 2019

@shahidammer for negative examples you can simply add images to train.txt and test.txt without needing to add empty labelfiles.

glenn-jocher on 15 Nov 2019

@glenn-jocher
I'm training for 3 classes and my dataset is about 190000 images. Batch-size 32
So If I use [[2000* (no.of.classes)]*batch_size]/images = epochs then
do I have to train for 2 epochs? Should I only use a part of my dataset, since I only need to detect 3 classes (vehicles, pedestrians and cyclists)?

akramscarfs on 18 Feb 2020

@akramscarfs you should train until you begin to observe overtraining. COCO trains to almost 300 epochs with about half as many images for reference.

glenn-jocher on 19 Feb 2020

@akramscarfs you should train until you begin to observe overtraining. COCO trains to almost 300 epochs with about half as many images for reference.

doesn't it need much time? :(