Darknet: EfficientDet: Scalable and Efficient Object Detection - 51.0% [email protected] COCO

Created on 21 Nov 2019  路  164Comments  路  Source: AlexeyAB/darknet

EfficientDet: Scalable and Efficient Object Detection

First, we propose a weighted bi-directional feature pyramid network (BiFPN), which allows easy and fast multi-scale feature fusion;


image


image


image


image

ToDo

Most helpful comment

All 164 comments

Looks really promising! The GPU latencies given are very low but it uses efficientnet as the backbone - how could that be?

So this could be implemented in this darknet repository??? I'm a little confused.

How to get the best mAP@50 until now? Can I use EfficientDet-D0 - D6? I use yolov3-voc.cfg to train myself dataset and get mAP@50=80. on myself test set. I just add three lines:
flip = 1
letter_box=1
mixup = 1
thinks a lot! @AlexeyAB

@AlexeyAB @WongKinYiu guys I might have an interesting clue about increasing mAP.

Efficientdet has 5 outputs (P3-P7) compared to 3 (P3-P5) for yolov3, but these extra 2 are for larger objects, not smaller objects. In the past I've added a 4th layer to yolov3, with the same or slightly worse results, but this was for smaller objects.

On the same topic, I recently added test-time augmentation to my repo https://github.com/ultralytics/yolov3/issues/931, which increased mAP from 42.4 to 44.7. I tested many different options, and settled on 2 winners: a left-right flip, and a 0.70 scale image. The highest mAP increase was coming from the larger objects. I think the 0.70 scale made these large objects smaller so they could fit in the P5 layer (whereas maybe before they would have needed to be in the P6 layer which doesn't exist).

So my proposal is (since the darknet53-bifpn cfg did not help), to simply add P6 and P7 outputs to yolov3-spp.cfg and test this out (with the same anchors redistributed among the layers I suppose). What do you think?

@glenn-jocher Hello,

Yes, from my previous analysis, I think we need at least one more scale (P6).
image
I already start training P3-P6 model for several days.
But there also a issue I have ignored first: The input size may should be 64x instead of 32x.

I saw your modification yesterday, and I already integrated YOLOv3-SPP into mmdetection.
There are two different tricks are used in ultralytics and mmdetection.
In ultralytics: train from scratch + prebias.
In mmdetection: with pretrained model but only BatchNorm layers will be update.
I would like to examine the performance of these two tricks.
Also, I will apply tricks in ATSS to YOLOv3-SPP if it need not to modify much code.

By the way, CSPDarkNet53 has also be integrated with TTFNet (anchor-free object detection), MSRCNN (instance segmentation), and JDE (simultaneously detect and track).

OK, I am training YOLOv3-SPP with almost same setting as CSPResNeXt50-PANet-SPP(optimal) using ultralytics. I think it can be baseline of your new model which integrate P3-P7.

@glenn-jocher @WongKinYiu

Yes, it seems that https://arxiv.org/abs/1909.00700v3 and https://github.com/ZJULearning/ttfnet (Training-Time-Friendly Network for Real-Time Object Detection) is a good way.

While I think we must move on:

  • use triplet-loss for train in ~10 seconds per object
  • use lstm to achieve +10-30% AP for Detectio on video and Training on static images

On the same topic, I recently added test-time augmentation to my repo ultralytics/yolov3#931, which increased mAP from 42.4 to 44.7. I tested many different options, and settled on 2 winners: a left-right flip, and a 0.70 scale image.

Yes the same result as for CenterNet, they achieve 40.3% AP / 14 FPS, then with Flip - they achieve 42.2% AP / 7.8 FPS, and with Multi-scale they achieve 45.1% AP / 1.4 FPS. https://github.com/xingyizhou/CenterNet#object-detection-on-coco-validation
But this is not one-stage detector already and makes the model not real time.
While I think about flip-invariance/rotation-invariance/scale-invariance weights with significant increasing accuracy and a small drop in FPS: https://github.com/AlexeyAB/darknet/issues/4495#issuecomment-578538967

The highest mAP increase was coming from the larger objects. I think the 0.70 scale made these large objects smaller so they could fit in the P5 layer (whereas maybe before they would have needed to be in the P6 layer which doesn't exist).

May be yes.

  • May be we just should add P6.
  • But may be we should also increase network-resolution and weights-size for optimal AP/FPS as it is stated in EfficientNet/Det papers

Efficientdet has 5 outputs (P3-P7) compared to 3 (P3-P5) for yolov3, but these extra 2 are for larger objects, not smaller objects. In the past I've added a 4th layer to yolov3, with the same or slightly worse results, but this was for smaller objects.

Yes, it is because, they increased input network resolution from 512x512 for D0 (where are P3-P5 for big objects) to 1536x1536 for D7 (where are P3-P5 for small objects), so we should add P6-P7 for big objects. Because receptive field NxN of P5 don't depend on network resolution and remains the same in pixels (look below for yolo cfg files), so NxN for 512x512 is big, while the same NxN for 1536x1536 is small.

So may be we should:

  • increase network resolution width=896 height=896
  • add P6 yolo-layer with higher receptive field, so we have 4 [yolo]-layers
  • increase ~1.35x weights size (filters= for each conv-layer)
  • use 12 or 15 anchors for 4 yolo-layers

So my proposal is (since the darknet53-bifpn cfg did not help), to simply add P6 and P7 outputs to yolov3-spp.cfg and test this out (with the same anchors redistributed among the layers I suppose). What do you think?

Yes, try 4 ways:

  1. yolov3-spp + P6
  2. yolov3-spp + P6 + network resolutionn 896x896
  3. yolov3-bifpnn + P6-P7
  4. yolov3-bifpnn + P6-P7 + network resolutionn 896x896

I added receptive field calculation - usage: [net] show_receptive_field=1 in cfg-file:

  • yolov3-tiny.cfg

    1. yolo-layer 13x13: 318x318

    2. yolo-layer 26x26: 286x286


  • yolov3.cfg

    1. yolo-layer 13x13: 917x917

    2. yolo-layer 26x26: 949x949

    3. yolo-layer 26x26: 965x965


  • yolov3-spp.cfg

    1. yolo-layer 13x13: 1301x1301

    2. yolo-layer 26x26: 1333x1333

    3. yolo-layer 26x26: 1349x1349

While input network size if just 608x608.

  • So for [net] width=608 height=608 the 1st yolo-layer 13x13: 1301x1301 is for big objects

  • But for [net] width=3200 height=3200 the 1st yolo-layer 13x13: 1301x1301 is for small objects

@WongKinYiu ah great! Lots of integrations going on. I have not looked at ATSS yet, I will check it out. TTFNet looks refreshingly simple.

Yes that's a good chart you have there. How do you calculate the receptive field exactly? I saw Efficientdet updated their anchor ratios to (1.0, 1.0), (1.4, 0.7), (0.7, 1.4). I'm not sure exactly how these work. Do you think these create anchors based on multiplying grid cells or the receptive field?

Yes a P6 layer requires 64-multiple size images, and a P7 layer would require 128-multiple size images, but its not a huge problem.

there is a 3x3 convolutional layer just before prediction layer, so i simply multiple size of grid by 3.

@AlexeyAB

from JDE paper, they provide results of different loss for embedding.
image
but unfortunately, they only submit code of cross entropy.

Also there are some issues to be solved. for example, it only support single class tracking and different anchors in same scale share same embedded feature.
image

the code is mainly based on ultralytics, so i think it can be start point for developing triplet-loss based tracker. https://github.com/Zhongdao/Towards-Realtime-MOT

@WongKinYiu so you simply take a 3x3 grid as the receptive field. Ok.

Do you think it might be beneficial to have the anchors be fixed in units of gridspace instead of image space? Maybe this is what EfficientDet is doing with their (1,1), (1.4, 0.7), (0.7, 1.4) anchor multiples (I don't know what they do with these multiples).

Right now the anchors are fixed/defined in imagespace (pixels) rather than grid space, so the same anchor would take up varying gridpoints depending on the output layer (if it was applied to different layers).

What do you think of the idea defining the anchors as (1,1), (1.4, 0.7), (0.7, 1.4) local gridpoints, and then maybe testing out say a 2x and 3x multiple of that?

I implemented my yolov3-spp-p6 in other news, I'm training it now. I trimmed some of the convolutions to keep the size maneagable, its 81M params now and training about 25% slower than normal. Early mAP was lower, but seems to be crossing yolov3-spp and going higher at around 50 epochs. I'll keep my fingers crossed.

Screen Shot 2020-03-30 at 7 45 40 PM

@glenn-jocher

from my previous analysis, i think {0.7,1.4} is due to IoU >= 0.5.
and sqrt(0.5)*sqrt(2) equals to 1, (0.7, 1.4), (1.4, 0.7), (1,1) almost has same area.
image

@WongKinYiu ah yes that makes sense! Also (1.4, 0.7) IOU is about 0.55, close to 0.5. From your earlier plots though it looks like the current anchors correspond much better to about 3x3 gridpoints than 1x1 gridpoints.

At 512x512, the P3 grid is 64x64, P4 is 32x32, P5 is 16x16, and P6 is 8x8. If we had a P7 that would be 4x4, and 3 gridpoints at that scale would take up almost the entire image (which sounds about right). At the smaller scale though the P3 stride is 8, and we currently have anchors about that size (smallest is 10x13 pixels).

I'm worried my existing anchors are causing tension in the P6 model, as the GIoU loss is higher than normal. I simply spread out the 12 anchors I was using for yolov4.cfg (which has P3-P5, 4 at each level) to yolov4-p6 (which has P3-P6, 3 anchors at each level).

@AlexeyAB @WongKinYiu ok, my P6 experiment was tracking worse than yolov3-spp after about 150 epochs so I cancelled it. I'm not sure why exactly.

If I look at the yolov3-spp receptive field, at P5, stride 32, the largest anchor is (373,326), or 10 grids, which would be 3X the receptive field according to @WongKinYiu

P6 has stride 64, so only 1.5X receptive field for the largest anchor, yet overall mAP is worse. I did trim some convolution operations to keep the parameter count reasonable, so this could be the cause. Back to the drawing board I guess. @WongKinYiu how did your P6 experiment go?

@glenn-jocher

currently 140k iterations, it need several weeks to finish training.

for yolov3-spp, the receptive filed become very large due to spp module is added.
(13x13 max conv -> (32 * 13)x(32 * 13)) = 416x416 receptive field)
image

@WongKinYiu @glenn-jocher

(13x13 max conv -> (32 * 13)x(32 * 13)) = 416x416 receptive field)

Also you should take into account that conv3x3 stride=1 increases receptive field too, not only conv3x3 stride=2.

You can see receptive field in the Darknet by using:

[net]
show_receptive_field=1

@glenn-jocher

Could you provide your cfg file and training command?
I will modify it and train on ultralytics.
(it will get error in test.py if i add P6 yolo layer in cfg.)

by the way, do you train/val on coco2014, or on coco2017?

@WongKinYiu yes here is the p6 cfg with 12 anchors, and a modified version of yolov3-spp called yolov4 that has the same 12 anchors, which trains to slightly above yolov3-spp (+0.1mAP).

I had to add a lot of convolutions to p6, so it has 81M params. I doubled the width of the stem convolutions (which use few params), but reduced the width of the largest head convolutions (i.e. 1024 -> 640 channels). Overall the result was slightly negative though, so you may want to adjust the cfg.

python3 train.py --data coco2014.data --img-size 416 608 --epochs 300 --batch 16 --accum 4 --weights '' --device 0 --cfg yolov4-81M-p6.cfg --name p6 --multi

yolov4-81M-p6.cfg.txt

@glenn-jocher Try to train and test this model with network resolution 832x832 (with random shapes).
Also why you didn't use SPP-block?

@AlexeyAB yes maybe I should put the SPP block back in on the P6 layer, and return the dn53 stem convolutions to their original sizes.

When I changed dn53 I saw that there were 8, 8 and 4 blocks in the last 3 downsamples. For p6 I changed this to 8, 8, 8 and 8 (no spp). Maybe I should update to 8, 8, 8, 4+spp, which would more closely mimic yolov3-spp.

@glenn-jocher

start training yolov3-spp and yolov3-spp-p6.
the loss of yolov3-spp-p6 is very large at the 1st epoch when compare to yolov3-spp.

@WongKinYiu yes, the loss is larger, in part because the total loss is the sum of the layer losses, i.e.:
total_obj_loss = obj_layer1_loss.mean() + obj_layer2_loss.mean() + obj_layer3_loss.mean()

whereas p6 will have an additional + obj_layer4_loss.mean(). But it may also simply be larger because the model is poorly designed.

@glenn-jocher @AlexeyAB

good pytorch implementation of EfficientDet, 26x faster than official Tensorflow implementation.
https://github.com/zylo117/Yet-Another-EfficientDet-Pytorch

@WongKinYiu Yes, @zylo117 achieved +3 (+10%) FPS and -1.2 (-4%) AP compared to stated results in the paper. But it is 25x faster than public TF-code: https://github.com/zylo117/Yet-Another-EfficientDet-Pytorch/issues/77

@AlexeyAB yes it looks like a very good pytorch addition!

I'm pretty suspicious about the capability of the repo to train from scratch currently though (he mentions the pytorch weights are transferred from tf and finetuned slightly), but the author seems to be a very in depth expert on efficientdet at least.

It seems efficientdet training is very slow in general with large memory requirements.

I really dont think EfficientDet works well for real time operations. On Nvida P100 gpu, the inference speed (pre- & postprocessing) is 10. On the contrary, the speed of YOLO is 17 ms ( 58 FPS)

but i got 30+FPS at batchsize 1 when evaluating d0. Since the evaluation includes pre and post processing, I think you need to optimize your implement.
10 FPS is way too slow.

my environment is:
torch1.4
torchvision0.5
Python3.7(not anaconda)
Ubuntu19.10
i5 8400
rtx2080ti

@zylo117 How many FPS do you get for yolov3-spp? While yolov3-spp has higher accuracy than EfficientDet D0.

@zylo117 How many FPS do you get for yolov3-spp? While yolov3-spp has higher accuracy than EfficientDet D0.

I haven't try yolov3-spp yet, but I do know yolov3-spp is faster and more accurate than d0. however, 8 fps seems strange to me.

I think the real advantage of efficientdet is that it consumes less memory and has less ops, so that more models can deploy on the same device or running with larger batchsize for offline tasks, as I mentioned at my repo's readme, it gets 163 fps at batchsize 32.

@zylo117 does efficientdet use many grouped convolutions, or depthwise convolutions? Can you link to the basic bottleneck module in your repo? I think grouped convolutions may be a cause of slower speed. I know this can cause much slower training in pytorch, and actually slow down inference as well, even if the model has less parameters and fewer FLOPS.

Pytorch (or Nvidia) lack a native cudnn backend kernel for grouped convolutions I think, so Pytorch falls back on it's default method, which is not cuda optimized as I understand it. I made a small notebook to test these timing effects:

Screen Shot 2020-04-17 at 12 50 21 PM

EDIT1: to be clear this would be a general issue with modern 'efficient' object detectors like efficientdet, and backbones like resnext, not specific to only @zylo117's pytorch implementation. Indeed there may be no real solution other than to wait for nvidia and pytorch to implement a cuda kernel for optimizing grouped convolution operations on gpu.

EDIT2: the timing effects shown simulate training, where a forward and backward pass are run on the convolution, but I believe inference shows similar but less severe slowdowns.

@zylo117
Yes, EfficientDet is more suitable for batch-inference and inference on TPU-edge.

  • What mini-batch size do you use for training?
  • What GPU and how many GPUs do you use for training EfficientDet D3 ?
  • How long do you train EfficientDet D3 (is it about ~1 month)?

@zylo117 does efficientdet use many grouped convolutions, or depthwise convolutions? Can you link to the basic bottleneck module in your repo? I think grouped convolutions may be a cause of slower speed. I know this can cause much slower training in pytorch, and actually slow down inference as well, even if the model has less parameters and fewer FLOPS.

Pytorch (or Nvidia) lack a native cudnn backend kernel for grouped convolutions I think, so Pytorch falls back on it's default method, which is not cuda optimized as I understand it. I made a small notebook to test these timing effects:

Screen Shot 2020-04-17 at 12 50 21 PM

EDIT1: to be clear this would be a general issue with modern 'efficient' object detectors like efficientdet, and backbones like resnext, not specific to only @zylo117's pytorch implementation. Indeed there may be no real solution other than to wait for nvidia and pytorch to implement a cuda kernel for optimizing grouped convolution operations on gpu.

EDIT2: the timing effects shown simulate training, where a forward and backward pass are run on the convolution, but I believe inference shows similar but less severe slowdowns.

can you share the notebook? I'd like to run it on my environment. the link you provided is not accessible.

@zylo117 does efficientdet use many grouped convolutions, or depthwise convolutions? Can you link to the basic bottleneck module in your repo? I think grouped convolutions may be a cause of slower speed. I know this can cause much slower training in pytorch, and actually slow down inference as well, even if the model has less parameters and fewer FLOPS.

Pytorch (or Nvidia) lack a native cudnn backend kernel for grouped convolutions I think, so Pytorch falls back on it's default method, which is not cuda optimized as I understand it. I made a small notebook to test these timing effects:

Screen Shot 2020-04-17 at 12 50 21 PM

EDIT1: to be clear this would be a general issue with modern 'efficient' object detectors like efficientdet, and backbones like resnext, not specific to only @zylo117's pytorch implementation. Indeed there may be no real solution other than to wait for nvidia and pytorch to implement a cuda kernel for optimizing grouped convolution operations on gpu.

EDIT2: the timing effects shown simulate training, where a forward and backward pass are run on the convolution, but I believe inference shows similar but less severe slowdowns.

nvm, I tried to implement it myself. this is the code.

import time

import torch
from torch import nn

k = 3
x = torch.randn((1, 128, 512, 512)).cuda()
print('%10s%10s%10s %-20s' % ('groups', 'time(ms)', 'params', 'shape m'))
for g in [1, 2, 4, 8, 16, 32, 64, 128]:
    m = nn.Conv2d(128, 256, k, stride=1, groups=g, padding=k // 2, bias=False).cuda()
    t1 = time.time()
    for _ in range(1000):
        m(x)
    t2 = time.time()
    t = t2 - t1
    p = list(m.parameters())[0]
    print('%10g%10.1f%10g %-20s' % (g, t, p.numel(), list(p.shape)))

And this is what I got,

      groups  time(ms)    params shape m             
         1       4.5    294912 [256, 128, 3, 3]    
         2       2.6    147456 [256, 64, 3, 3]     
         4       2.5     73728 [256, 32, 3, 3]     
         8       1.9     36864 [256, 16, 3, 3]     
        16       2.9     18432 [256, 8, 3, 3]      
        32       3.1      9216 [256, 4, 3, 3]      
        64       0.0      4608 [256, 2, 3, 3]      
       128       0.0      2304 [256, 1, 3, 3]  

The result is not so bad when dealing with larger groups groupconv.

I guess, either it's torch had optimized group conv by version 1.4, or rtx2080ti is capable to deal with large gropus groupconv and provides speedup.

env:
i5 8400
ubuntu 19.10 x64
rtx2080ti
official python 3.7
torch 1.4
torchvision 0.5

@zylo117 ah sorry, I've made the notebook public now:
https://colab.research.google.com/drive/1tBkFOSLl3V1DguDgtlm6bNErPVDBCD7z?authuser=1#scrollTo=cjpQb9AsbfGR

Yes your code looks good except that Pytorch timing is kind of tricky in that you need to run synchronize() every time right before time.time() to get the true time:

def tsync():
    torch.cuda.synchronize() if torch.cuda.is_available() else None
    return time.time()

I timed training and inference operations. Inference is also slower for grouped conv to a lesser degree:
Screen Shot 2020-04-18 at 12 25 55 PM

@glenn-jocher after changing my code with your tsync() and run() and x, this is what I got. Groupconvs are almost as fast as normal convs. Also, if there are many groups, groupconvs are even faster.

training:

    groups  time(ms)    params shape m             
         1       8.9    294912 [256, 128, 3, 3]    
         2       8.5    147456 [256, 64, 3, 3]     
         4       8.7     73728 [256, 32, 3, 3]     
         8       9.3     36864 [256, 16, 3, 3]     
        16      10.2     18432 [256, 8, 3, 3]      
        32       9.5      9216 [256, 4, 3, 3]      
        64       7.8      4608 [256, 2, 3, 3]      
       128       7.5      2304 [256, 1, 3, 3]      

inference:

    groups  time(ms)    params shape m             
         1       0.7    294912 [256, 128, 3, 3]    
         2       0.5    147456 [256, 64, 3, 3]     
         4       0.5     73728 [256, 32, 3, 3]     
         8       0.5     36864 [256, 16, 3, 3]     
        16       0.6     18432 [256, 8, 3, 3]      
        32       0.4      9216 [256, 4, 3, 3]      
        64       0.2      4608 [256, 2, 3, 3]      
       128       0.2      2304 [256, 1, 3, 3]      

@zylo117 yes, exactly. My main discovery was not that the operations take longer, it is that size and FLOPS savings do not translate to faster speed. For example, given your results, a model composed of groups=16 convolutions would be 16X smaller and use 16X less FLOPS than a model made up entirely of comparable groups=1 convolutions, but it would not be any faster.

If it is only 8X smaller than a normal model, then it would be twice as slow...

@glenn-jocher

the total loss of new 12-anchors-model is very low, but the AP is very poor.
currently 116 epochs and 6.89 total loss.
[cfg] [weights]

@WongKinYiu ah sorry bud, did not realize you were training it currently. Could you create a results.png to show? It鈥檚 plot_results() in 煤tils/煤tils.py.

I gave up on P6, it seems SPP is doing a good job of increasing receptive field on it鈥檚 own.

BTW I saw you guys published YOLOv4! Congratulations. I鈥檝e been cooking up a few changes of my own over here, it looks like I鈥檒l need a new name now 馃槂

@glenn-jocher

Oh, i use --notest for speeding up training, so there is no results to show.

OK, If I get any good results of P6 model, I will share it for your reference.
Since Pytorch 1.5 includes a significant update to the C++ front-end, I would like to develop some new function with Pytorch.

Thank you, now I have time to read into details of your code and start to design a good head for an object detector.
I borrow two 2080ti for developing new head of object detector based on ultralytics.

@WongKinYiu ah 鈥攏otest! But then how do you know what the AP is?

It鈥檚 actually a bad time to start working the ultralytics/yolov3 repo. I鈥檝e been working on a new repo which folds in all of my lessons learned over the last year from people trying to train their custom datasets. The new repo is simpler and cleaner, a step closer to AutoML style training, and produces better results on new architectures I鈥檝e explored. I鈥檝e redefined model architectures based on simple yaml files as well, which makes it easy to test new models. I鈥檓 aiming to release it in early may, but I鈥檒l send you and Alexey invitations tomorrow. It鈥檚 called ultralytics/yolov4 ironically.

BTW the new pytorch code I wrote is super efficient space wise. The model yaml files that define the layers/anchors etc are only about 50 lines, and the actual model.py file that contains the model classes is only about 200 lines long, including the Yaml parser, detection module, forward method, etc. it鈥檚 really minimalist and easy to understand.

@glenn-jocher

I run test with last.pt using other gpu.

Thanks! I am glad to be invited into your new repository.
Would you provide Dockerfile for quick installation?
I have some experiments for tracking and instance segmentation based on combine ultralytics/yolov3 and mmdetection. If it is possible, I would like to merge segmentation and tracking to new ultralytics repository.

@glenn-jocher Hi,

I gave up on P6, it seems SPP is doing a good job of increasing receptive field on it鈥檚 own.

Did you add SPP-block after P6?
How much did you increase: resolution, depth (num of layers), weights (filters)?
Did you use alpha=1.2, beta=1.1, gamma=1.15 as stated in EfficientNet/Det article?

BTW I saw you guys published YOLOv4! Congratulations. I鈥檝e been cooking up a few changes of my own over here, it looks like I鈥檒l need a new name now smiley

Thanks! We have placed thanks and a link to your repository in the article.
Currently YOLOv4 is the top1 for real time detection on video cards RTX2050 - TitanV/V100 for both AP / AP50.

I鈥檝e redefined model architectures based on simple yaml files as well, which makes it easy to test new models. I鈥檓 aiming to release it in early may, but I鈥檒l send you and Alexey invitations tomorrow. It鈥檚 called ultralytics/yolov4 ironically.

Yes, it makes sense to port YOLOv4 to your Pytorch implementation and use it for faster research and future developments.
May be we should implement Detector with conv-LSTM with additional tracking with re-identification of any object.

@WongKinYiu actually yes, maybe I should just post you guys a docker image for now, as I actually haven't made any commits, I've just been developing locally and testing in GCP with docker images myself. Tracking is a very interesting feature that would add a lot of value, I've had several people inquire about this, but I would not underestimate the difficulty of implementing it well. I used to do kalman filter design in the past, as well as implement KLT trackers. The KLT tracker naturally uses a type of 2d correlation between a recent template and a small future search area. A feature vector from yolov3 would be much more information rich than that, and would not suffer the same drift problems over long time spans. There is a very big opportunity in the space.

@AlexeyAB I'm convinced now that focal loss only applies to detectors with combined classification+objectness loss into 1, like SSD and EfficientDet. These have a huge imbalance between foreground and background classes, unlike YOLOv3, which has a medium level of imbalance. At first I tried to combine obj and cls into one also, as its simpler to build, but I found its also a bit slower to run, because every inference has to compare thresholds across all classes for all anchors, rather than compare one threshold per anchor.

I saw the acknowledgements section in the paper, thanks! I think I'll explore P6 a bit later on, but for now I'm simply trying to get my new repo out. It's designed to be easier to use and harder to mess up.

@AlexeyAB wow I didn't know about the acquisition. Was Ali a professor or advisor of Redmon's when Redmon was at university working on YOLO?

Its funny because Apple and Google are the yin and the yang of AI. Apple is, naturally as a hardware company, intensely focused on AI at the edge. I'm super excited for the 5nm A14 chip in the 2020 iPhones coming out later in the year, especially to see what TOPS they push the neural engine to. While Google is focused on the opposite, drawing everyone's dollars and euros to the cloud, where they can sell their GCP services and TPU hours.

As for XNOR I don't actually have any experience with them, but I've seen very impressive quantization with coreml, where the models I export from pytorch to coreml (through onnx) can be quantized to FP8 without any noticeable loss in precision. If XNOR ultimately wants to push that to single bit quantization I'm not so sure, there would obviously have to be precision tradeoffs. I suppose the real question is whether you could export a much larger, much higher performing model, i.e. 500M parameters, into a tiny xnor model that performs equally well to the FP32 models today at 60M parames like YOLOv3.

@glenn-jocher @WongKinYiu

There is a little regret that I did not go to work for them (with stock options), although there was such an opportunity )


I implemented XNOR inference for Yolo about ~1.5 years ago, But I transfer data between layers in the float, so the speed is lost.
And accuracy is bad, since we should use another approach for training.

nVidia GPU supports XNOR GEMM for CC >= 7.5 by using wmma::bmma_sync(c2_frag, a_frag, b_frag, c2_frag); // XOR-GEMM
https://github.com/AlexeyAB/darknet/blame/2fc7fbbc0ea001170b12d39b840b9f4d34905dd4/src/im2col_kernels.cu#L1224-L1419

@AlexeyAB yes I saw that before. Quantization seems to have different effects depending on the platform. In CoreML model speed is completely unaffected going from FP32 to FP16 to FP8, the only difference is the app bundle decreases in size if the model is prepackaged with it. So unless I'm doing something wrong there they see zero speedup.

In PyTorch I haven't tried quantization yet, but apparently the blog claims significant speedup.
https://pytorch.org/blog/introduction-to-quantization-on-pytorch/

@AlexeyAB maybe the pytorch guys are getting their speedup from blatantly assuming larger batch sizes? This could make sense, since iDetection only runs ones image at time, since it's realtime on the iPhone, so perhaps explains the lack of speedup.

there are two networks I am interested in:

  1. xnor-net: it can be applied to in-memory computing
  2. addernet: cnn without multiplicaion

@glenn-jocher @WongKinYiu
Since batching always increase latency, so batching is not always applicable.

Quantization seems to have different effects depending on the platform. In CoreML model speed is completely unaffected going from FP32 to FP16 to FP8, the only difference is the app bundle decreases in size if the model is prepackaged with it. So unless I'm doing something wrong there they see zero speedup.

Could it be that quantization to FP8 occurred automatically in all three cases?

Did you try such models BIN/XNOR models?


xnor-net: it can be applied to in-memory computing

What does it mean? In-memory for ASIC/ULA/FPGA? Or in-memory for CPU-cache?

@WongKinYiu ah no, it's not possible that FP8 was used for all 3, because it takes a long time to quantize from FP32 down to FP8, so I don't think the iPhone was doing this operation on the fly, or even on first run. Also the app sizes mirrored the model sizes well.

I haven't tried any of the models, since my main use case is realtime detection on iOS, and it seems we are just about there already without xnor. What is the difference between xnor and binary?

@AlexeyAB @glenn-jocher

i have trained xnor-net and abc-net, but accuracy is not promising enuogh in my task,
the results of mixture of floating and binary network is ok, but hard to accelerate in a general framework.

In-memory computing (IMC) stores data in RAM rather than in databases hosted on disks. This eliminates the I/O and ACID transaction requirements of OLTP applications and exponentially speeds data access because RAM-stored data is available instantaneously, while data stored on disks is limited by network and disk speeds. IMC can cache massive amounts of data, enabling extremely fast response times, and store session data, which can help achieve optimum performance.

only binary operation can be easier applied to in-memory computing.
for example shift, or, and, xnor.

and for normal computing, the computational unit usually use nand gate, so xnor-net or nand-net are better choice than binary-network which use and/or/not.

@WongKinYiu ah, then the --cache command on ultralytics/yolov3 is the same as IMC? --cache stores all of the training data in RAM, which speeds up training significantly, especially for smaller datasets. This is a challenge for COCO though, where you need about 150GB of RAM to store all images at 640 resolution.

@glenn-jocher

--cache is for reducing data loader time, not for computing.
IMC doing calculation of network in memory.

so for digital, in-memory computing is a trend.
and for analog, maybe spiking network.

@WongKinYiu ah ok. I just had an idea BTW to maybe save memory when caching. Currently --cache loads an image, resizes it to the training --img-size (i.e. 640), and leaves it in RAM for all the dataloader workers to access when they want.

For datasets with large images i.e. 1920x1080, caching at 640 will save RAM, but for datasets like COCO, resizing all of the images to 640 would actually use up RAM unnecessarily, as many images are smaller than 640. So perhaps I could move the resize operation out of the caching function. This would reduce RAM requirements, but then the image would need to be resized every epoch (4 times per epoch when using mosaic loader), so there would be a slight hit to speed, small compared to loading time from the hard disk though.

@AlexeyAB just saw your message about rejecting the job offer from xnor.ai. haha, that's a big shame. It might not be too late though, maybe Ali could vouch for you if you applied to them through Apple now. Did you do an in person interview with them before?

@glenn-jocher Via zoom.
A lot more significant happened in my life, so no, not a big shame )
But overall, the company is very successful.

@WongKinYiu I ported the yolov4.cfg from this repo into mine, and made sure it's trainable. Not all of the atrributes of each class are used, especially in the yolo layers, but the correct feature fusion and mish activations are used, and training seems to proceed with no errors now. I may need to implement a more memory friendly version of Mish though: https://github.com/ultralytics/yolov3/issues/1098

Do you have any pytorch mish implementations you'd recommend?

@WongKinYiu you mentioned you had some free gpus. Could you use one to train yolov4.cfg on ultralytics? Then we can get an apples to apples comparison with the latest yolov3-spp results. I can also add it to the training plots as well. You can swap smaller --batch as necessary, the repo handles the loss scaling accordingly now, so you can change --batch without worrying about --accum, which is deprecated. This does multi-scale 320 to 640, and tests at 640.

python train.py --img 320 640 --batch 8 --weights '' --cfg yolov4.cfg --data coco2014.data

@glenn-jocher

OK, I will stop P6 experiments and train it first.

But I have trained CSPDarknet53s model, which is same as the one in https://github.com/WongKinYiu/CrossStagePartialNetworks/tree/pytorch, using January and April. The repo in early April seems very fit to YOLOv3-SPP.
Using same model, I get +2% AP50 on YOLOv3-SPP and -3% AP50 on CSPDarknet53s-PANet-SPP when training with January and April repo.

And when I use docker to install repo in late April, it shows that there are some layers are not registered.
I will try to update to latest repo today, if I meet any problem, I will give you a feedback.

@WongKinYiu actually wait, you are right, there are unimplemented layers still. The basic cfg file works, and I've removed the error, but many of the yolo layer attributes are not used. OK just hold on for a bit then.

@glenn-jocher Thank you very much!

@AlexeyAB @WongKinYiu have you guys tried training yolov4.cfg without mish? Mish seems to use massive amounts of gpu ram in pytorch at least. I did a benchmark and found almost 3X the GPU RAM requirements for yolov4.cfg vs yolov3-spp.cfg.

I'm worried this may put a lot of people off and hinder wider adoption. I'm also worried about exportability through onnx to tflite, coreml etc with a custom activation function like this.

@glenn-jocher hello,

yes, we did. it drops about 0.6% ap.
maybe you can use this cfg https://github.com/AlexeyAB/darknet/issues/4346#issuecomment-620276493

@WongKinYiu ah ok, good to know. Let me play around with it a bit, maybe I'll have some ideas.

@glenn-jocher Mish (mish in backbone) improves AP for CSDarknet-Panet but degrades AP for CSResNext-Panet

image

@AlexeyAB @WongKinYiu I think I finally discovered why none of these PANet models were working well on ultralytics. I started training a relu version of yolov4, with poor initial results, and I realized I have a hard-coded stride array in place in my repo: 32, 16, 8, the yolov3-spp output order. PANet is reversed, so I think I've been scaling my P3 and P5 anchors incorrectly in all of these latest models. I will reverse the order and train again.

My new repo includes strides in the model yaml files, so they are parameterized along with the model. I might try and take it a step further and solve for them automatically during a preliminary forward pass. Anyway, my mistake!

Should be fixed now, more or less, in https://github.com/ultralytics/yolov3/commit/9cc4951d4fb8df0cf1c9fed5e60c01c150e78a0c

            stride = [32, 16, 8]  # P5, P4, P3 strides
            if 'panet' in cfg or 'yolov4' in cfg:  # stride order reversed
                stride = list(reversed(stride))

@glenn-jocher

OK, thank you.
So maybe my P6 model gets low loss but also low AP is due to there are one stride is missing.

@WongKinYiu ahh, yes, this could affect your P6 also. I think when I ran my P6 I manually swapped the strides, but of course the master branch does not have this, so if you clone and try to run P6, you will also be affected. I'm sorry! I don't have a 100% foolproof solution, my commit above is a bit of a band-aid unfortunately.

My new repo has the stride info included as part of the model yaml. Here is the entire yolov3-spp.yaml for example. The modules are either custom, like Conv(), or pytorch modules, like nn.Conv2d().

I'm thinking a more automl-like solution would do 1 forward pass on a fixed image shape, i.e. 1x3x640x640, and automatically compute strides at each Detect() module. This would completely remove the human from the loop, causing less problems. But for now I have this.

# parameters
nc: 80  # number of classes
strides: [8, 16, 32]  # strides P5, P4, P3

# anchors
anchors:
  - [10,13, 16,30, 33,23]  # P3
  - [30,61, 62,45, 59,119]  # P4
  - [116,90, 156,198, 373,326]  # P5

# darknet53 backbone
backbone:
  # [from, number, module, args]
  [[-1, 1, Conv, [32, 3, 1]],  # 0
   [-1, 1, Conv, [64, 3, 2]],  # 1-P1/2
   [-1, 1, Bottleneck, [64]],
   [-1, 1, Conv, [128, 3, 2]],  # 3-P2/4
   [-1, 2, Bottleneck, [128]],
   [-1, 1, Conv, [256, 3, 2]],  # 5-P3/8
   [-1, 8, Bottleneck, [256]],
   [-1, 1, Conv, [512, 3, 2]],  # 7-P4/16
   [-1, 8, Bottleneck, [512]],
   [-1, 1, Conv, [1024, 3, 2]], # 9-P5/32
   [-1, 4, Bottleneck, [1024]],  # 10
  ]

# yolov3-spp head
# na = len(anchors[0])
head:
  [[-1, 1, Bottleneck, [1024, False]],
   [-1, 1, Conv, [512, 1, 1]],
   [-1, 1, SPP, [512, [5, 9, 13]]],
   [-1, 1, Conv, [1024, 3, 1]],
   [-1, 1, Conv, [512, 1, 1]],
   [-1, 1, Conv, [1024, 3, 1]],
   [-1, 1, nn.Conv2d, [na * (nc + 5), 1, 1]],  # 17 (P5-large)

   [-3, 1, Conv, [256, 1, 1]],
   [-1, 1, nn.Upsample, [None, 2, 'nearest']],
   [[-1, 8], 1, Concat, [1]],  # cat backbone P4
   [-1, 1, Bottleneck, [512, False]],
   [-1, 1, Bottleneck, [512, False]],
   [-1, 1, Conv, [256, 1, 1]],
   [-1, 1, Conv, [512, 3, 1]],
   [-1, 1, nn.Conv2d, [na * (nc + 5), 1, 1]],  # 25 (P4-medium)

   [-3, 1, Conv, [128, 1, 1]],
   [-1, 1, nn.Upsample, [None, 2, 'nearest']],
   [[-1, 6], 1, Concat, [1]],  # cat backbone P3
   [-1, 1, Bottleneck, [256, False]],
   [-1, 2, Bottleneck, [256, False]],
   [-1, 1, nn.Conv2d, [na * (nc + 5), 1, 1]],  # 31 (P3-small)
   [[-1, 25, 17], 1, Detect, [nc, strides, anchors]],   # Detect(P3, P4, P5)
  ]

@WongKinYiu @AlexeyAB ok, I've updated my new repo with auto-strides now, computed during a single forward pass during model init, so there is no more human error possible. The only likely remaining error source is that the anchor order may be accidentally reversed. The class counts in the detection layers are automatically compared to the classes in the data when building the model also, so specifying an incorrect class count in the yaml will not break the training either (which happens all the time to users now).

I will also add anchor-order error checking to the error check list that runs before training starts. This is really exciting, I'm slowly removing every possible route that users could use to 'break' the training, and to reduce the hyperparameters they can modify to an absolute minimum. I think this will really help a lot more people train custom datasets successfully. I'm going to add a kmeans step also that runs before training starts, so you can specify your own anchors if you want, or leave them empty for the training algorithm to create its own.

I should have the repo out soon, I'm still cleaning it up and working out the kinks.

@glenn-jocher

Thank you, can I start training https://github.com/AlexeyAB/darknet/issues/4346#issuecomment-620270472 using latest repo?

@WongKinYiu yes I think you can start training, but be aware that many of the attributes in the yolo class won't have the same effect as they do here (though these also have no effect when training yolov3-spp.cfg to 43mAP)
Screen Shot 2020-04-29 at 10 13 23 AM

After fixing the stride issue, I started training a yolov4-relu.cfg, using the same anchors as yolov3-spp, to isolate the training changes caused by the new architecture (i.e. all else being equal). Unfortunately the new training (orange) is coming in below the yolov3-spp metrics (blue), at least at this early stage. Training will take about two weeks, so I'll leave it running. The only difference between this and what you would run @WongKinYiu is Mish and the updated anchors.

results

@glenn-jocher Hello,

Do you train https://github.com/AlexeyAB/darknet/issues/4346#issuecomment-620270472 using V100 GPU?
I can not train yolov4 even set batch=4 due to OOM.

@WongKinYiu yes I know! The gpu memory usage in pytorch for yolov4 is off the charts. See https://github.com/ultralytics/yolov3/issues/1098#issuecomment-620194657

The above plot is training yolov4-relu.cfg, which just swaps the mish activations for relu. I can't really train normal yolov4 either, it's just not practical. In addition to the 3X gpu memory usage, the train speed is about 2X slower, due primarily to the smaller batches.

@WongKinYiu but the answer your question I'm training on T4. It's slower but more economical, and 15GB mem. 1 epoch with yolov4.cfg takes about 150 minutes. One epoch of yolov4-relu.cfg takes about 70 min, only slightly slower than yolov3-spp.cfg, and about the same memory.

@glenn-jocher

Thanks, currently i change backbone to CSPDarknet53s and train with batch=4.

additional one question, do P3-P6 and P3-P7 models can be normal trained using latest repo?
or i need to modify parts of code?

@WongKinYiu @AlexeyAB was reviewing effientdet paper again and was surprised by a training value I'd missed before. Their weight decay is 4e-5, batch-size 128.

In ultralytics/yolov3, (and darknet and ASFF also), weight decay is about 10x larger, and also applied more frequently I believe since we use batch-size 64 (weight decay is applied once per optimizer update every 64 images). This seems like quite a large discrepancy. I verified the value in the official repo as well:
https://github.com/google/automl/blob/17637b428d46b002f4586b9541f6b7bbf2fab4bf/efficientdet/hparams_config.py#L213

@glenn-jocher Hi,
Do you want try to use
decay=0.00004
instead of? https://github.com/AlexeyAB/darknet/blob/6cbb75d10b43a95f11326a2475d64500b11fa64e/cfg/yolov4.cfg#L11


Did you check that correct anchors/masks for P3-P6 gives better accuracy?

@AlexeyAB no, I have not tried training P6 yet. I'm busy trying to validate my new repo's training against my existing yolov3 repo. I made too many changes too fast and just discovered I wasn't doing weight decay properly, which caused a surge in mAP early on, in epochs 0-100, but then a peak at 100 and overtraining afterward, while the original ultralytics/yolov3 with weight decay trained nice and steady to a higher peak at 270ish out of 300. But in the process of double checking the weight decay I noticed efficientdet uses a much lower value. I don't have free GPUs now, but when I get the new repo launched I will train a new model at a lower weight decay to compare.

P6 looks like it would actually be easier to add on a PANet like yolov4 than FPN like yolov3 in any case.

The current yolov4-relu training looks like this. This is a special yolov4 training just to compare architecture change effects in the absence of all the additional changes. So far I'm not seeing an improvement in the yolov4 architecture (orange) vs yolov3-spp (blue). We will have to wait a long time for this result, training is poking along at about 20 epochs per day on a T4.
results

@AlexeyAB @glenn-jocher

I am training a P6 model on a single 2080ti.
~20 epochs per day.
image

yolov4-mish ~10 epochs per day due to batch size.

@glenn-jocher

So far I'm not seeing an improvement in the yolov4 architecture (orange) vs yolov3-spp (blue).

May be some advantage of yolov4-architecutre: CSP + PAN (instead of FPN) - can be achieved only by using pre-trained weights-file that is trained with BoF+BoS+Mish on ImageNet?
Or large model should be trained longer.

I don't have free GPUs now, but when I get the new repo launched I will train a new model at a lower weight decay to compare.

Do you use decay=0.0005 now?
And where do you get free GPUs?

@AlexeyAB well, all of my gpus at the moment are from a GCP credit that Ultralytics received when we participated in an accelerator last year called Decelera, in Mayakoba Mexico. I'm not sure if it's $20k or possibly $100k, but to make the most efficient use of the credits, i.e. the most epochs/$ I'm training on T4's at about $400 each per month. Unfortunately they are quite slow, about 2-3X slower than a 2080 ti, but they do come with 15G RAM, which is nice.

Day to day tests I run on Colab, since I don't actually have any usable local GPUs. I do all my work on a macbook pro, which does not support cuda egpus due to some ridiculous fight between apple and nvidia.

I'm waiting for the 3080 ti's to come out later on this year and then I think I might finally buy a box for myself, probably a 4-gpu box from lambda labs for about $8k.

Yes the pretraining might be the missing link. I do all of my training from scratch actually, after I saw better results this way in a side by side comparison last year. Unfortunately I usually see earlier overfitting on coco when using pretrained weights.

@AlexeyAB yes I'm using the same weight decay as here, 5E-4

@glenn-jocher @AlexeyAB

I just finish training CSPDarknet53s-PANet-SPP with optimized anchor for 512x512 using ultrlytics.

Using python3 test.py --cfg cd53s.cfg --weights last.pt --img 512 --iou-thr 0.7 to test.
I get 43.2% AP_0.50:0.95.

Using python3 test.py --cfg cd53s.cfg --weights last.pt --img 608 --iou-thr 0.7 to test.
I get 44.4% AP_0.50:0.95.

@WongKinYiu @glenn-jocher
Nice!

  • Do you get this result on valid or test-dev eval server?
  • Does --iou-thr 0.7 use IOU_thresh=0.7 or 0.5...0.95 for AP calculation?
  • What was fixed to get good results?
  • So you don't use --augment ?

@AlexeyAB

Do you get this result on valid or test-dev eval server?

  • it is 5k set, i would like to evaluate test-dev set tomorrow.
    Does --iou-thr 0.7 use IOU_thresh=0.7 or 0.5...0.95 for AP calculation?
  • 0.7 is iou threshold of nms.
    What was fixed to get good results?
  • just fix the stride order of yolo layers.
    So you don't use --augment ?
  • no, i do not use --augment.

by the way, if use best.pt, it gets 43.4%/44.5% with input resolution 512x512/608x608.

@WongKinYiu

So you get 44.4% AP50...95 on 5k-valid dataset, it gives +1.3% AP compared to Yolov3-spp 43.1% AP50...95. https://github.com/ultralytics/yolov3#map

What mini-batch size did you use?

Yolov4 416x416 gives 47.1% AP50...95 on 5k-valid eval server.

@WongKinYiu ah great! 44.5 is a great result, and yes 0.7 --iou is best for [email protected]:0.95.

Can you post your results.png here? Also can you link to this cfg?

Yes, in general you should probably always use best.pt after training is complete. last.pt can be used to --resume for example, but in most/all? cases best.pt should provide the best results.

@AlexeyAB yes this should be an apples to apples comparison, current mAP for yolov3-spp on coco2014 is 43.1 vs 44.5, so +1.4.

How do you get 47.1mAP?? That sounds extremely good, what's the catch?

@glenn-jocher This is for 5k-val set, not for test-dev.


YOLOv4 416x416:

  • 5k-val: 47.1% AP50...95
  • testdev: 41.2% AP50...95

@glenn-jocher @AlexeyAB

I finish test the results on 2017 test-dev set using best.pt.

cfg
best.pt
last.pt

512x512 gets similar performance when compare with yolov4.
image

608x608 gets 44.1% AP!
image

@WongKinYiu thanks, I compared the cfg with https://github.com/ultralytics/yolov3/blob/master/cfg/yolov4-relu.cfg, and the only difference is 4 convolutions and 2 routes have been commented out. I'm surprised that you got such a good result then, because when I trained yolov4-relu.cfg up to 100 epochs it was still training yolov3-spp (see https://github.com/AlexeyAB/darknet/issues/4346#issuecomment-622516664) so I cancelled the training. Can you post your results.txt so I can plot side by side with yolov3-spp and my yolov4-relu training?

Screen Shot 2020-05-07 at 10 47 15 PM

@WongKinYiu also that's interesting that your best.pt is 500 MB, this means that the optimizer is packaged with the weights. I commited a change about a month ago to strip the optimizers from best.pt and last.pt once training was complete, so this means you have a slightly dated version of the repo. Do you know which git hash your repo has? It would be good of me to compare to make sure I haven't made any large changes since then, since your version seems to be working well.

@glenn-jocher

I use the 1 May repo.

I always gets error at this line https://github.com/ultralytics/yolov3/blob/master/test.py#L211 when use python train.py ....
So I can not evaluate the performance during training.
But I can run python test.py ... without any error...

And the strip function is at https://github.com/ultralytics/yolov3/blob/master/train.py#L365 which is behind the evaluation process.
I think it is the reason why the step of "strip the optimizers from best.pt and last.pt once training was complete" is not be processed.

by the way, p6-model gets only 40% AP.

@WongKinYiu oh, I understand. Training completes 300 epochs, and then tries to use pycocotools for final mAP, then crashes, so strip function is not run.

Great, that is very recent, there are essentially zero changes that should affect training and testing in that time compared to the current repo!

Could you post your results.txt file here for cd53s?

@WongKinYiu btw, this is a numpy-pycocotools bug. If you install numpy == 1.17 it resolves the issue.

@glenn-jocher

Could you post your results.txt file here for cd53s?

Due to the numpy-pycocotools bug, the strip function and rename steps are not processed.
My results.txt is already covered by new training.

@WongKinYiu hmm ok. Maybe I can try and get it directly from the model, since the training info was never stripped from it it may still be there. I'll try and do that and plot it against https://github.com/AlexeyAB/darknet/issues/4346#issuecomment-622516664

I was able to retrieve the training results from the model file. I plotted against yolov3-spp (43.1mAP) and yolov4-relu (training cancelled after 100 epochs). Results are overall very similar, though overtraining seems to be a bit less, and objectness in particular looks a bit different. What was your training command for this training?

results

@WongKinYiu one specific question I have is whether you commented out the 4 conv and 2 shortcut layers in cd53s.cfg for any specific reason? (differences shown in https://github.com/AlexeyAB/darknet/issues/4346#issuecomment-625643578) Are you going to try and train yolov4-relu.cfg now under the same settings to compare directly to cd53s.cfg?

@WongKinYiu @glenn-jocher

I finish test the results on 2017 test-dev set using best.pt.
608x608 gets 44.1% AP!

Do you think this improvement is due to:

  1. letterbox+mosaic+(jitter=0)
  2. good MSCOCO-2017 annotations without crowds
  3. and due to optimized cd53s.cfg model compared to cd53.cfg (without s)?

Also this is strange, that trained on the same network resolution (512x512):

  • Darknet-yolov4.cfg (512x512): 49.7% AP (5k-val) - 43.0% AP (test-dev)
  • Ultralitics-cd53s.cfg (512x512): 43.4% AP (5k-val) - 43.0% AP (test-dev)
    5k-val higher for yolov4, while test-dev are the same.

@glenn-jocher

What was your training command for this training?

I modify the training command in https://github.com/AlexeyAB/darknet/issues/4346#issuecomment-609062427.
I used -img-size 448 768 512 for training, since as I remember darknet use random coefficient 1.4 and ultralytics use 1.5. And I change 416 to 448 due to the image size in new repo seems have to be 64x.

one specific question I have is whether you commented out the 4 conv and 2 shortcut layers in cd53s.cfg for any specific reason?

cd53s is designed very early, at that time I just try to make batch size can be doubled to speed up the training on a 2080ti gpu. (cd53 uses 6.x GPU RAM...)
In my experiments, to achieve this goal, cd53s try to reduce some layers, and cd53m try to reduce some channels.

Are you going to try and train yolov4-relu.cfg now under the same settings to compare directly to cd53s.cfg?

I am training cd53s-mish now.
However, it will spend long long time to train due to it need many GPU RAM.

I was able to retrieve the training results from the model file.

Could you tell me how to retrieve the information from the model file? I find that when I train more than one model, all information are mixed in a same results.txt file. Thanks.

@AlexeyAB

  1. letterbox+mosaic+(jitter=0)

I think letterbox is more reasonable for training a object detector, also I think gaussian_yolo can only works well with letterbox.
Previous we use resize to make small objects can be a little big than using letterbox to make them easier to be detected. Now, mosaic enhance the data augmentation of small objects, so it can work well even using letterbox.

  1. good MSCOCO-2017 annotations without crowds

As I remember, @glenn-jocher fixed iscrowd both of MSCOCO-2014 and MSCOCO-2017?
However, I use same version MSCOCO-2014 which I used to train yolov4.

  1. and due to optimized cd53s.cfg model compared to cd53.cfg (without s)?

No, cd53s is designed very early for ultralytics, since the uses of apex reduces large mount of GPU RAM, I try to cut out some layers or channel to make the batch size can be twice when training on my GPU. In my experiments, cd53s try to reduce some layers, and cd53m try to reduce some channels.

Also this is strange, that trained on the same network resolution (512x512):

  • Darknet-yolov4.cfg (512x512): 49.7% AP (5k-val) - 43.0% AP (test-dev)
  • Ultralitics-cd53s.cfg (512x512): 43.4% AP (5k-val) - 43.0% AP (test-dev)
    5k-val higher for yolov4, while test-dev are the same.

Hmm... do not know what happen too.

@WongKinYiu Hi,

Previous we use resize to make small objects can be a little big than using letterbox to make them easier to be detected. Now, mosaic enhance the data augmentation of small objects, so it can work well even using letterbox.

How does mosaic help to detect small objects? Since Mosaic doesn't resize image and doesn't reduce the image and objects, while Stitcher (= mosaic + scale down) does it: https://arxiv.org/abs/2004.12432

@glenn-jocher Hi,
Did you try to convert yolov4.cfg/weights (Darknet > PyTorch > ONNX > CoreML > iOS) and run it on iOS?
Or did you try to run yolov4 without Mish on iOS?

@AlexeyAB ha, actually no, this is a prototype version of yolov3-spp that I deployed about a month ago to iOS, before your paper came out!

I figured it was different enough that I needed to call it something new, so I called it yolov4 since the name was not taken back then. Same with the tiny version. Conversion should work in theory, but I've not tried it.

https://apps.apple.com/us/app/idetection/id1452689527

@glenn-jocher

A13 iOS devices perform >30 FPS at 192 x 320 default inference size.

Just interestig, how many FPS will be for yolov4.cfg/weight especially for mish-activation. Did you optimize mish?

@AlexeyAB no I have not tried. I seriously doubt it would be a good idea to try and use mish for mobile devices. The export pipeline probably does not support it out of the box, so someone would likely have to code up custom layers in onnx/coreml/tflite, not to mention the memory usage issues.

Mobile devices have much less memory than what we train on, even at batch-size 1 in idetection yolov3-spp will saturate the ram at inference size beyond 640 for example.

@glenn-jocher Mish/Swish requires additional memory only for training, not for inference.
There are also two mish implementations that are several times faster than the original mish: https://github.com/AlexeyAB/darknet/blob/bef28445e57cd560fa3d0a24af98a562d289135b/src/activation_kernels.cu#L226-L246

@AlexeyAB hmm interesting. I just haven't had time to look into it further, but the mish results seem to me to be of more use in a laboratory setting than in real-world products unfortunately.

I think for training, part of the problem is that the activation can not be used in-place, which is true for most activations, and then secondly the additional operations requiring the extra memory usage compared to say swish.

Another known problem is really that the export pipelines are a nightmare to use even for everyday operations, let alone custom operations. As an example the 2x upsample operations used in yolov3, when going from pytorch to onnx are a complete disaster (onnx converts one pytorch upsample op into like 5-10 seperate ops with connections everywhere), and the results change depending on new versions of pytorch and onnx, the opset version you select etc, before you even get to the coreml export part. So I can't imagine the maintainability issues with completely custom operations unfortunately.

@AlexeyAB @glenn-jocher

yolov4 need lower memory then yolov3-spp when inference.
when using jetson nano, yolov3-spp usually stuck at second layer, and yolov4 can run smoothly.

i agree with @glenn-jocher, in my experiments, i always avoid exponential function when design models which run on a mobile device or a embedding system. and no matter for tensorrt or other acceleration framework, they usually can not support well for custom defined operations.

@glenn-jocher

I think for training, part of the problem is that the activation can not be used in-place, which is true for most activations, and then secondly the additional operations requiring the extra memory usage compared to say swish.

Mish doesn't require more memory than swish, for Training.
Both mish and swish can't be used in-place for Training.
Both mish and swish don't use additional memory for Detection.
In general, Mish is better than Swish.

Another known problem is really that the export pipelines are a nightmare to use even for everyday operations, let alone custom operations.

Yes, this is a big problem, until it is implemented in target soft.

@WongKinYiu

i agree with @glenn-jocher, in my experiments, i always avoid exponential function when design models which run on a mobile device or a embedding system.

It seems that yolov4.cfg (416x416 batch=1 with mish) can work on Embedded Jetson AGX Xavier with 30 FPS: https://github.com/AlexeyAB/darknet/issues/5354#issuecomment-621115435
And may be faster with futher optimizations.

On GPU the difference is oly 0.5%, GPU RTX 2070 - 512x512:

  • yolov4 (mish) - 50.0 FPS
  • yolov4 (leaky) - 50.3 FPS

darknet.exe detector demo cfg/coco.data cfg/yolov4.cfg yolov4.weights test.mp4 -ext_output -dont_show -benchmark

darknet.exe detector demo cfg/coco.data cfg/cd53paspp-gamma.cfg cfg/cd53paspp-gamma_final.weights test.mp4 -ext_output -dont_show -benchmark

@glenn-jocher @AlexeyAB

608x608 5k-set:
YOLOv3 (ultralytics): 43.1% AP
CD53s-YOLOv3(leaky, ultralytics): 43.7% AP
CD53s-YOLOv4(leaky, ultralytics): 44.5% AP

@WongKinYiu great numbers! Should we update the readme?
https://github.com/ultralytics/yolov3#map

Though to be complete we should also run yolov3-spp.cfg with the same training command. I think the larger training sizes are positively affecting the mAP!

@glenn-jocher

OK.

CD53s-YOLOv3(leaky): cfg best.pt
CD53s-YOLOv4(leaky): cfg best.pt

And the code for run on test-dev set are here [eval.py] [dataset.py].
but my evaluation code is a little bit out of date, could you help for updating it?

@WongKinYiu yes sure, I'll take a look at the two python files tomorrow.

The funny thing is I was doing all the training with settings --img 320 640 640 (the default settings), but it looks like training at higher sizes like you are doing works better for best mAP. The 3 --img sizes by the way are multiscale min, multiscale max, test size.

So perhaps I should update the defaults, or update the command to reproduce here as well:
https://github.com/ultralytics/yolov3#reproduce-our-results

@WongKinYiu hey good news. I tested your *.pt and *.cfg files in a fresh git clone and everything works correctly for cd53s-yo.cfg, for cd53s.cfg testing runs with no error, but mAP is very low.

This is what I see on a P100 colab instance. Speed, mAP and model size of cd53s-yo.cfg is very good, particularly good for the model size (49M params). What do you think went wrong with the other model?

python test.py --cfg yolov3-spp.cfg --img 608 --iou 0.7

Model Summary: 225 layers, 6.29987e+07 parameters, 6.29987e+07 gradients
Speed: 11.8/2.3/14.1 ms inference/NMS/total per 608x608 image at batch-size 16
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.431
python test.py  --cfg cd53s-yo.cfg --weights best_cd53s-yo.pt --img 608 --iou 0.7

Model Summary: 273 layers, 4.901e+07 parameters, 4.901e+07 gradients
Speed: 10.6/2.4/13.0 ms inference/NMS/total per 608x608 image at batch-size 16
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.437
python test.py --cfg cd53s.cfg --weights cd53s_best.pt --img 608 --iou 0.7 

Model Summary: 315 layers, 6.43421e+07 parameters, 6.43421e+07 gradients
Speed: 11.8/2.3/14.1 ms inference/NMS/total per 608x608 image at batch-size 16
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.131

Ok I discovered the problem, it was the strides again. I added the yolov4 cfg to the list of cfgs that use reverse stride order here, but I totally admit I need a better solution for this, I'm sorry.
https://github.com/ultralytics/yolov3/blob/b2fcfc573e5418c0b2ef0c0357bf51bc5cb027b6/models.py#L96-L97
if 'panet' in cfg or 'yolov4' in cfg or 'cd53' in cfg: # stride order reversed

Anyway, the updated yolov4-leaky results look great now!! Make sure you save the exact training command you used to train these two models, and then if you can also do a training of yolov3-spp.cfg with the same exact command. This is important to fully isolate the changes in mAP to changes in architecture. I can train one as well, though it will take me some time as I'm crawling along on T4 gpu's these days, let me know.

python test.py --cfg cd53s.cfg --weights cd53s_best.pt --img 608 --iou 0.7 

Model Summary: 315 layers, 6.43421e+07 parameters, 6.43421e+07 gradients
Speed: 11.8/2.2/14.0 ms inference/NMS/total per 608x608 image at batch-size 16
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.445

Here are the first batch of yolov4-leaky ground truths and predictions jpgs that test.py makes automatically now. Looks good!

test_batch0_gt

test_batch0_pred

@glenn-jocher

if 'panet' in cfg or 'yolov4' in cfg or 'cd53' in cfg: # stride order reversed
yes, currently i manually change the stride of each model.
in your older repo (maybe Jan 2020), u used size of feature map to calculate the stride.
at that time everything runs correctly.

OK, i will train yolov3-spp with same setting.
i think it need about 1~2 weeks for training.

I have trained cd53s-mish and cd53s-yo-mish, and in first 140 epochs both of them get better curve than cd53s model.
but unfortunately, my sever crashed yesterday, and i found if i use --resume for resume training, the performance drops at the resume epoch.
so i have to re-train it from scratch if i want to make fair comparison.
cd53s-yo also suffer from same situation, however, it crash at around 280 epochs, i just use the best.pt before the 280 epoch for generating results.

@WongKinYiu ah great! Yes unfortunately training longer increases risk that something may happen, and --resume never works perfectly either as you mentioned. On GCP the machines also periodically restart for 'host maintenance', making it more risky the more days it takes to train your model.

@WongKinYiu to understand these cfgs better, what you are essentially testing in these two runs are leaky versions of cd53s backbone with both yolov3 (FPN) head and yolov4 (PANet) head right?

In that case it appears the new backbone is lighter weight (14M less params), and a bit faster than darknet53 backbone right?

Then once we get your new yolov3-spp results, we will have a perfect ablation study of the impact of step1: replacing the backbone with cd53s, and then step2: replacing the head with the new yolov4 head. And then I can update the mAP section with these 3 apples-to-apples metrics. I really like this, this will give us a very clear picture of the architecture changes.

I should also update my table to include gpu/cpu latency values, in keeping with the efficientdet table format.

@glenn-jocher

and --resume never works perfectly either as you mentioned.

Is there any reason why does it happen?

On GCP the machines also periodically restart for 'host maintenance', making it more risky the more days it takes to train your model.

Is this behavior specific only for preemptible instances?

@AlexeyAB the --resume issue has been ongoing for some time. I think it relates to the LR scheduler, but I'm not really sure, I gave up on it, now instead I just completely avoid resuming models these days. If training stops for whatever reason, I restart from scratch as @WongKinYiu did.

The GCP host maintenance restarts are on the regular VMs. The preemptable VMs additionally shut down randomly within 24 hours also, which only makes them useful for limited testing.

It's hard to say exactly how often the restarts happen. It's pretty rare, but it does happen occasionally. I'd say (very rough guess) maybe a 0.5% chance your GCP VM will restart for maintenance on any given day.

@glenn-jocher So we can use cheap ~0.8$ preemptable-GCP-VM only for hyperparameters search for the first 10 epochs, but not for training.
While for training we should use ~2.0$ regular GCP-VM without --resume.

@AlexeyAB yes, that's a good summary. It may make more sense also to use V100's on preemptable instances, as they will get much more work done in those 0-24 hours, and their hourly rate is about 1/3 normal.

In practice though, the preemptable instances can be very frustrating, sometimes they run a full 24 hours before shutdown, other times they may only run a few minutes. You never know.

@AlexeyAB @WongKinYiu yo guys, I had an idea. Someone asked me if I could modify inference output into text files with the same labels format as the coco labels recently, and it got me thinking.

scale.ai makes all their money via hybrid human-machine labelling supposedly, at a $B valuation, and labelbox recently raised $10M, so labeling is a very serious business, unsexy as it sounds. I remember there was an effort here to manually review coco labels, which probably did not get very far due to the sheer volume of images.

The current problem statement is that if I want to label 2x the images, it will cost me 2x in time/labor. It does not scale. So the main idea I had is to make labelling an iterative process. Say I have 100k images. I would do these steps:

  1. Human label: Label a minimum viable quantity by hand, say 1k images.
  2. Train: Train a model on the 1k images.
  3. Relabel: Run augmented inference 3 times on the remaining 99k, and retain high confidence detections that have at least 2 overlapping boxes.

Iterate 2 and 3 several more times until the results settle. Now you've reduced your manual labor by 99%, and shifted the labelling burden onto scaleable resources (i.e. VMs, gpus), and now your labelling service suddenly scales quite well (as a business), but also in terms of mAP, your model may train better eventually as it may pick up more missing objects that the human labelers missed.

I thought an easy way to get started experimenting on this would be to run inference on coco train set, and manually review instances of new object detection. This step might even be included in training, i.e. every 5-10 epochs run a Relabel step on the train set.

@glenn-jocher Hello,

You can take a look something like https://arxiv.org/abs/1911.04252.

I have applied label refinement technique in my own dataset, it increased 3%~11% mIoU.
Darknet can gnerate pseudo label using darknet.exe detector test cfg/coco.data cfg/yolov4.cfg yolov4.weights -thresh 0.25 -dont_show -save_labels < data/new_train.txt.

@WongKinYiu yes I know about noisy student. You are correct, they are attacking the problem on the training side for better mAP, which is great from a research perspective. Is there a problem with it, or a reason we do not currently integrate it by default into existing training pipelines like darknet or ultrlaytics/yolov3?

But I am also interested in the business side. Imagine if we could start a labelling competitor to https://labelbox.com/, but we propose that the clients only need to label 10% of their data, or we could label the data ourselves, but our cost base is only 10% of theirs, as we are only human-labelling the first 10%. This might attract investor interest.

EDIT: Well look at this. Just a couple weeks ago they posted the same exact idea, haha.
https://labelbox.com/blog/using-labelboxs-model-assisted-labeling-to-speed-up-annotation-efficiency-advancing-ai-in-cutting-edge-research/

@WongKinYiu also what do you mean +3-11% mIoU? Do you mean mAP? This would be a huge mAP gain, why aren't we using it on coco? Is it because the coco labels are already much more professional than most custom datasets, and thus have less room for improvement?

@glenn-jocher

  • Pseudo-labeling and Reinforcement(hard-examples-mining)-labeling was added to the Darknet more than year ago

  • For fairly comparison we can't use additional non-MSCOCO pseudo-labeled datasets

  • We can try to releable MSCOCO by using pseudo-labeling, but it seems that there is problem only with person-labels in the MSCOCO

  • Yes, that might be a good idea if you can train models on most of the available datasets. And then organize the automatic run of detection of objects on the user dataset through several models at once, because the user may need only 1 class from MSCOCO + only 1 class from OpenImages + only1 class from another dataset...

@glenn-jocher @AlexeyAB

Yes, for fairly comparison, we can not use additional training data.
But if you just need a stronger detector, you can do semi/weakly-supervised learning on your model.

I find google research do the very similar strategy as what I do using darknet pseudo-labeling.
They release their method in this month.
https://github.com/google-research/ssl_detection

@WongKinYiu

I find google research do the very similar strategy as what I do using darknet pseudo-labeling.
They release their method in this month.
https://github.com/google-research/ssl_detection

Did you use pseudo-labeling for labeling new-dataset, and the continue trainining with old+new datasets?

It seems that Google just use Pseudo-labeling + CutOut-data-augmentation.

@AlexeyAB

The steps of my training are:

  1. train a yolov4 model on my labeled dataset.
  2. do pseudo-labeling on my unlabeled dataset using trained yolov4.
  3. do semi-supervised learning on yolov4-tiny with labeled dataset and pseudo-labeled dataset.

it increase 3\~7% mAP and 3\~11% mIoU on my validation dataset.

@WongKinYiu

do semi-supervised learning on yolov4-tiny with labeled dataset and pseudo-labeled dataset.

Is this just a regular training ./darket detector train ... on old+new labeld datasets, or there is something else?

@AlexeyAB

yes, currently i just use ./darket detector train ....
the development of un/semi/weakly-supervised learning methods are still in progress.

@glenn-jocher

Something like that is integrated with CVAT
https://www.youtube.com/watch?v=U3MYDhESHo4&feature=youtu.be

It seems to only support tensorflow models and is oriented towards labelling video frames (interpolation), I haven't actually tried it myself yet. I want to do something along these soon but haven't quite decided what appoach to take. I would prefer to maybe script the interpolation myself so that I'm not restricted to tensorflow and I can have more flexibility over the process. I do like CVAT though because it has a REST API and the fact that it runs in a web browser makes it easy to manage centrally and distribute batch jobs amongst many people.

@glenn-jocher So we can use cheap ~0.8$ preemptable-GCP-VM only for hyperparameters search for the first 10 epochs, but not for training.
While for training we should use ~2.0$ regular GCP-VM without --resume.

Have you ever tried direct GPU rig renting like on vast.ai ? https://vast.ai/console/create/
seems much cheaper than GCP for the on-demand / uninterruptible instances.

@LukeAI

Something like that is integrated with CVAT
https://www.youtube.com/watch?v=U3MYDhESHo4&feature=youtu.be

Is it something like o button in the Yolo_mark, that tracks objects on sequence of frames from Video during labeling - by using OpticalFLow-tracking: https://github.com/AlexeyAB/Yolo_mark

Have you ever tried direct GPU rig renting like on vast.ai ? https://vast.ai/console/create/

Is it aggregator of different clouds: Vast, GCP, AWS, Paperspace, ... ?
Only 0.6$ per hour for TeslaV100, compared with 2.6$ on GCP and 3.0$ on AWS.

It's individual people with big rigs of GPUs (probably mostly people who used to use them for mining cryptocurrencies, which is now not profitable on GPUs). The site vets them and requires certain standards for hardware, uptime, internet speed etc. and charges a commission for connecting them to buyers.

Not sure exactly how the cvat interpolation mode works, it requires a tensorflow model for detection. idk. if it also tries to leverage optical flow or some other tracking method.

Maybe the best automated video labeler would be some dedicated integrated video object-detector / tracker? I know some video objected detectors exist that try to leverage recent frames to refine predictions for current frame, but maybe better results could be had by some CNN that also used future frames? So you manually label every 10th frame, train a model and try to interpolate the rest?

@LukeAI @AlexeyAB

I also use CVAT.
CVAT interpolation mode is just use linear interpolation between two key annotations.
Their provided auto-labeling tools are for detection and segmentation which using tensorflow model.
It is easier to replace it by yourself if you are familiar to opencv.

ok thanks, good to know.

* Pseudo-labeling and Reinforcement(hard-examples-mining)-labeling was added to the Darknet more than year ago

  * https://github.com/AlexeyAB/darknet/blob/master/src/detector.c#L1125-L1132
  * https://github.com/AlexeyAB/darknet/blob/master/src/detector.c#L1647-L1670

How can I use hard-examples-mining? So it returns a list of images with only low confidence detections?

@LukeAI

How can I use hard-examples-mining? So it returns a list of images with only low confidence detections?

Un-comment these lines: https://github.com/AlexeyAB/darknet/blob/master/src/detector.c#L1125-L1132
It returns only images with fp+fn > 0 if you run ./darknet detector map ...

If someone need it, I can add flag for enabling this feature, wihtout changing source code.

Ah, so refining coco labels and then getting mAP improvement is not a publishable result you are saying, even though you start from the same dataset and don't add any images to it? I'd be a little surprised if that were the case, as we augment the images themselves currently, which is also modifying the original dataset.

In any case, if it provided repeatably improved results on custom datasets, it would be adding value to the product your company might provide in an automl type solution. I suppose the best implementation would be to allow this functionality with a simple argument during training, i.e. --refine_labels.

@AlexeyAB the label refinement you have creates new labels, but are these fed back into the training set during training, or is that a manual step one would do?

@LukeAI about video labelling, I've never heard of cvat, but I've done a lot of object tracking, both visual and radar with kalman filters, and yes, if you can detect an object in one video frame, you can track it using a KLT tracker easily for a few more frames before drift becomes an issue.

So in this sense you could use yolov3/4 to detect an object in a video frame, then a KLT tracker to update your boxes in the remaining 29 frames, and repeat.

Google has a demo app that's supposed to do this called ODT, which is pretty bad in real life, and Swift also has track object requests (you supply the initial box):
https://developer.apple.com/documentation/vision/vntrackobjectrequest

I haven't used vast.ai, but yes the prices are much better!

@AlexeyAB @WongKinYiu also guys, are you sure that using refined labels for training is not fair, because the noisy student paper uses 300M unlabelled images (!). So they are going out and finding new images no one else has, which is a step beyond what I was proposing, which uses the same images and no more.

I would tend to view it as fair, because you are starting with the same dataset and labels as everyone else, and not adding new images. I suppose the rules of the game depend on the game you are playing...

@glenn-jocher

For mscoco, there are unlabeled data for developing semi-supervised learning approaches.
image
If use those data, you should compare your method to those semi-supervised learning methods.

Usually, we will split experiments into two different kind: use extra training data or not.
image
By the way, as I remember noisy students use 2048 TPUs and train 3.5 days.

@WongKinYiu ah, so you are saying that refining the labels would count as 'extra training data', and then be classed togethor with methods which use unlabelled data. In this case then yes, you are competing with a much wider range of possible competitors.

Perhaps an alternate method then, leaving the labels alone, would be to update the objectness loss to account for the unequal probability of human-labelled FP and FN (false negative) mistakes. I'd suspect a human labeller would cause many more FN's than FP's, so we may want to apply an apriori distribution to the obj loss to reflect this.

In practice this might be accomplished by reducing the FP losses, i.e. I think multiplying them by a gain of 0.9 for example would imply that human labelers are only labelling 90% of the objects correctly. Does this make sense? It might be a more universal way to account for human labeling errors without having to check the 'extra training data' box.

EDIT: I realized I've just described class label smoothing. The only differences are that I proposed it for objectness and I proposed a positive-negative imbalance. Unfortunately I tried smoothing objectness before with poor results, and came to the conclusion that it is best applied only to classification loss, if at all.

@glenn-jocher @WongKinYiu

are you sure that using refined labels for training is not fair,

No.

  1. We can't use additional images/labels:

For fairly comparison we can't use additional non-MSCOCO pseudo-labeled datasets


  1. But our algorithm/network can change images/labels by itself, but without explicit a priori knowledge that only a person can calculate

We can try to re-lable MSCOCO by using pseudo-labeling, but it seems that there is problem only with person-labels in the MSCOCO

If we will try to refine labels of MSCOCO, then person-labels will be improved, but labels of all other 79 classes will be degraded.


Since our goal is ti find or create the best model rather than the win some challenge by using tricks. So we can use this model for real product, f.e. auto-labeling as you suggested.


@AlexeyAB the label refinement you have creates new labels, but are these fed back into the training set during training, or is that a manual step one would do?

This is a manual step curretly.

How can I use hard-examples-mining? So it returns a list of images with only low confidence detections?

Un-comment these lines: https://github.com/AlexeyAB/darknet/blob/master/src/detector.c#L1125-L1132
It returns only images with fp+fn > 0 if you run ./darknet detector map ...

So you would run the detector like darknet detector map obj.data yolov4.cfg yolov4_weights.weights using a _valid_ list of already labelled images and it would return the list of all images with false positives or false negatives?

And the purpose would be to understand blind spots in your detector so you know what areas of improvement are needed?

If there were a CLI flag for that, I would use it, I do something similar with a python script already so it wouldn't be a priority for me, but my approach feels a bit hacky.

So you would run the detector like darknet detector map obj.data yolov4.cfg yolov4_weights.weights using a valid list of already labelled images and it would return the list of all images with false positives or false negatives?

Yes, set valid=train.txt in obj.data file, then run ./darknet detector map ... then set train=reinforcement.txt and run training again.

@WongKinYiu I've been studying cd53s-yo.cfg. The new bottleneck strategy used there seems to be a good improvement over darknet53. Compared to yolov3-spp, cd53s-yolov3 has more layers and convolutions, but more importantly a 20-25% reduction in parameters and FLOPS, 10% reduction in inference time, and roughly similar mAP (or potentially slightly better).

If I'm understanding correctly the primary difference is that each bottleneck has a 1.0 expansion factor between the first and second conv in cd53s instead of a 0.5 factor, but also that there are leading and trailing convolutions (into the series of bottlenecks and exiting it) that reduce the channel count by half going in, and then double it back to normal going out (via concat with an additional residual). So taken as a unit, one of these cd53s series of bottlenecks can be a drop-in replacement for the more traditional darknet 53 series of bottlenecks.

This is all very good! This naturally leads me to wonder though if the expansion back to the original channel count is still necessary. For example, the first series of bottlenecks is a series of 2 bottlenecks with 128ch going in, then 64x64 convolutions for two bottlenecks, then additional convs to bring the channel count back to 128, before it gets passed to the same 3x3 stride 2 convolution to downsample it.

So my main question is, if we experimented with simply using 64ch throughout this series, and doing the same with the other bottlenecks, do you think the performance would be reduced significantly?

My second question is did you experiment with trying to reinvest the FLOPS/parameter 'savings' back into the network? For example perhaps you could increase the depth or width of the cd53s backbone by adding a few convolutions or increasing their channel count to bring it back up to 60M parameters, and thus capture a better performance increase compared to yolov3-spp?

I've written a pytorch module that should be able to reproduce the bottleneck series in cd53s-yo.cfg:

class BottleneckSeriesCSP(nn.Module):
    def __init__(self, c1, c2, n=2, shortcut=True, g=1, e=0.5):  # ch_in, ch_out, number, shortcut, groups, expansion
        super(BottleneckSeriesCSP, self).__init__()
        c_ = int(c2 * e)  # hidden channels
        self.cv1 = Conv(c1, c_, 1, 1)
        self.cv2 = Conv(c1, c_, 1, 1)
        self.cv3 = Conv(c_, c_, 1, 1)
        self.cv4 = Conv(c2, c2, 1, 1)
        self.m = nn.Sequential(*[Bottleneck(c_, c_, shortcut, g, e=1.0) for _ in range(n)])

    def forward(self, x):
        y1 = self.cv2(x)
        y2 = self.cv3(self.m(self.cv1(x)))
        return self.cv4(torch.cat((y1, y2), dim=1))

This uses instances of the Bottleneck() class, which is the normal bottleneck used in darknet53 that I'm using for my new repo.

class Bottleneck(nn.Module):
    def __init__(self, c1, c2, shortcut=True, g=1, e=0.5):  # ch_in, ch_out, shortcut, groups, expansion
        super(Bottleneck, self).__init__()
        c_ = int(c2 * e)  # hidden channels
        self.cv1 = Conv(c1, c_, 1, 1)
        self.cv2 = Conv(c_, c2, 3, 1, g=g)
        self.add = shortcut and c1 == c2

    def forward(self, x):
        return x + self.cv2(self.cv1(x)) if self.add else self.cv2(self.cv1(x))

Which in turn uses the Conv() class, which is simply a Conv2d() + bn + leaky sequence. To create the first bottleneck series in cd53s-yo.cfg for example, you would create an instance of the module as BottleneckSeriesCSP(c1=128, c2=128, n=2). I've scanned my notes here so you can see which conv is which. It's basically 4 convolution plus a series of normal bottlenecks. Does that sounds right?

Scan

@glenn-jocher

yes, you are right.

Darknet stage:

            x = down_layer(x) # can be included in darknet_layer
            x = darknet_layer(x) # with bottleneck
            x = tran_layer(x) # can be included in darknet_layer

CSPDarknet stage

            x = down_layer(x)
            x1, x2 = x.chunk(2, dim=1)
            x2 = darknet_layer(x2) # without bottleneck
            x = torch.cat([x1,x2], 1)
            x = tran_layer(x)

My English is not good enough to understand long paragraph in real-time, will give you feedback as soon as possible. https://github.com/AlexeyAB/darknet/issues/4346#issuecomment-631226862 https://github.com/AlexeyAB/darknet/issues/4346#issuecomment-631245174

@glenn-jocher

640x640 5k-set:
YOLOv3 (ultralytics): 43.1% AP
YOLOv3 (same setting as below): 43.6% AP
CD53s-YOLOv3(leaky, ultralytics): 43.7% AP
CD53s-YOLOv4(leaky, ultralytics): 44.5% AP

so the comparison would be as follows:

python test.py --cfg yolov3-spp.cfg --weights best_yolov3-spp.pt --img 608 --iou 0.7

Model Summary: 225 layers, 6.29987e+07 parameters, 6.29987e+07 gradients
Speed: 11.8/2.3/14.1 ms inference/NMS/total per 608x608 image at batch-size 16
Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.436
python test.py  --cfg cd53s-yo.cfg --weights best_cd53s-yo.pt --img 608 --iou 0.7

Model Summary: 273 layers, 4.901e+07 parameters, 4.901e+07 gradients
Speed: 10.6/2.4/13.0 ms inference/NMS/total per 608x608 image at batch-size 16
Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.437
python test.py --cfg cd53s.cfg --weights best_cd53s.pt --img 608 --iou 0.7 

Model Summary: 315 layers, 6.43421e+07 parameters, 6.43421e+07 gradients
Speed: 11.8/2.2/14.0 ms inference/NMS/total per 608x608 image at batch-size 16
Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.445

CD53s-YOLOv3 gets comparable AP as D53-YOLOv3, but it is lighter and faster.
CD53s-YOLOv4 gets comparable FPS as D53-YOLOv3, but it has higher AP.

@WongKinYiu awesome! Ok I'm sold on the changes :). I will need to study the yolov4 head, it looks like it had quite a big impact on the results. Do you have a simple block diagram of what the new head does?

I'm training some new models with BottleneckSeriesCSP() modules. Do you use these new bottlenecks in the backbone only, or do you also use them in the head?

@WongKinYiu about these new results, do you want to submit a PR for https://github.com/ultralytics/yolov3/blob/master/README.md? Else I could add these new results myself if you send me your training command. I will also upload the corresponding *.cfg files so they are all available.

@glenn-jocher Hello,

i use
python train.py --img 448 768 512 --weights '' --cfg xxx.cfg --data coco2014.data --name xxx
for training.

here are all of cfg/weights, you could add them to your repo.

currently i only use BottleneckSeriesCSP in the backbone, will design new head soon.

I am in a business trip, will give you feedback about https://github.com/AlexeyAB/darknet/issues/4346#issuecomment-631226862 https://github.com/AlexeyAB/darknet/issues/4346#issuecomment-631245174 https://github.com/AlexeyAB/darknet/issues/4346#issuecomment-632401295 soon.

@glenn-jocher

i find there is a bug when testing with 608x608.
https://github.com/ultralytics/yolov3/blob/master/utils/datasets.py#L321
it will force to use 640x640 for testing, so above results of 608x608 should be 640x640.

@WongKinYiu ah yes, this is an interesting point. test.py with rectangular inference will round to the nearest 64 size now, so you are correct that if you pass --img 608 it will actually run inference at 640. Do you get the same test results at --img 640 and --img 608?

EDIT: The purpose here was to prepare testing for a P6 model, which is now never used. A more robust approach would be to save the model strides as an attribute and then set the rounding to the max stride.

EDIT2: In the new repo I have code which handles this for train.py, but not in test.py:

    # Image sizes
    gs = int(max(model.stride))  # grid size (max stride)
    if any(x % gs != 0 for x in opt.img_size):
        print('WARNING: --img-size %g,%g must be multiple of %s max stride %g' % (*opt.img_size, opt.cfg, gs))
    imgsz, imgsz_test = [make_divisible(x, gs) for x in opt.img_size]  # image sizes (train, test)

EDIT3: I suppose a similar error check should be run on all 3 main files (train, test, detect.py)

hmm... test results at --img 640 and --img 608 are different, will check the code.

@glenn-jocher

5k-set:
YOLOv3 (ultralytics): 43.1% AP
YOLOv3 (same setting as below): 43.6% AP
CD53s-YOLOv3(leaky, ultralytics): 43.7% AP
CD53s-YOLOv3(mish, ultralytics): 44.3% AP
CD53s-YOLOv4(leaky, ultralytics): 44.5% AP
CD53s-YOLOv4(mish, ultralytics): 45.0% AP (~YOLOv4)

python test.py --cfg yolov3-spp.cfg --weights best_yolov3-spp.pt --img 608 --iou 0.7
Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.436
python test.py  --cfg cd53s-yo.cfg --weights best_cd53s-yo.pt --img 608 --iou 0.7
Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.437
python test.py  --cfg cd53s-yo-csptb.cfg --weights best_cd53s-yo-csptb.pt --img 608 --iou 0.7
Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.439
python test.py  --cfg cd53s-yo-mish.cfg --weights best_cd53s-yo-mish.pt --img 608 --iou 0.7
Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.443
python test.py --cfg cd53s.cfg --weights best_cd53s.pt --img 608 --iou 0.7 
Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.445
python test.py --cfg cd53s-mish.cfg --weights best_cd53s-mish.pt --img 608 --iou 0.7 
Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.450
python test.py --cfg cd53s-cspt.cfg --weights best_cd53s-cspt.pt --img 608 --iou 0.7 
Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.451
python test.py --cfg cd53s-csptb.cfg --weights best_cd53s-csptb.pt --img 608 --iou 0.7 
Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.450

@glenn-jocher

do you get better result after replace wh = torch.exp(p[:, 2:4]) * anchor_wh by y[..., 2:4] = (y[..., 2:4].sigmoid() * 2) ** 2 * self.anchor_grid[i]?

and i think your multi-scale training has bug, images will be resized to 640x640 no matter what is the input size. (or maybe it works with new scale hyper-parameter?)

            if True:
                imgsz = random.randrange(640, 640 + gs) // gs * gs
                sf = imgsz / max(imgs.shape[2:])  # scale factor
                if sf != 1:
                    ns = [math.ceil(x * sf / gs) * gs for x in imgs.shape[2:]]  # new shape (stretched to gs-multiple)
                    imgs = F.interpolate(imgs, size=ns, mode='bilinear', align_corners=False)

and test.py --img-size 736 actually use *768 due to the bug in dataset.py.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

louisondumont picture louisondumont  路  3Comments

bit-scientist picture bit-scientist  路  3Comments

zihaozhang9 picture zihaozhang9  路  3Comments

PROGRAMMINGENGINEER-NIKI picture PROGRAMMINGENGINEER-NIKI  路  3Comments

hemp110 picture hemp110  路  3Comments