Darknet: ASFF - Learning Spatial Fusion for Single-Shot Object Detection - 63% [email protected] with 45.5FPS

Created on 26 Nov 2019  Â·  140Comments  Â·  Source: AlexeyAB/darknet

Learning Spatial Fusion for Single-Shot Object Detection

image

image

image

@AlexeyAB it's seems worth to take a look

enhancement

Most helpful comment

@Kyuuki93 @WongKinYiu I added new version of [shortcut] layer for BiFPN from EfficientDet: https://github.com/AlexeyAB/darknet/issues/4662

So you can try to make Detector with 1 or several BiFPN blocks.
And with 1 ASFF + several BiFPN blocks (yolov3-spp-asff-bifpn-db-it.cfg)

All 140 comments

ASFF significantly improves the box AP from 38.8% to 40.6% as shown in Table 3.

image


Also there are used:

  1. BoF (MixUp, ...) - +4.2 [email protected], but +0 [email protected] and +5.6% AP@70: https://github.com/AlexeyAB/darknet/issues/3272

  2. MegDet: A Large Mini-Batch Object Detector (synchronized batch normalization technique) - mAP 52.5%) to COCO 2017 Challenge, where we won the 1st place of Detection task: https://arxiv.org/abs/1711.07240v4 - issue: https://github.com/AlexeyAB/darknet/issues/4386

  3. Dropblock + Receptive field block gives +1.7% [email protected]

  4. So ASFF gives only +1.8% [email protected] and 1.5% [email protected] and 2.5% [email protected]

  5. cosine learning rate: https://github.com/AlexeyAB/darknet/pull/2651

This paper is a bit confusing, so I took a look at his code, his code using conv_bn_leakyReLU for the level_weights instead of this formula before Softmax
image

In shortly, ASFF mapping the inputsx0, x1, x2 of yolo0, yolo1, yolo2 to each other to enhance the detection, but I still wonder, which layers output respond to x0, x1, x2?

@Kyuuki93

his code using conv_bn_leakyReLU for the level_weights instead of this formula before Softmax

Can you provide link to these lines of code?

@Kyuuki93

This paper is a bit confusing, so I took a look at his code, his code using conv_bn_leakyReLU for the level_weights instead of this formula before Softmax
image

This formula seems to be softmax a = exp(x1) / (exp(x1) + exp(x2) + exp(x3)) https://en.wikipedia.org/wiki/Softmax_function

I added fixes to implement ASFF and BiFPN (from EfficientDet): https://github.com/AlexeyAB/darknet/issues/3772#issuecomment-559592123


In shortly, ASFF mapping the inputsx0, x1, x2 of yolo0, yolo1, yolo2 to each other to enhance the detection, but I still wonder, which layers output respond to x0, x1, x2?

It seems layers: 17, 24, 32

https://github.com/ruinmessi/ASFF/blob/c74e08591b2756e5f773892628dd9a6d605f4b77/models/yolov3_asff.py#L142

https://github.com/ruinmessi/ASFF/blob/c74e08591b2756e5f773892628dd9a6d605f4b77/models/yolov3_asff.py#L129

waiting for improvements, good things happening here

This formula seems to be softmax a = exp(x1) / (exp(x1) + exp(x2) + exp(x3)) https://en.wikipedia.org/wiki/Softmax_function

Yeah, I got it, his fusion was finished by 1x1 conv, softmax and sum.

I added fixes to implement ASFF and BiFPN (from EfficientDet): #3772 (comment)

I will try to implement ASFF, BiFPN module and run some tests

For up-sampling, we first apply a 1x1 convolution layer to compress the number f channels of features to that in level _l_, and then upscale the resolutions respectively with interpolation.

@AlexeyAB How to implement this upscale in .cfg file?

@Kyuuki93 [upsample] layer with stride=2 or stride=4

   layer   filters  size/strd(dil)      input                output
   0 conv     32       3 x 3/ 1    416 x 416 x   3 ->  416 x 416 x  32 0.299 BF
   1 conv     64       3 x 3/ 2    416 x 416 x  32 ->  208 x 208 x  64 1.595 BF
   2 conv     32       1 x 1/ 1    208 x 208 x  64 ->  208 x 208 x  32 0.177 BF
   3 conv     64       3 x 3/ 1    208 x 208 x  32 ->  208 x 208 x  64 1.595 BF
   4 Shortcut Layer: 1
   5 conv    128       3 x 3/ 2    208 x 208 x  64 ->  104 x 104 x 128 1.595 BF
   6 conv     64       1 x 1/ 1    104 x 104 x 128 ->  104 x 104 x  64 0.177 BF
   7 conv    128       3 x 3/ 1    104 x 104 x  64 ->  104 x 104 x 128 1.595 BF
   8 Shortcut Layer: 5
   9 conv     64       1 x 1/ 1    104 x 104 x 128 ->  104 x 104 x  64 0.177 BF
  10 conv    128       3 x 3/ 1    104 x 104 x  64 ->  104 x 104 x 128 1.595 BF
  11 Shortcut Layer: 8
  12 conv    256       3 x 3/ 2    104 x 104 x 128 ->   52 x  52 x 256 1.595 BF
  13 conv    128       1 x 1/ 1     52 x  52 x 256 ->   52 x  52 x 128 0.177 BF
  14 conv    256       3 x 3/ 1     52 x  52 x 128 ->   52 x  52 x 256 1.595 BF
  15 Shortcut Layer: 12
  16 conv    128       1 x 1/ 1     52 x  52 x 256 ->   52 x  52 x 128 0.177 BF
  17 conv    256       3 x 3/ 1     52 x  52 x 128 ->   52 x  52 x 256 1.595 BF
  18 Shortcut Layer: 15
  19 conv    128       1 x 1/ 1     52 x  52 x 256 ->   52 x  52 x 128 0.177 BF
  20 conv    256       3 x 3/ 1     52 x  52 x 128 ->   52 x  52 x 256 1.595 BF
  21 Shortcut Layer: 18
  22 conv    128       1 x 1/ 1     52 x  52 x 256 ->   52 x  52 x 128 0.177 BF
  23 conv    256       3 x 3/ 1     52 x  52 x 128 ->   52 x  52 x 256 1.595 BF
  24 Shortcut Layer: 21
  25 conv    128       1 x 1/ 1     52 x  52 x 256 ->   52 x  52 x 128 0.177 BF
  26 conv    256       3 x 3/ 1     52 x  52 x 128 ->   52 x  52 x 256 1.595 BF
  27 Shortcut Layer: 24
  28 conv    128       1 x 1/ 1     52 x  52 x 256 ->   52 x  52 x 128 0.177 BF
  29 conv    256       3 x 3/ 1     52 x  52 x 128 ->   52 x  52 x 256 1.595 BF
  30 Shortcut Layer: 27
  31 conv    128       1 x 1/ 1     52 x  52 x 256 ->   52 x  52 x 128 0.177 BF
  32 conv    256       3 x 3/ 1     52 x  52 x 128 ->   52 x  52 x 256 1.595 BF
  33 Shortcut Layer: 30
  34 conv    128       1 x 1/ 1     52 x  52 x 256 ->   52 x  52 x 128 0.177 BF
  35 conv    256       3 x 3/ 1     52 x  52 x 128 ->   52 x  52 x 256 1.595 BF
  36 Shortcut Layer: 33
  37 conv    512       3 x 3/ 2     52 x  52 x 256 ->   26 x  26 x 512 1.595 BF
  38 conv    256       1 x 1/ 1     26 x  26 x 512 ->   26 x  26 x 256 0.177 BF
  39 conv    512       3 x 3/ 1     26 x  26 x 256 ->   26 x  26 x 512 1.595 BF
  40 Shortcut Layer: 37
  41 conv    256       1 x 1/ 1     26 x  26 x 512 ->   26 x  26 x 256 0.177 BF
  42 conv    512       3 x 3/ 1     26 x  26 x 256 ->   26 x  26 x 512 1.595 BF
  43 Shortcut Layer: 40
  44 conv    256       1 x 1/ 1     26 x  26 x 512 ->   26 x  26 x 256 0.177 BF
  45 conv    512       3 x 3/ 1     26 x  26 x 256 ->   26 x  26 x 512 1.595 BF
  46 Shortcut Layer: 43
  47 conv    256       1 x 1/ 1     26 x  26 x 512 ->   26 x  26 x 256 0.177 BF
  48 conv    512       3 x 3/ 1     26 x  26 x 256 ->   26 x  26 x 512 1.595 BF
  49 Shortcut Layer: 46
  50 conv    256       1 x 1/ 1     26 x  26 x 512 ->   26 x  26 x 256 0.177 BF
  51 conv    512       3 x 3/ 1     26 x  26 x 256 ->   26 x  26 x 512 1.595 BF
  52 Shortcut Layer: 49
  53 conv    256       1 x 1/ 1     26 x  26 x 512 ->   26 x  26 x 256 0.177 BF
  54 conv    512       3 x 3/ 1     26 x  26 x 256 ->   26 x  26 x 512 1.595 BF
  55 Shortcut Layer: 52
  56 conv    256       1 x 1/ 1     26 x  26 x 512 ->   26 x  26 x 256 0.177 BF
  57 conv    512       3 x 3/ 1     26 x  26 x 256 ->   26 x  26 x 512 1.595 BF
  58 Shortcut Layer: 55
  59 conv    256       1 x 1/ 1     26 x  26 x 512 ->   26 x  26 x 256 0.177 BF
  60 conv    512       3 x 3/ 1     26 x  26 x 256 ->   26 x  26 x 512 1.595 BF
  61 Shortcut Layer: 58
  62 conv   1024       3 x 3/ 2     26 x  26 x 512 ->   13 x  13 x1024 1.595 BF
  63 conv    512       1 x 1/ 1     13 x  13 x1024 ->   13 x  13 x 512 0.177 BF
  64 conv   1024       3 x 3/ 1     13 x  13 x 512 ->   13 x  13 x1024 1.595 BF
  65 Shortcut Layer: 62
  66 conv    512       1 x 1/ 1     13 x  13 x1024 ->   13 x  13 x 512 0.177 BF
  67 conv   1024       3 x 3/ 1     13 x  13 x 512 ->   13 x  13 x1024 1.595 BF
  68 Shortcut Layer: 65
  69 conv    512       1 x 1/ 1     13 x  13 x1024 ->   13 x  13 x 512 0.177 BF
  70 conv   1024       3 x 3/ 1     13 x  13 x 512 ->   13 x  13 x1024 1.595 BF
  71 Shortcut Layer: 68
  72 conv    512       1 x 1/ 1     13 x  13 x1024 ->   13 x  13 x 512 0.177 BF
  73 conv   1024       3 x 3/ 1     13 x  13 x 512 ->   13 x  13 x1024 1.595 BF
  74 Shortcut Layer: 71
  75 conv    512       1 x 1/ 1     13 x  13 x1024 ->   13 x  13 x 512 0.177 BF
  76 conv   1024       3 x 3/ 1     13 x  13 x 512 ->   13 x  13 x1024 1.595 BF
  77 conv    512       1 x 1/ 1     13 x  13 x1024 ->   13 x  13 x 512 0.177 BF
  78 max                5x 5/ 1     13 x  13 x 512 ->   13 x  13 x 512 0.002 BF
  79 route  77                                 ->   13 x  13 x 512 
  80 max                9x 9/ 1     13 x  13 x 512 ->   13 x  13 x 512 0.007 BF
  81 route  77                                 ->   13 x  13 x 512 
  82 max               13x13/ 1     13 x  13 x 512 ->   13 x  13 x 512 0.015 BF
  83 route  82 80 78 77                        ->   13 x  13 x2048 
# END SPP #
  84 conv    512       1 x 1/ 1     13 x  13 x2048 ->   13 x  13 x 512 0.354 BF
  85 conv   1024       3 x 3/ 1     13 x  13 x 512 ->   13 x  13 x1024 1.595 BF
  86 conv    512       1 x 1/ 1     13 x  13 x1024 ->   13 x  13 x 512 0.177 BF
# A(/32 Feature Map) #
  87 conv    256       1 x 1/ 1     13 x  13 x 512 ->   13 x  13 x 256 0.044 BF
  88 upsample                 2x    13 x  13 x 256 ->   26 x  26 x 256
# A -> B # 
  89 route  86                                 ->   13 x  13 x 512 
  90 conv    128       1 x 1/ 1     13 x  13 x 512 ->   13 x  13 x 128 0.022 BF
  91 upsample                 4x    13 x  13 x 128 ->   52 x  52 x 128
# A -> C #
  92 route  86                                 ->   13 x  13 x512
  93 conv    256       1 x 1/ 1     13 x  13 x512 ->   13 x  13 x 256 0.044 BF
  94 upsample                 2x    13 x  13 x 256 ->   26 x  26 x 256
  95 route  94 61                              ->   26 x  26 x 768 
  96 conv    256       1 x 1/ 1     26 x  26 x 768 ->   26 x  26 x 256 0.266 BF
  97 conv    512       3 x 3/ 1     26 x  26 x 256 ->   26 x  26 x 512 1.595 BF
  98 conv    256       1 x 1/ 1     26 x  26 x 512 ->   26 x  26 x 256 0.177 BF
  99 conv    512       3 x 3/ 1     26 x  26 x 256 ->   26 x  26 x 512 1.595 BF
 100 conv    256       1 x 1/ 1     26 x  26 x 512 ->   26 x  26 x 256 0.177 BF
# B(/16 Feature Map) #
 101 conv    512       3 x 3/ 2     26 x  26 x 256 ->   13 x  13 x 512 0.399 BF
# B -> A #
 102 route  100                                    ->   26 x  26 x 256 
 103 conv    128       1 x 1/ 1     26 x  26 x 256 ->   26 x  26 x 128 0.044 BF
 104 upsample                 2x    26 x  26 x 128 ->   52 x  52 x 128
# B -> C #
 105 route  100                                    ->   26 x  26 x 256 
 106 conv    128       1 x 1/ 1     26 x  26 x 256 ->   26 x  26 x 128 0.044 BF
 107 upsample                 2x    26 x  26 x 128 ->   52 x  52 x 128
 108 route  107 36                             ->   52 x  52 x 384 
 109 conv    128       1 x 1/ 1     52 x  52 x 384 ->   52 x  52 x 128 0.266 BF
 110 conv    256       3 x 3/ 1     52 x  52 x 128 ->   52 x  52 x 256 1.595 BF
 111 conv    128       1 x 1/ 1     52 x  52 x 256 ->   52 x  52 x 128 0.177 BF
 112 conv    256       3 x 3/ 1     52 x  52 x 128 ->   52 x  52 x 256 1.595 BF
 113 conv    128       1 x 1/ 1     52 x  52 x 256 ->   52 x  52 x 128 0.177 BF
# C(/8 Feature Map) #
 114 max                2x 2/ 2     52 x  52 x 128 ->   26 x  26 x 128 0.000 BF
 115 conv    512       3 x 3/ 2     26 x  26 x 128 ->   13 x  13 x 512 0.199 BF
# C -> A #
 116 route  113                                    ->   52 x  52 x 128 
 117 conv    256       3 x 3/ 2     52 x  52 x 128 ->   26 x  26 x 256 0.399 BF
# C -> B #
 118 route  86 101 115                             ->   13 x  13 x1536 
 119 conv      3       1 x 1/ 1     13 x  13 x1536 ->   13 x  13 x   3 0.002 BF
 120 route  119                                0/3 ->   13 x  13 x   1 
 121 scale Layer: 86
darknet: ./src/scale_channels_layer.c:23: make_scale_channels_layer: Assertion `l.out_c == l.c' failed.
Aborted (core dumped)

@AlexeyAB I created a asff.cfg based yolov3-spp.cfg, there is a error seems layer-86 is 13x13x512 and layer-119 e.g. alpha is 13x13x1, in [scale_channels] those layers output should be same?

@Kyuuki93 It seems I fixed it: https://github.com/AlexeyAB/darknet/commit/5ddf9c74a58ce61d2aa82b806b8d0912ab6cf8f3#diff-35a105a0ce468de87dbd554c901a45eeR23

[route]
layers=22,33,44 # 3-layers which are already resized to the same WxHxC

[convolutional]
stride=1
size=1
filters=3
activation=normalize_channels # ReLU is integrated to activation=normalize_channels

[route]
layers=-1
group_id=0
groups=3

[scale_channels]
from=22
scale_wh=1

[route]
layers=-3
group_id=1
groups=3

[scale_channels]
from=33
scale_wh=1

[route]
layers=-5
group_id=2
groups=3

[scale_channels]
from=44
scale_wh=1

[shortcut]
from=-3
activation=linear

[shortcut]
from=-6
activation=linear

@AlexeyAB
In your ASFF-like module, what exactly activation = normalize_channels do?

If activation = normalize_channels use relu to calculate gradients,
I think it should be activation = linear and use another softmax for (x1, x2, x3), to mach this formula alpha = exp(x1) / (exp(x1) + exp(x2) + exp(x3)), Or activation = softmax for SoftmaxBackward?

https://github.com/ruinmessi/ASFF/blob/f7814211b1fd1e6cde5e144503796f4676933667/models/network_blocks.py#L242

levels_weight = F.softmax(levels_weight, dim=1)
levels_weight.shape was torch.Size([1,3,13,13])

Is 'activation = normalize_channels' same with this F.softmax ?

If activation = normalize_channels actually excuse this code, normalize_channels with relu function, negative value was removed,
https://github.com/AlexeyAB/darknet/blob/9bb3c53698963f2a495be2dd9877d6ff523fe2ad/src/activations.c#L151-L177

maybe this result got a explain

|Model| chart|cfg|
|---|---|---|
|spp,mse| yolov3-spp-chart|yolov3-spp.cfg.txt|
|spp,mse,asff|chart|yolov3-spp-asff.cfg.txt|

I think the normalization with constraints channels_sum() = 1 was crucial, which indicate objects belongs to which ASFF feature.

And this ASFF module have a little different with your example, instead of

[route]
layers = 22,33,44# 3-layers which are already resized to the same WxHxC

...

use

[route]
layers = 22

[convolutional]
batch_normalize=1
size=1
stride=1
filters=8
activation=leaky

[route]
layers = 33

[convolutional]
batch_normalize=1
size=1
stride=1
filters=8
activation=leaky

[route]
layers = 44

[convolutional]
batch_normalize=1
size=1
stride=1
filters=8
activation=leaky

[route]
layers = -1,-3,-5

[convolutional]
stride=1
size=1
filters=3
activation= normalize_channels

...

@Kyuuki93

I think the normalization with constraints channels_sum() = 1 was crucial, which indicate objects belongs to which ASFF feature.

What do you mean?

And this ASFF module have a little different with your example, instead of

Why?


In your ASFF-like module, what exactly activation = normalize_channels do?

If activation = normalize_channels use relu to calculate gradients,
I think it should be activation = linear and use another softmax for (x1, x2, x3), to mach this formula alpha = exp(x1) / (exp(x1) + exp(x2) + exp(x3)), Or activation = softmax for SoftmaxBackward?

There is in the normalize_channels implemented Fast Normalized Fusion that should have the same Accuracy but faster Speed than SoftMax across channels, that is used in BiFPN for EfficientDet: https://github.com/AlexeyAB/darknet/issues/4346

Later I will add activation=normalize_channels_softmax

image

I think the normalization with constraints channels_sum() = 1 was crucial, which indicate objects belongs to which ASFF feature.

What do you mean?

Sorry, let me clear,

alpha(i,j) + beta(i,j) + gamma(i,j) = 1,
 and alpha(i,j)> 0, beta(i,j)>0, gamma(i,j)>0

In normalize_channels, maybe result from this code:

if (val > 0) val = val / sum; 
else val = 0;

many alpha or beta, gamma were set to 0, so relu gradients was 0 too, so gradients were vanished at very beginning, and in this way, training doesn't work properly, e.g. after 25k iters, best [email protected] just 10.41%, in the training, the value of Obj: were very hard to increase.

And this ASFF module have a little different with your example, instead of

Why?

I checked author's model, layers 22,33,44 were never concat, I just implemented his network structure. In his model, the coefficients were calculate from layers 22,33,44 separately, and channels changes like

512 -> 8
512 -> 8  (cat to) 24 -> 3 
512 -> 8

instead of
512 -> 3

There is in the normalize_channels implemented Fast Normalized Fusion that should have the same Accuracy but faster Speed than SoftMax across channels, that is used in BiFPN for EfficientDet: #4346

I will try to find why BiFPN can work with relu style normalize_channels but ASFF can not, I have a thought, just let me check it out

Later I will add activation=normalize_channels_softmax

I will take another test then

@Kyuuki93

I checked author's model, layers 22,33,44 were never concat, I just implemented his network structure.

You have done right. I have not yet verified the entire cfg file as a whole.

Here we are not talking about layers with indices exactly 22, 33, 44. This is just an example.
This means that already some layers with indicies XX,YY,ZZ are resized to the same WxHx8. It is assumed here that the layers are already applied: conv_stride_2, maxpool_sride_2, upsample_stride_2 and 4. And then applied conv-layer filters=8.
And these 3 layers with size WxHx8 will be concatenated: https://github.com/ruinmessi/ASFF/blob/master/models/network_blocks.py#L240

That's how you did it.


In normalize_channels, maybe result from this code:

if (val > 0) val = val / sum;
else val = 0;
many alpha or beta, gamma were set to 0, so relu gradients was 0 too, so gradients were vanished at very beginning, and in this way, training doesn't work properly, e.g. after 25k iters, best [email protected] just 10.41%, in the training, the value of Obj: were very hard to increase.

Yes, for one image - some outputs( alpha or beta, gamma) will have zeros, and for another image - other outputs( alpha or beta, gamma) will have zeros. There will not be dead neurons in Yolo, since all other layers use leaky-ReLU rather than ReLU.

This is a common problem for ReLU, calls dead neurons. https://datascience.stackexchange.com/questions/5706/what-is-the-dying-relu-problem-in-neural-networks
This applies to all modern neural networks that use RELU: MobileNet v1, ResNet-101, ...
The Leaky-ReLU, Swish or Mish solves this problem.

There will be dead neurons problem only if at least 2 conv-layers with ReLU in a row, go one after another. So output of conv-1 will be always >=0, so both input and output of conv-2 will be always >=0 In this case, since input of conv-2 is alwyas >=0, then if weights[i] < 0 then output of ReLU will be always 0 and Gradient will be always 0 - so there will be dead neurons, this weights[i]<0 will never be changed.

But if conv-1 layer has leak-ReLU (as in Yolo) or Swish or Mish activation, then input of conv-2 can be >0 or <0, then regardless of weights[i] (if weights[i] != 0) the Gradient will not be always == 0, and **this weights[i]<0 will be changed sometime**.

@Kyuuki93

Also you can try to use

[convolutional]
stride=1
size=1
filters=3
activation=logistic

instead of

[convolutional]
stride=1
size=1
filters=3
activation=normalize_channels

Here we are not talking about layers with indices exactly 22, 33, 44. This is just an example.

Yes, I aware that layers 22 exactly is layers 86 in darknet's yolov3-spp.cfg and so on.

There will be dead neurons problem only if at least 2 conv-layers with ReLU in a row, go one after another. So output of conv-1 will be always >=0, so both input and output of conv-2 will be always >=0 In this case, since input of conv-2 is alwyas >=0, then if weights[i] < 0 then output of ReLU will be always 0 and Gradient will be always 0 - so there will be dead neurons, this weights[i]<0 will never be changed.

But if conv-1 layer has leak-ReLU (as in Yolo) or Swish or Mish activation, then input of conv-2 can be >0 or <0, then regardless of weights[i] (if weights[i] != 0) the Gradient will not be always == 0, and **this weights[i]<0 will be changed sometime**.

I see, so there are a little influence but should be work,
wil try activation=logistic and activation=normalize_channels_softmax, result update later

@AlexeyAB
I created a new asff.cfg, yolov3-spp-asff-giou-logistic.cfg.txt
but with normalize_channels_softmax, training loss goes to nan in ~100 iters,
with logistic got this result, but yolov3-spp.cfg with mse loss achieved 89% at [email protected]
yolov3-spp-asff-giou-logistic-chart

If you got a time, could you tell me what mistake I made in there?

If you got a time, could you tell me what mistake I made in there?

@AlexeyAB
Sorry, that's my fault, previous .cfg file have connected wrong layers,
This one should be right for ASFF module, yolov3-spp-asff.cfg.txt,
viewed by Netron yolov3-spp-asff.png.zip

But unfortunately, this net was untrainingable, the official repo mentioned ASFF module need long warm up to avoid nan loss, but in darknet nan loss show up at the time lr > 0, no matter what activation = logistic or norm_channels or norm_channels_softmax, so I am wondering which part goes wrong.

Followed ASFF's native idea, e.g. every yolo-layer used all scale size feature map, so I created a simplified-ASFF module, it's just added feature map from layers-22,33,44 (by shortcut) instead of multiply with alpha, beta, gamma

and this was simplified one, yolov3-spp-asff-simplified.cfg.txt
viewed by Netron yolov3-spp-asff-simplified.png.zip

yolov3-spp + Gaussian_yolo(iou_n,uc_n = 0.5) + iou_thresh=0.213
spp,giou,gs,it
yolov3-spp + Gaussian_yolo(iou_n,uc_n = 0.5) + iou_thresh=0.213 + asff-sim
chart

The complete training result will updated several hours later, but seems simplified-ASFF module could boost AP or at least increased training speed.

And about ASFF module, if the cfg file was no wrong, maybe these layers e.g. scale_channels, activation = norm_channels_* not work as my expect?

This one should be right for ASFF module, yolov3-spp-asff.cfg.txt,

Did you try to use your new ASFF with default [yolo] without Gaussian and without GIoU and without iou_thresh and normalizers?

like

[yolo]
mask = 0,1,2
#anchors = 10,13,  16,30,  33,23,  30,61,  62,45,  59,119,  116,90,  156,198,  373,326
anchors =  57, 64,  87,113, 146,110, 116,181, 184,157, 175,230, 270,196, 236,282, 322,319
classes=1
num=9
jitter=.3
ignore_thresh = .7
truth_thresh = 1
random=1

This one should be right for ASFF module, yolov3-spp-asff.cfg.txt,

Did you try to use your new ASFF with default [yolo] without Gaussian and without GIoU and without iou_thresh and normalizers?

like

[yolo]
mask = 0,1,2
#anchors = 10,13,  16,30,  33,23,  30,61,  62,45,  59,119,  116,90,  156,198,  373,326
anchors =  57, 64,  87,113, 146,110, 116,181, 184,157, 175,230, 270,196, 236,282, 322,319
classes=1
num=9
jitter=.3
ignore_thresh = .7
truth_thresh = 1
random=1

I will try, and asff-sim results with gs, giou, iou_thresh:
|baseline | [email protected] = 91.89% |[email protected] = 63.53%|
|+asffsim| [email protected] = 91.62% |[email protected] = 63.28%|

results with mse loss will report tomorrow

@Kyuuki93
So assf-simplified doesn't improve accuracy.

Try with default [yolo]+mse without normalizers and if it doesn't work then try with default anchors.

Did you try to use your new ASFF with default [yolo] without Gaussian and without GIoU and without iou_thresh and normalizers?

Yes, ASFF-SIM with default [yolo] decrease 0.48% [email protected], and [email protected]

|baseline | [email protected] = 89.52% |[email protected] = 51.72%|
|+asffsim| [email protected] = 89.04%|[email protected] = 51.24%|

@Kyuuki93

Try norm_channels or norm_channels_softmax with default [yolo] layers.
May be only [Gaussian_yolo] produces Nan with ASFF.

I tried, it’s same

@Kyuuki93

And about ASFF module, if the cfg file was no wrong, maybe these layers e.g. scale_channels, activation = norm_channels_* not work as my expect?

I checked the implementation of activation = norm_channels_* and didn't find bugs.

Also cfg file https://github.com/AlexeyAB/darknet/files/3939064/yolov3-spp-asff.cfg.txt is almost correct, except very low learning_rate and burn_in=0, you should use burn_in=4000 and higher learning_rate. Also use default [yolo] without normalizers.


  1. Do you get Nan or low mAP with activation=norm_channels ?

  2. set [net] burn_in=4000 in cfg-file

  3. Change this line: https://github.com/AlexeyAB/darknet/blob/dbe34d78658746fcfc9548ebab759895ea05a70c/src/blas_kernels.cu#L1153
    to this
    atomicAdd(&out_state_delta[osd_index], in_w_h_c_delta[index] * in_from_output[index] / channel_size);

  4. Check that grad=0 there https://github.com/AlexeyAB/darknet/blob/dbe34d78658746fcfc9548ebab759895ea05a70c/src/activation_kernels.cu#L513

  5. use default [yolo] and iou_normalizer = 1.0 iou_loss = mse

  6. Recompile and train ASSF with default [yolo] and activation=norm_channels and activation=norm_channels_softmax and

learning_rate=0.001
burn_in=4000

Show output chart.png with Loss and mAP for both activation=norm_channels and activation=norm_channels_softmax

Very low LR and burn_in=0 were set to test if nan will show when LR>0. I have tried usual LR and burn_in =2000, it’s same, no matter yolo or gs_yolo with any loss

I will try your suggestion tomorrow, it’s midnight here

For 4. is already grad = 0, set baseline as

learning_rate = 0.0005 # for 2 GPUs 
burn_in = 4000
...
activation = normalize_channels_softmax # in the end of ASFF
...
[yolo]
iou_loss = mse
iou_normalizer = 1.0
cls_normalizer = 1.0
ignore_thresh = 0.213
...

Also, pre-trained weight used yolov3-spp.conv.86 instead of yolov3-spp.conv.88

| Settings | Got NaN | Iters got NaN| Chart |
| --- | --- | --- | --- |
|baseline |y | 363 | - |
|activation -> normalize_channels| n |- |chart|

After add /channel_size to darknet/src/blas_kernels.cu#L1153

| Settings | Got NaN | Iters got NaN| Chart |
| --- | --- | --- | --- |
|baseline | n | - | chart |
|activation -> normalize_channels| | |

@AlexeyAB It's seems work fine for now, full result will update later, it will be delayed a few days because of a business trip

@Kyuuki93 Fine. What is base line in your table?

@Kyuuki93 Fine. What is base line in your table?

yolov3-spp.cfg with mse loss, only add iou_thresh = 0.213

@Kyuuki93

This is strange:

yolov3-spp.cfg with mse loss, only add iou_thresh = 0.213

But why default yolov3-spp.cfg goes to Nan, while there are no ASFF, [scale_channels]-layer or activation=normalize_channels ? https://github.com/AlexeyAB/darknet/issues/4382#issuecomment-564437176

And why it doesn't go to Nan after fixing [scale_channels]-layer , while yolov3-spp.cfg doesn't have [scale_channels]-layer?

This is strange:

yolov3-spp.cfg with mse loss, only add iou_thresh = 0.213

But why default yolov3-spp.cfg goes to Nan, while there are no ASFF, [scale_channels]-layer or activation=normalize_channels ? #4382 (comment)

And why it doesn't go to Nan after fixing [scale_channels]-layer , while yolov3-spp.cfg doesn't have [scale_channels]-layer?

Sorry, all the test with ASFF-module, so baseline is yolov3-spp.cfg with mse loss, it=0.213, asff

There is baseline .cfg file
yolov3-spp-asff.cfg.txt

Thanks for explanation. So baseline is = yolov3-spp + iou_loss = mse + iou_thresh = 0.213 + ASFF with activation=normalize_channels_softmax

How about guiding anchor? In asff paper, yolo head consist of GA+RBF+Deform-conv

@Kyuuki93 Didn't look yet. Lets test these blocks first, to make sure they work and increase accuracy.

(A lot of features increase accuracy only in rare cases or if tricks and cheats are used.)

@Kyuuki93 Didn't look yet. Lets test these blocks first, to make sure they work and increase accuracy.

(A lot of features increase accuracy only in rare cases or if tricks and cheats are used.)

Of course, I will test dropblock this days or after back to office, by the way, training with my 4xGPUs machine works well over 80k iters, but with 2xGPUs machine still got killed after 10k iters, so this updated chart will be divided

@Kyuuki93 What GPU, OS and OpenCV versions do you use for?

  • 4xGPUs
  • 2xGPUs

How to use RFB-block: https://github.com/AlexeyAB/darknet/issues/4507

@Kyuuki93 What GPU, OS and OpenCV versions do you use for?

  • 4xGPUs
  • 2xGPUs

4 2080 Ti GPUs, Ubuntu 18.04, OpenCV 3.4.7
2 1080 Ti GPUs, Ubuntu 18.04, OpenCV 3.4.7

| Model | Chart | [email protected]| [email protected]| Inference Time (416x416)
| --- | ---| --- | --- | --- |
|spp,mse,it=0.213|spp,mse,it | 92.01% | 60.49%| 13.75ms|
|spp,mse,it=0.213,asff(softmax)| spp,mse,it,asff| 92.45%| 61.83%| 15.44ms|
|spp,mse,it=0.213,asff(relu)| | 91.83%| 59.60% | 15.34ms |
|spp,mse,it=0.213,asff(logistic )| | 91.18%| 60.79%| 15.40ms |

Will complete this Table soon, so far, the ASFF module seems work well, it's AP already higher than spp,giou,gs,it which can see in https://github.com/AlexeyAB/darknet/issues/3874#issuecomment-561064425
@AlexeyAB

Will fulfill later

I added this fix: https://github.com/AlexeyAB/darknet/commit/d137d304c1410253894dbfb7abaadfe6f4f867e7

Compare which of the ASFFs is better: logistic, normalize_channels or normalize_channels_softmax

Now when we know that it works well, you also can try to test it with [Guassian_yolo]

I added this fix: d137d30

Compare which of the ASFFs is better: logistic, normalize_channels or normalize_channels_softmax

Now when we know that it works well, you also can try to test it with [Guassian_yolo]

I fulfilled previous results table, ASFF module give a +0.44% at [email protected] and +1.34% at [email protected]. I have added this to https://github.com/AlexeyAB/darknet/issues/3874#issuecomment-561064425

@Kyuuki93 Nice!
Is it normalize_channels_softmax-ASFF?

What improvement in accuracy does give the normalize_channels(avg_ReLU)-ASFF?

@Kyuuki93 Nice!
Is it normalize_channels_softmax-ASFF?

What improvement in accuracy does give the normalize_channels(avg_ReLU)-ASFF?

Yes, is normalize_channels_softmax-ASFF, other normalize_channels didn't test yet, I will add dropblock before it

@Kyuuki93

This is weird that they use mlist.append(DropBlock(block_size=1, keep_prob=1)), https://github.com/ruinmessi/ASFF/blob/master/models/yolov3_asff.py#L39-L56


So I recommend to use

  • either DropOut + ASFF ([dropout] probability=0.1)
  • or DropBlock + ASFF + RFB-block
[dropout]
dropblock=1
dropblock_size=0.1  # 10% of width and height
probability=0.1     # this is drop probability = (1 - keep_probability)

May be

  • dropblock_size=0.15 for 13x13
  • dropblock_size=0.1 for 26x26
  • dropblock_size=0.05 for 52x52

I think I should split dropblock_size= to 2 params:

  • dropblock_size_rel= in percent - for training Classifier (or for using in DarkNet53-backbone for Detector)
  • dropblock_size_abs= abs size - for training Detector

This is weird that they use mlist.append(DropBlock(block_size=1, keep_prob=1)), https://github.com/ruinmessi/ASFF/blob/master/models/yolov3_asff.py#L39-L56

So I recommend to use

  • either DropOut + ASFF ([dropout] probability=0.1)
  • or DropBlock + ASFF + RFB-block
[dropout]
dropblock=1
dropblock_size=0.1  # 10% of width and height
probability=0.1     # this is drop probability = (1 - keep_probability)

Ok then, I will check that, but I just found it not so convenient to add DropBlock when I leave office. So I will test normlize_channels_* first, some results will come out tomorrow

And check here https://github.com/ruinmessi/ASFF/blob/538459e8a948c9cd70dbd8b66ee6017d20af77cc/main.py#L317-L337,
block_size=1 , keep_prob=1 should just for baseline which means original yolo-head

@Kyuuki93 Thanks, I missed this code.

I fixed DropBlock: https://github.com/AlexeyAB/darknet/issues/4498
Just use (use dropblock_size_abs=7 if RFB is used, otherwise use dropblock_size_abs=2):

[dropout]
dropblock=1
dropblock_size_abs=7  # block size 7x7
probability=0.1       # this is drop probability = (1 - keep_probability)

It will work mostly the same as in ASFF implementation (gradually increasing the block size from 1x1 to 7x7 in the first half of the training time):

Also probability will increase from 0.0 to 0.1 in the first half of the training time - as in the original DropBlock paper.

@AlexeyAB I have fulfilled the Table

@Kyuuki93 Nice!

It seems that only ASFF-softmax works well.

Now you can try to add DropBlock https://github.com/AlexeyAB/darknet/issues/4498 + RFB-block https://github.com/AlexeyAB/darknet/issues/4507


I fixed gradient of activation=normalize_channels in the same way as it is done for normalize_channels_softmax, also you can try new version of activation=normalize_channels too: https://github.com/AlexeyAB/darknet/commit/7ae1ae5641b549ebaa5c816701c4b9ca73247a65

There are another Table:

| Model | [email protected]| [email protected]|
| --- | --- | --- |
| spp,mse,it=0.213,asff(softmax) | 92.45%| 61.83%|
| spp,giou,it=0.213,asff(softmax) | 92.33% | 63.74% |
| spp,mse,it=0.213,asff(softmax),gs | 92.08%| 62.19% |
| spp,giou,it=0.213,asff(softmax),gs | 91.13%| 60.82%|

@AlexeyAB It's seems Gaussian_yolo not suite for ASFF

I got error after add dropblock

480 x 480
...
realloc(): invalid next size
Aborted (core dumped)

@AlexeyAB

@Kyuuki93

It's seems Gaussian_yolo not suite for ASFF

Try Gaussian_yolo with MSE.

@Kyuuki93

I got error after add dropblock

Can you share your cfg-file?

Sorry, wrong click,

There is yolov3-spp-asff-db-it.cfg.txt

Try Gaussian_yolo with MSE.

Running it now

@Kyuuki93 I fixed it: https://github.com/AlexeyAB/darknet/commit/f831835125d3181b03aa28134869099c82ca846e#diff-aad5e97a835cccda531d59ffcdcee9f1R542

I got error after add dropblock

Resizing
480 x 480

Try Gaussian_yolo with MSE.

updated on previous table

@Kyuuki93 Nice! Does DropBlock work well now?

@Kyuuki93 Nice! Does DropBlock work well now?

For now,

| Model| [email protected]| [email protected]|
|---|---|---|
|spp,giou,it=0.213,asff(softmax) | 92.33% | 63.74%|
|spp,giou,it=0.213,asff(softmax),dropblock|91.66% | 62.42%|

But, this chart seems this net needs a longer training,
chart

so I changed

max_batchs = 25200 -> 45200
steps = 10000,15000,20000 -> 20000,30000,40000

to take another run.

Also, I add rfb after dropblock, and results will come after this days

I got another question here, I have 2 custom dataset,
1st, 30k images for training, 4.5k images for validation, one-class, is the one used to produced previous results;
2nd, 100kimages for training, 20k images for validation, also one-class, which contains many small objects.

For example, use a same network, e.g. yolov3-spp.cfg, I can achieve

88.48% [email protected] in 1st dataset and 
91.50% [email protected] in 2nd dataset

After this I merged those dataset to a two-class dataset (1st to 1st class, 2nd to 2nd class), also use same network, of course changed filters before yolo layer, then I got this result,

[email protected]
71.30% for class 0
90.11% for class 1

So, from the results, this merge decrease AP for all classes, but I don't know what lead this,
what's your suggestion about this?

@Kyuuki93

But, this chart seems this net needs a longer training,
so I changed

max_batchs = 25200 -> 45200
steps = 10000,15000,20000 -> 20000,30000,40000

If you want a smooth decrease in accuracy, then just use SGDR (cosine lr schedule) from Bag of Freebies: https://github.com/AlexeyAB/darknet/issues/3272#issuecomment-497149618

Just use

learning_rate=0.0005
burn_in=4000
max_batches = 25200
policy=sgdr

instead of

learning_rate=0.0005
burn_in=4000
max_batches = 25200
policy=steps
steps=10000,15000,20000
scales=.1,.1,.1

[email protected]
71.30% for class 0
90.11% for class 1
So, from the results, this merge decrease AP for all classes, but I don't know what lead this,
what's your suggestion about this?

So dataset-0 has class-0, and dataset-1 has class-1.
Are there unlabeled objects of class-1 in dataset-0?
Are there unlabeled objects of class-0 in dataset-1?
All objects must be mandatory labeled.

Also in general, more classes - worse accuracy.

So dataset-0 has class-0, and dataset-1 has class-1.

Yes

Are there unlabeled objects of class-1 in dataset-0?
Are there unlabeled objects of class-0 in dataset-1?

No, this can ensure

Also in general, more classes - worse accuracy.

From results in many field, yes, but what lead that?

From results in many field, yes, but what lead that?

Any model has a limited capacity, and the more classes, the fewer features specific for each class can fit in the model. Classes compete for model capacity.

Any model has a limited capacity, and the more classes, the fewer features specific for each class can fit in the model. Classes compete for model capacity.

class-0 has 20k instances and class-1 has more than 80k instances, so class-1 get average accuracy because there more data to grab model capacity, opposite class-0 get worse

@Kyuuki93 May be yes.

I added fix: https://github.com/AlexeyAB/darknet/commit/e33ecb785ee288fca1fe50326f5c7b039a9f5a11

So now you can try to set in each [yolo] / [Gaussian_yolo] layer parameter:
counters_per_class=20000, 80000
And train.
It will use multipliers for delta_class during training

  • 4 x for class-0
  • 1 x for class-1

@Kyuuki93

I found a bug and fixed it in counters_per_class= https://github.com/AlexeyAB/darknet/commit/e43a1c424d9a20b8425d8dd8f240867f2522df3f

So now you can try to set in each [yolo] / [Gaussian_yolo] layer parameter:
counters_per_class=20000, 80000

20000 or 80000 means class instances number?

It will use multipliers for delta_class during training

* `4` x for class-0

* `1` x for class-1

4 and 1 it's ratio calculated by 20000 and 80000?
If yes, set counters_per_class=1, 4 was the same with counters_per_class=20000, 80000?

Btw, dropblock, precisely fix size = 7, did not get a higher AP, even take a longer training, maybe gradually increase size as 1, 5, 7 will get a better result, I will try this later

@AlexeyAB does counter per class work out of gaussian? Is it a way to solve class imbalance problem? Would you explain that commit and it's use case?

@Kyuuki93

20000 or 80000 means class instances number?
4 and 1 it's ratio calculated by 20000 and 80000?
If yes, set counters_per_class=1, 4 was the same with counters_per_class=20000, 80000?

Yes.

Btw, dropblock, precisely fix size = 7, did not get a higher AP, even take a longer training,

DropBlock with size=7 should be used only with RFB-block as I described above.

  • With the RBF, the DropBlock will obscure most of the RBF receptive field, and therefore will force the rest of the RBF to learn - this will improve accuracy.
  • Without the RBF, the DropBlock completely obscures several final activations in which there are objects - this will interfere with learning.

maybe gradually increase size as 1, 5, 7 will get a better result, I will try this later

Current implementation of DropBlock will gradually increase size from 1 to maxsize=7.
In the implementation from paper are used maxsize=7 for all 3 DropBlocks. But you can try to use for different DropBlocks the different maxsizes 1,5,7.

@isgursoy

does counter per class work out of gaussian? Is it a way to solve class imbalance problem? Would you explain that commit and it's use case?

Yes. It is experimental feature. Just set number of objects in training dataset for each class.

@AlexeyAB
So dropblock with size=7 should not use without rfb,

Current implementation of DropBlock will gradually increase size from 1 to maxsize=7.

When will the size increase? Is based on current_iters / max_batch?

And, there are another table, some results were copied from previous table for a convenient compare

| Model | [email protected] | [email protected]|
| --- | --- | --- |
|spp,mse,it=0.213,asff(softmax)|92.45% | 61.83%|
|spp,giou,it=0.213,asff(softmax) | 92.33% | 63.74%|
|spp,giou,it=0.213,asff(softmax),dropblock(size=7) | 91.66% | 62.42%|
|spp,giou,it=0.213,asff(softmax),rfb(bn=0)|91.57%|64.95% |
|spp,giou,it=0.213,asff(softmax),rfb(bn=1)|92.32%|60.05% |
|spp,giou,it=0.213,asff(softmax),dropblock(size=7) ,rfb(bn=0)| 91.65%|61.35% |
|spp,giou,it=0.213,asff(softmax),dropblock(size=7) ,rfb(bn=1)|92.12% |63.88% |
|spp,giou,it=0.213,asff(logistic),dropblock(size=7) ,rfb(bn=1)|91.78% |60.10% |
|spp,giou,it=0.213,asff(relulike),dropblock(size=7) ,rfb(bn=1)| 91.43%|61.11%|
|spp,giou,it=0.213,asff(softmax),dropblock(size=7) ,rfb(bn=1),sgdr|91.54% |62.67% |
|spp,giou,it=0.213,asff(softmax),dropblock(size=3,5,7) ,rfb(bn=1)|90.93% | 61.20%|

(?) this is rfb cfg file, I think its right, you can check for sure yolov3-spp-asff-it-rfb.cfg.txt

Btw, yolact updated this days to yolact++, this discuss may move to it's reference issue

AugFPN: Improving Multi-scale Feature Learning for Object Detection

paper https://arxiv.org/pdf/1912.05384.pdf

image

very similar to ASFF

@Kyuuki93

(?) this is rfb cfg file, I think its right, you can check for sure yolov3-spp-asff-it-rfb.cfg.txt

It seems this is correct.
Also you can try to train RFB-block with batch_normalize=1, it may have higher accuracy: https://github.com/ruinmessi/ASFF/issues/46


When will the size increase? Is based on current_iters / max_batch?

Yes multiplier = current_iters / (max_batches / 2) as in paper: https://github.com/AlexeyAB/darknet/blob/3004ee851c49e28a32fd60f2ae4a1ddf95b8b391/src/dropout_layer_kernels.cu#L31
https://github.com/AlexeyAB/darknet/blob/3004ee851c49e28a32fd60f2ae4a1ddf95b8b391/src/dropout_layer_kernels.cu#L39-L40


Btw, yolact updated this days to yolact++, this discuss may move to it's reference issue

Thanks! I will look at it: https://github.com/AlexeyAB/darknet/issues/3048#issuecomment-567017091

Also you can try to train RFB-block with batch_normalize=1, it may have higher accuracy: ruinmessi/ASFF#46

I have seen your discussion with ASFF's author before, compare of rfb(bn=0) with rfb(bn=1) was on schedule.

SpineNet: Learning Scale-Permuted Backbone for Recognition and Localization

Another work on FPN connection manipulates, it seems recently researchers paid more attention to the connection method of FPN

It is not clear what is better: ASFF, BiFPN, or these AugFPN, SpineNet...

AugFPN: Improving Multi-scale Feature Learning for Object Detection

paper https://arxiv.org/pdf/1912.05384.pdf

They compare their network with Yolov2 in 2019, and dont compare Speed/Bflops.
But may be we should read it:

By replacing FPN with
AugFPN in Faster R-CNN, our models achieve 2.3 and 1.6
points higher Average Precision (AP) when using ResNet50
and MobileNet-v2 as backbone respectively. Furthermore,
AugFPN improves RetinaNet by 1.6 points AP and FCOS
by 0.9 points AP when using ResNet50 as backbone.


SpineNet: Learning Scale-Permuted Backbone for Recognition and Localization

paper https://arxiv.org/pdf/1912.05027.pdf
Another work on FPN connection manipulates, it seems recently researchers paid more attention to the connection method of FPN

They compare AP / Bflops, but don't compare AP / FPS. This is usually done when the network is slow. But maybe we should read the article.

SpineNet achieves state-of-theart performance of one-stage object detector on COCO with
60% less computation, and outperforms ResNet-FPN counterparts by 6% AP. SpineNet architecture can transfer to
classification tasks, achieving 6% top-1 accuracy improvement on a challenging iNaturalist fine-grained dataset.

It is not clear what is better: ASFF, BiFPN, or these AugFPN, SpineNet

Yes, we need to read more details for implementation, then it’s able to compare in this framework. In my understanding, their high AP with slow speed is result from heavy backbone, all their idea were about how to connect FPN, which we only need to change network to implement, that may could useful to yolo.

Will take a deep look then

@AlexeyAB updated Table here https://github.com/AlexeyAB/darknet/issues/4382#issuecomment-567010280
It's seems rfb bn=1 better than rfb bn=0, dropblock (7,7,7) better than dropblock (3,5,7)

Also maybe this one-class dataset was too easy, and hard to show the validity of these improvement, perhaps I should run all experiments again in the two-class dataset which mentioned before

@Kyuuki93 Well.
Current implementation of dropblock (7,7,7) in the Darknet will use dropblock from (1,1,1) initially to (7,7,7) at the max_batches/2 iterations - so this is much closer to the article.

This is very strange why dropblock drops AP@75. May be with dropblock the model should be trained more iterations.

Show cfg-file and char.png file of spp,giou,it=0.213,asff(softmax),dropblock(size=7) ,rfb(bn=1)
Do you use sgdr-lr-policy or step-lr-policy, and how many iterations did you train?

This is very strange why dropblock drops AP@75. May be with dropblock the model should be trained more iterations.

Double training iterations did not get higher AP.

Show cfg-file and char.png file of spp,giou,it=0.213,asff(softmax),dropblock(size=7) ,rfb(bn=1)

Sorry, I have checked it but not keep it, it's just look like https://github.com/AlexeyAB/darknet/issues/4382#issuecomment-566509160

Do you use sgdr-lr-policy or step-lr-policy, and how many iterations did you train?

Is step-lr-policy,

@AlexeyAB I fulfilled previous table,

spp,giou,it=0.213,asff(softmax) | 92.33% | 63.74%|
spp,giou,it=0.213,asff(softmax),dropblock(size=7) ,rfb(bn=1) |92.12% | 63.88%|

maybe improvement of rfb block was based on guiding anchor

@Kyuuki93 Thanks!

  • Did you try to test spp,giou,it=0.213,asff(softmax),rfb(bn=1) (RFB with batch-norm, but without dropblock)?

  • In sgdr - did you use policy=sgdr or policy=sgdr sgdr_cycle=1000 ?

  • Did you try to test spp,giou,it=0.213,asff(softmax),rfb(bn=1) (RFB with batch-norm, but without dropblock)?

Yes, is still on the training, this is the last experiment on class-0 dataset, following experiment will use two-class dataset

  • In sgdr - did you use policy=sgdr or policy=sgdr sgdr_cycle=1000 ?

Only policy=sgdr, how to choose the nunber of sgdr_cycle = ?, this is chart of training use sgdr, the part in blue cycle was normal for sgdr or just need longer training?
chart

Only policy=sgdr,

Thats right!

how to choose the nunber of sgdr_cycle = ?,

sgdr_cycle = num_images_in_train_txt / batch - but you should choose max batches so that it fell on the end of a cycle, so don't use it.

this is chart of training use sgdr, the part in blue cycle was normal for sgdr or just need longer training?

It seems it should be trained longer.

@AlexeyAB finished spp,giou,it=0.213,asff(softmax),rfb(bn=1) training, nearly straight after 8k iters,
chart

@Kyuuki93

Yes, firstly AP50 stops, and later AP75 stops.

This is weird that rfb(bn=0) is better than rfb(bn=1) for AP75.

In general I think spp,giou,it=0.213,asff(softmax),dropblock(size=7) ,rfb(bn=1) the best model. Try it on multiclass-dataset.

On two-class dataset, all Model based on yolov3-spp.cfg with iou_thresh =0.213, giou with iou_normalizer=0.5, all train set same batch_size and step-lr-policy, C0/C1 means class-0 and class-1 which includes 20k and 80k instance respectively. This table may spend two weeks, results will gradually update.

| Model| [email protected](C0/C1) | [email protected](C0/C1)|
|---|---|---|
|mse|
|giou|79.53%(69.24%/89.83%)|59.65%(42.96%/76.34%)|
|giou,cpc| 79.51% (69.07%/89.96%)| 59.52%(42.17%/76.87%)|
|giou,asff,rfb|
|giou,asff,dropblock,rfb|

cpc means counters_per_class
asff default with softmax
rfb default with bn=1
dropblock default with size=7

asff default with softmax

Also try assf without softmax, at least for 1 class, just to know, is it implemented correctly, and should we use it for BiFPN implementation.

asff default with softmax

Also try assf without softmax, at least for 1 class, just to know, is it implemented correctly, and should we use it for BiFPN implementation.

Ok, I will do it for 1 class which would be faster

@AlexeyAB Previous table was completed, but I think ASFF norm channels once but BiFPN can norm channels many times, maybe that's why BiFPN could use a simple norm function

@Kyuuki93 Yes, may be many fusions compensate for the disadvantages of norm_channels with ReLU, since there are many BiFPN blocks with many norm_chanells in each BiFPN-block in the EfficientDet.

So the best models from: https://github.com/AlexeyAB/darknet/issues/4382#issuecomment-567010280

| Model | [email protected] | [email protected]|
| --- | --- | --- |
|spp,mse,it=0.213,asff(softmax)|92.45% | 61.83%|
|spp,giou,it=0.213,asff(softmax) | 92.33% | 63.74%|
|spp,giou,it=0.213,asff(softmax),rfb(bn=0)|91.57%|64.95% |
|spp,giou,it=0.213,asff(softmax),dropblock(size=7) ,rfb(bn=1)|92.12% |63.88% |

@Kyuuki93
I would suggest to use rfb(bn=1) instead of ,rfb(bn=0) in your new experiments with 2 classes: https://github.com/AlexeyAB/darknet/issues/4382#issuecomment-568184405

@Kyuuki93

Also since we don't use Deformable-conv, then you can try to use RFB-block with flexible receptive field from 1x1 to 11x11: https://github.com/AlexeyAB/darknet/issues/4507#issuecomment-568296011

You can test with one class too.

@AlexeyAB Seems counters_per_class not work out data unbalance issue, compare with class weights in classification problem, classify loss were produced by class only, but in detection problem, loss were produced by class and location even objectness, and loss from loc and obj were not relevant to class label, so losses multiplier could not work this out, actually in my dataset, class of box got high accuracy, its seems model just can't find the class-0's object which lack data on training dataset

Also since we don't use Deformable-conv, then you can try to use RFB-block with flexible receptive field from 1x1 to 11x11: #4507 (comment)

By change activation = linear to activation = leaky?

@Kyuuki93

Also since we don't use Deformable-conv, then you can try to use RFB-block with flexible receptive field from 1x1 to 11x11: #4507 (comment)

By change activation = linear to activation = leaky?

You can use activation = linear for conv-layers.

Generally by adding [maxpool] maxpool_depth=1:

[route]
layers = -1,-5,-9,-12

[maxpool]
maxpool_depth=1
out_channels=64
size=1
stride=1

[convolutional]
batch_normalize=1
filters=256
size=1
stride=1
pad=1
activation=linear

[shortcut]
from=-16
activation=leaky

@Kyuuki93 @AlexeyAB Hi,

When can we use the ASFF? please share the final cfg.
have you tried to compare ASFF and csresnext50-panet-spp?

@zpmmehrdad
yolov3-spp + ASFF yolov3-spp-asff-it.cfg.txt
yolov3-spp + ASFF +Dropblock + RFB(bn=0)yolov3-spp-asff-db-it-rfb.cfg.txt

you can set bn=1 in RFB blocks to get better results.

have you tried to compare ASFF and csresnext50-panet-spp?

Not yet, you can compare it on your data, and this ASFF based on darknet-53 but csresnext50-panet-spp based on resnext, maybe we should implement ASFF in resnext50 for a fair compassion

@Kyuuki93 Thanks for your reply,

I saw the table that you shared and in the table "spp,giou,it=0.213,asff(softmax),rfb(bn=0)" has a good result at [email protected]. I'm going to use for ~200 classes and some classes are almost the same and [email protected] important for me. Do you think "spp,giou,it=0.213,asff(softmax),rfb(bn=0)" is a good option for me or not?

Thanks

I'm going to use for ~200 classes and some classes are almost the same and [email protected] important for me. Do you think "spp,giou,it=0.213,asff(softmax),rfb(bn=0)" is a good option for me or not?

Try to compare spp,giou,it=0.213,asff(softmax),rfb(bn=0) and spp,giou,it=0.213,asff(softmax),rfb(bn=1), this ASFF module wasn't get enough test, I'm not sure it can imporove [email protected] at every dataset, I used it on a one-class datatset, so if you got results please share with us

@Kyuuki93

Not yet, you can compare it on your data, and this ASFF based on darknet-53 but csresnext50-panet-spp based on resnext, maybe we should implement ASFF in resnext50 for a fair compassion

Look at this comparison: https://github.com/AlexeyAB/darknet/issues/4406#issuecomment-567919052

  • For GPUs without Tensor Cores ResNext50 was better than Darknet53, but for Volta/Turing (RTX) GPUs and newer, it seems that Darknet53 is better.
    So may be we should use CSPDarkNet-53 backbone rather than CSPResNeXt-50 https://github.com/WongKinYiu/CrossStagePartialNetworks#big-models

  • Also may be 1 block of BiFPN (based on NORM_CHAN_SOFTMAX) can be better than ASFF

So may be we should use CSPDarkNet-53 backbone rather than CSPResNeXt-50 https://github.com/WongKinYiu/CrossStagePartialNetworks#big-models

It seems worth to try

  • Also may be 1 block of BiFPN (based on NORM_CHAN_SOFTMAX) can be better than ASFF

I have a question about BiFPN which is with 3 yolo layers, BiFPN should just keep P3-5 and ignore P6-7?
image

Btw, I have made a Spinenet-49 with 3 yolo layers spinenet.cfg.txt, you can check it for sure or take a test.
image

Training from scratch is a little bit slow..

@AlexeyAB Also, take a look of this https://github.com/AlexeyAB/darknet/issues/3874#issuecomment-568696075, it's seem gaussian_yolo hurt recall heavily, and low iou_thresh can significant improve it.

And next I want to find out the relation between precision ,reall and ignore_thresh, truth_thresh

@Kyuuki93

  • [Gaussian_yolo] introduces bbox_confidence_score = (0 - 1), so confidence_score = class_conf * bbox_conf will be lower than confidence_score = class_conf in [yolo] - it decreases the number of bboxes with thresh > conf_thresh - it increases Precision and decreases Recall for the same conf_threshold

  • iou_thresh=0.213 allows Yolo to use many not the most suitable anchors for one object - it increases the number of bboxes (but additional bboxes are not accurate) - it increases Recall and decreases Precision for the same conf_threshold

| Model | [email protected] | precision (th=0.85)| recall(th=0.85)|precision (th=0.7)| recall(th=0.7)|
| --- | --- | --- | --- | ---| ---|
|spp,mse |89.50% | 0.98 | 0.20| 0.97| 0.36|
|spp,giou |90.09% | 0.98 | 0.25| 0.97| 0.40|
|spp,ciou |89.88% | 0.99 | 0.22| 0.97| 0.38|
|spp,giou,gs | 91.39% |0.99| 0.05| 0.97 | 0.47|
|spp,giou,gs,it|91.87%|0.99 |0.16 |0.97|0.52|

I have a question about BiFPN which is with 3 yolo layers, BiFPN should just keep P3-5 and ignore P6-7?
image

  1. Yes. You can just get features from these 3 points (P3, P4, P5). And use NORM_CHAN_SOFTMAX

  2. Or you can get features from earlier points (figure below)

    • or from 4 points P2 - P5 instead of P3 - P7 (as it is done in PANnet https://github.com/AlexeyAB/darknet/issues/3175 )
    • or from 5 points P1 - P5 instead of P3 - P7.

And you can duplicate BiFPN-block many times (from 2 to 8 BiFPN blocks) - page 5, table 1: https://arxiv.org/pdf/1911.09070v1.pdf

69334457-cab33e80-0c6b-11ea-8214-8f38d5aeda88

Btw, I have made a Spinenet-49 with 3 yolo layers spinenet.cfg.txt, you can check it for sure or take a test.

I did not go into details how the Senet should look in details. Several questions:

  • Can you show a link to the code (Pytorch/TF/...) from which you copied the Senet?

  • masks= of [yolo] layers should be fixed for P3, P2, P4 sizes

  • Why did you remove top 2 blocks (P5, P6)?

image


  • 2 shortcut layers are pointed to the same layer-5

image


  • Is it normal that some of your layers have 19 BFLOPS?

image

  • Can you show a link to the code (Pytorch/TF/...) from which you copied the Senet?

There no public spinenet in any framework yet, this cfg was based on

  • masks= of [yolo] layers should be fixed for P3, P2, P4 sizes

I did not notice that, will change then

  • Why did you remove top 2 blocks (P5, P6)?

Feature map in P5, P6 were too small I think

  • 2 shortcut layers are pointed to the same layer-5

layer-9 was shortcut in 2nd bottleneck bloc, with activation=leaky
layer-10 was input of 3rd bottleneck, with activation=linear

  • Is it normal that some of your layers have 19 BFLOPS?

image

I checked the model, and compare the ratio here
image
I think this 19 BFLOPS was right, and it's result of use residual block instead of bottleneck block, which is diamond block in previous figure

2 shortcut layers are pointed to the same layer-5

layer-9 was shortcut in 2nd bottleneck bloc, with activation=leaky
layer-10 was input of 3rd bottleneck, with activation=linear

What do you mean? Do you mean this is correct?

#               # 9 b2
[shortcut]
from=-4
activation=leaky
#               # 10 b3 3rd gray rectangle block
#               # from b1,b2
[shortcut]
from=-5
activation=leaky

image

What do you mean? Do you mean this is correct?

#               # 9 b2
[shortcut]
from=-4
activation=leaky
#               # 10 b3 3rd gray rectangle block
#               # from b1,b2
[shortcut]
from=-5
activation=leaky -> should be linear 

My mistake, but two shortcut from layer-5 is correct

@AlexeyAB this spinenet which has one shortcut layer set wrong activation function and use 3 yolo on P2, P2, P3, got 88.80% [email protected] and 52.26% [email protected] in previous one-class dataset, training from scratch with setting

width,height = 384,384
random=0
iou_loss=giou
iou_thresh=0.213

but 1.5x slower than yolov3-spp, I will take more test then

@Kyuuki93 Try to train both yolov3-spp and fixed-spinenet without pre-trained weights and with the same other settings.

Seems counters_per_class not work out data unbalance issue, compare with class weights in classification problem, classify loss were produced by class only, but in detection problem, loss were produced by class and location even objectness, and loss from loc and obj were not relevant to class label, so losses multiplier could not work this out, actually in my dataset, class of box got high accuracy, its seems model just can't find the class-0's object which lack data on training dataset

I added 2 fixes, so now counters_per_class affects the objectness and bbox too.
https://github.com/AlexeyAB/darknet/commit/35a3870979e0d819208a3de7a24c39cc0539651d
https://github.com/AlexeyAB/darknet/commit/b8fe630119fea81200f6ca4641ce2514d893df04

For comparison of spinenet(fixed, 5 yolo-layers) and yolov3-spp(3 yolo-layers), training from scratch with same settings

width = 384
height = 384
batch = 96
subdivisions = 16
learning_rate = 0.00025
burn_in = 1000
max_batches = 30200
policy = steps
steps = 15000, 20000, 25000
scales = .1,.1,.1
...
random = 0
iou_loss = giou
iou_normalizer = 0.5
iou_thresh = 0.213

| Network | [email protected] | AP.75 |precision(.7) | recall(.7)|Inference time |
| --- | --- | --- | --- | --- | --- |
| spinenet49-5l|90.46%|53.80%|0.93|0.71|32.17ms|
| yolov3-spp|89.98%|54.47%|0.96|0.53|11.77ms|

@AlexeyAB
image

There are any op like nn.Parameter() in this repo for implementing this wi in BiFPN?

@Kyuuki93

There are any op like nn.Parameter() in this repo for implementing this wi in BiFPN?

What do you mean?

If you want Unbounded fusion, then just use activation=linear instead of activation=NORM_CHAN_SOFTMAX

@AlexeyAB

For example, wi is a scalar,
P4_mid = Conv( ( w1*P4_in + w2* Resize(P5_in)) / ( w1+ w2) ),
this wi should trainable but not relevant with any feature map

In ASFF, w was calculated by feature map through a conv_layer

@Kyuuki93

In ASFF, w was calculated by feature map through a conv_layer

Do you mean that is not so in BiFPN? https://github.com/xuannianz/EfficientDet/blob/ccc795781fa173b32a6785765c8a7105ba702d0b/model.py

If you want w constant during inference, then you can do something like this:
```
[route]
layers = P4

[convolutional]
batch_normalize=1
filters=256
groups=256
size=1
stride=1
pad=1
activation=linear

[route]
layers = P5

[convolutional]
batch_normalize=1
filters=256
groups=256
size=1
stride=1
pad=1
activation=linear

[shortcut]
from = -3
````

For comparison of spinenet(fixed, 5 yolo-layers) and yolov3-spp(3 yolo-layers), training from scratch with same settings

Also try to compare with spinenet(fixed, 3 yolo-layers) + spp, where is added SPP-block to the P5 or P6 block: https://github.com/AlexeyAB/darknet/issues/4382#issuecomment-568950286

https://github.com/AlexeyAB/darknet/blob/35a3870979e0d819208a3de7a24c39cc0539651d/cfg/yolov3-spp.cfg#L575-L597

image

@AlexeyAB

Do you mean that is not so in BiFPN? https://github.com/xuannianz/EfficientDet/blob/ccc795781fa173b32a6785765c8a7105ba702d0b/model.py

def build_BiFPN() here is not so, it without w
https://github.com/xuannianz/EfficientDet/blob/ccc795781fa173b32a6785765c8a7105ba702d0b/model.py#L40-L93

def build_wBiFPN() here is BiFPN with w
https://github.com/xuannianz/EfficientDet/blob/ccc795781fa173b32a6785765c8a7105ba702d0b/model.py#L96-L149
w was defined here, actually, we need a layer like this one
https://github.com/xuannianz/EfficientDet/blob/ccc795781fa173b32a6785765c8a7105ba702d0b/layers.py#L33-L60

Maybe add a weights to [shortcut] layer is a option, also [shortcut] can take more than 2 inputs, something like

[shortcut]
from=P4, P5_up
weights_type = feature (or channel or pixel)
weights_normalizion = relu (or softmax or linear)
activation = linear

image

Feature map on P6 only 4x4, could be too small to get useful feature?

Normally, SPP was on the middle and connected Backbone and FPN? Look like Backbone -> SPP -> FPN

But in Spinenet49, it seems all network is a FPN

@AlexeyAB I moved spinenet related comment to its issue

@Kyuuki93

Feature map on P6 only 4x4, could be too small to get useful feature?

Yes, then spp should be placed in P5 (especially if you use small initiall network resolution)


[shortcut]
from=P4, P5_up
weights_type = feature (or channel or pixel)
weights_normalizion = relu (or softmax or linear)
activation = linear

Yes, or maybe just enough feature without channel or pixel

Interestingly, a fusion from BiPPN is more effective than such a fusion?

  • this is the same as w - a vector (per-channel) in BiFPN with ReLU
  • batch_normalize=1 - will do normalization to solve training instability issue
  • leaky in this block and in conv-layers L1, L2, L3, ensures that weights will be mostly positive too
[route] 
layers= L1, L2, L3    # output: W x H x 3*C

[convolutional]
batch_normalize=1
filters=3*C
groups=3*C
size=1
stride=1
pad=1
activation=leaky

[local_avgpool]
avgpool_depth = 1  # isn't implemented yet
                   #avg across C instead of WxH - Same meaning as maxpool_depth=1 in [maxpool]
out_channels = C

@Kyuuki93
It seems that higher ignore_thresh=0.85 is better than ignore_thresh=0.7 for your dataset. https://github.com/AlexeyAB/darknet/issues/3874#issuecomment-568696075
Also turth_tresh=1.0 is good.
So for your dataset is better to use iou_tresh=1.0 (or not use it at all).

@AlexeyAB

It seems that higher ignore_thresh=0.85 is better than ignore_thresh=0.7 for your dataset.

ignore_thresh = 0.85 got higher [email protected] but much lower recall than ignore_thresh = 0.7

Also turth_tresh=1.0 is good.

Actually,

  • truth_tresh only worked on 1.0
  • when both truth_thresh and ignore_thresh set to 0.7 network become untrainable
  • keep ignore_thresh = 0.7, truth_thresh = 0.85`, decrease perfomance

So for your dataset is better to use iou_tresh=1.0 (or not use it at all).

What do you mean? For now, all training with iou_thresh = 0.213, do you mean set iou_thresh=1.0 when change truth_thresh or ignore_thresh?

Other one-stage methods worked on dual threshold such as ignore_thresh = 0.3 and truth_thresh = 0.5, but yolo worked on single threshold with ignore_thresh = 0.7, this also mentioned in yolov3's paper but no explain, I just wonder why

@Kyuuki93

Happy New Year! :fireworks: :sparkler:

What do you mean? For now, all training with iou_thresh = 0.213, do you mean set iou_thresh=1.0 when change truth_thresh or ignore_thresh?

I mean may be better to use in your dataset:

ignore_thresh = 0.7
truth_thresh = 1.0
iou_thresh=1.0

While for MS COCO may be better to use

ignore_thresh = 0.7
truth_thresh = 1.0
iou_thresh=0.213

Other one-stage methods worked on dual threshold such as ignore_thresh = 0.3 and truth_thresh = 0.5, but yolo worked on single threshold with ignore_thresh = 0.7, this also mentioned in yolov3's paper but no explain, I just wonder why

What methods do you mean?

In the original Darknet there are several issues which may degrade accuracy when using low values of ignore_thresh or truth_thresh

Initially in the original Darknet there were several wrong places which I fixed:

  1. There was used if (best_iou > l.ignore_thresh) {
    instead of if (best_match_iou > l.ignore_thresh) { https://github.com/AlexeyAB/darknet/blame/dcfeea30f195e0ca1210d580cac8b91b6beaf3f7/src/yolo_layer.c#L355
    Thus, it didn't decrease objectness even if there was an incorrect class_id.
    Now it decrease objectness if detection_class_id != truth_class_id - it improves accuracy if ignore_thresh < 1.0.

  2. When truth_thresh < 1.0 then the probability that many objects will correspond to one anchor increases. But in the original Darknet, only the last (from label-txt-file) truth-bbox affected the anchor. I fixed it - now it averages deltas of all truths which correspond to this one anchor - so truth_thresh < 1.0 and iou_thresh < 1.0 may have a better effect:

  3. Also isn't tested and isn't fixed possible bug with MSE: https://github.com/AlexeyAB/darknet/issues/4594#issuecomment-569927386

@AlexeyAB Happy New Year!

There are old cpc and new cpc results, seems use loss multiplier on all loss parts could balance classes AP slightly but not improve it

| Model| [email protected](C0/C1) | [email protected](C0/C1)|
|---|---|---|
|giou|79.53%(69.24%/89.83%)|59.65%(42.96%/76.34%)|
|giou,cpc| 79.51% (69.07%/89.96%)| 59.52%(42.17%/76.87%)|
|giou,cpc(new)|79.44%(70.03%/88.84%)|59.61%(44.95%/74.27%)|

I mean may be better to use in your dataset: iou_thresh=1.0
While for MS COCO may be better to use: iou_thresh=0.213

Actually, in my dataset iou_thresh = 0.213 always get better results, I think use a lower iou_thresh allows several anchors can predict same object, and in original darknet use only nearest anchor to predict object which limited yolo's ability, so set a lower iou_thresh will always get better results, just need to search a suit value for a certain dataset.

What methods do you mean?

Some method use like ignore_thresh = 0.5 & truth_thresh =0.7, which means
iou < 0.5, negative sample
0.5 <iou<0.7, ignore
iou > 0.7, positive sample

I'm not sure this is exactly yolo's ignore_thresh and truth_thresh

@Kyuuki93

seems use loss multiplier on all loss parts could balance classes AP slightly but not improve it

Yes.

Some method use like ignore_thresh = 0.5 & truth_thresh =0.7, which means
iou < 0.5, negative sample
0.5 iou > 0.7, positive sample

Yes.

truth_thresh is very similar (but not the same) as iou_thresh, so this is strange that you get better result with higher truth_thresh and with lower iou_thresh.

For MS COCO iou_thresh=0.213 greatly increases accuracy.

@WongKinYiu @Kyuuki93
I am adding new version of [shortcut], now I am re-making [shortcut] layer for fast BiFPN: https://github.com/AlexeyAB/darknet/issues/4382#issuecomment-569197177

so be careful by using commits from Jan 7, 2020 it may have bugs in [shortcut] layer.

Before using, try to train small model with [shortcut] layer

@AlexeyAB

Okay, thanks.

@AlexeyAB ok, thanks

@Kyuuki93 @WongKinYiu I added new version of [shortcut] layer for BiFPN from EfficientDet: https://github.com/AlexeyAB/darknet/issues/4662

So you can try to make Detector with 1 or several BiFPN blocks.
And with 1 ASFF + several BiFPN blocks (yolov3-spp-asff-bifpn-db-it.cfg)

@nyj-ocean

[convolutional]
stride=1
size=1
filters=4
activation=normalize_channels_softmax

[route]
layers=-1
group_id=0
groups=4

...


[route]
layers=-1
group_id=3
groups=4

@nyj-ocean It is due that 4-th branch has 4=(2x2) more outputs. So you should use /2 less filters in conv-layers.

@AlexeyAB
I reduce the value of filters in some [convolutional] layers.
But the FPS of yolov3-4l+ASFF.cfg is still slow than yolov3-4l.cfg
I am waiting to see whether the final mAP of yolov3-4l+ASFF.cfg increase or not compared with yolov3-4l.cfg

But the way , i want to try ASFF + several BiFPN ,where could i download the yolov3-spp-asff-bifpn-db-it.cfg in https://github.com/AlexeyAB/darknet/issues/4382#issuecomment-572760285?

Was this page helpful?
0 / 5 - 0 ratings