Learning Spatial Fusion for Single-Shot Object Detection



@AlexeyAB it's seems worth to take a look
ASFF significantly improves the box AP from 38.8% to 40.6% as shown in Table 3.

Also there are used:
BoF (MixUp, ...) - +4.2 [email protected], but +0 [email protected] and +5.6% AP@70: https://github.com/AlexeyAB/darknet/issues/3272
MegDet: A Large Mini-Batch Object Detector (synchronized batch normalization technique) - mAP 52.5%) to COCO 2017 Challenge, where we won the 1st place of Detection task: https://arxiv.org/abs/1711.07240v4 - issue: https://github.com/AlexeyAB/darknet/issues/4386
Dropblock + Receptive field block gives +1.7% [email protected]
So ASFF gives only +1.8% [email protected] and 1.5% [email protected] and 2.5% [email protected]
cosine learning rate: https://github.com/AlexeyAB/darknet/pull/2651
This paper is a bit confusing, so I took a look at his code, his code using conv_bn_leakyReLU for the level_weights instead of this formula before Softmax

In shortly, ASFF mapping the inputsx0, x1, x2 of yolo0, yolo1, yolo2 to each other to enhance the detection, but I still wonder, which layers output respond to x0, x1, x2?
@Kyuuki93
his code using conv_bn_leakyReLU for the level_weights instead of this formula before Softmax
Can you provide link to these lines of code?
https://github.com/ruinmessi/ASFF/blob/master/models/network_blocks.py
calc weights:

weights func:

add_conv func:

@Kyuuki93
This paper is a bit confusing, so I took a look at his code, his code using
conv_bn_leakyReLUfor the level_weights instead of this formula beforeSoftmax
This formula seems to be softmax a = exp(x1) / (exp(x1) + exp(x2) + exp(x3)) https://en.wikipedia.org/wiki/Softmax_function
I added fixes to implement ASFF and BiFPN (from EfficientDet): https://github.com/AlexeyAB/darknet/issues/3772#issuecomment-559592123
In shortly, ASFF mapping the inputsx0, x1, x2 of yolo0, yolo1, yolo2 to each other to enhance the detection, but I still wonder, which layers output respond to x0, x1, x2?
It seems layers: 17, 24, 32
waiting for improvements, good things happening here
This formula seems to be softmax
a = exp(x1) / (exp(x1) + exp(x2) + exp(x3))https://en.wikipedia.org/wiki/Softmax_function
Yeah, I got it, his fusion was finished by 1x1 conv, softmax and sum.
I added fixes to implement ASFF and BiFPN (from EfficientDet): #3772 (comment)
I will try to implement ASFF, BiFPN module and run some tests
For up-sampling, we first apply a 1x1 convolution layer to compress the number f channels of features to that in level _l_, and then upscale the resolutions respectively with interpolation.
@AlexeyAB How to implement this upscale in .cfg file?
@Kyuuki93 [upsample] layer with stride=2 or stride=4
layer filters size/strd(dil) input output
0 conv 32 3 x 3/ 1 416 x 416 x 3 -> 416 x 416 x 32 0.299 BF
1 conv 64 3 x 3/ 2 416 x 416 x 32 -> 208 x 208 x 64 1.595 BF
2 conv 32 1 x 1/ 1 208 x 208 x 64 -> 208 x 208 x 32 0.177 BF
3 conv 64 3 x 3/ 1 208 x 208 x 32 -> 208 x 208 x 64 1.595 BF
4 Shortcut Layer: 1
5 conv 128 3 x 3/ 2 208 x 208 x 64 -> 104 x 104 x 128 1.595 BF
6 conv 64 1 x 1/ 1 104 x 104 x 128 -> 104 x 104 x 64 0.177 BF
7 conv 128 3 x 3/ 1 104 x 104 x 64 -> 104 x 104 x 128 1.595 BF
8 Shortcut Layer: 5
9 conv 64 1 x 1/ 1 104 x 104 x 128 -> 104 x 104 x 64 0.177 BF
10 conv 128 3 x 3/ 1 104 x 104 x 64 -> 104 x 104 x 128 1.595 BF
11 Shortcut Layer: 8
12 conv 256 3 x 3/ 2 104 x 104 x 128 -> 52 x 52 x 256 1.595 BF
13 conv 128 1 x 1/ 1 52 x 52 x 256 -> 52 x 52 x 128 0.177 BF
14 conv 256 3 x 3/ 1 52 x 52 x 128 -> 52 x 52 x 256 1.595 BF
15 Shortcut Layer: 12
16 conv 128 1 x 1/ 1 52 x 52 x 256 -> 52 x 52 x 128 0.177 BF
17 conv 256 3 x 3/ 1 52 x 52 x 128 -> 52 x 52 x 256 1.595 BF
18 Shortcut Layer: 15
19 conv 128 1 x 1/ 1 52 x 52 x 256 -> 52 x 52 x 128 0.177 BF
20 conv 256 3 x 3/ 1 52 x 52 x 128 -> 52 x 52 x 256 1.595 BF
21 Shortcut Layer: 18
22 conv 128 1 x 1/ 1 52 x 52 x 256 -> 52 x 52 x 128 0.177 BF
23 conv 256 3 x 3/ 1 52 x 52 x 128 -> 52 x 52 x 256 1.595 BF
24 Shortcut Layer: 21
25 conv 128 1 x 1/ 1 52 x 52 x 256 -> 52 x 52 x 128 0.177 BF
26 conv 256 3 x 3/ 1 52 x 52 x 128 -> 52 x 52 x 256 1.595 BF
27 Shortcut Layer: 24
28 conv 128 1 x 1/ 1 52 x 52 x 256 -> 52 x 52 x 128 0.177 BF
29 conv 256 3 x 3/ 1 52 x 52 x 128 -> 52 x 52 x 256 1.595 BF
30 Shortcut Layer: 27
31 conv 128 1 x 1/ 1 52 x 52 x 256 -> 52 x 52 x 128 0.177 BF
32 conv 256 3 x 3/ 1 52 x 52 x 128 -> 52 x 52 x 256 1.595 BF
33 Shortcut Layer: 30
34 conv 128 1 x 1/ 1 52 x 52 x 256 -> 52 x 52 x 128 0.177 BF
35 conv 256 3 x 3/ 1 52 x 52 x 128 -> 52 x 52 x 256 1.595 BF
36 Shortcut Layer: 33
37 conv 512 3 x 3/ 2 52 x 52 x 256 -> 26 x 26 x 512 1.595 BF
38 conv 256 1 x 1/ 1 26 x 26 x 512 -> 26 x 26 x 256 0.177 BF
39 conv 512 3 x 3/ 1 26 x 26 x 256 -> 26 x 26 x 512 1.595 BF
40 Shortcut Layer: 37
41 conv 256 1 x 1/ 1 26 x 26 x 512 -> 26 x 26 x 256 0.177 BF
42 conv 512 3 x 3/ 1 26 x 26 x 256 -> 26 x 26 x 512 1.595 BF
43 Shortcut Layer: 40
44 conv 256 1 x 1/ 1 26 x 26 x 512 -> 26 x 26 x 256 0.177 BF
45 conv 512 3 x 3/ 1 26 x 26 x 256 -> 26 x 26 x 512 1.595 BF
46 Shortcut Layer: 43
47 conv 256 1 x 1/ 1 26 x 26 x 512 -> 26 x 26 x 256 0.177 BF
48 conv 512 3 x 3/ 1 26 x 26 x 256 -> 26 x 26 x 512 1.595 BF
49 Shortcut Layer: 46
50 conv 256 1 x 1/ 1 26 x 26 x 512 -> 26 x 26 x 256 0.177 BF
51 conv 512 3 x 3/ 1 26 x 26 x 256 -> 26 x 26 x 512 1.595 BF
52 Shortcut Layer: 49
53 conv 256 1 x 1/ 1 26 x 26 x 512 -> 26 x 26 x 256 0.177 BF
54 conv 512 3 x 3/ 1 26 x 26 x 256 -> 26 x 26 x 512 1.595 BF
55 Shortcut Layer: 52
56 conv 256 1 x 1/ 1 26 x 26 x 512 -> 26 x 26 x 256 0.177 BF
57 conv 512 3 x 3/ 1 26 x 26 x 256 -> 26 x 26 x 512 1.595 BF
58 Shortcut Layer: 55
59 conv 256 1 x 1/ 1 26 x 26 x 512 -> 26 x 26 x 256 0.177 BF
60 conv 512 3 x 3/ 1 26 x 26 x 256 -> 26 x 26 x 512 1.595 BF
61 Shortcut Layer: 58
62 conv 1024 3 x 3/ 2 26 x 26 x 512 -> 13 x 13 x1024 1.595 BF
63 conv 512 1 x 1/ 1 13 x 13 x1024 -> 13 x 13 x 512 0.177 BF
64 conv 1024 3 x 3/ 1 13 x 13 x 512 -> 13 x 13 x1024 1.595 BF
65 Shortcut Layer: 62
66 conv 512 1 x 1/ 1 13 x 13 x1024 -> 13 x 13 x 512 0.177 BF
67 conv 1024 3 x 3/ 1 13 x 13 x 512 -> 13 x 13 x1024 1.595 BF
68 Shortcut Layer: 65
69 conv 512 1 x 1/ 1 13 x 13 x1024 -> 13 x 13 x 512 0.177 BF
70 conv 1024 3 x 3/ 1 13 x 13 x 512 -> 13 x 13 x1024 1.595 BF
71 Shortcut Layer: 68
72 conv 512 1 x 1/ 1 13 x 13 x1024 -> 13 x 13 x 512 0.177 BF
73 conv 1024 3 x 3/ 1 13 x 13 x 512 -> 13 x 13 x1024 1.595 BF
74 Shortcut Layer: 71
75 conv 512 1 x 1/ 1 13 x 13 x1024 -> 13 x 13 x 512 0.177 BF
76 conv 1024 3 x 3/ 1 13 x 13 x 512 -> 13 x 13 x1024 1.595 BF
77 conv 512 1 x 1/ 1 13 x 13 x1024 -> 13 x 13 x 512 0.177 BF
78 max 5x 5/ 1 13 x 13 x 512 -> 13 x 13 x 512 0.002 BF
79 route 77 -> 13 x 13 x 512
80 max 9x 9/ 1 13 x 13 x 512 -> 13 x 13 x 512 0.007 BF
81 route 77 -> 13 x 13 x 512
82 max 13x13/ 1 13 x 13 x 512 -> 13 x 13 x 512 0.015 BF
83 route 82 80 78 77 -> 13 x 13 x2048
# END SPP #
84 conv 512 1 x 1/ 1 13 x 13 x2048 -> 13 x 13 x 512 0.354 BF
85 conv 1024 3 x 3/ 1 13 x 13 x 512 -> 13 x 13 x1024 1.595 BF
86 conv 512 1 x 1/ 1 13 x 13 x1024 -> 13 x 13 x 512 0.177 BF
# A(/32 Feature Map) #
87 conv 256 1 x 1/ 1 13 x 13 x 512 -> 13 x 13 x 256 0.044 BF
88 upsample 2x 13 x 13 x 256 -> 26 x 26 x 256
# A -> B #
89 route 86 -> 13 x 13 x 512
90 conv 128 1 x 1/ 1 13 x 13 x 512 -> 13 x 13 x 128 0.022 BF
91 upsample 4x 13 x 13 x 128 -> 52 x 52 x 128
# A -> C #
92 route 86 -> 13 x 13 x512
93 conv 256 1 x 1/ 1 13 x 13 x512 -> 13 x 13 x 256 0.044 BF
94 upsample 2x 13 x 13 x 256 -> 26 x 26 x 256
95 route 94 61 -> 26 x 26 x 768
96 conv 256 1 x 1/ 1 26 x 26 x 768 -> 26 x 26 x 256 0.266 BF
97 conv 512 3 x 3/ 1 26 x 26 x 256 -> 26 x 26 x 512 1.595 BF
98 conv 256 1 x 1/ 1 26 x 26 x 512 -> 26 x 26 x 256 0.177 BF
99 conv 512 3 x 3/ 1 26 x 26 x 256 -> 26 x 26 x 512 1.595 BF
100 conv 256 1 x 1/ 1 26 x 26 x 512 -> 26 x 26 x 256 0.177 BF
# B(/16 Feature Map) #
101 conv 512 3 x 3/ 2 26 x 26 x 256 -> 13 x 13 x 512 0.399 BF
# B -> A #
102 route 100 -> 26 x 26 x 256
103 conv 128 1 x 1/ 1 26 x 26 x 256 -> 26 x 26 x 128 0.044 BF
104 upsample 2x 26 x 26 x 128 -> 52 x 52 x 128
# B -> C #
105 route 100 -> 26 x 26 x 256
106 conv 128 1 x 1/ 1 26 x 26 x 256 -> 26 x 26 x 128 0.044 BF
107 upsample 2x 26 x 26 x 128 -> 52 x 52 x 128
108 route 107 36 -> 52 x 52 x 384
109 conv 128 1 x 1/ 1 52 x 52 x 384 -> 52 x 52 x 128 0.266 BF
110 conv 256 3 x 3/ 1 52 x 52 x 128 -> 52 x 52 x 256 1.595 BF
111 conv 128 1 x 1/ 1 52 x 52 x 256 -> 52 x 52 x 128 0.177 BF
112 conv 256 3 x 3/ 1 52 x 52 x 128 -> 52 x 52 x 256 1.595 BF
113 conv 128 1 x 1/ 1 52 x 52 x 256 -> 52 x 52 x 128 0.177 BF
# C(/8 Feature Map) #
114 max 2x 2/ 2 52 x 52 x 128 -> 26 x 26 x 128 0.000 BF
115 conv 512 3 x 3/ 2 26 x 26 x 128 -> 13 x 13 x 512 0.199 BF
# C -> A #
116 route 113 -> 52 x 52 x 128
117 conv 256 3 x 3/ 2 52 x 52 x 128 -> 26 x 26 x 256 0.399 BF
# C -> B #
118 route 86 101 115 -> 13 x 13 x1536
119 conv 3 1 x 1/ 1 13 x 13 x1536 -> 13 x 13 x 3 0.002 BF
120 route 119 0/3 -> 13 x 13 x 1
121 scale Layer: 86
darknet: ./src/scale_channels_layer.c:23: make_scale_channels_layer: Assertion `l.out_c == l.c' failed.
Aborted (core dumped)
@AlexeyAB I created a asff.cfg based yolov3-spp.cfg, there is a error seems layer-86 is 13x13x512 and layer-119 e.g. alpha is 13x13x1, in [scale_channels] those layers output should be same?
@Kyuuki93 It seems I fixed it: https://github.com/AlexeyAB/darknet/commit/5ddf9c74a58ce61d2aa82b806b8d0912ab6cf8f3#diff-35a105a0ce468de87dbd554c901a45eeR23
[route]
layers=22,33,44 # 3-layers which are already resized to the same WxHxC[convolutional]
stride=1
size=1
filters=3
activation=normalize_channels # ReLU is integrated to activation=normalize_channels[route]
layers=-1
group_id=0
groups=3[scale_channels]
from=22
scale_wh=1[route]
layers=-3
group_id=1
groups=3[scale_channels]
from=33
scale_wh=1[route]
layers=-5
group_id=2
groups=3[scale_channels]
from=44
scale_wh=1[shortcut]
from=-3
activation=linear[shortcut]
from=-6
activation=linear
@AlexeyAB
In your ASFF-like module, what exactly activation = normalize_channels do?
If activation = normalize_channels use relu to calculate gradients,
I think it should be activation = linear and use another softmax for (x1, x2, x3), to mach this formula alpha = exp(x1) / (exp(x1) + exp(x2) + exp(x3)), Or activation = softmax for SoftmaxBackward?
levels_weight = F.softmax(levels_weight, dim=1)
levels_weight.shape was torch.Size([1,3,13,13])
Is 'activation = normalize_channels' same with this F.softmax ?
If activation = normalize_channels actually excuse this code, normalize_channels with relu function, negative value was removed,
https://github.com/AlexeyAB/darknet/blob/9bb3c53698963f2a495be2dd9877d6ff523fe2ad/src/activations.c#L151-L177
maybe this result got a explain
|Model| chart|cfg|
|---|---|---|
|spp,mse|
|yolov3-spp.cfg.txt|
|spp,mse,asff|
|yolov3-spp-asff.cfg.txt|
I think the normalization with constraints channels_sum() = 1 was crucial, which indicate objects belongs to which ASFF feature.
And this ASFF module have a little different with your example, instead of
[route]
layers = 22,33,44# 3-layers which are already resized to the same WxHxC
...
use
[route]
layers = 22
[convolutional]
batch_normalize=1
size=1
stride=1
filters=8
activation=leaky
[route]
layers = 33
[convolutional]
batch_normalize=1
size=1
stride=1
filters=8
activation=leaky
[route]
layers = 44
[convolutional]
batch_normalize=1
size=1
stride=1
filters=8
activation=leaky
[route]
layers = -1,-3,-5
[convolutional]
stride=1
size=1
filters=3
activation= normalize_channels
...
@Kyuuki93
I think the normalization with constraints channels_sum() = 1 was crucial, which indicate objects belongs to which ASFF feature.
What do you mean?
And this ASFF module have a little different with your example, instead of
Why?
In your ASFF-like module, what exactly activation = normalize_channels do?
If activation = normalize_channels use relu to calculate gradients,
I think it should be activation = linear and use another softmax for (x1, x2, x3), to mach this formula alpha = exp(x1) / (exp(x1) + exp(x2) + exp(x3)), Or activation = softmax for SoftmaxBackward?
There is in the normalize_channels implemented Fast Normalized Fusion that should have the same Accuracy but faster Speed than SoftMax across channels, that is used in BiFPN for EfficientDet: https://github.com/AlexeyAB/darknet/issues/4346
Later I will add activation=normalize_channels_softmax

I think the normalization with constraints channels_sum() = 1 was crucial, which indicate objects belongs to which ASFF feature.
What do you mean?
Sorry, let me clear,
alpha(i,j) + beta(i,j) + gamma(i,j) = 1,
and alpha(i,j)> 0, beta(i,j)>0, gamma(i,j)>0
In normalize_channels, maybe result from this code:
if (val > 0) val = val / sum;
else val = 0;
many alpha or beta, gamma were set to 0, so relu gradients was 0 too, so gradients were vanished at very beginning, and in this way, training doesn't work properly, e.g. after 25k iters, best [email protected] just 10.41%, in the training, the value of Obj: were very hard to increase.
And this ASFF module have a little different with your example, instead of
Why?
I checked author's model, layers 22,33,44 were never concat, I just implemented his network structure. In his model, the coefficients were calculate from layers 22,33,44 separately, and channels changes like
512 -> 8
512 -> 8 (cat to) 24 -> 3
512 -> 8
instead of
512 -> 3
There is in the
normalize_channelsimplemented Fast Normalized Fusion that should have the same Accuracy but faster Speed than SoftMax across channels, that is used in BiFPN for EfficientDet: #4346
I will try to find why BiFPN can work with relu style normalize_channels but ASFF can not, I have a thought, just let me check it out
Later I will add
activation=normalize_channels_softmax
I will take another test then
@Kyuuki93
I checked author's model, layers 22,33,44 were never concat, I just implemented his network structure.
You have done right. I have not yet verified the entire cfg file as a whole.
Here we are not talking about layers with indices exactly 22, 33, 44. This is just an example.
This means that already some layers with indicies XX,YY,ZZ are resized to the same WxHx8. It is assumed here that the layers are already applied: conv_stride_2, maxpool_sride_2, upsample_stride_2 and 4. And then applied conv-layer filters=8.
And these 3 layers with size WxHx8 will be concatenated: https://github.com/ruinmessi/ASFF/blob/master/models/network_blocks.py#L240
That's how you did it.
In normalize_channels, maybe result from this code:
if (val > 0) val = val / sum;
else val = 0;
many alpha or beta, gamma were set to 0, so relu gradients was 0 too, so gradients were vanished at very beginning, and in this way, training doesn't work properly, e.g. after 25k iters, best [email protected] just 10.41%, in the training, the value of Obj: were very hard to increase.
Yes, for one image - some outputs( alpha or beta, gamma) will have zeros, and for another image - other outputs( alpha or beta, gamma) will have zeros. There will not be dead neurons in Yolo, since all other layers use leaky-ReLU rather than ReLU.
This is a common problem for ReLU, calls dead neurons. https://datascience.stackexchange.com/questions/5706/what-is-the-dying-relu-problem-in-neural-networks
This applies to all modern neural networks that use RELU: MobileNet v1, ResNet-101, ...
The Leaky-ReLU, Swish or Mish solves this problem.
There will be dead neurons problem only if at least 2 conv-layers with ReLU in a row, go one after another. So output of conv-1 will be always >=0, so both input and output of conv-2 will be always >=0 In this case, since input of conv-2 is alwyas >=0, then if weights[i] < 0 then output of ReLU will be always 0 and Gradient will be always 0 - so there will be dead neurons, this weights[i]<0 will never be changed.
But if conv-1 layer has leak-ReLU (as in Yolo) or Swish or Mish activation, then input of conv-2 can be >0 or <0, then regardless of weights[i] (if weights[i] != 0) the Gradient will not be always == 0, and **this weights[i]<0 will be changed sometime**.
@Kyuuki93
Also you can try to use
[convolutional]
stride=1
size=1
filters=3
activation=logistic
instead of
[convolutional]
stride=1
size=1
filters=3
activation=normalize_channels
@Kyuuki93
I added [convolutional] activation=normalize_channels_softmax
Check whether there are bugs: https://github.com/AlexeyAB/darknet/commit/c9c745ccf1de97d01cc3c69f81e83011f6439f1a and https://github.com/AlexeyAB/darknet/commit/4f52ba1a25ade35119cefc3840ef65a509851809
Page 4: https://arxiv.org/pdf/1911.09516v2.pdf

Here we are not talking about layers with indices exactly 22, 33, 44. This is just an example.
Yes, I aware that layers 22 exactly is layers 86 in darknet's yolov3-spp.cfg and so on.
There will be dead neurons problem only if at least 2 conv-layers with ReLU in a row, go one after another. So output of conv-1 will be always
>=0, so both input and output of conv-2 will be always>=0In this case, since input of conv-2 is alwyas>=0, then if weights[i] < 0 then output of ReLU will be always 0 and Gradient will be always 0 - so there will be dead neurons, thisweights[i]<0will never be changed.But if conv-1 layer has leak-ReLU (as in Yolo) or Swish or Mish activation, then input of conv-2 can be >0 or <0, then regardless of weights[i] (if weights[i] != 0) the Gradient will not be always == 0, and **this
weights[i]<0will be changed sometime**.
I see, so there are a little influence but should be work,
wil try activation=logistic and activation=normalize_channels_softmax, result update later
@AlexeyAB
I created a new asff.cfg, yolov3-spp-asff-giou-logistic.cfg.txt
but with normalize_channels_softmax, training loss goes to nan in ~100 iters,
with logistic got this result, but yolov3-spp.cfg with mse loss achieved 89% at [email protected]

If you got a time, could you tell me what mistake I made in there?
If you got a time, could you tell me what mistake I made in there?
@AlexeyAB
Sorry, that's my fault, previous .cfg file have connected wrong layers,
This one should be right for ASFF module, yolov3-spp-asff.cfg.txt,
viewed by Netron yolov3-spp-asff.png.zip
But unfortunately, this net was untrainingable, the official repo mentioned ASFF module need long warm up to avoid nan loss, but in darknet nan loss show up at the time lr > 0, no matter what activation = logistic or norm_channels or norm_channels_softmax, so I am wondering which part goes wrong.
Followed ASFF's native idea, e.g. every yolo-layer used all scale size feature map, so I created a simplified-ASFF module, it's just added feature map from layers-22,33,44 (by shortcut) instead of multiply with alpha, beta, gamma
and this was simplified one, yolov3-spp-asff-simplified.cfg.txt
viewed by Netron yolov3-spp-asff-simplified.png.zip
yolov3-spp + Gaussian_yolo(iou_n,uc_n = 0.5) + iou_thresh=0.213

yolov3-spp + Gaussian_yolo(iou_n,uc_n = 0.5) + iou_thresh=0.213 + asff-sim

The complete training result will updated several hours later, but seems simplified-ASFF module could boost AP or at least increased training speed.
And about ASFF module, if the cfg file was no wrong, maybe these layers e.g. scale_channels, activation = norm_channels_* not work as my expect?
This one should be right for ASFF module, yolov3-spp-asff.cfg.txt,
Did you try to use your new ASFF with default [yolo] without Gaussian and without GIoU and without iou_thresh and normalizers?
like
[yolo]
mask = 0,1,2
#anchors = 10,13, 16,30, 33,23, 30,61, 62,45, 59,119, 116,90, 156,198, 373,326
anchors = 57, 64, 87,113, 146,110, 116,181, 184,157, 175,230, 270,196, 236,282, 322,319
classes=1
num=9
jitter=.3
ignore_thresh = .7
truth_thresh = 1
random=1
This one should be right for ASFF module, yolov3-spp-asff.cfg.txt,
Did you try to use your new ASFF with default [yolo] without Gaussian and without GIoU and without iou_thresh and normalizers?
like
[yolo] mask = 0,1,2 #anchors = 10,13, 16,30, 33,23, 30,61, 62,45, 59,119, 116,90, 156,198, 373,326 anchors = 57, 64, 87,113, 146,110, 116,181, 184,157, 175,230, 270,196, 236,282, 322,319 classes=1 num=9 jitter=.3 ignore_thresh = .7 truth_thresh = 1 random=1
I will try, and asff-sim results with gs, giou, iou_thresh:
|baseline | [email protected] = 91.89% |[email protected] = 63.53%|
|+asffsim| [email protected] = 91.62% |[email protected] = 63.28%|
results with mse loss will report tomorrow
@Kyuuki93
So assf-simplified doesn't improve accuracy.
Try with default [yolo]+mse without normalizers and if it doesn't work then try with default anchors.
Did you try to use your new ASFF with default [yolo] without Gaussian and without GIoU and without iou_thresh and normalizers?
Yes, ASFF-SIM with default [yolo] decrease 0.48% [email protected], and [email protected]
|baseline | [email protected] = 89.52% |[email protected] = 51.72%|
|+asffsim| [email protected] = 89.04%|[email protected] = 51.24%|
@Kyuuki93
Try norm_channels or norm_channels_softmax with default [yolo] layers.
May be only [Gaussian_yolo] produces Nan with ASFF.
I tried, it’s same
@Kyuuki93
And about ASFF module, if the cfg file was no wrong, maybe these layers e.g. scale_channels, activation = norm_channels_* not work as my expect?
I checked the implementation of activation = norm_channels_* and didn't find bugs.
Also cfg file https://github.com/AlexeyAB/darknet/files/3939064/yolov3-spp-asff.cfg.txt is almost correct, except very low learning_rate and burn_in=0, you should use burn_in=4000 and higher learning_rate. Also use default [yolo] without normalizers.
Do you get Nan or low mAP with activation=norm_channels ?
set [net] burn_in=4000 in cfg-file
Change this line: https://github.com/AlexeyAB/darknet/blob/dbe34d78658746fcfc9548ebab759895ea05a70c/src/blas_kernels.cu#L1153
to this
atomicAdd(&out_state_delta[osd_index], in_w_h_c_delta[index] * in_from_output[index] / channel_size);
Check that grad=0 there https://github.com/AlexeyAB/darknet/blob/dbe34d78658746fcfc9548ebab759895ea05a70c/src/activation_kernels.cu#L513
use default [yolo] and iou_normalizer = 1.0 iou_loss = mse
Recompile and train ASSF with default [yolo] and activation=norm_channels and activation=norm_channels_softmax and
learning_rate=0.001
burn_in=4000
Show output chart.png with Loss and mAP for both activation=norm_channels and activation=norm_channels_softmax
Very low LR and burn_in=0 were set to test if nan will show when LR>0. I have tried usual LR and burn_in =2000, it’s same, no matter yolo or gs_yolo with any loss
I will try your suggestion tomorrow, it’s midnight here
For 4. is already grad = 0, set baseline as
learning_rate = 0.0005 # for 2 GPUs
burn_in = 4000
...
activation = normalize_channels_softmax # in the end of ASFF
...
[yolo]
iou_loss = mse
iou_normalizer = 1.0
cls_normalizer = 1.0
ignore_thresh = 0.213
...
Also, pre-trained weight used yolov3-spp.conv.86 instead of yolov3-spp.conv.88
| Settings | Got NaN | Iters got NaN| Chart |
| --- | --- | --- | --- |
|baseline |y | 363 | - |
|activation -> normalize_channels| n |- |
|
After add /channel_size to darknet/src/blas_kernels.cu#L1153
| Settings | Got NaN | Iters got NaN| Chart |
| --- | --- | --- | --- |
|baseline | n | - |
|
|activation -> normalize_channels| | |
@AlexeyAB It's seems work fine for now, full result will update later, it will be delayed a few days because of a business trip
@Kyuuki93 Fine. What is base line in your table?
@Kyuuki93 Fine. What is base line in your table?
yolov3-spp.cfg with mse loss, only add iou_thresh = 0.213
@Kyuuki93
This is strange:
yolov3-spp.cfg with mse loss, only add iou_thresh = 0.213
But why default yolov3-spp.cfg goes to Nan, while there are no ASFF, [scale_channels]-layer or activation=normalize_channels ? https://github.com/AlexeyAB/darknet/issues/4382#issuecomment-564437176
And why it doesn't go to Nan after fixing [scale_channels]-layer , while yolov3-spp.cfg doesn't have [scale_channels]-layer?
This is strange:
yolov3-spp.cfg with mse loss, only add iou_thresh = 0.213
But why default
yolov3-spp.cfggoes to Nan, while there are no ASFF,[scale_channels]-layeroractivation=normalize_channels? #4382 (comment)And why it doesn't go to Nan after fixing
[scale_channels]-layer, whileyolov3-spp.cfgdoesn't have[scale_channels]-layer?
Sorry, all the test with ASFF-module, so baseline is yolov3-spp.cfg with mse loss, it=0.213, asff
There is baseline .cfg file
yolov3-spp-asff.cfg.txt
Thanks for explanation. So baseline is = yolov3-spp + iou_loss = mse + iou_thresh = 0.213 + ASFF with activation=normalize_channels_softmax
How about guiding anchor? In asff paper, yolo head consist of GA+RBF+Deform-conv
@Kyuuki93 Didn't look yet. Lets test these blocks first, to make sure they work and increase accuracy.
(A lot of features increase accuracy only in rare cases or if tricks and cheats are used.)
@Kyuuki93 Didn't look yet. Lets test these blocks first, to make sure they work and increase accuracy.
(A lot of features increase accuracy only in rare cases or if tricks and cheats are used.)
Of course, I will test dropblock this days or after back to office, by the way, training with my 4xGPUs machine works well over 80k iters, but with 2xGPUs machine still got killed after 10k iters, so this updated chart will be divided
@Kyuuki93 What GPU, OS and OpenCV versions do you use for?
How to use RFB-block: https://github.com/AlexeyAB/darknet/issues/4507
@Kyuuki93 What GPU, OS and OpenCV versions do you use for?
- 4xGPUs
- 2xGPUs
4 2080 Ti GPUs, Ubuntu 18.04, OpenCV 3.4.7
2 1080 Ti GPUs, Ubuntu 18.04, OpenCV 3.4.7
| Model | Chart | [email protected]| [email protected]| Inference Time (416x416)
| --- | ---| --- | --- | --- |
|spp,mse,it=0.213|
| 92.01% | 60.49%| 13.75ms|
|spp,mse,it=0.213,asff(softmax)|
| 92.45%| 61.83%| 15.44ms|
|spp,mse,it=0.213,asff(relu)| | 91.83%| 59.60% | 15.34ms |
|spp,mse,it=0.213,asff(logistic )| | 91.18%| 60.79%| 15.40ms |
Will complete this Table soon, so far, the ASFF module seems work well, it's AP already higher than spp,giou,gs,it which can see in https://github.com/AlexeyAB/darknet/issues/3874#issuecomment-561064425
@AlexeyAB
Will fulfill later
I added this fix: https://github.com/AlexeyAB/darknet/commit/d137d304c1410253894dbfb7abaadfe6f4f867e7
Compare which of the ASFFs is better: logistic, normalize_channels or normalize_channels_softmax
Now when we know that it works well, you also can try to test it with [Guassian_yolo]
I added this fix: d137d30
Compare which of the ASFFs is better:
logistic, normalize_channels or normalize_channels_softmaxNow when we know that it works well, you also can try to test it with [Guassian_yolo]
I fulfilled previous results table, ASFF module give a +0.44% at [email protected] and +1.34% at [email protected]. I have added this to https://github.com/AlexeyAB/darknet/issues/3874#issuecomment-561064425
@Kyuuki93 Nice!
Is it normalize_channels_softmax-ASFF?
What improvement in accuracy does give the normalize_channels(avg_ReLU)-ASFF?
@Kyuuki93 Nice!
Is itnormalize_channels_softmax-ASFF?What improvement in accuracy does give the
normalize_channels(avg_ReLU)-ASFF?
Yes, is normalize_channels_softmax-ASFF, other normalize_channels didn't test yet, I will add dropblock before it
@Kyuuki93
This is weird that they use mlist.append(DropBlock(block_size=1, keep_prob=1)), https://github.com/ruinmessi/ASFF/blob/master/models/yolov3_asff.py#L39-L56
near with Yolo-head
block_size=1 - means that it is just a simple DropOut instead of DropBlock
keep_prob=1 - means that it is disabled https://github.com/ruinmessi/ASFF/blob/f7814211b1fd1e6cde5e144503796f4676933667/models/network_blocks.py#L84
So I recommend to use
[dropout] probability=0.1)[dropout]
dropblock=1
dropblock_size=0.1 # 10% of width and height
probability=0.1 # this is drop probability = (1 - keep_probability)
May be
I think I should split dropblock_size= to 2 params:
This is weird that they use
mlist.append(DropBlock(block_size=1, keep_prob=1)), https://github.com/ruinmessi/ASFF/blob/master/models/yolov3_asff.py#L39-L56
- near with Yolo-head
- block_size=1 - means that it is just a simple DropOut instead of DropBlock
- keep_prob=1 - means that it is disabled https://github.com/ruinmessi/ASFF/blob/f7814211b1fd1e6cde5e144503796f4676933667/models/network_blocks.py#L84
So I recommend to use
- either DropOut + ASFF (
[dropout] probability=0.1)- or DropBlock + ASFF + RFB-block
[dropout] dropblock=1 dropblock_size=0.1 # 10% of width and height probability=0.1 # this is drop probability = (1 - keep_probability)
Ok then, I will check that, but I just found it not so convenient to add DropBlock when I leave office. So I will test normlize_channels_* first, some results will come out tomorrow
And check here https://github.com/ruinmessi/ASFF/blob/538459e8a948c9cd70dbd8b66ee6017d20af77cc/main.py#L317-L337,
block_size=1 , keep_prob=1 should just for baseline which means original yolo-head
@Kyuuki93 Thanks, I missed this code.
I fixed DropBlock: https://github.com/AlexeyAB/darknet/issues/4498
Just use (use dropblock_size_abs=7 if RFB is used, otherwise use dropblock_size_abs=2):
[dropout]
dropblock=1
dropblock_size_abs=7 # block size 7x7
probability=0.1 # this is drop probability = (1 - keep_probability)
It will work mostly the same as in ASFF implementation (gradually increasing the block size from 1x1 to 7x7 in the first half of the training time):
Also probability will increase from 0.0 to 0.1 in the first half of the training time - as in the original DropBlock paper.
@AlexeyAB I have fulfilled the Table
@Kyuuki93 Nice!
It seems that only ASFF-softmax works well.
Now you can try to add DropBlock https://github.com/AlexeyAB/darknet/issues/4498 + RFB-block https://github.com/AlexeyAB/darknet/issues/4507
I fixed gradient of activation=normalize_channels in the same way as it is done for normalize_channels_softmax, also you can try new version of activation=normalize_channels too: https://github.com/AlexeyAB/darknet/commit/7ae1ae5641b549ebaa5c816701c4b9ca73247a65
There are another Table:
| Model | [email protected]| [email protected]|
| --- | --- | --- |
| spp,mse,it=0.213,asff(softmax) | 92.45%| 61.83%|
| spp,giou,it=0.213,asff(softmax) | 92.33% | 63.74% |
| spp,mse,it=0.213,asff(softmax),gs | 92.08%| 62.19% |
| spp,giou,it=0.213,asff(softmax),gs | 91.13%| 60.82%|
@AlexeyAB It's seems Gaussian_yolo not suite for ASFF
I got error after add dropblock
480 x 480
...
realloc(): invalid next size
Aborted (core dumped)
@AlexeyAB
@Kyuuki93
It's seems Gaussian_yolo not suite for ASFF
Try Gaussian_yolo with MSE.
@Kyuuki93
I got error after add dropblock
Can you share your cfg-file?
Sorry, wrong click,
There is yolov3-spp-asff-db-it.cfg.txt
Try Gaussian_yolo with MSE.
Running it now
@Kyuuki93 I fixed it: https://github.com/AlexeyAB/darknet/commit/f831835125d3181b03aa28134869099c82ca846e#diff-aad5e97a835cccda531d59ffcdcee9f1R542
I got error after add dropblock
Resizing
480 x 480
Try Gaussian_yolo with MSE.
updated on previous table
@Kyuuki93 Nice! Does DropBlock work well now?
@Kyuuki93 Nice! Does DropBlock work well now?
For now,
| Model| [email protected]| [email protected]|
|---|---|---|
|spp,giou,it=0.213,asff(softmax) | 92.33% | 63.74%|
|spp,giou,it=0.213,asff(softmax),dropblock|91.66% | 62.42%|
But, this chart seems this net needs a longer training,

so I changed
max_batchs = 25200 -> 45200
steps = 10000,15000,20000 -> 20000,30000,40000
to take another run.
Also, I add rfb after dropblock, and results will come after this days
I got another question here, I have 2 custom dataset,
1st, 30k images for training, 4.5k images for validation, one-class, is the one used to produced previous results;
2nd, 100kimages for training, 20k images for validation, also one-class, which contains many small objects.
For example, use a same network, e.g. yolov3-spp.cfg, I can achieve
88.48% [email protected] in 1st dataset and
91.50% [email protected] in 2nd dataset
After this I merged those dataset to a two-class dataset (1st to 1st class, 2nd to 2nd class), also use same network, of course changed filters before yolo layer, then I got this result,
[email protected]
71.30% for class 0
90.11% for class 1
So, from the results, this merge decrease AP for all classes, but I don't know what lead this,
what's your suggestion about this?
@Kyuuki93
But, this chart seems this net needs a longer training,
so I changedmax_batchs = 25200 -> 45200
steps = 10000,15000,20000 -> 20000,30000,40000
If you want a smooth decrease in accuracy, then just use SGDR (cosine lr schedule) from Bag of Freebies: https://github.com/AlexeyAB/darknet/issues/3272#issuecomment-497149618
Just use
learning_rate=0.0005
burn_in=4000
max_batches = 25200
policy=sgdr
instead of
learning_rate=0.0005
burn_in=4000
max_batches = 25200
policy=steps
steps=10000,15000,20000
scales=.1,.1,.1
[email protected]
71.30% for class 0
90.11% for class 1
So, from the results, this merge decrease AP for all classes, but I don't know what lead this,
what's your suggestion about this?
So dataset-0 has class-0, and dataset-1 has class-1.
Are there unlabeled objects of class-1 in dataset-0?
Are there unlabeled objects of class-0 in dataset-1?
All objects must be mandatory labeled.
Also in general, more classes - worse accuracy.
So dataset-0 has class-0, and dataset-1 has class-1.
Yes
Are there unlabeled objects of class-1 in dataset-0?
Are there unlabeled objects of class-0 in dataset-1?
No, this can ensure
Also in general, more classes - worse accuracy.
From results in many field, yes, but what lead that?
From results in many field, yes, but what lead that?
Any model has a limited capacity, and the more classes, the fewer features specific for each class can fit in the model. Classes compete for model capacity.
Any model has a limited capacity, and the more classes, the fewer features specific for each class can fit in the model. Classes compete for model capacity.
class-0 has 20k instances and class-1 has more than 80k instances, so class-1 get average accuracy because there more data to grab model capacity, opposite class-0 get worse
@Kyuuki93 May be yes.
I added fix: https://github.com/AlexeyAB/darknet/commit/e33ecb785ee288fca1fe50326f5c7b039a9f5a11
So now you can try to set in each [yolo] / [Gaussian_yolo] layer parameter:
counters_per_class=20000, 80000
And train.
It will use multipliers for delta_class during training
4 x for class-01 x for class-1@Kyuuki93
I found a bug and fixed it in counters_per_class= https://github.com/AlexeyAB/darknet/commit/e43a1c424d9a20b8425d8dd8f240867f2522df3f
So now you can try to set in each
[yolo] / [Gaussian_yolo]layer parameter:
counters_per_class=20000, 80000
20000 or 80000 means class instances number?
It will use multipliers for
delta_classduring training* `4` x for class-0 * `1` x for class-1
4 and 1 it's ratio calculated by 20000 and 80000?
If yes, set counters_per_class=1, 4 was the same with counters_per_class=20000, 80000?
Btw, dropblock, precisely fix size = 7, did not get a higher AP, even take a longer training, maybe gradually increase size as 1, 5, 7 will get a better result, I will try this later
@AlexeyAB does counter per class work out of gaussian? Is it a way to solve class imbalance problem? Would you explain that commit and it's use case?
@Kyuuki93
20000 or 80000 means class instances number?
4 and 1 it's ratio calculated by 20000 and 80000?
If yes, set counters_per_class=1, 4 was the same with counters_per_class=20000, 80000?
Yes.
Btw, dropblock, precisely fix size = 7, did not get a higher AP, even take a longer training,
DropBlock with size=7 should be used only with RFB-block as I described above.
maybe gradually increase size as 1, 5, 7 will get a better result, I will try this later
Current implementation of DropBlock will gradually increase size from 1 to maxsize=7.
In the implementation from paper are used maxsize=7 for all 3 DropBlocks. But you can try to use for different DropBlocks the different maxsizes 1,5,7.
@isgursoy
does counter per class work out of gaussian? Is it a way to solve class imbalance problem? Would you explain that commit and it's use case?
Yes. It is experimental feature. Just set number of objects in training dataset for each class.
@AlexeyAB
So dropblock with size=7 should not use without rfb,
Current implementation of DropBlock will gradually increase size from 1 to maxsize=7.
When will the size increase? Is based on current_iters / max_batch?
And, there are another table, some results were copied from previous table for a convenient compare
| Model | [email protected] | [email protected]|
| --- | --- | --- |
|spp,mse,it=0.213,asff(softmax)|92.45% | 61.83%|
|spp,giou,it=0.213,asff(softmax) | 92.33% | 63.74%|
|spp,giou,it=0.213,asff(softmax),dropblock(size=7) | 91.66% | 62.42%|
|spp,giou,it=0.213,asff(softmax),rfb(bn=0)|91.57%|64.95% |
|spp,giou,it=0.213,asff(softmax),rfb(bn=1)|92.32%|60.05% |
|spp,giou,it=0.213,asff(softmax),dropblock(size=7) ,rfb(bn=0)| 91.65%|61.35% |
|spp,giou,it=0.213,asff(softmax),dropblock(size=7) ,rfb(bn=1)|92.12% |63.88% |
|spp,giou,it=0.213,asff(logistic),dropblock(size=7) ,rfb(bn=1)|91.78% |60.10% |
|spp,giou,it=0.213,asff(relulike),dropblock(size=7) ,rfb(bn=1)| 91.43%|61.11%|
|spp,giou,it=0.213,asff(softmax),dropblock(size=7) ,rfb(bn=1),sgdr|91.54% |62.67% |
|spp,giou,it=0.213,asff(softmax),dropblock(size=3,5,7) ,rfb(bn=1)|90.93% | 61.20%|
(?) this is rfb cfg file, I think its right, you can check for sure yolov3-spp-asff-it-rfb.cfg.txt
Btw, yolact updated this days to yolact++, this discuss may move to it's reference issue
AugFPN: Improving Multi-scale Feature Learning for Object Detection
paper https://arxiv.org/pdf/1912.05384.pdf

very similar to ASFF
@Kyuuki93
(?) this is rfb cfg file, I think its right, you can check for sure yolov3-spp-asff-it-rfb.cfg.txt
It seems this is correct.
Also you can try to train RFB-block with batch_normalize=1, it may have higher accuracy: https://github.com/ruinmessi/ASFF/issues/46
When will the size increase? Is based on current_iters / max_batch?
Yes multiplier = current_iters / (max_batches / 2) as in paper: https://github.com/AlexeyAB/darknet/blob/3004ee851c49e28a32fd60f2ae4a1ddf95b8b391/src/dropout_layer_kernels.cu#L31
https://github.com/AlexeyAB/darknet/blob/3004ee851c49e28a32fd60f2ae4a1ddf95b8b391/src/dropout_layer_kernels.cu#L39-L40
Btw, yolact updated this days to yolact++, this discuss may move to it's reference issue
Thanks! I will look at it: https://github.com/AlexeyAB/darknet/issues/3048#issuecomment-567017091
Also you can try to train RFB-block with
batch_normalize=1, it may have higher accuracy: ruinmessi/ASFF#46
I have seen your discussion with ASFF's author before, compare of rfb(bn=0) with rfb(bn=1) was on schedule.
SpineNet: Learning Scale-Permuted Backbone for Recognition and Localization
Another work on FPN connection manipulates, it seems recently researchers paid more attention to the connection method of FPN
It is not clear what is better: ASFF, BiFPN, or these AugFPN, SpineNet...
AugFPN: Improving Multi-scale Feature Learning for Object Detection
They compare their network with Yolov2 in 2019, and dont compare Speed/Bflops.
But may be we should read it:
By replacing FPN with
AugFPN in Faster R-CNN, our models achieve 2.3 and 1.6
points higher Average Precision (AP) when using ResNet50
and MobileNet-v2 as backbone respectively. Furthermore,
AugFPN improves RetinaNet by 1.6 points AP and FCOS
by 0.9 points AP when using ResNet50 as backbone.
SpineNet: Learning Scale-Permuted Backbone for Recognition and Localization
paper https://arxiv.org/pdf/1912.05027.pdf
Another work on FPN connection manipulates, it seems recently researchers paid more attention to the connection method of FPN
They compare AP / Bflops, but don't compare AP / FPS. This is usually done when the network is slow. But maybe we should read the article.
SpineNet achieves state-of-theart performance of one-stage object detector on COCO with
60% less computation, and outperforms ResNet-FPN counterparts by 6% AP. SpineNet architecture can transfer to
classification tasks, achieving 6% top-1 accuracy improvement on a challenging iNaturalist fine-grained dataset.
It is not clear what is better: ASFF, BiFPN, or these AugFPN, SpineNet
Yes, we need to read more details for implementation, then it’s able to compare in this framework. In my understanding, their high AP with slow speed is result from heavy backbone, all their idea were about how to connect FPN, which we only need to change network to implement, that may could useful to yolo.
Will take a deep look then
@AlexeyAB updated Table here https://github.com/AlexeyAB/darknet/issues/4382#issuecomment-567010280
It's seems rfb bn=1 better than rfb bn=0, dropblock (7,7,7) better than dropblock (3,5,7)
Also maybe this one-class dataset was too easy, and hard to show the validity of these improvement, perhaps I should run all experiments again in the two-class dataset which mentioned before
@Kyuuki93 Well.
Current implementation of dropblock (7,7,7) in the Darknet will use dropblock from (1,1,1) initially to (7,7,7) at the max_batches/2 iterations - so this is much closer to the article.
This is very strange why dropblock drops AP@75. May be with dropblock the model should be trained more iterations.
Show cfg-file and char.png file of spp,giou,it=0.213,asff(softmax),dropblock(size=7) ,rfb(bn=1)
Do you use sgdr-lr-policy or step-lr-policy, and how many iterations did you train?
This is very strange why dropblock drops AP@75. May be with dropblock the model should be trained more iterations.
Double training iterations did not get higher AP.
Show cfg-file and char.png file of
spp,giou,it=0.213,asff(softmax),dropblock(size=7) ,rfb(bn=1)
Sorry, I have checked it but not keep it, it's just look like https://github.com/AlexeyAB/darknet/issues/4382#issuecomment-566509160
Do you use sgdr-lr-policy or step-lr-policy, and how many iterations did you train?
Is step-lr-policy,
@AlexeyAB I fulfilled previous table,
spp,giou,it=0.213,asff(softmax) | 92.33% | 63.74%|
spp,giou,it=0.213,asff(softmax),dropblock(size=7) ,rfb(bn=1) |92.12% | 63.88%|
maybe improvement of rfb block was based on guiding anchor
@Kyuuki93 Thanks!
Did you try to test spp,giou,it=0.213,asff(softmax),rfb(bn=1) (RFB with batch-norm, but without dropblock)?
In sgdr - did you use policy=sgdr or policy=sgdr sgdr_cycle=1000 ?
- Did you try to test
spp,giou,it=0.213,asff(softmax),rfb(bn=1)(RFB with batch-norm, but without dropblock)?
Yes, is still on the training, this is the last experiment on class-0 dataset, following experiment will use two-class dataset
- In sgdr - did you use
policy=sgdrorpolicy=sgdr sgdr_cycle=1000?
Only policy=sgdr, how to choose the nunber of sgdr_cycle = ?, this is chart of training use sgdr, the part in blue cycle was normal for sgdr or just need longer training?

Only policy=sgdr,
Thats right!
how to choose the nunber of sgdr_cycle = ?,
sgdr_cycle = num_images_in_train_txt / batch - but you should choose max batches so that it fell on the end of a cycle, so don't use it.
this is chart of training use sgdr, the part in blue cycle was normal for sgdr or just need longer training?
It seems it should be trained longer.
@AlexeyAB finished spp,giou,it=0.213,asff(softmax),rfb(bn=1) training, nearly straight after 8k iters,

@Kyuuki93
Yes, firstly AP50 stops, and later AP75 stops.
This is weird that rfb(bn=0) is better than rfb(bn=1) for AP75.
In general I think spp,giou,it=0.213,asff(softmax),dropblock(size=7) ,rfb(bn=1) the best model. Try it on multiclass-dataset.
On two-class dataset, all Model based on yolov3-spp.cfg with iou_thresh =0.213, giou with iou_normalizer=0.5, all train set same batch_size and step-lr-policy, C0/C1 means class-0 and class-1 which includes 20k and 80k instance respectively. This table may spend two weeks, results will gradually update.
| Model| [email protected](C0/C1) | [email protected](C0/C1)|
|---|---|---|
|mse|
|giou|79.53%(69.24%/89.83%)|59.65%(42.96%/76.34%)|
|giou,cpc| 79.51% (69.07%/89.96%)| 59.52%(42.17%/76.87%)|
|giou,asff,rfb|
|giou,asff,dropblock,rfb|
cpc means counters_per_class
asff default with softmax
rfb default with bn=1
dropblock default with size=7
asff default with softmax
Also try assf without softmax, at least for 1 class, just to know, is it implemented correctly, and should we use it for BiFPN implementation.
asff default with softmax
Also try assf without softmax, at least for 1 class, just to know, is it implemented correctly, and should we use it for BiFPN implementation.
Ok, I will do it for 1 class which would be faster
@AlexeyAB Previous table was completed, but I think ASFF norm channels once but BiFPN can norm channels many times, maybe that's why BiFPN could use a simple norm function
@Kyuuki93 Yes, may be many fusions compensate for the disadvantages of norm_channels with ReLU, since there are many BiFPN blocks with many norm_chanells in each BiFPN-block in the EfficientDet.
So the best models from: https://github.com/AlexeyAB/darknet/issues/4382#issuecomment-567010280
| Model | [email protected] | [email protected]|
| --- | --- | --- |
|spp,mse,it=0.213,asff(softmax)|92.45% | 61.83%|
|spp,giou,it=0.213,asff(softmax) | 92.33% | 63.74%|
|spp,giou,it=0.213,asff(softmax),rfb(bn=0)|91.57%|64.95% |
|spp,giou,it=0.213,asff(softmax),dropblock(size=7) ,rfb(bn=1)|92.12% |63.88% |
@Kyuuki93
I would suggest to use rfb(bn=1) instead of ,rfb(bn=0) in your new experiments with 2 classes: https://github.com/AlexeyAB/darknet/issues/4382#issuecomment-568184405
@Kyuuki93
Also since we don't use Deformable-conv, then you can try to use RFB-block with flexible receptive field from 1x1 to 11x11: https://github.com/AlexeyAB/darknet/issues/4507#issuecomment-568296011
You can test with one class too.
@AlexeyAB Seems counters_per_class not work out data unbalance issue, compare with class weights in classification problem, classify loss were produced by class only, but in detection problem, loss were produced by class and location even objectness, and loss from loc and obj were not relevant to class label, so losses multiplier could not work this out, actually in my dataset, class of box got high accuracy, its seems model just can't find the class-0's object which lack data on training dataset
Also since we don't use Deformable-conv, then you can try to use RFB-block with flexible receptive field from 1x1 to 11x11: #4507 (comment)
By change activation = linear to activation = leaky?
@Kyuuki93
Also since we don't use Deformable-conv, then you can try to use RFB-block with flexible receptive field from 1x1 to 11x11: #4507 (comment)
By change
activation = lineartoactivation = leaky?
You can use activation = linear for conv-layers.
Generally by adding [maxpool] maxpool_depth=1:
[route]
layers = -1,-5,-9,-12
[maxpool]
maxpool_depth=1
out_channels=64
size=1
stride=1
[convolutional]
batch_normalize=1
filters=256
size=1
stride=1
pad=1
activation=linear
[shortcut]
from=-16
activation=leaky
@Kyuuki93 @AlexeyAB Hi,
When can we use the ASFF? please share the final cfg.
have you tried to compare ASFF and csresnext50-panet-spp?
@zpmmehrdad
yolov3-spp + ASFF yolov3-spp-asff-it.cfg.txt
yolov3-spp + ASFF +Dropblock + RFB(bn=0)yolov3-spp-asff-db-it-rfb.cfg.txt
you can set bn=1 in RFB blocks to get better results.
have you tried to compare ASFF and csresnext50-panet-spp?
Not yet, you can compare it on your data, and this ASFF based on darknet-53 but csresnext50-panet-spp based on resnext, maybe we should implement ASFF in resnext50 for a fair compassion
@Kyuuki93 Thanks for your reply,
I saw the table that you shared and in the table "spp,giou,it=0.213,asff(softmax),rfb(bn=0)" has a good result at [email protected]. I'm going to use for ~200 classes and some classes are almost the same and [email protected] important for me. Do you think "spp,giou,it=0.213,asff(softmax),rfb(bn=0)" is a good option for me or not?
Thanks
I'm going to use for ~200 classes and some classes are almost the same and [email protected] important for me. Do you think "spp,giou,it=0.213,asff(softmax),rfb(bn=0)" is a good option for me or not?
Try to compare spp,giou,it=0.213,asff(softmax),rfb(bn=0) and spp,giou,it=0.213,asff(softmax),rfb(bn=1), this ASFF module wasn't get enough test, I'm not sure it can imporove [email protected] at every dataset, I used it on a one-class datatset, so if you got results please share with us
@Kyuuki93
Not yet, you can compare it on your data, and this ASFF based on darknet-53 but csresnext50-panet-spp based on resnext, maybe we should implement ASFF in resnext50 for a fair compassion
Look at this comparison: https://github.com/AlexeyAB/darknet/issues/4406#issuecomment-567919052
For GPUs without Tensor Cores ResNext50 was better than Darknet53, but for Volta/Turing (RTX) GPUs and newer, it seems that Darknet53 is better.
So may be we should use CSPDarkNet-53 backbone rather than CSPResNeXt-50 https://github.com/WongKinYiu/CrossStagePartialNetworks#big-models
Also may be 1 block of BiFPN (based on NORM_CHAN_SOFTMAX) can be better than ASFF
So may be we should use
CSPDarkNet-53backbone rather thanCSPResNeXt-50https://github.com/WongKinYiu/CrossStagePartialNetworks#big-models
It seems worth to try
- Also may be 1 block of BiFPN (based on NORM_CHAN_SOFTMAX) can be better than ASFF
I have a question about BiFPN which is with 3 yolo layers, BiFPN should just keep P3-5 and ignore P6-7?

Btw, I have made a Spinenet-49 with 3 yolo layers spinenet.cfg.txt, you can check it for sure or take a test.

Training from scratch is a little bit slow..
@AlexeyAB Also, take a look of this https://github.com/AlexeyAB/darknet/issues/3874#issuecomment-568696075, it's seem gaussian_yolo hurt recall heavily, and low iou_thresh can significant improve it.
And next I want to find out the relation between precision ,reall and ignore_thresh, truth_thresh
@Kyuuki93
[Gaussian_yolo] introduces bbox_confidence_score = (0 - 1), so confidence_score = class_conf * bbox_conf will be lower than confidence_score = class_conf in [yolo] - it decreases the number of bboxes with thresh > conf_thresh - it increases Precision and decreases Recall for the same conf_threshold
iou_thresh=0.213 allows Yolo to use many not the most suitable anchors for one object - it increases the number of bboxes (but additional bboxes are not accurate) - it increases Recall and decreases Precision for the same conf_threshold
| Model | [email protected] | precision (th=0.85)| recall(th=0.85)|precision (th=0.7)| recall(th=0.7)|
| --- | --- | --- | --- | ---| ---|
|spp,mse |89.50% | 0.98 | 0.20| 0.97| 0.36|
|spp,giou |90.09% | 0.98 | 0.25| 0.97| 0.40|
|spp,ciou |89.88% | 0.99 | 0.22| 0.97| 0.38|
|spp,giou,gs | 91.39% |0.99| 0.05| 0.97 | 0.47|
|spp,giou,gs,it|91.87%|0.99 |0.16 |0.97|0.52|
I have a question about BiFPN which is with 3 yolo layers, BiFPN should just keep P3-5 and ignore P6-7?
Yes. You can just get features from these 3 points (P3, P4, P5). And use NORM_CHAN_SOFTMAX
Or you can get features from earlier points (figure below)
And you can duplicate BiFPN-block many times (from 2 to 8 BiFPN blocks) - page 5, table 1: https://arxiv.org/pdf/1911.09070v1.pdf

Btw, I have made a Spinenet-49 with 3 yolo layers spinenet.cfg.txt, you can check it for sure or take a test.
I did not go into details how the Senet should look in details. Several questions:
Can you show a link to the code (Pytorch/TF/...) from which you copied the Senet?
masks= of [yolo] layers should be fixed for P3, P2, P4 sizes
Why did you remove top 2 blocks (P5, P6)?



- Can you show a link to the code (Pytorch/TF/...) from which you copied the Senet?
There no public spinenet in any framework yet, this cfg was based on

- masks= of [yolo] layers should be fixed for P3, P2, P4 sizes
I did not notice that, will change then
- Why did you remove top 2 blocks (P5, P6)?
Feature map in P5, P6 were too small I think
- 2 shortcut layers are pointed to the same layer-5
layer-9 was shortcut in 2nd bottleneck bloc, with activation=leaky
layer-10 was input of 3rd bottleneck, with activation=linear
- Is it normal that some of your layers have 19 BFLOPS?
I checked the model, and compare the ratio here

I think this 19 BFLOPS was right, and it's result of use residual block instead of bottleneck block, which is diamond block in previous figure
2 shortcut layers are pointed to the same layer-5
layer-9 was shortcut in 2nd bottleneck bloc, with activation=leaky
layer-10 was input of 3rd bottleneck, with activation=linear
What do you mean? Do you mean this is correct?
# # 9 b2
[shortcut]
from=-4
activation=leaky
# # 10 b3 3rd gray rectangle block
# # from b1,b2
[shortcut]
from=-5
activation=leaky

What do you mean? Do you mean this is correct?
# # 9 b2
[shortcut]
from=-4
activation=leaky
# # 10 b3 3rd gray rectangle block
# # from b1,b2
[shortcut]
from=-5
activation=leaky -> should be linear
My mistake, but two shortcut from layer-5 is correct
@AlexeyAB this spinenet which has one shortcut layer set wrong activation function and use 3 yolo on P2, P2, P3, got 88.80% [email protected] and 52.26% [email protected] in previous one-class dataset, training from scratch with setting
width,height = 384,384
random=0
iou_loss=giou
iou_thresh=0.213
but 1.5x slower than yolov3-spp, I will take more test then
@Kyuuki93 Try to train both yolov3-spp and fixed-spinenet without pre-trained weights and with the same other settings.
Seems counters_per_class not work out data unbalance issue, compare with class weights in classification problem, classify loss were produced by class only, but in detection problem, loss were produced by class and location even objectness, and loss from loc and obj were not relevant to class label, so losses multiplier could not work this out, actually in my dataset, class of box got high accuracy, its seems model just can't find the class-0's object which lack data on training dataset
I added 2 fixes, so now counters_per_class affects the objectness and bbox too.
https://github.com/AlexeyAB/darknet/commit/35a3870979e0d819208a3de7a24c39cc0539651d
https://github.com/AlexeyAB/darknet/commit/b8fe630119fea81200f6ca4641ce2514d893df04
For comparison of spinenet(fixed, 5 yolo-layers) and yolov3-spp(3 yolo-layers), training from scratch with same settings
width = 384
height = 384
batch = 96
subdivisions = 16
learning_rate = 0.00025
burn_in = 1000
max_batches = 30200
policy = steps
steps = 15000, 20000, 25000
scales = .1,.1,.1
...
random = 0
iou_loss = giou
iou_normalizer = 0.5
iou_thresh = 0.213
| Network | [email protected] | AP.75 |precision(.7) | recall(.7)|Inference time |
| --- | --- | --- | --- | --- | --- |
| spinenet49-5l|90.46%|53.80%|0.93|0.71|32.17ms|
| yolov3-spp|89.98%|54.47%|0.96|0.53|11.77ms|
@AlexeyAB

There are any op like nn.Parameter() in this repo for implementing this wi in BiFPN?
@Kyuuki93
There are any op like nn.Parameter() in this repo for implementing this wi in BiFPN?
What do you mean?
If you want Unbounded fusion, then just use activation=linear instead of activation=NORM_CHAN_SOFTMAX
@AlexeyAB
For example, wi is a scalar,
P4_mid = Conv( ( w1*P4_in + w2* Resize(P5_in)) / ( w1+ w2) ),
this wi should trainable but not relevant with any feature map
In ASFF, w was calculated by feature map through a conv_layer
@Kyuuki93
In ASFF, w was calculated by feature map through a conv_layer
Do you mean that is not so in BiFPN? https://github.com/xuannianz/EfficientDet/blob/ccc795781fa173b32a6785765c8a7105ba702d0b/model.py
If you want w constant during inference, then you can do something like this:
```
[route]
layers = P4
[convolutional]
batch_normalize=1
filters=256
groups=256
size=1
stride=1
pad=1
activation=linear
[route]
layers = P5
[convolutional]
batch_normalize=1
filters=256
groups=256
size=1
stride=1
pad=1
activation=linear
[shortcut]
from = -3
````
For comparison of spinenet(fixed, 5 yolo-layers) and yolov3-spp(3 yolo-layers), training from scratch with same settings
Also try to compare with spinenet(fixed, 3 yolo-layers) + spp, where is added SPP-block to the P5 or P6 block: https://github.com/AlexeyAB/darknet/issues/4382#issuecomment-568950286

@AlexeyAB
Do you mean that is not so in BiFPN? https://github.com/xuannianz/EfficientDet/blob/ccc795781fa173b32a6785765c8a7105ba702d0b/model.py
def build_BiFPN() here is not so, it without w
https://github.com/xuannianz/EfficientDet/blob/ccc795781fa173b32a6785765c8a7105ba702d0b/model.py#L40-L93
def build_wBiFPN() here is BiFPN with w
https://github.com/xuannianz/EfficientDet/blob/ccc795781fa173b32a6785765c8a7105ba702d0b/model.py#L96-L149
w was defined here, actually, we need a layer like this one
https://github.com/xuannianz/EfficientDet/blob/ccc795781fa173b32a6785765c8a7105ba702d0b/layers.py#L33-L60
Maybe add a weights to [shortcut] layer is a option, also [shortcut] can take more than 2 inputs, something like
[shortcut]
from=P4, P5_up
weights_type = feature (or channel or pixel)
weights_normalizion = relu (or softmax or linear)
activation = linear
Feature map on P6 only 4x4, could be too small to get useful feature?
Normally, SPP was on the middle and connected Backbone and FPN? Look like Backbone -> SPP -> FPN
But in Spinenet49, it seems all network is a FPN
@AlexeyAB I moved spinenet related comment to its issue
@Kyuuki93
Feature map on P6 only 4x4, could be too small to get useful feature?
Yes, then spp should be placed in P5 (especially if you use small initiall network resolution)
[shortcut] from=P4, P5_up weights_type = feature (or channel or pixel) weights_normalizion = relu (or softmax or linear) activation = linear
Yes, or maybe just enough feature without channel or pixel
Interestingly, a fusion from BiPPN is more effective than such a fusion?
w - a vector (per-channel) in BiFPN with ReLUbatch_normalize=1 - will do normalization to solve training instability issue[route]
layers= L1, L2, L3 # output: W x H x 3*C
[convolutional]
batch_normalize=1
filters=3*C
groups=3*C
size=1
stride=1
pad=1
activation=leaky
[local_avgpool]
avgpool_depth = 1 # isn't implemented yet
#avg across C instead of WxH - Same meaning as maxpool_depth=1 in [maxpool]
out_channels = C
@Kyuuki93
It seems that higher ignore_thresh=0.85 is better than ignore_thresh=0.7 for your dataset. https://github.com/AlexeyAB/darknet/issues/3874#issuecomment-568696075
Also turth_tresh=1.0 is good.
So for your dataset is better to use iou_tresh=1.0 (or not use it at all).
@AlexeyAB
It seems that higher ignore_thresh=0.85 is better than ignore_thresh=0.7 for your dataset.
ignore_thresh = 0.85 got higher [email protected] but much lower recall than ignore_thresh = 0.7
Also turth_tresh=1.0 is good.
Actually,
truth_tresh only worked on 1.0truth_thresh and ignore_thresh set to 0.7 network become untrainableignore_thresh = 0.7, truth_thresh = 0.85`, decrease perfomanceSo for your dataset is better to use
iou_tresh=1.0(or not use it at all).
What do you mean? For now, all training with iou_thresh = 0.213, do you mean set iou_thresh=1.0 when change truth_thresh or ignore_thresh?
Other one-stage methods worked on dual threshold such as ignore_thresh = 0.3 and truth_thresh = 0.5, but yolo worked on single threshold with ignore_thresh = 0.7, this also mentioned in yolov3's paper but no explain, I just wonder why
@Kyuuki93
Happy New Year! :fireworks: :sparkler:
What do you mean? For now, all training with iou_thresh = 0.213, do you mean set iou_thresh=1.0 when change truth_thresh or ignore_thresh?
I mean may be better to use in your dataset:
ignore_thresh = 0.7
truth_thresh = 1.0
iou_thresh=1.0
While for MS COCO may be better to use
ignore_thresh = 0.7
truth_thresh = 1.0
iou_thresh=0.213
Other one-stage methods worked on dual threshold such as ignore_thresh = 0.3 and truth_thresh = 0.5, but yolo worked on single threshold with ignore_thresh = 0.7, this also mentioned in yolov3's paper but no explain, I just wonder why
What methods do you mean?
In the original Darknet there are several issues which may degrade accuracy when using low values of ignore_thresh or truth_thresh
Initially in the original Darknet there were several wrong places which I fixed:
There was used if (best_iou > l.ignore_thresh) {
instead of if (best_match_iou > l.ignore_thresh) { https://github.com/AlexeyAB/darknet/blame/dcfeea30f195e0ca1210d580cac8b91b6beaf3f7/src/yolo_layer.c#L355
Thus, it didn't decrease objectness even if there was an incorrect class_id.
Now it decrease objectness if detection_class_id != truth_class_id - it improves accuracy if ignore_thresh < 1.0.
When truth_thresh < 1.0 then the probability that many objects will correspond to one anchor increases. But in the original Darknet, only the last (from label-txt-file) truth-bbox affected the anchor. I fixed it - now it averages deltas of all truths which correspond to this one anchor - so truth_thresh < 1.0 and iou_thresh < 1.0 may have a better effect:
Also isn't tested and isn't fixed possible bug with MSE: https://github.com/AlexeyAB/darknet/issues/4594#issuecomment-569927386
@AlexeyAB Happy New Year!
There are old cpc and new cpc results, seems use loss multiplier on all loss parts could balance classes AP slightly but not improve it
| Model| [email protected](C0/C1) | [email protected](C0/C1)|
|---|---|---|
|giou|79.53%(69.24%/89.83%)|59.65%(42.96%/76.34%)|
|giou,cpc| 79.51% (69.07%/89.96%)| 59.52%(42.17%/76.87%)|
|giou,cpc(new)|79.44%(70.03%/88.84%)|59.61%(44.95%/74.27%)|
I mean may be better to use in your dataset:
iou_thresh=1.0
While for MS COCO may be better to use:iou_thresh=0.213
Actually, in my dataset iou_thresh = 0.213 always get better results, I think use a lower iou_thresh allows several anchors can predict same object, and in original darknet use only nearest anchor to predict object which limited yolo's ability, so set a lower iou_thresh will always get better results, just need to search a suit value for a certain dataset.
What methods do you mean?
Some method use like ignore_thresh = 0.5 & truth_thresh =0.7, which means
iou < 0.5, negative sample
0.5 <iou<0.7, ignore
iou > 0.7, positive sample
I'm not sure this is exactly yolo's ignore_thresh and truth_thresh
@Kyuuki93
seems use loss multiplier on all loss parts could balance classes AP slightly but not improve it
Yes.
Some method use like ignore_thresh = 0.5 & truth_thresh =0.7, which means
iou < 0.5, negative sample
0.5iou > 0.7, positive sample
Yes.
truth_thresh is very similar (but not the same) as iou_thresh, so this is strange that you get better result with higher truth_thresh and with lower iou_thresh.
For MS COCO iou_thresh=0.213 greatly increases accuracy.
@WongKinYiu @Kyuuki93
I am adding new version of [shortcut], now I am re-making [shortcut] layer for fast BiFPN: https://github.com/AlexeyAB/darknet/issues/4382#issuecomment-569197177
so be careful by using commits from Jan 7, 2020 it may have bugs in [shortcut] layer.
Before using, try to train small model with [shortcut] layer
@AlexeyAB
Okay, thanks.
@AlexeyAB ok, thanks
@Kyuuki93 @WongKinYiu I added new version of [shortcut] layer for BiFPN from EfficientDet: https://github.com/AlexeyAB/darknet/issues/4662
So you can try to make Detector with 1 or several BiFPN blocks.
And with 1 ASFF + several BiFPN blocks (yolov3-spp-asff-bifpn-db-it.cfg)
@nyj-ocean
[convolutional]
stride=1
size=1
filters=4
activation=normalize_channels_softmax
[route]
layers=-1
group_id=0
groups=4
...
[route]
layers=-1
group_id=3
groups=4
@nyj-ocean It is due that 4-th branch has 4=(2x2) more outputs. So you should use /2 less filters in conv-layers.
@AlexeyAB
I reduce the value of filters in some [convolutional] layers.
But the FPS of yolov3-4l+ASFF.cfg is still slow than yolov3-4l.cfg
I am waiting to see whether the final mAP of yolov3-4l+ASFF.cfg increase or not compared with yolov3-4l.cfg
But the way , i want to try ASFF + several BiFPN ,where could i download the yolov3-spp-asff-bifpn-db-it.cfg in https://github.com/AlexeyAB/darknet/issues/4382#issuecomment-572760285?
Most helpful comment
@Kyuuki93 @WongKinYiu I added new version of
[shortcut]layer for BiFPN from EfficientDet: https://github.com/AlexeyAB/darknet/issues/4662So you can try to make Detector with 1 or several BiFPN blocks.
And with 1 ASFF + several BiFPN blocks (yolov3-spp-asff-bifpn-db-it.cfg)