Yolov3: CSPResNeXt50-PANet-SPP

Created on 9 Dec 2019  路  109Comments  路  Source: ultralytics/yolov3

Does this repo. support CSPResNeXt50-PANet-SPP? (https://github.com/WongKinYiu/CrossStagePartialNetworks/)

AlexeyABs support: https://github.com/AlexeyAB/darknet/issues/4406

My tests have found it to be a clear winner over yolov3-spp in terms of mAP and speed.

Stale enhancement

Most helpful comment

yolov3-spp.cfg has 17 unique fields in it's cfg:

17 ['type', 'batch_normalize', 'filters', 'size', 'stride', 'pad', 'activation', 'from', 'layers', 'mask', 'anchors', 'classes', 'num', 'jitter', 'ignore_thresh', 'truth_thresh', 'random']

csresnext50-panet-spp.cfg has 18 unique fields. It seems group is the only newcomer. Ok, so this repo should now fully support csresnext50-panet-spp.cfg @LukeAI.

18 ['type', 'batch_normalize', 'filters', 'size', 'stride', 'pad', 'activation', 'layers', 'groups', 'from', 'mask', 'anchors', 'classes', 'num', 'jitter', 'ignore_thresh', 'truth_thresh', 'random']

All 109 comments

@LukeAI hi, thanks for the feedback! Off the top of my head I think we may not support some of the layers there (https://github.com/ultralytics/yolov3/issues/631#issuecomment-563224735). Do you have an exact *.cfg file that you saw improvements with?

Is this complementary to Gaussian YOLO, can they both be used togethor? So this would be a replacement of the darknet53 backbone with a ReseXt50 backbone?

It's too bad @WongKinYiu didn't do the modifications directly in this repo :)

@WongKinYiu I'd like to implement this cfg in ultralytics/yolov3:
https://github.com/WongKinYiu/CrossStagePartialNetworks/blob/master/cfg/csresnext50-panet-spp.cfg

The only new field I see is 'groups' in the convolution layers. Are there other new fields I didn't see?
https://github.com/WongKinYiu/CrossStagePartialNetworks/blob/ff762e58750a2261d64855ac9c3a3ea1a993a24a/cfg/csresnext50-panet-spp.cfg#L383-L390

Do you know where I would slot groups into the PyTorch nn.Conv2d() module?
https://github.com/ultralytics/yolov3/blob/07c1fafba832ef83fca70576e04cef48686f72a1/models.py#L22-L33

@LukeAI @WongKinYiu I've added import of 'groups' into the Conv2d() definition in https://github.com/ultralytics/yolov3/commit/3bfbab7afd5850b4f21b73dd3184374f47eb1d98. Is this sufficient to run CSPResNeXt50-PANet-SPP? @LukeAI can you git pull this repo and try with the cfg?

https://github.com/ultralytics/yolov3/blob/3bfbab7afd5850b4f21b73dd3184374f47eb1d98/models.py#L22-L34

yolov3-spp.cfg has 17 unique fields in it's cfg:

17 ['type', 'batch_normalize', 'filters', 'size', 'stride', 'pad', 'activation', 'from', 'layers', 'mask', 'anchors', 'classes', 'num', 'jitter', 'ignore_thresh', 'truth_thresh', 'random']

csresnext50-panet-spp.cfg has 18 unique fields. It seems group is the only newcomer. Ok, so this repo should now fully support csresnext50-panet-spp.cfg @LukeAI.

18 ['type', 'batch_normalize', 'filters', 'size', 'stride', 'pad', 'activation', 'layers', 'groups', 'from', 'mask', 'anchors', 'classes', 'num', 'jitter', 'ignore_thresh', 'truth_thresh', 'random']

@WongKinYiu I am getting an error in 3 of the shortcut layers when running csresnext50-panet-spp.cfg. They are trying to add tensors of torch.Size([1, 64, 104, 104]) from -4 layers previous with incorrectly sized tensors of torch.Size([1, 128, 104, 104]).

# models.py line 260:
            elif mtype == 'shortcut':
                try:
                    x = x + layer_outputs[int(mdef['from'])]
                except:
                    print(i, x.shape, layer_outputs[int(mdef['from'])].shape)
                    x = layer_outputs[int(mdef['from'])]

# excepted layers:
# 8 torch.Size([1, 128, 104, 104]) torch.Size([1, 64, 104, 104])
# 12 torch.Size([1, 128, 104, 104]) torch.Size([1, 64, 104, 104])
# 16 torch.Size([1, 128, 104, 104]) torch.Size([1, 64, 104, 104])

Possible Fix

Change filters=128 to filters=64 on csresnext50-panet-spp.cfg lines 79, 110, 141. Then all the shapes combine correctly in this repo. Implemented in https://github.com/ultralytics/yolov3/commit/86588f15796a47faa2deb572da0bc62d15fe6c9c. Not sure if this is a correct modification according to the original cfg designer.

@glenn-jocher Hello,

In pytorch, do zero padding to same size, then add.
for example, in pytorch 0.4:

        if residual_channel != shortcut_channel:
            padding = torch.autograd.Variable(torch.cuda.FloatTensor(batch_size, residual_channel - shortcut_channel, featuremap_size[0], featuremap_size[1]).fill_(0)) 
            out += torch.cat((shortcut, padding), 1)
        else:
            out += shortcut 

@glenn-jocher

For convenient, I change line 57 to filters=128 instead of filter=64 to make it has consistent filter number. Here are csresnext50c.cfg and csresnext50c.conv.80.

@WongKinYiu @LukeAI @AlexeyAB I trained csresnext50-panet-spp.cfg https://github.com/ultralytics/yolov3/commit/86588f15796a47faa2deb572da0bc62d15fe6c9c against default yolov3-spp.cfg for 27 COCO epochs at 416 (10% of full training), but got worse results at a slower speed. I ran yolov3-spp3.cfg (see https://github.com/ultralytics/yolov3/issues/694) with slightly worse results as well. Commands to reproduce:

git clone https://github.com/ultralytics/yolov3
bash yolov3/data/get_coco_dataset_gdrive.sh
cd yolov3
python3 train.py --epochs 27 --weights '' --cfg yolov3-spp.cfg --name 113
python3 train.py --epochs 27 --weights '' --cfg yolov3-spp3.cfg --name 115
python3 train.py --epochs 27 --weights '' --cfg csresnext50-panet-spp.cfg --name 121

  |mAP
@0.5...0.95 |mAP
@0.5 | time (hrs)
to 27 epochs
--- |--- |--- |---
yolov3-spp.cfg |29.7 | 49.5 | 12.7
yolov3-spp3.cfg | 29.1 | 49.0 | 13.5
csresnext50-panet-spp.cfg https://github.com/ultralytics/yolov3/commit/86588f15796a47faa2deb572da0bc62d15fe6c9c |25.9 | 44.2 | 28.3
csresnext50-panet-spp.cfg zero-pad TODO? | | |
csresnext50c.cfg TODO? | | |

results

If you guys have time and are good with PyTorch please feel free to clone this repo and try the https://github.com/WongKinYiu/CrossStagePartialNetworks/ implementations yourself. I'd really like to exploit some of the research there but I don't have time. We are getting excellent results with our baseline yolov3-spp.cfg from scratch ([email protected], [email protected] see https://github.com/ultralytics/yolov3#map), so if the improvements are relative, then they should help here also I assume.

ok, i ll try to install this repo.

so all of ur training do not use imagenet pre-trained model?

@WongKinYiu ok great! No I don't use any pre-trained model for the initial weights. In an earlier test I found that starting from darknet53.conv.74 produced worse mAP after 273 epochs than starting from randomly initialized weights. For quick results (a day or less of training) yes, the imagenet trained weights will help, but for longer training I found they hurt.

To reproduce:

git clone https://github.com/ultralytics/yolov3
bash yolov3/data/get_coco_dataset_gdrive.sh
cd yolov3
python3 train.py --epochs 273 --weights darknet53.conv.74 --cfg yolov3-spp.cfg --name 41
python3 train.py --epochs 273 --weights '' --cfg yolov3-spp.cfg --name 42

  |mAP
@0.5...0.95 |mAP
@0.5
-- | -- | --
results41: 416 multiscale to 273 epochs (darknet53.conv.74 start) | 56.8 | 36.2
results42: 416 multiscale to 273 epochs (random start) | 57.5 | 37.1

results

@glenn-jocher

Thanks for your reply.
PANet need more training epochs to converge when compare with YOLOv3.

Do your models are trained using single GPU?

@WongKinYiu yes I typically train them on one 2080Ti or V100, which usually do about 50 epochs per day with the default settings (5 days to train COCO). See https://github.com/ultralytics/yolov3#speed for training speeds. Multi-GPU can also be used.

To get the best mAPs though --multi-scale must be used, which adds about 50% more training time (7-8 days on 1 GPU). This is why I usually test changes on 27 epochs.

@glenn-jocher

Thanks for your reply.
PANet need more training epochs to converge when compare with YOLOv3.

Should I try csresnext50c.cfg?

UPDATE: I put it in, but there are new layers again :)

python3 train.py --epochs 27 --weights '' --cfg csresnext50c.cfg --name 122

Warning: Unrecognized Layer Type: avgpool
Warning: Unrecognized Layer Type: softmax

@glenn-jocher

No, if train from scratch, i think u will get similar results.
panet has additional path than fpn, so it need more epochs.

oh, it is becuz csresnext50c.cfg is for imagenet classifier.

@glenn-jocher

No, if train from scratch, i think u will get similar results.
panet has additional path than fpn, so it need more epochs.

oh, it is becuz csresnext50c.cfg is for imagenet classifier.

Oh, haha, ok I'll leave csresnext50c.cfg alone then.

@glenn-jocher

start training...
do u use python3 train.py --epochs 273 --weights '' --cfg yolov3-spp.cfg --name 42 to get 40.9 AP?

@WongKinYiu the exact training command to get to 40.9 AP with one GPU is:

python3 train.py --weights '' --epochs 273 --batch 16 --accumulate 4 --multi --pre

If you use multi-GPU though you will have more memory available, so you can use a larger --batch --accumulate combination to get to 64 like 32x2, or even 64x1:

python3 train.py --weights '' --epochs 273 --batch 32 --accumulate 2 --multi --pre

yolov3-spp.cfg is the default cfg, so you don't need to supply it above (but you can). The --pre argument performs one epoch of biasing the yolo output neurons before training starts. See https://github.com/ultralytics/yolov3/issues/460

my gpu ram is not enough even though i set --batch 16 --accumulate 4 --multi --pre.
i will borrow other gpu for training.

@glenn-jocher

I trained csresnext50-panet-spp.cfg 86588f1 against default yolov3-spp.cfg for 27 COCO epochs at 416 (10% of full training), but got worse results at a slower speed.

This is weird, did you measure speed on GPU? And what FPS/ms did you get for SPP vs CSP?

Have you tried converting an already trained on Darknet model CSPResNeXt50-PANet-SPP (cfg / weights) to ultralytics (pytorch), and did you get better mAP and better speed?

Or does this inconsistency interfere with this conversion? https://github.com/ultralytics/yolov3/issues/698#issuecomment-563466452

@AlexeyAB Hello,

I think slow speed is talking about training speed.
training of group convolution is slower than training of conventional convolution.

@AlexeyAB @WongKinYiu if I run test.py on the two trained models, this applies inference (and NMS) on the 5000 images in 5k.txt. This takes 138 seconds with yolov3-spp.cfg on a P4 GPU, and 139 seconds with csresnext50-panet-spp.cfg. Ah interesting, so the inference speed is nearly identical, but the training speed takes twice as long.

So is the CSPResNeXt50-PANet-SPP operational? And does it provide better results? I am looking more into it right now. And reading the article.

my gpu ram is not enough even though i set --batch 16 --accumulate 4 --multi --pre.
i will borrow other gpu for training.

@WongKinYiu I forgot to mention, you should install Nvidia Apex for mixed precision training with this repo. It increases speed substantially and reduces memory requirements substantially. Once installed correctly you should see this:
Screen Shot 2019-12-14 at 2 36 45 PM

See https://github.com/NVIDIA/apex#quick-start

@glenn-jocher

Yes, I have installed apex.
Now I training with --multi with scale 320~608.

@glenn-jocher Hello,

I would like to know why --pre need a little bit more gpu memory than without using it.

@WongKinYiu ahhh this is interesting, I had not realized that. There is a memory leak when invoking train.py repeatedly, which is very obvious when running hyperparameter evolution as train.py is called repeatedly in a foor loop https://github.com/ultralytics/yolov3/issues/392#issuecomment-565475680, but I did not realize --pre also causes this. This makes sense though, as it is calling train.py once to train the output biases for one epoch, then calling it again for actual training. How much extra memory is this using?

Is this complementary to Gaussian YOLO, can they both be used togethor?

I independently found good improvements ~+3mAP with Gaussian-Yolo and also cspresnext50-pan-spp vs. yolov3-spp - but I got pretty bad results when I tried combining them (-10mAP) - this may be because:
(1) I made a mistake
(2) The features that are useful to gaussian-yolo are quite different to the features that are useful for yolo so training a network with a gaussian-yolo head from pretrained weights from a non-gaussian head gives poor results
(3) I need to tune the hyper-parameters more (I tried 3 different learning rates but no dice)
(4) for some subtle reason these features just don't play well together - seems unlikely to me though.

@WongKinYiu have you tried Gaussian with cspresnext-pan-spp? Do you have any thoughts or results?

I think you need more iterations for warmup when combine cspresnext50-pan-spp with gaussian-yolo (I have no gpus to test it currently).

In my experiments, when combining cspresnext50-pan-spp with gaussian-yolo, the precision drops and recall improves. And the strange thing is that the loss become lager after 200k epochs.

@glenn-jocher @AlexeyAB

i do some optimization of hyper-parameter, including iou_thresh, ciou, ....
the new results training by darknet(AlexeyAB) are as follows:

| model | size | AP | AP50 | AP75 |
| :-- | :-: | :-: | :-: | :-: |
| CSPResNeXt50-PANet-SPP | 512x512 | 42.4 | 64.4 | 45.9 |
| CSPResNeXt50-PANet-SPP | 608x608 | 43.2 | 65.4 | 47.0 |

@WongKinYiu ah those are very very good!!

  1. Did you train on COCO for 500k iterations?
  2. How does the the training and inference speed compare to yolov3-spp?
  3. Can you pass the cfg for those so I can try to train on ultralytics?

@glenn-jocher Hello,

@WongKinYiu @AlexeyAB I tried to run detections with the trained weights and csresnext50-panet-spp-original-optimal.cfg as below, but I get the same issue regarding padding as before https://github.com/ultralytics/yolov3/issues/698#issuecomment-563521134. I tried to pad these shortcut layers with zeros at dimension 1, but then another error appears which seems to do with the groupings. Do you have any idea what this might be?

Namespace(cfg='cfg/csresnext50-panet-spp-original-optimal.cfg', classes=None, conf_thres=0.3, device='', fourcc='mp4v', half=False, img_size=416, iou_thres=0.5, names='data/coco.names', output='output', save_txt=False, source='data/samples', view_img=False, weights='csresnext50-panet-spp-original-optimal_final.weights')
Using CPU

image 1/2 data/samples/bus.jpg: 8 torch.Size([1, 128, 104, 80]) torch.Size([1, 64, 104, 80])
Traceback (most recent call last):
  File "/Users/glennjocher/PycharmProjects/yolov3/detect.py", line 176, in <module>
    detect()
  File "/Users/glennjocher/PycharmProjects/yolov3/detect.py", line 83, in detect
    pred = model(img)[0]
  File "/Users/glennjocher/opt/anaconda3/envs/pn1/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "/Users/glennjocher/PycharmProjects/yolov3/models.py", line 248, in forward
    x = module(x)
  File "/Users/glennjocher/opt/anaconda3/envs/pn1/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "/Users/glennjocher/opt/anaconda3/envs/pn1/lib/python3.7/site-packages/torch/nn/modules/container.py", line 92, in forward
    input = module(input)
  File "/Users/glennjocher/opt/anaconda3/envs/pn1/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "/Users/glennjocher/opt/anaconda3/envs/pn1/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 345, in forward
    return self.conv2d_forward(input, self.weight)
  File "/Users/glennjocher/opt/anaconda3/envs/pn1/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 342, in conv2d_forward
    self.padding, self.dilation, self.groups)
RuntimeError: Given groups=1, weight of size 128 64 1 1, expected input[1, 128, 104, 80] to have 64 channels, but got 128 channels instead

@glenn-jocher Hello,

I think it is because of that the output shape of shorcut layer in ultralytics is same as from layer, and the output shape of shorcut layer in darknet is same as -1 layer.
I already fixed the zero padding problem in ultralytics, i will share you the code after finish my breakfast.
and some of cspne(x)t-based model training by ultralytics will finish next week.

but maybe the group convolution of ultralytics and darknet are different.
i have tried to convert cspdensenet to .pt, and it works.
however, when i convert cspresnext to .pt, it detects nothing.

@WongKinYiu the error is being produced in the first conv2d layer after the first shortcut layer. For a batch-size 1 image of size 416x320 the shapes look like this going into the error:

0 convolutional torch.Size([1, 3, 416, 320])
1 maxpool torch.Size([1, 64, 208, 160])
2 convolutional torch.Size([1, 64, 104, 80])
3 route torch.Size([1, 128, 104, 80])
4 convolutional torch.Size([1, 64, 104, 80])
5 convolutional torch.Size([1, 64, 104, 80])
6 convolutional torch.Size([1, 128, 104, 80])
7 convolutional torch.Size([1, 128, 104, 80])
8 shortcut torch.Size([1, 128, 104, 80])
9 convolutional torch.Size([1, 128, 104, 80])

update: I used this code to temporarily patch the shape issue in models.py L261:

            elif mtype == 'shortcut':
                b = layer_outputs[int(mdef['from'])]
                if b.shape == x.shape:
                    x = x + b
                else:
                    pad = max(x.shape[i] - b.shape[i] for i in range(len(x.shape))) // 2
                    x = x + F.pad(b, pad=[0, 0, 0, 0, pad, pad])
                    print(i, x.shape, b.shape)

@glenn-jocher

yes, becuz the output_filters tell next convolution the input size has 64 channel in this case.
https://github.com/ultralytics/yolov3/blob/master/models.py#L64

@glenn-jocher

yes, becuz the output_filters tell next convolution the input size has 64 channel in this case.
https://github.com/ultralytics/yolov3/blob/master/models.py#L64

Ah, yes, I understand now. Ok, this is going to need a bit of cleanup before csresnext50-panet-spp-original-optimal.cfg runs correctly, but I think I can do this in the next couple days. One important question though is, when I am shortcutting a size [1, 64, 104, 80] to a size [1, 128, 104, 80] as in the example, do I pad dimension 1 with 64 zeros at the end, 64 zeros at the beginning, or 32 zeros before and after (reflect pad 32) to bring the smaller size up to [1, 128, 104, 80]?

@glenn-jocher

in darknet implementation, the equivalent is padding at the end.
https://github.com/AlexeyAB/darknet/blob/master/src/blas.c#L83-L94

@glenn-jocher

your code is more clean than mine, just put my code here for your reference.

change filters from from layer to -1 layer.

        elif mdef['type'] == 'shortcut':  # nn.Sequential() placeholder for 'shortcut' layer
            filters = output_filters[i]
            layer = int(mdef['from'])
            routs.extend([i + layer if layer < 0 else layer])

currently i just use 2x2 pooling when w and h are different, it should be
https://github.com/AlexeyAB/darknet/blob/master/src/blas.c#L73-L74
i have not implemented the code for the case which channel of from layer is larger than the -1 layer.

            elif mtype == 'shortcut':
                s1 = x.size()[1]
                s2 = layer_outputs[int(mdef['from'])].size()[1]
                s3 = x.size()[2]
                s4 = layer_outputs[int(mdef['from'])].size()[2]

                if s3 == s4:
                    ad = layer_outputs[int(mdef['from'])]
                else:
                    # only 2 by 2 currently
                    maxpool = nn.MaxPool2d(kernel_size=2, stride=2, padding=int((2 - 1) // 2))
                    ad = maxpool(layer_outputs[int(mdef['from'])])
                if s1==s2:
                    x = x + layer_outputs[int(mdef['from'])]
                elif s1 > s2:
                    padding = torch.autograd.Variable(torch.cuda.FloatTensor(ad.size()[0], s1-s2, ad.size()[2], ad.size()[3]).fill_(0))
                    ot = torch.cat((ad, padding), 1)
                    x = x + ot
                else:
                    # not yet implement.
                    pass

@WongKinYiu would you kindly consider sharing the trained weights for the hyper-parameter optimised config? I'd love to try it out! Can post my oblation results here.

@LukeAI Hello,

The cfg/weight already put on the github.
https://github.com/WongKinYiu/CrossStagePartialNetworks
just download and try it.

@LukeAI Hello,

The cfg/weight already put on the github.
https://github.com/WongKinYiu/CrossStagePartialNetworks
just download and try it.

Hi. What about csresnext50-elastic.cfg? How i can see, there is no num of classes, filters, anchors inside. Can you make cfg file for yolov3? Thanks!

@hwijune did it works with standard yolov3 model? I tried, but i'm getting error, when i try to use pruned model
= torch.cat([layer_outputs[i] for i in layers], 1) RuntimeError: invalid argument 0: Sizes of tensors must match except in dimension 1. Got 94 and 190 in dimension 2 at /pytorch/aten/src/THC/generic/THCTensorMath.cu:71

@hwijune did it works with standard yolov3 model? I tried, but i'm getting error, when i try to use pruned model
= torch.cat([layer_outputs[i] for i in layers], 1) RuntimeError: invalid argument 0: Sizes of tensors must match except in dimension 1. Got 94 and 190 in dimension 2 at /pytorch/aten/src/THC/generic/THCTensorMath.cu:71

It works normally.

https://github.com/erikguo/yolov3 <== this repo
I checked the -3 mAP in size 608 ( 0.7 overall_ratio)

previous Yolov3 running test.

https://github.com/WongKinYiu/CrossStagePartialNetworks/blob/master/cfg/csresnet50-panet-spp.cfg << I will pruning test of cfg without group param.

@WongKinYiu What is the current situation for now? Can this repo use CSP backbone?
@glenn-jocher I am looking for PyTorch implementation because I don't want to port debugging/visualization code to CPP.
Like GradCam, like these: https://github.com/utkuozbulak/pytorch-cnn-visualizations

pytorch results are posted on https://github.com/WongKinYiu/CrossStagePartialNetworks/tree/pytorch

Thank you for your hard and great work! I have a question that why the CSPResNeXt50-PANet-SPP in this project can only get 59.5[email protected] but the alexeyAB's version can get 64.4mAP? Maybe still some small flaws in this project? @WongKinYiu

64.4 is got by CSPResNeXt50-PANet-SPP-optimal, plz check https://github.com/WongKinYiu/CrossStagePartialNetworks

CSPResNeXt50-PANet-SPP an CSPResNeXt50-PANet-SPP-optimal are same models with different hyper-parameter.
We have not implemented all of optimizations using PyTorch.

So what if if I convert weights that trained using alexey's darknet and using pytorch YOLO for forward pass?

The group convolution of darknet and pytorch seems different.

@clw5180 I'm not sure what the cause of the discrepancy is. It could be differences in the group convolutions as @WongKinYiu mentioned. Note that yolov3-spp.cfg trains to much higher mAP with this repo than with darknet, so actual technical problems are very unlikely. See https://github.com/ultralytics/yolov3#map

@isgursoy @WongKinYiu @clw5180 @hwijune @Spectra456 the current status as far as I know is that there is a slight difference in implementing some operations in the csresnext50-panet-spp.cfg file in this repo compared to darknet, such that simply running the training command below fails:

python3 train.py --cfg csresnext50-panet-spp.cfg

The fix is essentially described here: https://github.com/ultralytics/yolov3/issues/698#issuecomment-570441779, I just need to implement and push it. I'll try to get this done in the next couple days, and then the next step would be to verify the cfg functionality by comparing mAP here using test.py.

Once that's done we can try to train from scratch and perhaps look at balancing the 3 losses or evolving the hyperparameters for this particular cfg. But yes it's a bit frustrating and a mystery why the cfg trains so much higher on darknet at the moment.

Darknet uses grouped-convolutional in the same way as nVidia cuDNN library, so it should be the same as in Pytorch.

hi @WongKinYiu

origin yolov3 mask order [yolo] 6,7,8 [yolo] 3,4,5 [yolo] 0,1,2
cspnet mask order [yolo] 0,1,2 [yolo] 3,4,5 [yolo] 6,7,8

Is there any difference?

No, there is no different.
It because the order of pyramid scales of FPN and PANet are different.
image

No, there is no different.
It because the order of pyramid scales of FPN and PANet are different.
image

can't change the order, right?

[yolo] 0,1,2 [yolo] 3,4,5 [yolo] 6,7,8 >>>>> [yolo] 6,7,8 [yolo] 3,4,5 [yolo] 0,1,2

Yes, because the anchor size should match the grid size.

@WongKinYiu I see in the https://github.com/ultralytics/yolov3/issues/698#issuecomment-585209887 image YOLOv3 corresponds to the FPN architecture (with 4 output layers), with the last output for the smallest objects. There are basically two steps: downsample, then upsample (with crosslinks).

In the PANet example, are there 3 steps? downsample, upsample, downsample (with crosslinks from step 2 to 3)? Does this improve the mAP typically at the expense of more weights/computation?

@glenn-jocher Hello,

typically yes.

But there are many different methods can be used to avoid that, for example, BiFPN.
image
image

@WongKinYiu ah very interesting! Figure 2 shows a good summary of the differences. Have you tried to create a *.cfg for efficientnet, or for a BiFPN type network? The results on COCO seem to show substantial improvement over what we are doing.

Screen Shot 2020-02-12 at 6 22 03 PM

@glenn-jocher Hello,

I do not build such cfg file, but someone does. https://github.com/AlexeyAB/darknet/issues/4662

@WongKinYiu I see. Have you tried the 'Simplified PANet' that they show with CSPResNeXt50-PANet-SPP?

I did a brief search online for EfficientDet implementations but I could not find any good ones. The paper does not supply code, and 3rd party implementations don't show very good or reliable mAPs.

Would you be interested in trying to implement a BiFPN network?

@WongKinYiu ah I had another question. Why are the group convolutions necesary in CSPResNeXt50-PANet-SPP?

Have you tried using the basic Conv2d() instead, and were you able to determine performance improvements when moving from the basic convolutions to the group convolutions?

CSPResNeXt50 has too much filters (outputs), so without groups it will take a very large amount of memory, so you should decrease mini_batch size significantly. So better to use groups=4...16 https://github.com/WongKinYiu/CrossStagePartialNetworks/issues/6#issuecomment-584406057

If there is no group convolution, it is a CSPResNet50-PANet-SPP.

@AlexeyAB @glenn-jocher

Hello, I think BiFPN which implemented by darknet is good enough.
csdarknet53-panet-spp-bifpn.txt

| model | size | ap | ap50 | ap75 |
| :-: | :-: | :-: | :-: | :-: |
| CSPDarknet53-BiFPN | 512x512 | 38.4 | 62.3 | 41.3 |

@WongKinYiu Hi,

But even BiFPN-optimal worse than PANet-not-optimal, while optimal should give ~+4.4% extra AP: https://github.com/WongKinYiu/CrossStagePartialNetworks#gpu-real-time-models

model | size | ap | ap50 | ap75
-- | -- | -- | -- | --
CSPDarknet53-BiFPN (optimal) | 512x512 | 38.4 | 62.3 | 41.3
CSPDarknet53-PANet-SPP (not optimal) | 512x512 | 38.7 | 61.3 | 41.7

@AlexeyAB Hello,

The anchor size of CSPDarknet53-BiFPN is not optimized due to my GPU RAM is insufficient to train with same setting as CSPResNeXt50-PANet-SPP (optimal).

@WongKinYiu

What do you mean?
Memory consumption doesn't depend on achor size.

Do you mean that you trained?

  • CSPResNeXt50-PANet-SPP (optimal) - with width=512 height=512 subdivisions=8 mosaic=1 learning_rate=0.00261
  • CSPDarknet53-BiFPN (optimal) - with width=416 height=416 subdivisions=16 mosaic=1 learning_rate=0.001

Or did you train CSPResNeXt50-PANet-SPP (optimal) - with width=416 height=416 ?

the anchor size of CSPResNeXt50-PANet-SPP is designed for 416x416.
(trained with width=416 height=416)

the anchor size of CSPResNeXt50-PANet-SPP (optimal) is optimized for 512x512.
(trained with width=512 height=512)

https://github.com/ultralytics/yolov3/issues/698#issuecomment-586271292.
(trained with width=416 height=416 due to memory is not enough trained with width=512 height=512)

@WongKinYiu Thanks! So you trained CSPDarknet53 with lower network resolution than CSPResNext50.

But there are compared two CSPDarknet53 models, not CSPResNext50:

model | size | ap | ap50 | ap75
-- | -- | -- | -- | --
CSPDarknet53-BiFPN (optimal) | 512x512 | 38.4 | 62.3 | 41.3
CSPDarknet53-PANet-SPP (not optimal) | 512x512 | 38.7 | 61.3 | 41.7

Are both these models trained with width=416 height=416 subdivisions=16 ?

Or as I see:

both of these two models are trained with width=416 height=416.
the setting of CSPDarknet53-BiFPN (optimal) is as you see.
i am not sure about the subdivision of CSPDarknet53-PANet-SPP (not optimal), but yes mosaic=0.

in https://github.com/WongKinYiu/CrossStagePartialNetworks#gpu-real-time-models
CSPDarknet53-PANet-SPP (not optimal) and CSPResNet50-PANet-SPP (not optimal) are not trained by myself.

@WongKinYiu

both of these two models are trained with width=416 height=416.

So from this table we can't say what is better BiFPN vs PAN?

model | size | ap | ap50 | ap75
-- | -- | -- | -- | --
CSPDarknet53 BiFPN (optimal) trained 416x416 subdivisions=16 | 512x512 | 38.4 | 62.3 | 41.3
CSPDarknet53 PANet-SPP (not optimal) trained 416x416 subdivisions=4 or 8 or 16 | 512x512 | 38.7 | 61.3 | 41.7


  • So we can say that BiFPN at least works.

  • What is the current avg-loss of ASFF, can we say that it at least works?

currently 245k epoch, 10.5 loss.

@glenn-jocher @AlexeyAB update

| Model | Size | AP | AP50 | AP75 |
| :-: | :-: | :-: | :-: | :-: |
| CSPDarknet53 BiFPN (optimal) trained 416x416 subdivisions=16 | 512x512 | 38.4 | 62.3 | 41.3 |
| CSPDarknet53 PANet-SPP (optimal) trained 416x416 subdivisions=16 | 512x512 | 41.6 | 64.1 | 45.0 |

@WongKinYiu @glenn-jocher So previous version of BiFPN is bad. Try to use new BiFPN version: https://github.com/AlexeyAB/darknet/issues/4662#issuecomment-587490873

@glenn-jocher @AlexeyAB update

Model Size AP AP50 AP75
CSPDarknet53 BiFPN (optimal) trained 416x416 subdivisions=16 512x512 38.4 62.3 41.3
CSPDarknet53 PANet-SPP (optimal) trained 416x416 subdivisions=16 512x512 41.6 64.1 45.0

@WongKinYiu wow great! What's the difference between the not-optimal and optimal versions of CSPDarknet53 PANet-SPP? The optimal version shows +3 mAP improvement, what differences did you make to get this?

not-optimal: all hyper-parameters are same as default yolov3.
optimal: with ciou and your genetic algorithm, mosaic augmentation, scale sensitivity, iou threshold. (see [net] and [yolo] in cfg file https://github.com/ultralytics/yolov3/issues/698#issuecomment-586271292)

@glenn-jocher @WongKinYiu

Why CSPDarknet53s-PANet-SPP Ultralitics has lower AP than CSPDarknet53 PANet-SPP Darknet ?

Model | Size | AP | AP50 | AP75 | URL | cfg
-- | -- | -- | -- | -- | -- | --
YOLOv3-SPP (baseline) Ultralitics (optimal) trained 416x416聽-batch=16 | 512x512 | 39.7 | 60.5 | 42.2 | url | cfg
CSPDarknet53s-PANet-SPP Ultralitics (optimal) trained 416x416聽-batch=16 | 512x512 | 40.0 | 60.4 | 42.9 | url | cfg
CSPDarknet53聽PANet-SPP Darknet (optimal) trained 416x416聽subdivisions=16 | 512x512 | 41.6 | 64.1 | 45.0 | url | cfg

Both use:

  • optimal hyper parameters
  • activation=leaky
  • mosaic=1
  • CSPDarknet53-backbone (without grouped-conv as in Yolov3-spp) (CSPDarknet53s and CSPDarknet53 are very similar)

The difference is only -

  1. Darkent uses pre-trained classifier-weights, while Ultralitics doesn't
  2. Darknet uses CIoU-loss while Ultralitics uses GIoU-loss?

What am I missing?

@AlexeyAB I don't know, this is a very good question. The gap is very large in mAP. I think what I should do is try to test mAP with CSPDarknet53 PANet-SPP Darknet first, to establish that the cfg loads the model correctly. I'll do that today.

Yes it is true I don't use any pretrained weights (I saw slightly worse results with darknet53.conv.74). I tried CIoU loss and did not see any added benefit compared to GIoU.

I used the linked urls and weights, and tested at 512 on my own with the following commands. Results are slightly higher than the earlier table. I was not able to test the last one, as there were new cfg entries it did not recognize. I will comment these and try again.

git clone https://github.com/ultralytics/yolov3
cd yolov3
python3 test.py --img 512 --weights ... --cfg ...

Model | Size | AP | AP50 | AP75 | URL | cfg
-- | -- | -- | -- | -- | -- | --
YOLOv3-SPP (baseline) Ultralytics (optimal) trained 416x416聽-batch=16 | 512x512 | 40.2 | 61.3 | - | url | cfg
CSPDarknet53s-PANet-SPP Ultralitics (optimal) trained 416x416聽-batch=16 | 512x512 | 40.7 | 60.7 | - | url | cfg
CSPDarknet53聽PANet-SPP Darknet (optimal) trained 416x416聽subdivisions=16 | 512x512 | - | - | - | url | cfg

i am in a business trip, will provide some training info of YOLOv3-SPP (baseline) Ultralitics and CSPDarknet53s-PANet-SPP Ultralitics after back to office.

@WongKinYiu ok great! I got the last darknet model to run, but mAPs came back as 0.0. Note that I modified my default test nms --iou-thres from 0.5 to 0.6, as this produces a better balance of [email protected]:0.95 (best at --iou-thres 0.7) and [email protected] (best at --iou-thres 0.5).

Also note the latest yolov3-spp.cfg baseline trains to 41.9/61.8 at 608 with the default settings. The training commands to reproduce this are here. The two seperate --img-size are train img-size and test img-size. Multi-scale train img sizes using this command will be 288 - 640.

python3 train.py --data coco2014.data --img-size 416 608 --epochs 273 --batch 16 --accum 4 --weights '' --device 0 --cfg yolov3-spp.cfg --multi

@glenn-jocher

Note that I modified my default test nms --iou-thres from 0.5 to 0.6, as this produces a better balance of [email protected]:0.95 (best at --iou-thres 0.7) and [email protected] (best at --iou-thres 0.5).

Yes, I know.
However, for the competition, we should use same IoU threshold for both [email protected]:0.95 and [email protected].

Also note the latest yolov3-spp.cfg baseline trains to 41.9/61.8 with the default settings. The training commands to reproduce this are here. The two seperate --img-size are train img-size and test img-size. Multi-scale train img sizes using this command will be 288 - 640.

Thanks, I just use the default setting of the repo which I used to train the model. As I remember, that repo gets about 40.9 [email protected]:0.95 on your report. By the way, all of my results are obtained by test-dev set and your results are obtained by min-val set.

@WongKinYiu ah test-dev set could be a difference too then!

Well it seems some differences remain as the ultralytics repo can't load the best performing darknet CSPDarknet53s-PANet-SPP model then. These differences must be the source of the problem I think.

@glenn-jocher

Also note the latest yolov3-spp.cfg baseline trains to 41.9/61.8 at 608 with the default settings.

What is the difference between your training and this yolov3-spp.cfg https://github.com/WongKinYiu/CrossStagePartialNetworks/tree/pytorch#ms-coco ?
Why such difference?

@AlexeyAB

I use this repo to train: https://github.com/ultralytics/yolov3/tree/a6f87a28e7595e71752583fb41340f9d1105d75f
There are many improvements in these days on ultralytics.

@WongKinYiu @glenn-jocher So, I want to know what improvements have been made?

Hmmm well lots of small day to day changes. If I use the github /compare it doesn't show the date of that commit, but it shows that there are 400 commits since then, with many modifications:
https://github.com/ultralytics/yolov3/compare/a6f87a28e7595e71752583fb41340f9d1105d75f...master#diff-04c6e90faac2675aa89e2176d2eec7d8

The README from then was showing 40.0/60.9 mAP, which is similar to what @WongKinYiu was seeing, vs today's README which shows 41.9/61.8.

The improvements are over many different parts, such as the NMS, which now uses multi-label, the augmentation, which has been set to zero, the loss function reduction, which I returned to mean() instead of sum(), the cosine scheduler implementation, the increase in the LR to 0.01 after cos was implemented, and maybe a few other tiny things. The architecture itself is the same (yolov3-spp.cfg).

Actually this is an important point. A lot of papers today are showing very outdated comparisons to YOLOv3, i.e. showing 33 [email protected]:0.95 like the EfficientDet paper, with a GPU latency of 51ms. The reality is the most recent YOLOv3-SPP model I trained is at 42.1 [email protected]:0.95, with a GPU latency of 12.8ms https://github.com/ultralytics/yolov3/issues/679#issuecomment-597219021, which puts it far better than their own D0-D2 models in both speed and mAP. I'm not sure how best to get that message out.

Screen Shot 2020-03-10 at 4 27 33 PM

@glenn-jocher
So the main difference:

  1. NMS uses multi-label
  2. the augmentation, which has been set to zero - what does it mean, did you disable data augmentation?
  3. the loss function reduction, which I returned to mean() instead of sum() - are all the true-positive loss values averaged new_loss = sum_for_i( loss_obj, loss_cls, loss_bbox) / count ?

image

@AlexeyAB

Yes NMS uses multi-label now, which bumped up mAP about +0.3. Yes spatial augmentation seemed to hurt training, so I set it to zero, but left HSV augmentation on:

       'hsv_h': 0.0138,  # image HSV-Hue augmentation (fraction)
       'hsv_s': 0.678,  # image HSV-Saturation augmentation (fraction)
       'hsv_v': 0.36,  # image HSV-Value augmentation (fraction)
       'degrees': 1.98 * 0,  # image rotation (+/- deg)
       'translate': 0.05 * 0,  # image translation (+/- fraction)
       'scale': 0.05 * 0,  # image scale (+/- gain)
       'shear': 0.641 * 0}  # image shear (+/- deg)
  1. The loss is back to it's original form, using the PyTorch defaults, which is for example for the 3 yolo layers: loss_giou = (giou_1.mean() + giou_2.mean() + giou_3.mean()).sum()

I'm really hoping we might be able to merge the YOLO outputs some day so I can do away with this uncertainty in how to combine the losses from the different layers. ASFF seems to be an interesting step in that direction.

@AlexeyAB ah also another change I forgot to mention was I changed multi-scale to change the resolution every batch now, instead of every 10 batches before. This seemed to smooth the results a bit, epoch to epoch.

@WongKinYiu yes they look super similar to each other unfortunately. I'm not sure why we aren't seeing the same gains as the darknet training. It must have to do with the grouped convolutions I think.

@glenn-jocher

Yes NMS uses multi-label now, which bumped up mAP about +0.3.

Does it currently work in such a way?
if there are 2 bboxes with IoU > iou_nms

  1. class1_prob = 0.5, class2_prob = 0.7
  2. class1_prob = 0.7, class2_prob = 0.5

Then it will remove class1_prob = 0.5 and class2_prob = 0.5, and will leave:

  1. class2_prob = 0.7
  2. class1_prob = 0.7

The loss is back to it's original form, using the PyTorch defaults, which is for example for the 3 yolo layers: loss_giou = (giou_1.mean() + giou_2.mean() + giou_3.mean()).sum()

Do you know how this changes the Delta during auto-differentiation in Pytorch?
Do you apply it only for x,y,w,h and not for probs and obj?


Yes spatial augmentation seemed to hurt training, so I set it to zero, but left HSV augmentation on:

Yes, it may help to win compete, but may be it may hurt cross-domain accuracy when testing images/videos are not similar to MS COCO.

It seems it works well because Ultralitics uses letter_box-image-resizing by default, so it keeps aspect ratio and doesn't require large spatial image transformation.
In the Darknet we can try to use jitter=0.1 letter_box=1 instead of jitter=0.3 letter_box=0
I think the higher network resolution - the more preferably to use jitter=0.1 letter_box=1

I'm really hoping we might be able to merge the YOLO outputs some day so I can do away with this uncertainty in how to combine the losses from the different layers.

What do you mean?

I changed multi-scale to change the resolution every batch now, instead of every 10 batches before. This seemed to smooth the results a bit, epoch to epoch.

Does it decrease training speed, because changing of network size requires time?

If we use dynamic_minibatch=1 in the Darknet, when we change width,height,mini_batch dynamically and should reallocate GPU-arrayes for each layer, it can decrease treaining speed 2x-3x times if we will use it after each iteration.

@WongKinYiu

Have you checked if scale_x_y=1.1 increases AP95 accuracy, while it decreases AP50 and AP75 but keeps the same AP50...95? https://github.com/WongKinYiu/CrossStagePartialNetworks/blob/master/coco/results.md#mscoco


EfficientNetB0-Yolo was added to the OpenCV-dnn module

So it only requires to implement scale_x_y=1.1 for using csresnext50-panet-spp-original-optimal.cfg with OpenCV-dnn.

i have only done experiments for scale_x_y=1.05, scale_x_y=1.1, and scale_x_y=1.2 of different feature pyramids.

have u tested the inference speed of enetb0-yolo using opencv-dnn?

have u tested the inference speed of enetb0-yolo using opencv-dnn?

Not yet. I will test it on Intel CPU and Intel Myraid X neurochip

@AlexeyAB @WongKinYiu I made a simple Colab notebook to see the time effects of group/mix convolutions.

It times a tensor passing forward and backward (to mimic training) through a Conv2d() op. The speeds stay about the same even as the parameter count drops by >10X. So similar sized models using these ops may be much slower.

b=m(x), x=[16, 128, 38, 38], b=[16, 256, 38, 38]

    groups  time(ms)    params  shape m             
         1       5.1    294912  [256, 128, 3, 3]    
         2       4.2    147456  [256, 64, 3, 3]     
         4       4.2     73728  [256, 32, 3, 3]     
         8       4.9     36864  [256, 16, 3, 3]     
        16       6.9     18432  [256, 8, 3, 3]      
        32       6.1      9216  [256, 4, 3, 3]      
        64       2.6      4608  [256, 2, 3, 3]      
       128       2.0      2304  [256, 1, 3, 3]   

@glenn-jocher
Yes, nVidia cuDNN work in the same way.
Also Google Coral TPU-Edge neurochip doesn't use Grouped-conv, despite the fact that they advertise the EffecientDet/Net with grouped convolutions. https://ai.googleblog.com/2019/08/efficientnet-edgetpu-creating.html

have u tested the inference speed of enetb0-yolo using opencv-dnn?

Not yet. I will test it on Intel CPU and Intel Myraid X neurochip

Hi @AlexeyAB,

I ran the speed test of this network on the Intel CPU. It looks like it is almost 5 times slower than the Tiny Yolov3 PRN network on CPU as well. Below are the results,

OpenCV: 3.4.10-pre (https://github.com/opencv/opencv/tree/377dd04224630e835cce8c7d67e651cae73fd3b3)
CPU: Intel(R) Core(TM) i5-6200U CPU @ 2.30 GHz
Hard Drive Type: HDD
Display: Off
Yolov3-Tiny-PRN: 21.62 FPS
EfficientNetB0-Yolov3: 4.72 FPS

It looks like depth wise convolutions are slow on CPU as well. Any thoughts?

Thanks

@glenn-jocher @mmaaz60 Take a look at the comparison: https://github.com/AlexeyAB/darknet/issues/5079

@glenn-jocher Hi, Did you successfully train ASFF model?

@AlexeyAB yes I trained ASFF on COCO (results99 in orange), but got slightly worse results in the end compared to default (blue). Performance in the first 5% of epochs was much better, probably because the summation of outputs reduced a lot of that early noise in the model, but did not help after that point.

Of course my implementation might be wrong!

results

@glenn-jocher
Is this a comparison of yolov3-spp.cfg and yolov3-asff.cfg? Show the asff-cfg file.
What is the network resolution?
So asff / bifpn don't increase accuracy?

@AlexeyAB yes, basically. I created a 12-anchor version of yolov3-spp.cfg called yolov4.cfg (for 4 anchors per yolo layer) which I used for my comparison (this 12 anchor model increases mAP a tiny bit, about +0.1). I compared yolov4.cfg against yolov4-asff.cfg. For asff I moved all of the yolo layers to the end, and added 3 features to the existing feature vector of length 340 to create the asff weights, so the input to each yolo layer is the same: (1,343,13,13), (1,343,26,26), (1,343,52,52).

I split the feature vectors into the traditional size (1,340,13,13) and the weights (1,3,13,13) for the weighted summations:

yolo1 = (1,340,13,13)*(1,1,13,13) + (1,340,13,13)*(1,1,13,13) + (1,340,13,13)*(1,1,13,13)

etc. using this extra ASFF code. I used sigmoid weights since softmax was much slower, and did a linear interpolation for the resizing.

        if ASFF:
            i, n = self.index, self.nl  # index in layers, number of layers
            p = out[self.layers[i]]
            bs, _, ny, nx = p.shape  # bs, 255, 13, 13
            if (self.nx, self.ny) != (nx, ny):
                create_grids(self, img_size, (nx, ny), p.device, p.dtype)

            # outputs and weights
            # w = F.softmax(p[:, -n:], 1)  # normalized weights
            w = torch.sigmoid(p[:, -n:]) * (2 / n)  # sigmoid weights (faster)
            # w = w / w.sum(1).unsqueeze(1)  # normalize across layer dimension

            # weighted ASFF sum
            p = out[self.layers[i]][:, :-n] * w[:, i:i + 1]
            for j in range(n):
                if j != i:
                    p += w[:, j:j + 1] * \
                         F.interpolate(out[self.layers[j]][:, :-n], size=[ny, nx], mode='bilinear', align_corners=False)

Training was multi-scale 288-640, with metrics plotted at 608 img-size. So no, so far I haven't been able to increase accuracy with BiFPN or ASFF. The only thing that improved a tiny bit was weighted feature fusion, but the gain was tiny (0.1 mAP).

yolov4.cfg.txt
yolov4-asff.cfg.txt

@glenn-jocher So do you get AP50...95 higher than 40.6 - 42.4% for ASFF 608x608? https://github.com/ruinmessi/ASFF#coco

It seems that ASFF+RFB or multi-block-BiFPN should use higher network resolution for higher accuracy.

@AlexeyAB no, I actually saw worse results for my ASFF impementation, about -0.5mAP at 608 vs the default yolov4.cfg.

Higher image size is definitely one of the ingredients in higher mAPs. EfficientDet uses 512@D0, 640@D1, all the way to 1280@D7:
https://github.com/google/automl/blob/3d88847cc18c69d194490f039279502ddcb536f2/efficientdet/hparams_config.py#L199

The official ASFF trains at 320-608 for 42.4@608 and 480-800 for 43.9@800. https://github.com/ruinmessi/ASFF#models

This issue is stale because it has been open 30 days with no activity. Remove Stale label or comment or this will be closed in 5 days.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

leeyunhome picture leeyunhome  路  3Comments

Deep-Learner picture Deep-Learner  路  5Comments

NgTuong picture NgTuong  路  4Comments

acburigo picture acburigo  路  4Comments

MichaelCong picture MichaelCong  路  4Comments