Darknet: What is yolo-spp? ~+2.7 AP@[.5, .95]

Created on 7 Apr 2019  路  10Comments  路  Source: AlexeyAB/darknet

Is there any paper/blog where I can find an explanation for yolo-spp?

Is it the same that the spp block used here DC-SPP-YOLO: Dense Connection and Spatial Pyramid Pooling Based YOLO for Object Detection?

question

Most helpful comment

@drapado @HamzahNizami Hi,

The difference between yolov3.cfg and yolov3-spp.cfg just this SPP-block: https://github.com/AlexeyAB/darknet/blob/b6a824df39d1a79f15916d2c11133ce27bc0ab06/cfg/yolov3-spp.cfg#L575-L597

  • yolov3-tiny.cfg uses downsampling (stride=2) in Max-Pooling layers
  • yolov3.cfg uses downsampling (stride=2) in Convolutional layers
  • yolov3-spp.cfg uses downsampling (stride=2) in Convolutional layers + gets the best features in Max-Pooling layers

Yes, this is described in this artice, just yolov3-spp.cfg uses 5x5 9x9 13x13, while in the artice 5x5 7x7 13x13: https://arxiv.org/abs/1903.08589

They got only mAP = 79.6% on Pascal VOC 2007 test with using Yolov3SPP-model.
But you can achive higher accuracy mAP = 82.1% even with yolov3.cfg model by using this repository: https://github.com/AlexeyAB/darknet/issues/2557#issuecomment-474187706
And even higher mAP with yolov3-spp.cfg and this repository.

image

All 10 comments

@AlexeyAB Hi there i would also like to ask the same question. Any explanation of what exactly the differences are between the different versions (YOLOv3-Tiny, YOLOv3 and yolov3-spp) would be really helpful. A description of the changes and what exactly those changes do would be very helpful.

Thanks in advance

@drapado @HamzahNizami Hi,

The difference between yolov3.cfg and yolov3-spp.cfg just this SPP-block: https://github.com/AlexeyAB/darknet/blob/b6a824df39d1a79f15916d2c11133ce27bc0ab06/cfg/yolov3-spp.cfg#L575-L597

  • yolov3-tiny.cfg uses downsampling (stride=2) in Max-Pooling layers
  • yolov3.cfg uses downsampling (stride=2) in Convolutional layers
  • yolov3-spp.cfg uses downsampling (stride=2) in Convolutional layers + gets the best features in Max-Pooling layers

Yes, this is described in this artice, just yolov3-spp.cfg uses 5x5 9x9 13x13, while in the artice 5x5 7x7 13x13: https://arxiv.org/abs/1903.08589

They got only mAP = 79.6% on Pascal VOC 2007 test with using Yolov3SPP-model.
But you can achive higher accuracy mAP = 82.1% even with yolov3.cfg model by using this repository: https://github.com/AlexeyAB/darknet/issues/2557#issuecomment-474187706
And even higher mAP with yolov3-spp.cfg and this repository.

image

@AlexeyAb thanks for your answers!

How the spp block makes yolov3-spp suitable for small and large object detection?

@AlexeyAB I was under the impression that YOLOv3 removed max pooling layers?

@AlexeyAB thanks for your answers!

How the spp block makes yolov3-spp suitable for small and large object detection?

My understanding -
Both convolution subsampling and max pooling have distinct advantages. Convolution subsampling can be better reversed probably in the subsequent upsampling layers. Max pooling somewhat acts to remove some high frequency noise from the image by selecting only max values from the adjacent regions.
By combining both, spp seems to be leveraging the benefits of both.

@AlexeyAB , Would be good to know your thoughts if the above is valid comment.

Thanks!

image

@AlexeyAB , Want to know if the concept of "Dense connection block" also implemented as part of yolov3-spp here in this repository? Or this was not needed?

@kmsravindra

Want to know if the concept of "Dense connection block" also implemented as part of yolov3-spp here in this repository?

No.

You can try to add it there: https://github.com/AlexeyAB/darknet/blob/8c970498a296ed129ffef7d872ccc25d42d1afda/cfg/yolov3-spp.cfg#L574

For example, you can use Dense-Connection block like this (just set filters=512 in [convolutional] layers) : https://github.com/AlexeyAB/darknet/blob/8c970498a296ed129ffef7d872ccc25d42d1afda/cfg/densenet201.cfg#L1224-L1317

@kmsravindra

My understanding -
Both convolution subsampling and max pooling have distinct advantages. Convolution subsampling can be better reversed probably in the subsequent upsampling layers. Max pooling somewhat acts to remove some high frequency noise from the image by selecting only max values from the adjacent regions.
By combining both, spp seems to be leveraging the benefits of both.

@AlexeyAB , Would be good to know your thoughts if the above is valid comment.

Thanks!


Convolution subsampling can be better reversed probably in the subsequent upsampling layers.

What do you mean?

  • Convolution subsampling (with kernel_size >= stride) uses all input data (weighted), while max-pooling uses only the maximum inputs.

  • Convolution subsampling can be end-to-end optimized for XNOR-model (bit1-input & bit1-output), while max-pooling requires INT16/32 or FP16/32 inputs even for XNOR-model (so conversion BIT1->FP32->BIT1(transpose) - it brings a big overhead) .

@AlexeyAB Can you please tell me how the concatenation of the max pooling layers is carried out if they are different dimensions? Are they simply padded to the correct size?

@mcguile

They have different sizes but the same stride=1, so output WxH == input WxH, therefore all these maxpool-layers have the same output.

Yes it automatically uses padding size-1: https://github.com/AlexeyAB/darknet/blob/a1abd07e23fc5b143a6197de9908fe4f33791c6a/src/parser.c#L533

Was this page helpful?
0 / 5 - 0 ratings

Related issues

Greta-A picture Greta-A  路  3Comments

Jacky3213 picture Jacky3213  路  3Comments

HanSeYeong picture HanSeYeong  路  3Comments

shootingliu picture shootingliu  路  3Comments

Cipusha picture Cipusha  路  3Comments