Darknet: Padding used for YOLO v3

Created on 6 Jul 2018 · 4Comments · Source: pjreddie/darknet

In the config file for the darknet implementation of YOLO v3, it says:
pad=1. Does that mean that there's always a padding of 1 pixel around the image or is this a boolean value for 'SAME'/'VALID' padding?

If there's always one pixel padding, is it true that the first few layers actually don't change in dimensions (except for the channels)?

Thanks in advance,
Jules

Source

merckxiaan

Most helpful comment

It is using constant padding. I hope I can clarify this. First, let's take a look at the first two layers of Darknet-53:

| Type | Filter | Size | Output |
|---------------|--------|-----------|-----------|
| Convolutional | 32 | 3 x 3 | 256 x 256 |
| Convolutional | 64 | 3 x 3 / 2 | 128 x 128 |

Notice how in the first layer the stride is 1, and 2 in the second layer. You can also find this information in the yolov3.cfg file

Nice. So the input of the 2nd layer is a volume of 256x256, and the output is a volume of 128x128 (half the size). Now, let's recall the formula to compute the output size of a convolution:

O = ((W - F + 2P) / S) + 1

Where _O_ denotes the output size, _W_ the input size, _F_ the size of the kernel, _P_ the padding and _S_ the stride.

If you substitute in this equation (using the info in the .cfg file) you would find the dimensions don't match:

((256 - 3 + 2x1) / 2) + 1 = 128.5

However, if you truncate the above division, they match. This is what is happening in practice. To simplify, let's consider an input volume of 4x4, and we want to downsample it to a 2x2 volume:

1    2    3    4

Now, let's add padding of 1:

0    1    2    3    4    0

Let's make the convolution with a kernel of size 3, and stride 2:

| -  -  - | (A)
          | -  -  - | (B)
                    | -  -  - || <-- Error: out of bounds (this one is dropped)
0    1    2    3    4    0

You can see how the output size will loke like:

A    B

Successfully downsampling the image. This example also applies for a volume of 256x256.

TL:DR Check your convolutions!

fediazgon on 26 Oct 2018

❤3

All 4 comments

I'm also wondering. I read somewhere that top&left padding was used. I don't quite understand how though. 2 px at every 3x3 layer?

OlofHarrysson on 8 Oct 2018

It is using constant padding. I hope I can clarify this. First, let's take a look at the first two layers of Darknet-53:

| Type | Filter | Size | Output |
|---------------|--------|-----------|-----------|
| Convolutional | 32 | 3 x 3 | 256 x 256 |
| Convolutional | 64 | 3 x 3 / 2 | 128 x 128 |

Notice how in the first layer the stride is 1, and 2 in the second layer. You can also find this information in the yolov3.cfg file

Nice. So the input of the 2nd layer is a volume of 256x256, and the output is a volume of 128x128 (half the size). Now, let's recall the formula to compute the output size of a convolution:

O = ((W - F + 2P) / S) + 1

Where _O_ denotes the output size, _W_ the input size, _F_ the size of the kernel, _P_ the padding and _S_ the stride.

If you substitute in this equation (using the info in the .cfg file) you would find the dimensions don't match:

((256 - 3 + 2x1) / 2) + 1 = 128.5

However, if you truncate the above division, they match. This is what is happening in practice. To simplify, let's consider an input volume of 4x4, and we want to downsample it to a 2x2 volume:

1    2    3    4

Now, let's add padding of 1:

0    1    2    3    4    0

Let's make the convolution with a kernel of size 3, and stride 2:

| -  -  - | (A)
          | -  -  - | (B)
                    | -  -  - || <-- Error: out of bounds (this one is dropped)
0    1    2    3    4    0

You can see how the output size will loke like:

A    B

Successfully downsampling the image. This example also applies for a volume of 256x256.

TL:DR Check your convolutions!

fediazgon on 26 Oct 2018

❤3

Thanks alot!

merckxiaan on 26 Oct 2018

hey @fediazgon ,
i was reading ur reply in https://github.com/pjreddie/darknet/issues/950#issuecomment-433494661 and i wanted to ask , instead of padding the input to convolution first and then applying the convolution with padding as "VALID" why cant we use input without padding and use conv with "SAME" padding where the equations used would be something like :
output_height = input_height / stride_height
output_width = input_width / stride_width

which will still give the same results as the output height and width does not depend on filter dimension.

Note : these equations are subjective to framework used , in tensorflow these are valid equations.

thanks