In the config file for the darknet implementation of YOLO v3, it says:
pad=1. Does that mean that there's always a padding of 1 pixel around the image or is this a boolean value for 'SAME'/'VALID' padding?
If there's always one pixel padding, is it true that the first few layers actually don't change in dimensions (except for the channels)?
Thanks in advance,
Jules
I'm also wondering. I read somewhere that top&left padding was used. I don't quite understand how though. 2 px at every 3x3 layer?
It is using constant padding. I hope I can clarify this. First, let's take a look at the first two layers of Darknet-53:
| Type | Filter | Size | Output |
|---------------|--------|-----------|-----------|
| Convolutional | 32 | 3 x 3 | 256 x 256 |
| Convolutional | 64 | 3 x 3 / 2 | 128 x 128 |
Notice how in the first layer the stride is 1, and 2 in the second layer. You can also find this information in the yolov3.cfg file
Nice. So the input of the 2nd layer is a volume of 256x256, and the output is a volume of 128x128 (half the size). Now, let's recall the formula to compute the output size of a convolution:
O = ((W - F + 2P) / S) + 1
Where _O_ denotes the output size, _W_ the input size, _F_ the size of the kernel, _P_ the padding and _S_ the stride.
If you substitute in this equation (using the info in the .cfg file) you would find the dimensions don't match:
((256 - 3 + 2x1) / 2) + 1 = 128.5
However, if you truncate the above division, they match. This is what is happening in practice. To simplify, let's consider an input volume of 4x4, and we want to downsample it to a 2x2 volume:
1 2 3 4
Now, let's add padding of 1:
0 1 2 3 4 0
Let's make the convolution with a kernel of size 3, and stride 2:
| - - - | (A)
| - - - | (B)
| - - - || <-- Error: out of bounds (this one is dropped)
0 1 2 3 4 0
You can see how the output size will loke like:
A B
Successfully downsampling the image. This example also applies for a volume of 256x256.
TL:DR Check your convolutions!
Thanks alot!
hey @fediazgon ,
i was reading ur reply in https://github.com/pjreddie/darknet/issues/950#issuecomment-433494661 and i wanted to ask , instead of padding the input to convolution first and then applying the convolution with padding as "VALID" why cant we use input without padding and use conv with "SAME" padding where the equations used would be something like :
output_height = input_height / stride_height
output_width = input_width / stride_width
which will still give the same results as the output height and width does not depend on filter dimension.
Note : these equations are subjective to framework used , in tensorflow these are valid equations.
thanks
Most helpful comment
It is using constant padding. I hope I can clarify this. First, let's take a look at the first two layers of Darknet-53:
| Type | Filter | Size | Output |
|---------------|--------|-----------|-----------|
| Convolutional | 32 | 3 x 3 | 256 x 256 |
| Convolutional | 64 | 3 x 3 / 2 | 128 x 128 |
Notice how in the first layer the stride is 1, and 2 in the second layer. You can also find this information in the yolov3.cfg file
Nice. So the input of the 2nd layer is a volume of 256x256, and the output is a volume of 128x128 (half the size). Now, let's recall the formula to compute the output size of a convolution:
O = ((W - F + 2P) / S) + 1
Where _O_ denotes the output size, _W_ the input size, _F_ the size of the kernel, _P_ the padding and _S_ the stride.
If you substitute in this equation (using the info in the .cfg file) you would find the dimensions don't match:
((256 - 3 + 2x1) / 2) + 1 = 128.5
However, if you truncate the above division, they match. This is what is happening in practice. To simplify, let's consider an input volume of 4x4, and we want to downsample it to a 2x2 volume:
Now, let's add padding of 1:
Let's make the convolution with a kernel of size 3, and stride 2:
You can see how the output size will loke like:
Successfully downsampling the image. This example also applies for a volume of 256x256.
TL:DR Check your convolutions!