Caffe: Should pooling regions be identical to convolution regions?

Created on 18 Oct 2014 · 16Comments · Source: BVLC/caffe

Pooling and convolution both aggregate inputs over rectangular regions defined by kernel size, stride, and padding parameters. (In fact, average pooling is convolution with a fixed filter, and max pooling is a special case of "tropical convolution".)

However, Caffe currently uses different sets of regions (and consequently, produces different output sizes) for pooling and convolution. Convolution regions are never allowed outside of the padded input, while pooling is performed when a _strided_ pooling region extends off the ends of the input.

This inconsistency causes some annoyance when the exact sizes of things need to be computed and can vary (due to to #594). (The issue doesn't normally come up when using networks like AlexNet that ensure that their pooling regions don't encounter this edge case.)

Should the behavior be made consistent, or is there a good reason for its current state?

(See also #988, but note that the expression presented there is different from both the current behaviors.)

enhancement

Source

longjon

Most helpful comment

Hey guys, it has been a very long time since the issue was opened.
When can we solve this problem？

drcege on 10 May 2017

👍9

All 16 comments

If I understand it correctly, currently for pooling it is top_size = ceil((bottom_size + 2*pad - kernel_size) / stride) + 1 and for conv it is top_size = floor((bottom_size + 2*pad - kernel_size) / stride) + 1

I also have this issue with my fully-convolutional model. I think it should be made consistent.

ronghanghu on 18 Oct 2014

👍6

Yeah, I've run into problems with this as well -- there should be a RectangularTileParameter or some such that both ConvolutionParameter and PoolingParameter understand.

while pooling is performed when a strided pooling region extends off the ends of the input.

Hmm, I don't think I understand; are you saying the behavior of PoolingLayer::Reshape (https://github.com/BVLC/caffe/blob/dev/src/caffe/layers/pooling_layer.cpp#L81) is different for the case of stride_h_ == 1 vs stride_h_ > 1 (and/or for stride_w_)? If so that's not obvious to me from the code, but maybe it's more subtle than an if (stride > 1) kind of statement.

jeffdonahue on 18 Oct 2014

I vote yes on standardizing the regions for convolution and pooling. This was raised earlier in https://github.com/BVLC/caffe/pull/473#issuecomment-45386156 and although @jeffdonahue suggested the "cover" mode be kept for compatibility at the time I'd like to hear if he's still into it.

shelhamer on 18 Oct 2014

tropics

Above all, I raise my glass to tropical pools.

shelhamer on 18 Oct 2014

@jeffdonahue, all I meant was that in the stride 1 case, the behavior of pooling and convolution is the same; the pooling and convolution regions will always cover the entire input without padding.

longjon on 18 Oct 2014

@shelhamer and I have discussed, and concluded that it's reasonable to change the behavior of pooling to match that of convolution. In particular:

in the unpadded case, allowing pooling regions to extend off the end of the input may change the statistics of the input in undesirable ways;
padding of the bottom-most input can always ensure that any clipped input cells do not cover real data, if needed; and
padding can be specified explicitly to compute all of the inputs available in the "cover" mode.

If a reason for keeping the "cover" mode still exists, comment here. Otherwise I'll implement the change eventually... others are welcome to implement it soon if desired.

longjon on 18 Oct 2014

I don't particularly care about having the "cover" mode in terms of wanting to use it for future training, but I'd think it would cause backwards compatibility issues to change the default pooling behavior -- e.g. if you have a saved net with a pooling layer and its output is fed into an inner product layer, and now the pooling layer has a different output dimension, the inner product layer's weights will have a different input dimension and the weights would thus be incompatible. And even if you go fully convolutional it might cause undesired effects on the sampling behavior if the dimensions were carefully planned (e.g. to have exactly 1x1 outputs from some pooling layer)?

Maybe we should keep the current behavior for the current set of proto parameters specifying the tiling behavior and deprecate them. Then add a new common proto message with the new behavior used by convolution and pooling.

jeffdonahue on 18 Oct 2014

@jeffdonahue, yeah, that's a good point. I doubt many (any?) are relying on edge-pooling in networks that include inner product layers, but it's desirable to have a common message anyway. So your suggestion makes sense as a nice way to avoid some potential backwards compatibility issues, while getting the new behavior _and_ a common message that can simplify code at the same time.

longjon on 18 Oct 2014

The padding stuff is also one of the trickiest parts of #560. It is mixing the padding logic with the others that caused so many confusions. Padding is more likely a member of the data transformation family, rather than an inherent part of any layer.

The padding of MATLAB conv2 has three modes: full, same and valid. All of them have some use cases. When an independent class is created to manage the padding, it is not very hard to support any mode.

futurely on 21 Oct 2014

As a side note, the sizes compatibility validations of the trained models and the network definitions are mixed with many other things in the layer setup methods currently. It would be much easier to reason and maintain by separating them out.

futurely on 21 Oct 2014

@futurely, it might be nice to factor out padding (thus allowing it to be used by itself, even when not followed by a conv or pooling layer), but I don't see a way to do that with a non-negligible cost. The data transformation stuff works because the data is first read into host memory, and only the transformed version goes to GPU. For an intermediate layer, one can't pad in-place, so a naive approach has a memory cost (and I don't see a less naive approach).

longjon on 21 Oct 2014

class InPlacePadding {
  public:
    void pad(const PaddingParameter& param, const vector<Blob<Dtype>*>& bottom, 
                   vector<Blob<Dtype>*>* top);
  protected:
    void pad_mode1(vector<Blob<Dtype>*>* top);
    void pad_mode2(vector<Blob<Dtype>*>* top);
    void pad_mode3(vector<Blob<Dtype>*>* top);
}

class PoolingLayer {
  public:
     void Method(const vector<Blob<Dtype>*>& bottom, vector<Blob<Dtype>*>* top) {
         ...
         padding_.pad(this->layer_param_.padding_param(), bottom, top);
         ...
    }
  protected:
    InPlacePadding padding_;
}

Using composition instead of adding a layer above the conv or pooling layer, no extra memory has to be allocated.

futurely on 22 Oct 2014

@futurely I can't tell what you're trying to get at above... what is padding_.pad supposed to do? Presumably it does puts the padded version of its bottom in top, but then one has to allocate extra memory for the padded version (note that PoolingLayer can't use its _own_ top to the store the padded bottom, it's too small). But your claim is that no extra memory has to be allocated, so one of us is confused.

There _is_ a way to abstract away the padding, which is to have an indexing routine which indirects from direct access to blob memory, performing bounds checking and sometimes returning zero rather than actual blob contents. But I worry that would be overengineering.

longjon on 22 Oct 2014

The above sketch tried to share padding codes without using inheritance. I thought the Blob* bottom could be reshaped in place and the memory was automatically managed by the layers Reshape methods. To see whether there is any commonality to extract, the padding logic of the convolution and pooling layer could be refactored into separate methods first.

futurely on 23 Oct 2014

Thanks for details. Wired dimension of outputs of pooling is confusing.