Maskrcnn-benchmark: W and H computation in `crop` method of several target structures seems wrong

Created on 10 May 2019  路  12Comments  路  Source: facebookresearch/maskrcnn-benchmark

馃悰 Bug

It seems that the crop() methods present in many target structure classes, e.g. BoxList, BinaryMaskList, and Polygons, calculate the height and width wrongly.

BoxList class

In its crop(self, box) method, L173-L178:

        w, h = box[2] - box[0], box[3] - box[1]
        cropped_xmin = (xmin - box[0]).clamp(min=0, max=w)
        cropped_ymin = (ymin - box[1]).clamp(min=0, max=h)
        cropped_xmax = (xmax - box[0]).clamp(min=0, max=w)
        cropped_ymax = (ymax - box[1]).clamp(min=0, max=h)

Since box defines the corner coordinates of the desired crop region, its width and height should be box[2] - box[0] + 1 and box[3] - box[1] + 1 respectively. The correct clamp max bound in the following four lines of code should be w - 1 and h - 1.

Therefore, the above code block should be:

        w, h = box[2] - box[0] + 1, box[3] - box[1] + 1
        cropped_xmin = (xmin - box[0]).clamp(min=0, max=w - 1)
        cropped_ymin = (ymin - box[1]).clamp(min=0, max=h - 1)
        cropped_xmax = (xmax - box[0]).clamp(min=0, max=w - 1)
        cropped_ymax = (ymax - box[1]).clamp(min=0, max=h - 1)

While at first glance it seems that the wrong code cancels itself out, and doesn't impact the resultant cropped_xs and cropped_ys, it does cause a wrong bounding box size later, in L187 of the same method:

       bbox = BoxList(cropped_box, (w, h), mode="xyxy")

The cropped BoxList would be smaller than the desired crop box, by 1px in both x and y direction.

BinaryMaskList, PolygonInstance, PolygonList class

Similarly, in the crop() method of these classes, the computed w and h are 1px smaller than the correct values.

BinaryMaskList.crop(self, box) L92-L111:

    def crop(self, box):
        ...
        width, height = xmax - xmin, ymax - ymin
        cropped_masks = self.masks[:, ymin:ymax, xmin:xmax]
        cropped_size = width, height
        return BinaryMaskList(cropped_masks, cropped_size)

PolygonInstance.crop(self, box) L242-L268:

    def crop(self, box):
        ...
        w, h = xmax - xmin, ymax - ymin

        cropped_polygons = []
        for poly in self.polygons:
            p = poly.clone()
            p[0::2] = p[0::2] - xmin  # .clamp(min=0, max=w)
            p[1::2] = p[1::2] - ymin  # .clamp(min=0, max=h)
            cropped_polygons.append(p)

        return PolygonInstance(cropped_polygons, size=(w, h))

PolygonList.crop(self, box) L381-L388:

    def crop(self, box):
        w, h = box[2] - box[0], box[3] - box[1]
        cropped_polygons = []
        for polygon in self.polygons:
            cropped_polygons.append(polygon.crop(box))

        cropped_size = w, h
        return PolygonList(cropped_polygons, cropped_size)

Impact

The impact is usually unfelt if one just runs the code base as is, since these mistakes seem quite _self-consistent_. However, besides the implementation being fundamentally incorrect, it also adds unnecessary annoyances to people developing new functionalities based on this code base.

wontfix

Most helpful comment

So again it is just an annotation convention of your data format. If the right bottom point is excluded, there is no "+1", but if the right bottom point is included, there should be "+1".

It is a matter of convention, but there are good convention and bad convention.

"+1" is a bad convention because when x2 = x1 + w - 1, x2 is not scale-invariant, meaning that all code written to scale the boxes are not precise.
Without the "+1", a box that's half of the image is simply (0.0, 0.0, 128.0, 128.0).

All 12 comments

I agree with this mistake, which seems to be an old mistake in the original Detectron. The implementation in mmdetection has corrected this. (E.g. https://github.com/open-mmlab/mmdetection/blob/master/mmdet/core/bbox/transforms.py)

Yes, this is a mistake that has been kept for backwards-compatibility with Detectron. :-/

I agree with @qizhuli, this mistake appeared in several places in this repo and Detectron does bring me some annoyances. It makes me wonder why there should be "+1" when converting a box from "xyxy" to "xywh" if the coordinates of this box are 0-indexed absolute coordinates (floats).

why there should be "+1" when converting a box from "xyxy" to "xywh" if the coordinates of this box are 0-indexed absolute coordinates (floats)

You're right and there should not be. This is just a historical issue and we're slowly correcting it (e.g., https://github.com/pytorch/pytorch/commit/373e6a78bf6afaf66d2749740036b077b70312c8 )

Why there should not be "+1" when converting "xyxy" to "xywh"? Are you assuming the bottom right point is not on the object?

Why there should not be "+1" when converting "xyxy" to "xywh"? Are you assuming the bottom right point is not on the object?

No. "The x and y coordinates are 0-indexed absolute coordinates (floats) obtained from user mouse clicks, not integer pixel locations." (referred from pdollar). So, computing the width and height of a box is different from counting the number of pixel locations. As explained by @ppwwyyxx, this is a historical issue.

What is the difference with floats? Do you mean the width between 0.0 and 1.0 is 1? Adn if the image size is 256x256, the width and height are actually 255.0 - 0.0 = 255.0 and 255.0 - 0.0 = 255.0?

@wangg12 For an image, its width and height are always integers rather than floats because it consists of pixels. For a bounding box, its coordinates represent the precise locations (or points but not pixels) of an object on the image.

What is the bounding box coordinates convention in your mind? If an image is filled with one object (eg. 256x256), how will you represent the xyxy and xywh?

What is the bounding box coordinates convention in your mind? If an image is filled with one object (eg. 256x256), how will you represent the xyxy and xywh?

For your case, both xyxy and xywh are (0.0, 0.0, 256.0, 256.0).

So again it is just an annotation convention of your data format. If the right bottom point is excluded, there is no "+1", but if the right bottom point is included, there should be "+1". There is no difference if the annotation is in "xywh" and the code is self-consistent (However it is apparently not self-consistent in this repo). If the annotation is in "xyxy" format, more attention should be paid.

So again it is just an annotation convention of your data format. If the right bottom point is excluded, there is no "+1", but if the right bottom point is included, there should be "+1".

It is a matter of convention, but there are good convention and bad convention.

"+1" is a bad convention because when x2 = x1 + w - 1, x2 is not scale-invariant, meaning that all code written to scale the boxes are not precise.
Without the "+1", a box that's half of the image is simply (0.0, 0.0, 128.0, 128.0).

Was this page helpful?
0 / 5 - 0 ratings