Vision: Possible inversion of x-axis and y-axis in crop transform

Created on 27 Sep 2019 · 8Comments · Source: pytorch/vision

I just realized that the crop transform in transform/functional.py with arguments i, j, w and h (using PIL.crop function) actually crops a rectangular portion of w width and h height, having the top left coordinate as (j, i). Wouldn't it be intuitive to have the top left corner coordinates as (i, j)?

documentation transforms

Source

ekagra-ranjan

All 8 comments

Related to #1323

pmeier on 27 Sep 2019

👍1

Hi @pmeier ! As pointed out by @fmassa, Pillow has different ordering for width and height but I think the position of i and j are still reversed. Please check out: https://github.com/python-pillow/Pillow/issues/2729#issuecomment-425388850 for the syntax for crop in Pillow. The first 2 arguments are for the (x, y) coordinates for the top left corner.

If the user input (i, j) is in the common format of (x, y), then the current implementation of crop in Pytorch is setting (y, x) as coordinates for top left corner. I think the correct implementation should be:

img.crop((i, j, i+w, j+h)

Am I missing something?

ekagra-ranjan on 27 Sep 2019

IMO the current implementation is correct. I did the following to verify it:

from PIL import Image
import numpy as np
from torchvision.transforms import functional as F

np_image = np.random.rand(100, 100)
pil_image = Image.fromarray(np_image * 255.0)

left = 10
top = 20
width = 30
height = 40
right = left + width
bottom = top + height

crop1 = pil_image.crop((left, top, right, bottom))
crop2 = F.crop(pil_image, top, left, height, width)

abs_diff = np.sum(np.abs(np.array(crop1) - np.array(crop2)))
implementation_correct = abs_diff < 1e-6

However, we should definitely change the signature and docstring to avoid further confusion. I propose the following:

def crop(img, top, left, height, width):
    """Crop the given PIL Image.
    Args:
        img (PIL Image): Image to be cropped. (0,0) denotes the top left corner of the image.
        top (int): Vertical component of the top left corner of the crop box.
        left (int): Horizontal component of the top left corner of the crop box.
        height (int): Height of the crop box.
        width (int): Width of the crop box.
    Returns:
        PIL Image: Cropped image.
    """

pmeier on 27 Sep 2019

👍1

If we follow the convention of (top, left, height, width) then the current implementation is indeed fine. However, we have 2 choices for the naming conventions:
1) (top, left, height, width)
2) (x, y, height, width)

Since top and left correspond to coordinates in y-axis and x-axis respectively, both of them are equivalent. So the only question remaining is what convention do we follow? IMO (x, y, height, width) would be more clear than (top, left, height, width) as even to explain what top and left mean, we need to use (x, y) coordinates, e.g., (0, 0) is used in the docs.

Is it a standard convention to use (top, left, height, width) in Pytorch? Have we used this convention anywhere else?

ekagra-ranjan on 27 Sep 2019

👍1

IMO (x, y, height, width) would be more clear than (top, left, height, width) as even to explain what top and left mean, we need to use (x, y) coordinates, e.g., (0, 0) is used in the docs.

If we use x and y we need to state in which direction the box is expanding. This is clear for top and left. Thus, even for the amount of documentation both are equivalent. I think this comes down to preference.

If we go with option 2. the signature should be

(y, x, height, width)

to keep BC.

pmeier on 29 Sep 2019

I guess we can leave it as it. You may proceed to change the documentation as you mentioned earlier or if you want I can do it in case you are busy at the moment.

ekagra-ranjan on 29 Sep 2019

I can send a PR, if the documentation change is OK for @fmassa

pmeier on 30 Sep 2019

@pmeier the doc improvements that you proposed in https://github.com/pytorch/vision/issues/1380#issuecomment-535871094 are great, it would be great if you could send a PR improving it