Vision: [RFC] Abstractions for segmentation / detection transforms

Created on 2 Oct 2019 · 11Comments · Source: pytorch/vision

This is a proposal. I'm not sure yet it's the best way to achieve this, so I'm putting this up for discussion.

tl;dr

Have specific tensor subclasses for BoxList / SegmentationMask, CombinedObjects , etc, which inherits from torch.Tensor and overrides methods to properly dispatch to the relevant implementations. Depends on __torch_function__ from https://github.com/pytorch/pytorch/issues/22402 (implemented in https://github.com/pytorch/pytorch/pull/27064)

Background

For more than 2 years now, users have asked for ways of performing transformations of multiple inputs at the same time, for example for semantic segmentation or object detection https://github.com/pytorch/vision/issues/9

The recommended solution is to use the functional transforms in this case https://github.com/pytorch/vision/issues/230 https://github.com/pytorch/vision/issues/1169 , but for simple cases, this is a bit verbose.

Requirements

Ideally, we would want the following to be possible:

work with a Compose style interface for simple cases
support more than a single input of each type (for example, two images and one segmentation mask)
support joint rotations / rescale with different hyperparameters for different input types (images can do bilinear interpolation, segmentation maps should do nearest interpolation)
be simple and modular

Proposed solution

We define new classes for each type of object, which should all inherit from torch.Tensor, and implement / override a few specific methods. It might depend on __torch_function__ from https://github.com/pytorch/pytorch/issues/22402 (implemented in https://github.com/pytorch/pytorch/pull/27064)

Work with `Compose`-style

We propose to define a CombinedObjects (better names welcome), which is a collection of arbitrary objects (potentially named, but that's not a requirement).
Calling any of the methods in it should dispatch to the corresponding methods of its constituents. A basic example is below, I'll mention a few more points about it afterwards):

class CombinedObjects(object):
    def __init__(self, **kwargs):
        self.kwargs = kwargs

    def hflip(self):
        result = {}
        for name, value in self.kwargs.items():
            result[name] = value.hflip()
        return type(self)(**result)

In this way, if the underlying objects follows the same protocol (i.e., implement the required functions), then this should allow to combine an arbitrary number of objects with a Compose API via

# example for flip
class RandomFlip(object):
    def __call__(self, x):
        # implementation stays almost the same
        # but now, `x` can be an Image, or a CombinedObject
        if random.random() > 0.5:
            x = x.hflip()
        return x

transforms = Compose([
    Resize(300),
    RandomFlip(),
    RandomColorAugment()
])

inputs = CombinedObjects(img1= x, img2=y, mask=z)
output = transforms(inputs)

which satisfies point 1 and 2 above, and part of point 3 (except for the different transformation hyperparameters for image / segmentation mask, which I'll cover next).

Different behavior for mask / boxes / images

In the same vein as the CombinedObject approach from the previous section, we would have subclasses of torch.Tensor for BoxList / SegmentationMask / etc which would override the behavior of specific functions so that they work as expected.

For example (and using code snippets from https://github.com/pytorch/pytorch/pull/25629), we can define a class for segmentation masks where rotation / interpolation / grid sample always behave with nearest interpolation:

HANDLED_FUNCTIONS = {}

def implements(torch_function):
    "Register an implementation of a torch function for a Tensor-like object."
    def decorator(func):
        HANDLED_FUNCTIONS[torch_function] = func
        return func
    return decorator

class SegmentationMask(torch.Tensor):
    def __torch_function__(self, func, types, args, kwargs):
        if func not in HANDLED_FUNCTIONS:
            return NotImplemented
        # Note: this allows subclasses that don't override
        # __torch_function__ to handle DiagonalTensor objects.
        if not all(issubclass(t, self.__class__) for t in types):
            return NotImplemented
        return HANDLED_FUNCTIONS[func](*args, **kwargs)

@implements(torch.nn.functional.interpolate)
def interpolate(...):
    # force nearest interpolation
    return torch.nn.functional.interpolate(..., mode='nearest')

and we can also give custom implementations for bounding boxes:

class BoxList(torch.Tensor):
    # need to define height and width somewhere as an attribute
    ...

@implements(torch.nn.functional.interpolate)
def interpolate(...):
    return box * scale

This would allow to cover the remaining of point 3. Because they are subclasses of torch.Tensor, they behave as Tensor except in some particular cases where we override the behavior.
This would be used as follows:

boxes = BoxList(torch.rand(10, 4), ...)
masks = SegmentationMask(torch.rand(1, 100, 200))
image = torch.rand(3, 100, 200)
# basically a dict of inputs
x = CombinedObject(image=image, boxes=boxes, masks=masks)
transforms = Compose([
    RandomResizeCrop(224),
    RandomFlip(),
    ColorAugment(),
])
out = transforms(x)
# have an API for getting the elements back
image = out.get('image')
# or something like that

Be simple and modular

This is up for discussion. The fact that we are implementing subclasses that do not behave exactly like tensors can be confusing and misleading. But it does seem to simplify a number of things, and makes it possible for users to leverage the same abstractions in torchvision for their own custom types, without having to modify anything in torchvision, which is nice.

Related discussions

Some other proposals have been discussed in https://github.com/pytorch/vision/issues/230 https://github.com/pytorch/vision/issues/1169 and many other places.

cc @Noiredd @SebastienEske @pmeier for discussion

transforms needs discussion

Source

fmassa

👍2 ❤1

Most helpful comment

Hi @SebastienEske

No worries at all, thanks for sharing your thoughts!

I've been slow to respond on this one because I've been busy working on other parts of torchvision. I'm going to be getting back to segmentation / detection transforms most probably in January

fmassa on 4 Nov 2019

👍2 ❤1

All 11 comments

Hello,
If I understand correctly, this proposal implies to

Create the class CombinedObject
port all transforms so they can apply to several inputs contained in the CombinedObject class
implement new classes for specific tensor types (SegmentationMask, etc)
these new classes should also contain the implementation for basic transforms

I am not sure that combining transforms and the object being transformed is a good idea. I understand the necessity of applying the same transform with different interpolation methods but while in some generic cases, it may be easy to predict, in some corner cases, it may not work.

For example, if we assume bilinear interpolation for an image and then someone comes up and says that he prefers cubic interpolation. How does that work? Do we add yet another parameter for that? But this is a transform parameter in a Tensor-like object...

I am not sure that you saw my last (heavily edited) comment in #1315. I think it solves the issue at hand

without too much rework,
with keeping backward compatibility,
with keeping transforms and transformed objects separated,
and it meets all the above objectives 1, 2, 3 and 4 (objective 2 is met with a different formulation "being able to apply the exact same transform to several inputs")

Here is a more detailed version of it and adapted to meet all the above requirements. It only needs to update the existing transforms by adding:

one extra constructor argument master
one extra argument when using the transform params
three extra internal attributes: master (transform), params(dict of the transform params) and async_params (list of the params in params that do not follow master)
one extra method get_params

The example below uses a dictionary of parameters for the constructor to be generic but it could be implemented with the normal "list" of parameters which can then be put in an internal dict anyway. This maintains backward compatibility.

We could have a master argument when creating a transform that would indicate the transform object to reuse the state from the parent.

affine_augment = tv.transforms.RandomAffine({'degrees':90, 'translate':(-50, 50), 'scale':(0.5, 2)})
affine_augment_label = tv.transforms.RandomAffine({'interpolation':'nearest'}, master=affine_augment)
image = affine_augment(image)
label = affine_augment_label(label)

And inside the transform we could have something like this:

def __init__(params, master=None)
    self.master = master
    self.async_params = []  # parameters that do not follow master transform
    if master is not None:
        master_params = master.get_params()
        for param_name in master_params.keys():
            self.params[param_name] = master_params[param_name]
    for param_name in params.keys:
         self.params[param_name] = params[param_name]
         self.async_params.append(param_name)

def __call__(input, params=None):
    if self.master is not None:
        master_params = master.get_params()
        for param_name in master_params.keys():
            if param_name not in self.async_params:
                self.params[param_name] = master_params[param_name]
    self.update_params() # update self.async_params
    temp_params = self.params
    if params is not None:
        for param_name in params.keys:
            temp_params[param_name] = params[param_name]
    return apply_params(input, temp_params)

def get_params()
    return self.params

This also allows customized transforms in the training loop:

affine_augment = tv.transforms.RandomAffine({'degrees':90, 'translate':(-50, 50), 'scale':(0.5, 2)})
# update params and apply transform
image = affine_augment(image)
#get current params
params = affine_augment.get_params()
params['interpolation'] = 'nearest'
#do not use updated params, apply the provided ones instead
label = affine_augment(label, params)

SebastienEske on 3 Oct 2019

PS: if you absolutely want to have one transform taking several inputs, we could have a BundleTransform with the above proposed transforms (or any other transform for that matter)

class BundleTransform(object):
    def __init__(self, *transforms):
        self.transforms = transforms

    def __call__(*inputs):
        if len(inputs) == len(self.transforms):
            return [self.transforms[i](inputs[i]) for i in range(len(self.transforms))]

affine_augment = tv.transforms.RandomAffine({'degrees':90, 'translate':(-50, 50), 'scale':(0.5, 2)})
affine_augment_label = tv.transforms.RandomAffine({'interpolation':'nearest'}, master=affine_augment)

affine_augment_bundle = BundleTransform(affine_augment, affine_augment_label)
image, label = affine_augment_bundle(image, label)

SebastienEske on 3 Oct 2019

Previously the transform objects were responsible for actually performing the transformation. You propose to move this logic to the objects that hold data. In your pipeline, Compose executes a Transform on some object representing data; Transform, instead of directly executing a function from tv.transforms.functional, now calls a member function of the received object. If that object happens to be a ComposedObject, it calls a member function on each of its members. Finally, a member function of the data-holding object is executed. This function performs the actual transformation; it may use things from functional or elsewhere.

This means that the transformation logic would be attached to the data objects. In the end, I think, there is no other way to solve this in a general way. After all, it's the user who knows what their data means and what does it mean to "transform it". I think that makes sense - the last code snippet in your post made me smile, this is exactly what I need!

What I'm worried with, however, is the resulting complexity on the back end of this approach. Two questions here:

How difficult would it be to implement my own transform? Let's say I'm working with classification and segmentation, and I want to try this weird thing with Tensor. How complex is that path? Currently I just write a functor. In the segmentation case, I merge #1315 and add my functor to e.g. SegmentationCompose._interpolate_transforms, depending on what it does.
How difficult would it be to implement my own data type? Let's say I'm working with pose estimation of some weird robot and I want to express this pose as an object, and I want that object to be transformable (together with rotations and stuff that I do on my input image). How complex is that path? Currently I can't do it using existing transform objects, I have to resort to functional (in addition to implementing the transformation logic, of course).

Honestly, the concern stems from this whole __torch_function__ thing that I can't say I understand exactly. In order to write my own transforms, will I have to?

PS: Are BoxList and SegmentationMask classes that would be provided by torchvision, or would the user have to write them themselves?

@SebastienEske

For example, if we assume bilinear interpolation for an image and then someone comes up and says that he prefers cubic interpolation.

If the object-oriented transformation API stays the same, you can just construct your transform object with cubic interpolation. As I understand the above, nothing changes about that.

Your master-follower scheme would allow constructing two separate Compose objects that would consist of linked transforms. Synchronization is indeed stronger than in the RNG case, but it's still implicit - once the transform objects are created, the link becomes hidden in the lower layer of abstraction (and thus harder to track, either in case something goes wrong, or if we want to alter it on the run).

There seem to be two ways to go: either a small patch to existing transforms, or a total makeover as proposed here. Of the two most reasonable proposals for the former, param-based (#1315) and master-follower (comment), the only difference is basically who passes the params to whom: is it the Compose object, or do transforms do it themselves. I've been pondering on this for a while, trying to name the thing that just doesn't feel right to me about this, and I come to the conclusion that I think this operations is not good OOP. Exchange of parameters belongs to a higher level of abstraction; transforms should just do what they're explicitly told (especially that you allow this path in your latest proposal), not talk among each other without any managing entity knowing. Of course this is just my opinion.

To sum up, I don't dislike this proposal - I'm just cautious about it. The question that's important to me is: what will be simpler an average PyTorch user: writing my own Compose and do get_params on existing transforms, #1315-style, or implement a transform and go through all that __torch_function__ stuff? Including the mental exercise of understanding either approach.

Noiredd on 4 Oct 2019

Thanks @SebastienEske and @Noiredd for your feedback!

I agree with @Noiredd points about the master-worker approach proposed by @SebastienEske (and the param-based approach) , in particular

I think this operations is not good OOP. Exchange of parameters belongs to a higher level of abstraction; transforms should just do what they're explicitly told (especially that you allow this path in your latest proposal), not talk among each other without any managing entity knowing.

To answer @Noiredd questions

How difficult would it be to implement my own transform?

Great question, and I see two different cases here.

If you live out-of-tree (i.e., not modifying torchvision), I would imagine that the user would implement a new transform by handling himself the different input types:

class MyNewTransform(object):
    def __call__(self, x):
        if isinstance(x, CombinedObject):
            out = []
            for y in x:
                out.append(self(y))
            return out
        elif isinstance(x, BoxList):
            # do something here
        elif isinstance(x, SegmentationMask):
            # something lese
        else:
            # finally

This doesn't require monkey-patching the torchvision objects to support your new transform, and the user controls what they want to support in a very explicit way.

If your code lives in torchvision, you could extend methods to each class that you care about so that this new transform is transparently handled.

How difficult would it be to implement my own data type?

I think this should be fairly simple. A cousin approach of what I proposed is what I did in maskrcnn-benchmark in the structures folder, where we have different classes for each object, and each one follows a common API (which was inspired by PIL API, see for example resize for boxes, for masks and for keypoints).
So if we follow that approach for now (to simplify things), it would involve doing something like

class Mesh(object):
    def resize(self, ...):
        pass
    def transpose(self, ...):
        pass
    ...

Now, why don't we just do this and try not to mess with __torch_function__ (which doesn't exist yet btw)?
Well, in many cases, it is useful to have everything be a torch.Tensor. You can perform many operations on them, without having to unwrap / rewrap everything all the time. The __torch_function__ approach would simplify some cases for dispatching to specific functions (like interpolate or grid_sample), but I think we can try to keep things simpler as of now (specially that __torch_function__ isn't available yet and I'd need to play a bit with it).

Are BoxList and SegmentationMask classes that would be provided by torchvision, or would the user have to write them themselves?

torchvision would provide those classes

fmassa on 4 Oct 2019

Sorry for the late reply, but at this point I don't have much to add. Also, I think my understandig of the bigger picture is not clear enough to give reasonable suggestions. However, if you need help implementing (even for experimenting) anything, feel free to CC me again.

pmeier on 7 Oct 2019

Hello,

I am ok with whatever approach you choose to use but, considering the amount of change, it's probably good to be aware of the limitations that the one your are proposing has. Allow me to play the devil's advocate one more time ;-):

currently one can combine transforms in any way one wishes, if the transform implementation lies in the transformed/content class, then combining them is restricted to the combinations available in the existing classes (e.g. the flip of BoxList and the rescale of SegmentationMask cannot be combined)
unless, of course the user creates a new class that meets his/her needs but that is my second point which goes with this quote as well:

If you live out-of-tree (i.e., not modifying torchvision), I would imagine that the user would implement a new transform by handling himself the different input types:

That is a lot more complicated (and bug prone) than just passing a parameter to an existing function.

If your code lives in torchvision, you could extend methods to each class that you care about so that this new transform is transparently handled.

So if someone wants to contribute a new transform, he/she will need to implement it for all the classes? Even with a single implementation in nn.functional plus reuse in other classes, I am not sure that this is very community friendly.

I also realized that part of the transform is implemented in the content class SegmentationMask for instance for hflip and then another part is implemented in the transform class RandomFlip. I guess that can be a design choice but then we need a clear guideline to say what should be implemented in the content class and what should be implemented in the transform class.

This means that the transformation logic would be attached to the data objects. In the end, I think, there is no other way to solve this in a general way. After all, it's the user who knows what their data means and what does it mean to "transform it".

Users already do it with transforms and they are not attached to the data objects. Attaching the transforms to the data objects gives less flexibility and will inevitably be less generic.

@Noiredd mentioning bad/good OOP got me thinking about it ;-)
While I understand that having a supervisor for the synchronization could be nice, I don't understand what would make not having one bad? Do you have any example?

To me, having transforms manage their dependencies on their own is a lightweight approach. The only reason I can think of for having a supervisor would be in the case of distributed training. But even then, all the inputs would probably be loaded on the same GPU so it is not needed. If it were needed I would expect it to be because of memory constraints in which case the class CombinedObjects would be even less flexible as it just would not load in memory.

I am far from being any expert for that but if I understand correctly, could putting the transforms in the data container violate the single responsibility principle. The new classes would be responsible for both storing/representing the data and transforming it...

The need to modify every class to introduce a new transform feels like breaking the open-closed principle (the closed part of it) as well.

PS: I'll try to think of something nicer for the master/slave approach by tomorrow.

SebastienEske on 7 Oct 2019

Well... I may have a better solution than the master/slave. Only I am surprised that it was not proposed before. Maybe it was and there is something wrong with it. I guess you'll tell me. Why not allow a transform to take lists of parameters and inputs?

# nbtrans is the number of transforms/inputs that the user wants to define
def __init__(param1, param2, ..., nbtrans=None)
    # get value of nbtrans
    self.nbtrans = nbtrans
    # check the other inputs and infer nbtrans value from them if needed
    if isinstance(param1, list):
        if self.nbtrans == None:
            self.nbtrans = len(param1)
        elif self.nbtrans != len(param1):
            raise ValueError('inconsistent number of parameters')
    if isinstance(param2, list):
        if self.nbtrans == None:
            self.nbtrans = len(param2)
        elif self.nbtrans != len(param2):
            raise ValueError('inconsistent number of parameters')
    ...
    if self.nbtrans == None:
        self.nbtrans = 1

    # create the proper set of parameters
    if isinstance(param1, list):
        self.param1 = param1
    else:
        self.param1 = [param1 for i in range(self.nbtrans)]
    if isinstance(param2, list):
        self.param2 = param2
    else:
        self.param2 = [param2 for i in range(self.nbtrans)]
    ...

def __call__(input):
    if isinstance(input, list):
        if len(input) = self.nbtrans:
            # update internal state if needed
            ...
            # compute the transformation for each input/set of parameters
            return [apply_trans(input[i], i) for i in range(len(input))]
        elif self.nbtrans == 1:
            # update internal state if needed
            ...
            # compute the transformation for each input/set of parameters
            return [apply_trans(input[i], 0) for i in range(len(input))]
        else:
            raise ValueError('wrong number of inputs')
    else:
        # should we throw an exception if self.nbtrans>1?
        # update internal state if needed
        ...
        return apply_trans(input, 0)

If a parameter was originally a list we can simply test if it has become a list of lists.

And we don't need to rewrite transforms.Compose. The initial example becomes:

boxes = torch.rand(10, 4)
masks = torch.rand(1, 100, 200)
image = torch.rand(3, 100, 200)
transforms = Compose([
    RandomResizeCrop(224, interpolation=['coords', 'nearest', 'bilinear']),
    RandomFlip(input_type=['coords', 'image', 'image']),
    ColorAugment(input_type=['coords', 'image', 'image']),
])
boxes, mask, image = transforms([boxes, mask, image])

Handling the coordinates is a bit tricky but the input_type parameter and the coords interpolation type solves this problem without using custom classes.

And we don't touch the tensors; the modifications remain in the transforms which I think is really a key point to have. I believe this satisfies all seven criteria above plus the one from @Noiredd about several transforms communicating together without supervision. Now, there is only one transform.

The possible values for input_type should probably be defined somewhere in order to have a unified interface.

What do you think?

SebastienEske on 8 Oct 2019

Doing some extra thinking. There are at least two ways to resize coordinates: return the exact float values and return the rounded integer values. The first one is akin to bilinear/bicubic interpolation and the second one to nearest neighbor.

So maybe we also want to have the input_type parameter for the RandomResizeCrop...

SebastienEske on 8 Oct 2019

Hello, it's been a while and since this has not moved on I feel like maybe I voiced my concerns a bit too much. If that's the case, I'm sorry about it.

In any case, my point was to make sure that all concerns and alternatives are presented. The above is all I got so @fmassa whatever you decide to go with will be good for me.

SebastienEske on 1 Nov 2019

Hi @SebastienEske

No worries at all, thanks for sharing your thoughts!

I've been slow to respond on this one because I've been busy working on other parts of torchvision. I'm going to be getting back to segmentation / detection transforms most probably in January

fmassa on 4 Nov 2019

👍2 ❤1

One issue when I stumbled on long ago when reading PASCAL segmentation multiclass masks: https://github.com/pytorch/pytorch/issues/5436

vadimkantorov on 8 Dec 2019

Was this page helpful?

0 / 5 - 0 ratings

Related issues

[feature request] ROI Pooling layers

varunagrawal · 45Comments

Problems training Faster-RCNN from pretrained backbone

lpuglia · 44Comments

RuntimeError: No such operator torchvision::nms

ppwwyyxx · 33Comments

Proposal for extending transforms

fmassa · 30Comments

Failed to register operator torchvision::_new_empty_tensor_op

h6197627 · 23Comments

Vision: [RFC] Abstractions for segmentation / detection transforms

tl;dr

Background

Requirements

Proposed solution

Work with Compose-style

Different behavior for mask / boxes / images

Be simple and modular

Related discussions

Most helpful comment

All 11 comments

Related issues

Work with `Compose`-style