Vision: Non-Maximum Supression on the GPU

Created on 17 Jan 2018  路  60Comments  路  Source: pytorch/vision

Is there any interest in an NMS layer that runs on the GPU for torchvision? I have one implemented; it gives a 1-2 order of magnitude speedup over a naive version composed from pytorch ops. Would be happy to contribute it if anyone's interested.

enhancement help wanted

Most helpful comment

Hey @varunagrawal

Now that we've released torchvision 0.2.2, I'll be (slowly) integrating all the changes that I've been discussing for a while, including merging NMS (and other layers) in torchvision.

Sorry for the wait!

All 60 comments

that's definitely interesting. We should add it into pytorch, until we figure out ATen extensions.
For example like how we added ROIPooling: https://github.com/pytorch/pytorch/pull/3672

Great, I'll read over that PR for starters.

This is all ready now. Where in the Python API would you like it go? I actually have two implementations ready; one is quicker in principle but it requires allocating global memory to store intermediate results, and freeing the memory at the end of each call completely kills performance. Is there anything one can do about this? If not, the other implementation (which doesn't allocate any intermediate global memory) is still fine and gives the kind of speedup I mentioned anyway, so it's no big deal.

Hi, is this going to be added soon? (I'm asking since pytorch/pytorch/#5404 is closed now)

Yes, I think we should add efficient NMS in here, following the Cpp extensions that were recently added.
Because Cpp extensions are only present in master, I'd wait until 0.4.0 is released before merging any C++ code into torchvision.
How does that sound?

@fmassa That's fine anyway.
Btw, is 0.4 coming soon? :-)

Thanks!

Whats the current status of this issue?

@ruotianluo I'm drafting the layers part of torchvision in https://github.com/pytorch/vision/tree/layers
I've added a CPU version of NMS, and I'll look into porting one of the following versions: from py-faster-rcnn, from this PR and a few others.
Have you tried any of those?

When I build my faster-rcnn project, I ported the nms from py-faster-rcnn; the main reason is I'm not familiar with cuda code and it's simpler to just copy.
With regard to https://github.com/pytorch/pytorch/pull/5404/files, it's using a full byte to save a boolean value(mask), while py-faster-rcnn uses each bit. (I can't comment on anything else, 'cause as I said I'm not familiar.)

Thanks, I'll take that into account when porting NMS to GPU.

On the other hand, the py-faster-rcnn version throws data back and fourth between GPU and CPU, which may not be ideal. Also it requires a workspace of size O(n_boxes^2), and dynamically allocating/freeing this messes with performance.

We can use the PyTorch caching allocator to handle memory allocation (which will save us from the sync points from freeing the memory).
I'll have a closer look at both implementations this weekend, but thanks for the feedbacks!

Cool, in that case you might also be interested in the implementation here, https://github.com/dssa56/projects/tree/master/nms under cuda_nms_workspace.cu. That uses the same technique as py-faster-rcnn, but keeps all data on the GPU. I didn't use it in the PR because of the workspace, but if that's a non-issue then it's potentially worth a look.

Has the current version of Torchvision the GPU-based NMS?

No, it doesn't.
I have it implemented locally (based on the implementations that were shared previously), but I haven't pushed it yet.

@fmassa,
Ok. Thank you for your reply. I hope the PyTorch team push it as soon as possible.

@fmassa Can I do you a favor if you need for implementing the GPU-based NMS ? I'm also working on it. I'm also using the GPU-based RoI align and RoI pool from your torchvision branch.

I have a version of it that I adapted from https://github.com/rbgirshick/py-faster-rcnn/blob/master/lib/nms/nms_kernel.cu
But because pytorch is so rapidly evolving, it has already been broken by pytorch refactoring at least 3 times.
I'll hold on before pushing it to torchvision because I expect those breakage to still happen.
I can push the current version I have in a gist though, if you want to try it out

@fmassa Thank you. Actually, I prefer to try it out, then consider how to fix the breakage in my production env. Looking forward to the gist version.

@Sucran here is the implementation https://gist.github.com/fmassa/cf9ab87e4bd71e849655d5abb8cfd6c6
It was working, but I'm not sure anymore.
Also, it required PyTorch compiled from source

FYI, we have released our implementation of {Faster, Mask} R-CNN in https://github.com/facebookresearch/maskrcnn-benchmark , which contains a CPU and CUDA implementation for Non Maximum suppression.

I suggest we move this discussion there for now.

@fmassa, Why close this issue though?
Many people use torchvision, but do not care about or need the maskrcnn repository.
I still think having a pure cuda implementation of NMS in torch/torchvision would be highly appreciated by lots of people!

I'm curious in which cases do people use NMS outside of object detection?
But at some point we might indeed add layers to torchvision, it's just that packaging torchvision with those is a bit more involved so I'll let it for when the Cpp extensions API is more stabilized.

But I'm opening it anyway as a reminder

Similar use case, but I've been using it for detecting events in audio.

I'm curious in which cases do people use NMS outside of object detection?
But at some point we might indeed add layers to torchvision, it's just that packaging torchvision with those is a bit more involved so I'll let it for when the Cpp extensions API is more stabilized.

But I'm opening it anyway as a reminder

Thanks!
I indeed need it for object detection myself, but I have developed my own framework for single-pass detection networks (eg. Yolo), so I don't need all the extra things from the rcnn repo.
I am hesitating of just stealing your nms implementation for my own repo, but atm my repo is pure python and doesn't need any compiling, which is quite handy for deployment across multiple platforms.

I might just wait it out, as I have a pytorch implementation that works quite well, but is just doing a small part on CPU.

@0phoff there are still several corner cases for users when compiling extension libraries, see for example https://github.com/facebookresearch/maskrcnn-benchmark/blob/master/TROUBLESHOOTING.md for a gist.

So we will be adding C++ extensions to torchvision once we start shipping pre-compiled binaries for it, or else it will introduce a lot of friction to the users.

@fmassa Sorry for reopen this issue, if adding fast nms code into torchvision is not suitable for now. Then how to build a cpp extension though?

I just got some nms C and Cuda codes (I thought most people are sharing the same), question is how to adding it into python and integrate with torch? Here is what I got now:

import os
import torch
from torch.utils.cpp_extension import BuildExtension, CppExtension, CUDAExtension
from setuptools import setup
from torch.utils.cpp_extension import CUDA_HOME


#
# ffi = cpp_extension(
#     '_ext.nms',
#     headers=headers,
#     sources=sources,
#     define_macros=defines,
#     relative_to=__file__,
#     with_cuda=with_cuda,
#     extra_objects=extra_objects,
#     extra_compile_args=['-std=c99']
# )


def get_extensions():
    sources = ['src/nms.c']
    headers = ['src/nms.h']
    defines = []
    with_cuda = False

    if torch.cuda.is_available():
        print('Including CUDA code.')
        sources += ['src/nms_cuda.c']
        headers += ['src/nms_cuda.h']
        defines += [('WITH_CUDA', None)]
        with_cuda = True

    this_file = os.path.dirname(os.path.realpath(__file__))
    print(this_file)
    extra_objects = ['src/cuda/nms_kernel.cu.o']
    extra_objects = [os.path.join(this_file, fname) for fname in extra_objects]

    extra_compile_args = {"cxx": []}
    define_macros = []
    extension = CppExtension

    if torch.cuda.is_available() and CUDA_HOME is not None:
        extension = CUDAExtension
        define_macros += [("WITH_CUDA", None)]
        extra_compile_args["nvcc"] = [
            "-DCUDA_HAS_FP16=1",
            "-D__CUDA_NO_HALF_OPERATORS__",
            "-D__CUDA_NO_HALF_CONVERSIONS__",
            "-D__CUDA_NO_HALF2_OPERATORS__",
        ]

    ext_modules = [
        extension(
            "_ext.nms",
            sources,
            include_dirs=headers,
            define_macros=define_macros,
            extra_compile_args=extra_compile_args,
            extra_objects=extra_objects
        )
    ]

    return ext_modules


setup(
    # install_requires=requirements,
    ext_modules=get_extensions(),
    cmdclass={"build_ext": torch.utils.cpp_extension.BuildExtension},
)

Seems the ways is not correct beacuse it says:

lib/nms/_ext/nms.cpython-36m-x86_64-linux-gnu.so: undefined symbol: state

@jinfagang the easiest to do for now is to ask the users to install maskrcnn-benchmark, and then you can just do

from maskrcnn_benchmark.layers import nms

@fmassa I added the GPU based NMS as a PR a while ago. :)

Apparently, NMS is also used in Keypoint detection such as in the OpenPose paper.

Hey @varunagrawal

Now that we've released torchvision 0.2.2, I'll be (slowly) integrating all the changes that I've been discussing for a while, including merging NMS (and other layers) in torchvision.

Sorry for the wait!

@fmassa thats great thank you, I was just extracting NMS out of maksrcnn_benchmark for my use (https://github.com/xvdp/torchvision_extra) As soon as you put it in Ill use torchvision instead.

@fmassa sounds good. Let me know if you need me to do anything else. I can do all the rebases etc., whenever you're ready.

https://github.com/pytorch/vision/pull/826 has been merged and adds support for NMS on both CPU and GPU

FYI, this is located at torchvision.ops.nms.

It would be nice to mention the format for the boxes in the docstring, though. Is it [y1, x1, y2, x2] for some diagonal corners, as in TensorFlow? Or perhaps just top-left and bottom-right corners?

It should be [x1, y1, x2, y2] but adding this to the docstring makes a lot of sense.

@fmassa @neighthan I've made the changes in #1110. It should be quick to review since it is mostly doc changes.

@neighthan torchvision.ops.nms seems to also support CUDA? the only issue is that it does not support batch numbers, that would be another dimension to benefit from CUDA parallelisation.
I have this nms implementation that does and I use it to apply NMS on multiple images at once, each independently.

@varunagrawal @fmassa Can anyone help me what should be used now to run NMS (cuda) in batch mode while training? The NMS implementation provided by torchvision.ops is not in Batch mode and will be slow to use at training time.

@dishank-b what do you mean by batch mode?

Also note that we have a batched_nms implementation in torch.ops.batched_nms.

@fmassa By batch mode, I mean taking input in the form of [batch_size, N, 4]. The current torchvision.ops.nms takes the boxes in form [N,4] where N is the number of boxes in single Image. Whereas let's say during training Faster RCNN if your batch_size is 16, then we would like the NMS to run for those 16 images in a batch. Although there is a problem that N may not be the same for all 16 images.

Also, the torch.ops.batched_nms is different, it for the case when there is more than 1 classes to in the image.

ONNX implementation of NMS is generic enough, I wish torchvision had something similar.
There is the batched-nms as described by @dishank-b and there is per-class NMS that is what torch.ops.batched_nms seems to do. ONNX implementation supports both cases using one API.

@dishank-b @dashesy torchvision.ops.batched_nms can be used to perform NMS in a batch, but it doesn't matter if it is per class or per image actually. It just performs NMS independently "per category", which can mean image, class, etc.

The limitation with having NMS taking [batch_size, N, 4] inputs is that you need to set a max number of boxes, and pad the images that do not have enough boxes with something (like zeros). This is actually a subset of what torchvision.ops.batched_nms can do. Let me show you an example of how to implement this NMS with torchvision.ops.batched_nms:

def batched_nms(boxes, scores, iou_threshold):
    # boxes is a [batch_size, N, 4] tensor, and scores a
    # [batch_size, N] tensor.
    batch_size, N, _ = boxes.shape
    indices = torch.arange(batch_size, device=boxes.device)
    indices = indices[:, None].expand(batch_size, N).flatten()
    boxes_flat = boxes.flatten(0, 1)
    scores_flat = scores.flatten()
    indices_flat = torchvision.ops.boxes.batched_nms(
        boxes_flat, scores_flat, indices, iou_threshold)
    # now reshape the indices as you want, maybe
    # projecting back to the [batch_size, N] space
    # I'm omitting this here
    indices = indices_flat
    return indices

Let me know if you have questions

@fmassa Thanks, this seems to work with batch either in images or class. However what if we have both the cases i.e do the image batching as well as classes?
I can think of a possible solution that we create indices = torch.arange(batch_size*classes) where each element of the indices will represent each class in each image of the batch.
But this seems quite naive to me, do you have any other way of doing the same?

@dishank-b in that case you can squash the 2 dimensions together with view

boxes = boxes.view(-1, N, 4)

Since they are independent, it would not matter if they are from different batch, or different class then.

Still would be nice if nms had a max_boxes_per_batch parameter to avoid the zero padding.

@dashesy I am not sure that would work because let's say there are two boxes in two different images but of the same category, now if they are overlapping then they would be merged using above method, which should not be the case as they are from different images.

Let me know if I am wrong somewhere.

I think they will not be merged.
N is the maximum number of boxes per-image-per-class. Now lets say you have 2 classes and 3 images then you have boxes with shape 3x2xNx4 after doing the view you will have 6xNx4 and first dimension will be done independently.

@dashesy Yeah, you are right, I get it. I was kind of doing the same but using the indices. Thanks your's way is easier.

@dashesy

Still would be nice if nms had a max_boxes_per_batch parameter to avoid the zero padding.

This makes a few assumptions though, which might not always be desired. We need to either remove some boxes via a criterion (low scores? small boxes? something else?) or pad with zeros to the maximum size, which is not very convenient.

I considered those cases when implementing batched_nms and Faster R-CNN, and I came up with what we currently have, which seems like a good compromise on flexibility and ease-of-use.

@fmassa Is or will there be any implementation of Soft-NMS in pytorch too?

@fortunex3000

Is or will there be any implementation of Soft-NMS in pytorch too?

it's not in the plans for torchvision in the near future

Soft-NMS is pretty straightforward. I might add it once academic winter break starts.

Is the documentation/code for torchvision.ops.boxes.batched_nms correct? The documentation suggests that the nms iou threshold discards boxes with iou < iou threshold (i.e. it only discards boxes that don't overlap which would normally be considered unique). My own testing of the function showed that the function does give fewer boxes with a smaller threshold (0.0 gave 1 box per image) and gives all boxes back with a threshold of 1.0. Is there something I am not understanding about the implementation or the documentation? This seems to be a non-standard choice as you should want to keep boxes with a small iou (i.e. less overlap and more likely to be non unique) and not discard them. See the soft-nms paper Figure 2 comparing NMS and soft-NMS for how it seems to me like it should behave. http://www.cs.umd.edu/~bharat/snms.pdf

@rmcavoy there is an error in the documentation.
https://github.com/pytorch/vision/blob/c05da2a84c1239a9d57431cec7e9ed83931d478b/torchvision/ops/boxes.py#L27
should read

        boxes with IoU > iou_threshold

I've sent a PR fixing the documentation in https://github.com/pytorch/vision/pull/1614, thanks!

@fmassa how can I flatten back the indices_flat to [batch_size, N] if each image has different number of valid boxes after nms.

@Edwardmark can you open a new issue describing what you are trying to do?

@dishank-b @dashesy torchvision.ops.batched_nms can be used to perform NMS in a batch, but it doesn't matter if it is per class or per image actually. It just performs NMS independently "per category", which can mean image, class, etc.

The limitation with having NMS taking [batch_size, N, 4] inputs is that you need to set a max number of boxes, and pad the images that do not have enough boxes with something (like zeros). This is actually a subset of what torchvision.ops.batched_nms can do. Let me show you an example of how to implement this NMS with torchvision.ops.batched_nms:

def batched_nms(boxes, scores, iou_threshold):
    # boxes is a [batch_size, N, 4] tensor, and scores a
    # [batch_size, N] tensor.
    batch_size, N, _ = boxes.shape
    indices = torch.arange(batch_size, device=boxes.device)
    indices = indices[:, None].expand(batch_size, N).flatten()
    boxes_flat = boxes.flatten(0, 1)
    scores_flat = scores.flatten()
    indices_flat = torchvision.ops.boxes.batched_nms(
        boxes_flat, scores_flat, indices, iou_threshold)
    # now reshape the indices as you want, maybe
    # projecting back to the [batch_size, N] space
    # I'm omitting this here
    indices = indices_flat
    return indices

Let me know if you have questions

@fmassa, I think @Edwardmark wants to ask how to unflatten the indices in above reply, so that you get the detections for each image, given you don't know the number of final detections in each image.

@fmassa @dishank-b Yes, and I figure it out. We can unflatten the indices based on the returned indices range. For example, indices within range(0, N) belongs to img1, and indices within range(N, 2N) belongs to img2. Thank you all very much.

Was this page helpful?
0 / 5 - 0 ratings