Vision: Non-Maximum Supression on the GPU

Created on 17 Jan 2018 · 60Comments · Source: pytorch/vision

Is there any interest in an NMS layer that runs on the GPU for torchvision? I have one implemented; it gives a 1-2 order of magnitude speedup over a naive version composed from pytorch ops. Would be happy to contribute it if anyone's interested.

enhancement help wanted

Source

dssa56

👍13

Most helpful comment

Hey @varunagrawal

Now that we've released torchvision 0.2.2, I'll be (slowly) integrating all the changes that I've been discussing for a while, including merging NMS (and other layers) in torchvision.

Sorry for the wait!

fmassa on 9 Mar 2019

🎉7

All 60 comments

that's definitely interesting. We should add it into pytorch, until we figure out ATen extensions.
For example like how we added ROIPooling: https://github.com/pytorch/pytorch/pull/3672

soumith on 17 Jan 2018

👍4

Great, I'll read over that PR for starters.

dssa56 on 17 Jan 2018

This is all ready now. Where in the Python API would you like it go? I actually have two implementations ready; one is quicker in principle but it requires allocating global memory to store intermediate results, and freeing the memory at the end of each call completely kills performance. Is there anything one can do about this? If not, the other implementation (which doesn't allocate any intermediate global memory) is still fine and gives the kind of speedup I mentioned anyway, so it's no big deal.

dssa56 on 20 Feb 2018

🎉3

Hi, is this going to be added soon? (I'm asking since pytorch/pytorch/#5404 is closed now)

aosokin on 13 Mar 2018

Yes, I think we should add efficient NMS in here, following the Cpp extensions that were recently added.
Because Cpp extensions are only present in master, I'd wait until 0.4.0 is released before merging any C++ code into torchvision.
How does that sound?

fmassa on 14 Mar 2018

👍3 🎉1

@fmassa That's fine anyway.
Btw, is 0.4 coming soon? :-)

aosokin on 14 Mar 2018

Yes, once https://github.com/pytorch/pytorch/issues/5673 is addressed

fmassa on 14 Mar 2018

Thanks!

aosokin on 14 Mar 2018

Whats the current status of this issue?

ruotianluo on 26 Apr 2018

@ruotianluo I'm drafting the layers part of torchvision in https://github.com/pytorch/vision/tree/layers
I've added a CPU version of NMS, and I'll look into porting one of the following versions: from py-faster-rcnn, from this PR and a few others.
Have you tried any of those?

fmassa on 26 Apr 2018

When I build my faster-rcnn project, I ported the nms from py-faster-rcnn; the main reason is I'm not familiar with cuda code and it's simpler to just copy.
With regard to https://github.com/pytorch/pytorch/pull/5404/files, it's using a full byte to save a boolean value(mask), while py-faster-rcnn uses each bit. (I can't comment on anything else, 'cause as I said I'm not familiar.)

ruotianluo on 27 Apr 2018

👍2

Thanks, I'll take that into account when porting NMS to GPU.

fmassa on 27 Apr 2018

On the other hand, the py-faster-rcnn version throws data back and fourth between GPU and CPU, which may not be ideal. Also it requires a workspace of size O(n_boxes^2), and dynamically allocating/freeing this messes with performance.

dssa56 on 27 Apr 2018

We can use the PyTorch caching allocator to handle memory allocation (which will save us from the sync points from freeing the memory).
I'll have a closer look at both implementations this weekend, but thanks for the feedbacks!

fmassa on 27 Apr 2018

👍1

Cool, in that case you might also be interested in the implementation here, https://github.com/dssa56/projects/tree/master/nms under cuda_nms_workspace.cu. That uses the same technique as py-faster-rcnn, but keeps all data on the GPU. I didn't use it in the PR because of the workspace, but if that's a non-issue then it's potentially worth a look.

dssa56 on 27 Apr 2018

👍3

Has the current version of Torchvision the GPU-based NMS?

ahkarami on 1 Jul 2018

No, it doesn't.
I have it implemented locally (based on the implementations that were shared previously), but I haven't pushed it yet.

fmassa on 3 Jul 2018

@fmassa,
Ok. Thank you for your reply. I hope the PyTorch team push it as soon as possible.

ahkarami on 3 Jul 2018

@fmassa Can I do you a favor if you need for implementing the GPU-based NMS ? I'm also working on it. I'm also using the GPU-based RoI align and RoI pool from your torchvision branch.

Sucran on 27 Jul 2018

I have a version of it that I adapted from https://github.com/rbgirshick/py-faster-rcnn/blob/master/lib/nms/nms_kernel.cu
But because pytorch is so rapidly evolving, it has already been broken by pytorch refactoring at least 3 times.
I'll hold on before pushing it to torchvision because I expect those breakage to still happen.
I can push the current version I have in a gist though, if you want to try it out

fmassa on 27 Jul 2018

@fmassa Thank you. Actually, I prefer to try it out, then consider how to fix the breakage in my production env. Looking forward to the gist version.

Sucran on 27 Jul 2018

@Sucran here is the implementation https://gist.github.com/fmassa/cf9ab87e4bd71e849655d5abb8cfd6c6
It was working, but I'm not sure anymore.
Also, it required PyTorch compiled from source

fmassa on 27 Jul 2018

FYI, we have released our implementation of {Faster, Mask} R-CNN in https://github.com/facebookresearch/maskrcnn-benchmark , which contains a CPU and CUDA implementation for Non Maximum suppression.

I suggest we move this discussion there for now.

fmassa on 25 Oct 2018

@fmassa, Why close this issue though?
Many people use torchvision, but do not care about or need the maskrcnn repository.
I still think having a pure cuda implementation of NMS in torch/torchvision would be highly appreciated by lots of people!

0phoff on 25 Oct 2018

I'm curious in which cases do people use NMS outside of object detection?
But at some point we might indeed add layers to torchvision, it's just that packaging torchvision with those is a bit more involved so I'll let it for when the Cpp extensions API is more stabilized.

But I'm opening it anyway as a reminder

fmassa on 25 Oct 2018

Similar use case, but I've been using it for detecting events in audio.

macaodha on 25 Oct 2018

I'm curious in which cases do people use NMS outside of object detection?
But at some point we might indeed add layers to torchvision, it's just that packaging torchvision with those is a bit more involved so I'll let it for when the Cpp extensions API is more stabilized.

But I'm opening it anyway as a reminder

Thanks!
I indeed need it for object detection myself, but I have developed my own framework for single-pass detection networks (eg. Yolo), so I don't need all the extra things from the rcnn repo.
I am hesitating of just stealing your nms implementation for my own repo, but atm my repo is pure python and doesn't need any compiling, which is quite handy for deployment across multiple platforms.

I might just wait it out, as I have a pytorch implementation that works quite well, but is just doing a small part on CPU.

0phoff on 26 Oct 2018

@0phoff there are still several corner cases for users when compiling extension libraries, see for example https://github.com/facebookresearch/maskrcnn-benchmark/blob/master/TROUBLESHOOTING.md for a gist.

So we will be adding C++ extensions to torchvision once we start shipping pre-compiled binaries for it, or else it will introduce a lot of friction to the users.

fmassa on 26 Oct 2018

👍4 🎉1

@fmassa Sorry for reopen this issue, if adding fast nms code into torchvision is not suitable for now. Then how to build a cpp extension though?

I just got some nms C and Cuda codes (I thought most people are sharing the same), question is how to adding it into python and integrate with torch? Here is what I got now:

import os
import torch
from torch.utils.cpp_extension import BuildExtension, CppExtension, CUDAExtension
from setuptools import setup
from torch.utils.cpp_extension import CUDA_HOME


#
# ffi = cpp_extension(
#     '_ext.nms',
#     headers=headers,
#     sources=sources,
#     define_macros=defines,
#     relative_to=__file__,
#     with_cuda=with_cuda,
#     extra_objects=extra_objects,
#     extra_compile_args=['-std=c99']
# )


def get_extensions():
    sources = ['src/nms.c']
    headers = ['src/nms.h']
    defines = []
    with_cuda = False

    if torch.cuda.is_available():
        print('Including CUDA code.')
        sources += ['src/nms_cuda.c']
        headers += ['src/nms_cuda.h']
        defines += [('WITH_CUDA', None)]
        with_cuda = True

    this_file = os.path.dirname(os.path.realpath(__file__))
    print(this_file)
    extra_objects = ['src/cuda/nms_kernel.cu.o']
    extra_objects = [os.path.join(this_file, fname) for fname in extra_objects]

    extra_compile_args = {"cxx": []}
    define_macros = []
    extension = CppExtension

    if torch.cuda.is_available() and CUDA_HOME is not None:
        extension = CUDAExtension
        define_macros += [("WITH_CUDA", None)]
        extra_compile_args["nvcc"] = [
            "-DCUDA_HAS_FP16=1",
            "-D__CUDA_NO_HALF_OPERATORS__",
            "-D__CUDA_NO_HALF_CONVERSIONS__",
            "-D__CUDA_NO_HALF2_OPERATORS__",
        ]

    ext_modules = [
        extension(
            "_ext.nms",
            sources,
            include_dirs=headers,
            define_macros=define_macros,
            extra_compile_args=extra_compile_args,
            extra_objects=extra_objects
        )
    ]

    return ext_modules


setup(
    # install_requires=requirements,
    ext_modules=get_extensions(),
    cmdclass={"build_ext": torch.utils.cpp_extension.BuildExtension},
)

Seems the ways is not correct beacuse it says:

lib/nms/_ext/nms.cpython-36m-x86_64-linux-gnu.so: undefined symbol: state

jinfagang on 8 Nov 2018

@jinfagang the easiest to do for now is to ask the users to install maskrcnn-benchmark, and then you can just do

from maskrcnn_benchmark.layers import nms

fmassa on 27 Nov 2018

@fmassa I added the GPU based NMS as a PR a while ago. :)

Apparently, NMS is also used in Keypoint detection such as in the OpenPose paper.

varunagrawal on 9 Mar 2019

Hey @varunagrawal

Now that we've released torchvision 0.2.2, I'll be (slowly) integrating all the changes that I've been discussing for a while, including merging NMS (and other layers) in torchvision.

Sorry for the wait!

fmassa on 9 Mar 2019

🎉7

@fmassa thats great thank you, I was just extracting NMS out of maksrcnn_benchmark for my use (https://github.com/xvdp/torchvision_extra) As soon as you put it in Ill use torchvision instead.

xvdp on 11 Mar 2019

👍2

@fmassa sounds good. Let me know if you need me to do anything else. I can do all the rebases etc., whenever you're ready.

varunagrawal on 11 Mar 2019

https://github.com/pytorch/vision/pull/826 has been merged and adds support for NMS on both CPU and GPU

fmassa on 7 May 2019

👍1

FYI, this is located at torchvision.ops.nms.

neighthan on 11 Jul 2019

👍3

It would be nice to mention the format for the boxes in the docstring, though. Is it [y1, x1, y2, x2] for some diagonal corners, as in TensorFlow? Or perhaps just top-left and bottom-right corners?

neighthan on 11 Jul 2019

It should be [x1, y1, x2, y2] but adding this to the docstring makes a lot of sense.

varunagrawal on 11 Jul 2019

@fmassa @neighthan I've made the changes in #1110. It should be quick to review since it is mostly doc changes.

varunagrawal on 11 Jul 2019

👍1

@neighthan torchvision.ops.nms seems to also support CUDA? the only issue is that it does not support batch numbers, that would be another dimension to benefit from CUDA parallelisation.
I have this nms implementation that does and I use it to apply NMS on multiple images at once, each independently.

dashesy on 1 Aug 2019

👍1

@varunagrawal @fmassa Can anyone help me what should be used now to run NMS (cuda) in batch mode while training? The NMS implementation provided by torchvision.ops is not in Batch mode and will be slow to use at training time.

dishank-b on 23 Oct 2019

@dishank-b what do you mean by batch mode?

Also note that we have a batched_nms implementation in torch.ops.batched_nms.

fmassa on 23 Oct 2019

@fmassa By batch mode, I mean taking input in the form of [batch_size, N, 4]. The current torchvision.ops.nms takes the boxes in form [N,4] where N is the number of boxes in single Image. Whereas let's say during training Faster RCNN if your batch_size is 16, then we would like the NMS to run for those 16 images in a batch. Although there is a problem that N may not be the same for all 16 images.

Also, the torch.ops.batched_nms is different, it for the case when there is more than 1 classes to in the image.

dishank-b on 24 Oct 2019

ONNX implementation of NMS is generic enough, I wish torchvision had something similar.
There is the batched-nms as described by @dishank-b and there is per-class NMS that is what torch.ops.batched_nms seems to do. ONNX implementation supports both cases using one API.

dashesy on 24 Oct 2019

@dishank-b @dashesy torchvision.ops.batched_nms can be used to perform NMS in a batch, but it doesn't matter if it is per class or per image actually. It just performs NMS independently "per category", which can mean image, class, etc.

The limitation with having NMS taking [batch_size, N, 4] inputs is that you need to set a max number of boxes, and pad the images that do not have enough boxes with something (like zeros). This is actually a subset of what torchvision.ops.batched_nms can do. Let me show you an example of how to implement this NMS with torchvision.ops.batched_nms:

def batched_nms(boxes, scores, iou_threshold):
    # boxes is a [batch_size, N, 4] tensor, and scores a
    # [batch_size, N] tensor.
    batch_size, N, _ = boxes.shape
    indices = torch.arange(batch_size, device=boxes.device)
    indices = indices[:, None].expand(batch_size, N).flatten()
    boxes_flat = boxes.flatten(0, 1)
    scores_flat = scores.flatten()
    indices_flat = torchvision.ops.boxes.batched_nms(
        boxes_flat, scores_flat, indices, iou_threshold)
    # now reshape the indices as you want, maybe
    # projecting back to the [batch_size, N] space
    # I'm omitting this here
    indices = indices_flat
    return indices

Let me know if you have questions

fmassa on 24 Oct 2019

👍2

@fmassa Thanks, this seems to work with batch either in images or class. However what if we have both the cases i.e do the image batching as well as classes?
I can think of a possible solution that we create indices = torch.arange(batch_size*classes) where each element of the indices will represent each class in each image of the batch.
But this seems quite naive to me, do you have any other way of doing the same?

dishank-b on 25 Oct 2019

@dishank-b in that case you can squash the 2 dimensions together with view

boxes = boxes.view(-1, N, 4)

Since they are independent, it would not matter if they are from different batch, or different class then.

Still would be nice if nms had a max_boxes_per_batch parameter to avoid the zero padding.

dashesy on 25 Oct 2019

@dashesy I am not sure that would work because let's say there are two boxes in two different images but of the same category, now if they are overlapping then they would be merged using above method, which should not be the case as they are from different images.

Let me know if I am wrong somewhere.

dishank-b on 25 Oct 2019

I think they will not be merged.
N is the maximum number of boxes per-image-per-class. Now lets say you have 2 classes and 3 images then you have boxes with shape 3x2xNx4 after doing the view you will have 6xNx4 and first dimension will be done independently.

dashesy on 25 Oct 2019

@dashesy Yeah, you are right, I get it. I was kind of doing the same but using the indices. Thanks your's way is easier.

dishank-b on 25 Oct 2019

@dashesy

Still would be nice if nms had a max_boxes_per_batch parameter to avoid the zero padding.

This makes a few assumptions though, which might not always be desired. We need to either remove some boxes via a criterion (low scores? small boxes? something else?) or pad with zeros to the maximum size, which is not very convenient.

I considered those cases when implementing batched_nms and Faster R-CNN, and I came up with what we currently have, which seems like a good compromise on flexibility and ease-of-use.

fmassa on 25 Oct 2019

@fmassa Is or will there be any implementation of Soft-NMS in pytorch too?

MarkusAmann on 12 Nov 2019

@fortunex3000

Is or will there be any implementation of Soft-NMS in pytorch too?

it's not in the plans for torchvision in the near future

fmassa on 13 Nov 2019

Soft-NMS is pretty straightforward. I might add it once academic winter break starts.

varunagrawal on 13 Nov 2019

Is the documentation/code for torchvision.ops.boxes.batched_nms correct? The documentation suggests that the nms iou threshold discards boxes with iou < iou threshold (i.e. it only discards boxes that don't overlap which would normally be considered unique). My own testing of the function showed that the function does give fewer boxes with a smaller threshold (0.0 gave 1 box per image) and gives all boxes back with a threshold of 1.0. Is there something I am not understanding about the implementation or the documentation? This seems to be a non-standard choice as you should want to keep boxes with a small iou (i.e. less overlap and more likely to be non unique) and not discard them. See the soft-nms paper Figure 2 comparing NMS and soft-NMS for how it seems to me like it should behave. http://www.cs.umd.edu/~bharat/snms.pdf

rmcavoy on 24 Nov 2019

@rmcavoy there is an error in the documentation.
https://github.com/pytorch/vision/blob/c05da2a84c1239a9d57431cec7e9ed83931d478b/torchvision/ops/boxes.py#L27
should read

        boxes with IoU > iou_threshold

I've sent a PR fixing the documentation in https://github.com/pytorch/vision/pull/1614, thanks!

fmassa on 26 Nov 2019

@fmassa how can I flatten back the indices_flat to [batch_size, N] if each image has different number of valid boxes after nms.

Edwardmark on 7 Jan 2020

@Edwardmark can you open a new issue describing what you are trying to do?

fmassa on 8 Jan 2020

@dishank-b @dashesy torchvision.ops.batched_nms can be used to perform NMS in a batch, but it doesn't matter if it is per class or per image actually. It just performs NMS independently "per category", which can mean image, class, etc.

The limitation with having NMS taking [batch_size, N, 4] inputs is that you need to set a max number of boxes, and pad the images that do not have enough boxes with something (like zeros). This is actually a subset of what torchvision.ops.batched_nms can do. Let me show you an example of how to implement this NMS with torchvision.ops.batched_nms:
def batched_nms(boxes, scores, iou_threshold):
    # boxes is a [batch_size, N, 4] tensor, and scores a
    # [batch_size, N] tensor.
    batch_size, N, _ = boxes.shape
    indices = torch.arange(batch_size, device=boxes.device)
    indices = indices[:, None].expand(batch_size, N).flatten()
    boxes_flat = boxes.flatten(0, 1)
    scores_flat = scores.flatten()
    indices_flat = torchvision.ops.boxes.batched_nms(
        boxes_flat, scores_flat, indices, iou_threshold)
    # now reshape the indices as you want, maybe
    # projecting back to the [batch_size, N] space
    # I'm omitting this here
    indices = indices_flat
    return indices
Let me know if you have questions

@fmassa, I think @Edwardmark wants to ask how to unflatten the indices in above reply, so that you get the detections for each image, given you don't know the number of final detections in each image.

dishank-b on 8 Jan 2020

@fmassa @dishank-b Yes, and I figure it out. We can unflatten the indices based on the returned indices range. For example, indices within range(0, N) belongs to img1, and indices within range(N, 2N) belongs to img2. Thank you all very much.

Edwardmark on 9 Jan 2020

Was this page helpful?

0 / 5 - 0 ratings