Maskrcnn-benchmark: Why is inference slower than Detectron?

Created on 6 Apr 2019 · 7Comments · Source: facebookresearch/maskrcnn-benchmark

❓ Questions and Help

According to the inference speed from the maskrcnn-benchmark and Detectron, Mask R-CNN with R-101-FPN as backbone is 25% slower (0.15384 VS 0.119). Moreover, V100 is suppose to be 20% faster than P100. Acoording to the results reported in Tensormask, mask rcnn with R-101-fpn backbone runs at 90 ms in V100.

What may be the reason?

Source

chenyuntc

Most helpful comment

@fmassa,
I think the inference time on MODEL_ZOO is not accurate too. The current speed of maskrcnn-benchmark actually is 15~20% faster than it was. I think it is related to this update https://github.com/pytorch/pytorch/pull/13420. For example, the

| Model (Det) | MODEL_ZOO number | Re-evaluated on 1080Ti |
| ------------- |:-------------:| -----:|
| R-50-FPN | 126ms | 93ms |
| R-101-FPN | 143ms | 116ms |

chengyangfu on 6 Apr 2019

👍3

All 7 comments

One simple reason is that in Detectron, they use a fused kernel for AffineChannel, while we use a python implementation which dispatches to two operations (FrozenBatchNorm).

This operation is used a lot, and it can cumulate during inference. I suspect this is one of the biggest reasons.

fmassa on 6 Apr 2019

I'm a little confused by the FrozenBatchNorm implementation here. I can't find where the running mean/running var are updated. And the weight/bias are not learnable params. During the forward, it does nothing unless these buffers are loaded from pretrain weight. The affinechannel op seems to be simple and only has a few flops. Maybe reimplemented if it's the bottleneck.

chenyuntc on 6 Apr 2019

Its frozen because the stats are not learnable.

Sure, reimplementing it in C++/CUDA is possible, but was not a priority, given that it is so simple and would be a perfect fit for the JIT to optimise it.

fmassa on 6 Apr 2019

According to https://github.com/facebookresearch/maskrcnn-benchmark/issues/267#issuecomment-454039102

Why not precompute scale and bias and register them as buffer after loading weights?

chenyuntc on 6 Apr 2019

That's what I have initially done.
But in order to make loading models from torchvision (or other pre-trained models for classification) without having to do a pre processing step to fuse those steps, I decided to just have some slight redundancy there, as it makes things overall simpler.
This way, there is no need of a separate pass to perform the BatchNorm replacement.

fmassa on 6 Apr 2019

| Model (Det) | MODEL_ZOO number | Re-evaluated on 1080Ti |
| ------------- |:-------------:| -----:|
| R-50-FPN | 126ms | 93ms |
| R-101-FPN | 143ms | 116ms |

chengyangfu on 6 Apr 2019

👍3

@chengyangfu definitely, the speedup of indexing brings quite some speedup to inference, and a bit to testing as well.

fmassa on 6 Apr 2019

Was this page helpful?

0 / 5 - 0 ratings