According to the inference speed from the maskrcnn-benchmark and Detectron, Mask R-CNN with R-101-FPN as backbone is 25% slower (0.15384 VS 0.119). Moreover, V100 is suppose to be 20% faster than P100. Acoording to the results reported in Tensormask, mask rcnn with R-101-fpn backbone runs at 90 ms in V100.
What may be the reason?
One simple reason is that in Detectron, they use a fused kernel for AffineChannel, while we use a python implementation which dispatches to two operations (FrozenBatchNorm).
This operation is used a lot, and it can cumulate during inference. I suspect this is one of the biggest reasons.
I'm a little confused by the FrozenBatchNorm implementation here. I can't find where the running mean/running var are updated. And the weight/bias are not learnable params. During the forward, it does nothing unless these buffers are loaded from pretrain weight. The affinechannel op seems to be simple and only has a few flops. Maybe reimplemented if it's the bottleneck.
Its frozen because the stats are not learnable.
Sure, reimplementing it in C++/CUDA is possible, but was not a priority, given that it is so simple and would be a perfect fit for the JIT to optimise it.
According to https://github.com/facebookresearch/maskrcnn-benchmark/issues/267#issuecomment-454039102
Why not precompute scale and bias and register them as buffer after loading weights?
That's what I have initially done.
But in order to make loading models from torchvision (or other pre-trained models for classification) without having to do a pre processing step to fuse those steps, I decided to just have some slight redundancy there, as it makes things overall simpler.
This way, there is no need of a separate pass to perform the BatchNorm replacement.
@fmassa,
I think the inference time on MODEL_ZOO is not accurate too. The current speed of maskrcnn-benchmark actually is 15~20% faster than it was. I think it is related to this update https://github.com/pytorch/pytorch/pull/13420. For example, the
| Model (Det) | MODEL_ZOO number | Re-evaluated on 1080Ti |
| ------------- |:-------------:| -----:|
| R-50-FPN | 126ms | 93ms |
| R-101-FPN | 143ms | 116ms |
@chengyangfu definitely, the speedup of indexing brings quite some speedup to inference, and a bit to testing as well.
Most helpful comment
@fmassa,
I think the inference time on MODEL_ZOO is not accurate too. The current speed of maskrcnn-benchmark actually is 15~20% faster than it was. I think it is related to this update https://github.com/pytorch/pytorch/pull/13420. For example, the
| Model (Det) | MODEL_ZOO number | Re-evaluated on 1080Ti |
| ------------- |:-------------:| -----:|
| R-50-FPN | 126ms | 93ms |
| R-101-FPN | 143ms | 116ms |