Vision: The exported Mask-RCNN onnx model is not correct

Created on 13 Jan 2020 · 11Comments · Source: pytorch/vision

Bugs in the exported onnx model

I was thinking that Mask-RCNN to onnx is already pipe cleaned according to the comments. Also, I saw torchvision has test code in here that covers different parts of mrcnn . But when I try exporting the model myself, the result onnx model can only "work" on the image that I used to export the model. If I use another image with a different size, the onnx runtime will result into errors like this:

----------------------------------------------------------------------
Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 49, in decorate_no_grad
    return func(*args, **kwargs)
  File "/pytoch_onnx/tests/test_onnx_rpn_filter_proposals.py", line 156, in test_rpn_head_anchor_generator_filter_proposal
    "feat_pool"   : features_g["pool"]
  File "/opt/conda/lib/python3.7/site-packages/onnxruntime/capi/session.py", line 142, in run
    return self._sess.run(output_names, input_feed, run_options)
onnxruntime.capi.onnxruntime_pybind11_state.Fail: [ONNXRuntimeError] : 1 : FAIL : Non-zero status code returned while running Split node. Name:'' Status Message: Cannot split using values in 'split' attribute. Axis=1 Input shape={1,179046} NumOutputs=5 Num entries in 'split' (must equal number of outputs) was 5 Sum of sizes in 'split' (must equal size of selected axis) was 242991

----------------------------------------------------------------------

Environment

Python 3.7.4
torchvision==0.5.0a0+07cbb46 (built from scratch)
torch==1.4.0 (downloaded from here)
onnx==1.6.0
onnxruntime==1.1.0
Base image: nvidia/cuda:10.1-cudnn7-devel-ubuntu16.04

Reproduction

Code

The script I used to test the mrcnn onnx export: gist

Steps

1.Put these 2 files under the reference/detection directory
2.Run python test_onnx_export.py

Error Message

You will see that when we validate the onnx model on coco validation set with onnx runtime, only the first image can pass. The second image fails with errors like below:

Score batch 0 start
Score batch 0 finish
Test:  [   0/5000]  eta: 9:22:21  model_time: 6.4258 (6.4258)  evaluator_time: 0.1232 (0.1232)  time: 6.7483  data: 0.1986
Score batch 1 start
2020-01-13 04:59:12.794733672 [E:onnxruntime:, sequential_executor.cc:183 Execute] Non-zero status code returned while running Split node. Name:'' Status Message: Cannot split using values in 'split' attribute. Axis=1 Input shape={1,340176} NumOutputs=5 Num entries in 'split' (must equal number of outputs) was 5 Sum of sizes in 'split' (must equal size of selected axis) was 242991
Traceback (most recent call last):
  ....
  File "/pytoch_onnx/engine.py", line 150, in evaluate_onnx
    ort_output = ort_session.run(None, ort_image)
  File "/opt/conda/lib/python3.7/site-packages/onnxruntime/capi/session.py", line 142, in run
    return self._sess.run(output_names, input_feed, run_options)
onnxruntime.capi.onnxruntime_pybind11_state.Fail: [ONNXRuntimeError] : 1 : FAIL : Non-zero status code returned while running Split node. Name:'' Status Message: Cannot split using values in 'split' attribute. Axis=1 Input shape={1,340176} NumOutputs=5 Num entries in 'split' (must equal number of outputs) was 5 Sum of sizes in 'split' (must equal size of selected axis) was 242991

More information

Given the mrcnn's input images are in different sizes, I believe there are more hiding errors like this in the exported model.
Here are something I got when I tried to seperate the rpn(region proposal network) module out and debug the export process:

The strides in anchor generator's forward function is created as list of python int. So when exporting, these will be traced as constant in onnx. But this value should vary from image to image.
The image_size is used to clip the bbox, but from the log, onnx also treat this value as constant:

      %1106 : Float(),
      %1107 : Float(),
      %1108 : Float(),
      %1109 : Float()):
...
  %977 : Float(4741, 2) = onnx::Clip(%967, %1106, %1107) # /opt/conda/lib/python3.7/site-packages/torchvision/ops/boxes.py:115:0
  %982 : Float(4741, 2) = onnx::Clip(%972, %1108, %1109) # /opt/conda/lib/python3.7/site-packages/torchvision/ops/boxes.py:116:0

The num_anchors_per_level in rpn forward is also traced as constants. The exported Split op uses num_anchors_per_level as its attribute. This is the root cause of the "Split Error" I said above. num_anchors_per_level should not be a contant. Different images size have different values.

  %ob.1 : Float(1, 182400), %ob.2 : Float(1, 45600), %ob.3 : Float(1, 11400), %ob.4 : Float(1, 2850), %ob : Float(1, 741) = onnx::Split[axis=1, split=[182400, 45600, 11400, 2850, 741]](%811)

It seems to me that we're still far away from a working mrcnn onnx module! A few bugs are hiding down there. Do we have plans to fix these issues?

bug models onnx object detection

Source

shaoboc-lmf

Most helpful comment

Hi, @fmassa

I have created an initial PR for the mrcnn's region proposal network. See #1749.

shaoboc-lmf on 15 Jan 2020

👍2

All 11 comments

Okay, The clip bbox bug turns to be more funny. It turns out that the torch.clamp 's min, max argument are always traced as contant, even if I passed a tensor to it.

The script I used to test the clamp function

import torch
import io
import onnxruntime

def onnx_print_io_meta(ort_session):

    print("\n==ort inputs==")
    for ort_inputs_meta in ort_session.get_inputs():
        # print(dir(ort_inputs_meta))
        print(ort_inputs_meta.name)
        print(ort_inputs_meta.shape)
        print(ort_inputs_meta.type)
        print()

    print("\n==ort outputs==")
    for ort_outputs_meta in ort_session.get_outputs():
        # print(dir(ort_outputs_meta))
        print(ort_outputs_meta.name)
        print(ort_outputs_meta.shape)
        print(ort_outputs_meta.type)
        print()

class ClipMod(torch.nn.Module):

    def __init__(self):

        super(ClipMod, self).__init__()

    def forward(self, box, upper_bound):
        print("forward, upper:", upper_bound)
        return box.clamp(max=upper_bound)
        # return torch.min(box, upper_bound)


@torch.no_grad()
def test():

    clip_mod = ClipMod()
    clip_mod.eval()

    i = torch.tensor([0.1, 1.0, 1.1, 1.2])
    u = torch.tensor(1.)
    o = clip_mod(i, u)

    print("==torch==")
    print("i", i)
    print("o", o)

    onnx_io = io.BytesIO()
    torch.onnx.export(clip_mod,
                      (i, u),
                      onnx_io,
                      do_constant_folding = True,
                      opset_version = 11,
                      verbose = True,
                      input_names  = ['input', 'upper_bound'],
                      output_names = ['output'])

    ort_session = onnxruntime.InferenceSession(onnx_io.getvalue())
    onnx_print_io_meta(ort_session)

    i_ort = i.numpy()
    u_ort = o.numpy()
    print("==onnxruntime upper_bound 1.==")
    o_ort = ort_session.run(None, {
        "input"       : i_ort,
        "upper_bound" : u_ort
    })
    print(o_ort)

    # o_ort = ort_session.run(None, {
    #     "input"       : i_ort,
    # })
    # print("==onnxruntime==")
    # print(o_ort)

    print("==onnxruntime upper_bound 1.1==")
    u_ort = torch.tensor(1.1).numpy()
    o_ort = ort_session.run(None, {
        "input"       : i_ort,
        "upper_bound" : u_ort
    })
    print(o_ort)

if __name__ == "__main__":
    test()

If we use torch.clamp, the upper_bound will not be traced, the result onnx session only have the input input.
If we use torch.min, we'll get 2 inputs: input, upper_bound. Everything works as expected.
So is it a feature or a bug in onnx exporter? Did I miss something in above script?

shaoboc-lmf on 13 Jan 2020

Thanks a lot for the very detailed bug report!

Let me have a closer look at the issues you are pointing out and get back to you.

cc @lara-hdr for visibility

fmassa on 13 Jan 2020

cc @fmassa
I've created the PR for ONNX to address the issue with split:
onnx/onnx#2544

Sorry, I'm not sure I understand the PR correctly, why the "zero length split" could fix the "constant length split issue" I mentioned in this thread?

shaoboc-lmf on 14 Jan 2020

Hi, I found ways to WAR the issues in the region proposal netowrk(rpn.py). Currently I only focus the batch size = 1 case, so the dynamic_axes is set to height and weight, batch_size is not considered. But I saw unscripted for loop which depends on the # of images in the batch. So we need more efforts to enable the dynamic_axes on batch_size case.

These are just quick ugly WARs, I'm sure you guys have better solution.

I have tested the exported backbone + rpn subnetowrk with 5 batches of data.

For the WARs in rpn, see here

For the corresponding tests, see here

shaoboc-lmf on 14 Jan 2020

👍1

@shaoboc-lmf thanks for the patches!

They look generally good, but some work would still be needed to ensure that torchscript is supported as well.
This would involve separating out ONNX-specific imports into functions and annotating them with @torch.jit.unused.

Would you be able to send an initial PR? It would be very helpful!

fmassa on 14 Jan 2020

cc @fmassa
I've created the PR for ONNX to address the issue with split:
onnx/onnx#2544

Sorry, I'm not sure I understand the PR correctly, why the "zero length split" could fix the "constant length split issue" I mentioned in this thread?

Sorry, my bad. This is the wrong thread.