Yolov5: ONNX Export as a INT32 matrix

Created on 1 Jul 2020  路  31Comments  路  Source: ultralytics/yolov5

馃殌 Feature


When exporting and ONNX model add the option to use INT32 matrix instead of INT64 as INT64 is not supported by OpenCV

Motivation


I exported the model as an ONNX model, the I tried to import it in OpenCV 4.2.0 using cv2.dnn.readNetFromONNX(model_path) and I got the following error:

self.model = cv2.dnn.readNetFromONNX(model_path)
cv2.error: OpenCV(4.2.0) /io/opencv/modules/dnn/src/onnx/onnx_importer.cpp:101: error: (-211:One of the arguments' values is out of range) Input is out of OpenCV 32S range in function 'convertInt64ToInt32'

Pitch


I want to have a parameter when exporting to be able to select INT32 matrix.

Alternatives


An alternative solution could be to just use INT32 always instead of INT64.

Additional context


https://github.com/opencv/opencv/issues/14830

Stale enhancement

Most helpful comment

Well according to https://github.com/opencv/opencv/issues/14830#issuecomment-503279466 OpenCV tries to convert it to INT32 but it fails because it is out of the range.

Lot's of people like me will try to use it on the edge with OpenCV and OpenVINO because of how light your model is, makes it ideal for those use cases.

Can't I just remove those tensors somehow when exporting to ONNX?

All 31 comments

Hello @edurenye, thank you for your interest in our work! Please visit our Custom Training Tutorial to get started, and see our Jupyter Notebook Open In Colab, Docker Image, and Google Cloud Quickstart Guide for example environments.

If this is a bug report, please provide screenshots and minimum viable code to reproduce your issue, otherwise we can not help you.

If this is a custom model or data training question, please note that Ultralytics does not provide free personal support. As a leader in vision ML and AI, we do offer professional consulting, from simple expert advice up to delivery of fully customized, end-to-end production solutions for our clients, such as:

  • Cloud-based AI systems operating on hundreds of HD video streams in realtime.
  • Edge AI integrated into custom iOS and Android apps for realtime 30 FPS video inference.
  • Custom data training, hyperparameter evolution, and model exportation to any destination.

For more information please visit https://www.ultralytics.com.

@edurenye thanks for the comment. There is not specific requirement for int64 anywhere in yolov5, so I don't know exactly where these variables are originating or why their datatype is as such. If you debug this and find a solution then please let us know.

A hint I have is that it probably has nothing to do with ONNX, it's probably a pytorch variable in int64 for some reason that's converted to the same in ONNX.

Thanks @glenn-jocher. My guess is that comes from this issue https://github.com/pytorch/pytorch/issues/7870, we have to find where is using the LongTensor and force it to int32 using dtype=torch.int as specified in that thread.

Hi @glenn-jocher I tried to add the dtype=torch.int without luck, see https://github.com/ultralytics/yolov5/compare/master...edurenye:remove_int64
I still get:

Model Summary: 140 layers, 7.26e+06 parameters, 6.61683e+06 gradients
graph torch-jit-export (
  %images[FLOAT, 1x3x416x416]
) initializers (
  %483[INT64, 1]
  %484[INT64, 1]
  %485[INT64, 1]
  %486[INT64, 1]
  %487[INT64, 1]
  %488[INT64, 1]
  %model.0.conv.conv.bias[FLOAT, 32]
...

As well as #225 I wonder from where those parameters come from.

Hi @glenn-jocher
Our issue is the same as here https://github.com/pytorch/pytorch/issues/16218

So I used this code:

layer_names = list()
        for name, param_tensor in model.state_dict().items():
            if param_tensor.dtype == torch.int64:
                print('hola 1')
                new_param = param_tensor.int()
                print('hola 2')
                rsetattr(model, name, new_param)
                layer_names.append(name)
        print(layer_names)

To find the Tensors that are INT64 and transform them to INT32 in the export.py file.

And It got me the following list of Tensors:

['model.0.conv.bn.num_batches_tracked', 'model.1.bn.num_batches_tracked', 'model.2.cv1.bn.num_batches_tracked', 
'model.2.cv4.bn.num_batches_tracked', 'model.2.bn.num_batches_tracked', 'model.2.m.0.cv1.bn.num_batches_tracked', 
'model.2.m.0.cv2.bn.num_batches_tracked', 'model.3.bn.num_batches_tracked', 'model.4.cv1.bn.num_batches_tracked', 
'model.4.cv4.bn.num_batches_tracked', 'model.4.bn.num_batches_tracked', 'model.4.m.0.cv1.bn.num_batches_tracked', 
'model.4.m.0.cv2.bn.num_batches_tracked', 'model.4.m.1.cv1.bn.num_batches_tracked', 
'model.4.m.1.cv2.bn.num_batches_tracked', 'model.4.m.2.cv1.bn.num_batches_tracked', 
'model.4.m.2.cv2.bn.num_batches_tracked', 'model.5.bn.num_batches_tracked', 'model.6.cv1.bn.num_batches_tracked', 
'model.6.cv4.bn.num_batches_tracked', 'model.6.bn.num_batches_tracked', 'model.6.m.0.cv1.bn.num_batches_tracked', 
'model.6.m.0.cv2.bn.num_batches_tracked', 'model.6.m.1.cv1.bn.num_batches_tracked', 
'model.6.m.1.cv2.bn.num_batches_tracked', 'model.6.m.2.cv1.bn.num_batches_tracked', 
'model.6.m.2.cv2.bn.num_batches_tracked', 'model.7.bn.num_batches_tracked', 'model.8.cv1.bn.num_batches_tracked', 
'model.8.cv2.bn.num_batches_tracked', 'model.9.cv1.bn.num_batches_tracked', 'model.9.cv4.bn.num_batches_tracked', 
'model.9.bn.num_batches_tracked', 'model.9.m.0.cv1.bn.num_batches_tracked', 'model.9.m.0.cv2.bn.num_batches_tracked', 
'model.10.bn.num_batches_tracked', 'model.13.cv1.bn.num_batches_tracked', 'model.13.cv4.bn.num_batches_tracked', 
'model.13.bn.num_batches_tracked', 'model.13.m.0.cv1.bn.num_batches_tracked', 
'model.13.m.0.cv2.bn.num_batches_tracked', 'model.14.bn.num_batches_tracked', 'model.17.cv1.bn.num_batches_tracked', 
'model.17.cv4.bn.num_batches_tracked', 'model.17.bn.num_batches_tracked', 'model.17.m.0.cv1.bn.num_batches_tracked', 
'model.17.m.0.cv2.bn.num_batches_tracked', 'model.19.bn.num_batches_tracked', 'model.21.cv1.bn.num_batches_tracked', 
'model.21.cv4.bn.num_batches_tracked', 'model.21.bn.num_batches_tracked', 'model.21.m.0.cv1.bn.num_batches_tracked', 
'model.21.m.0.cv2.bn.num_batches_tracked', 'model.23.bn.num_batches_tracked', 'model.25.cv1.bn.num_batches_tracked', 
'model.25.cv4.bn.num_batches_tracked', 'model.25.bn.num_batches_tracked', 'model.25.m.0.cv1.bn.num_batches_tracked', 
'model.25.m.0.cv2.bn.num_batches_tracked']

As you can see, the problem is num_batches_tracked as well as in this issue: https://github.com/pytorch/pytorch/issues/16218

Looks like the transformation worked, because afterwards I checked the Tensors again and there where no INT64 Tensors, but when I did the export to ONNX I got the following error:

ONNX export failed.

Is there a way to have a more verbose output?

I found that I had something later that was breaking the code, I fixed it and now it 'works'.
Well, it exports it as ONNX, but I still get the same output, like if I did not transformation:

Fusing layers...
Model Summary: 140 layers, 7.26e+06 parameters, 6.61683e+06 gradients
graph torch-jit-export (
  %images[FLOAT, 1x3x416x416]
) initializers (
  %483[INT64, 1]
  %484[INT64, 1]
  %485[INT64, 1]
  %486[INT64, 1]
  %487[INT64, 1]
  %488[INT64, 1]
  %model.0.conv.conv.bias[FLOAT, 32]

Any further ideas?

@edurenye ok. These are just batch-norm statistics. It makes sense then that pytorch matinains them as int64's to decrease the chance of them overflowing. I would say since there is no ONNX error (the export process runs a full suite of checks), then this is just between ONNX and opencv.

https://github.com/ultralytics/yolov5/blob/dfd63de20a92152e53ed13346322fe91369de240/models/export.py#L49-L53

Well according to https://github.com/opencv/opencv/issues/14830#issuecomment-503279466 OpenCV tries to convert it to INT32 but it fails because it is out of the range.

Lot's of people like me will try to use it on the edge with OpenCV and OpenVINO because of how light your model is, makes it ideal for those use cases.

Can't I just remove those tensors somehow when exporting to ONNX?

@edurenye yes, I agree export should be easier, but this is the current state of affairs. The ONNX guys will say they are doing their job correctly, as will the opencv guys and the pytorch guys, and they are all technically correct since their responsibilities don't really extend past their packages, and all the packages are working correctly in their standalone capacities.

By the way, you can get a verbose export by setting the verbose flag:
https://github.com/ultralytics/yolov5/blob/e74ccb2985ea747e1d4a2d92cad5f4f7738fb54f/models/export.py#L42-L48

@edurenye just realized that there are only 6 int64's in your onnx export, but there are many more batchnorm values. The 6 int64's must originate somewhere else.

Yes @glenn-jocher, because the code should turn them into int32, but they are still there.

I used the debug and I could find the origin of them, they are used in the 'Concat' operations, the following is the ending of the graph:

%468 : Tensor = onnx::Unsqueeze[axes=[0]](%459)
  %471 : Tensor = onnx::Unsqueeze[axes=[0]](%462)
  %472 : Tensor = onnx::Unsqueeze[axes=[0]](%465)
  %473 : Tensor = onnx::Concat[axis=0](%468, %480, %481, %471, %472)
  %474 : Float(1:89232, 3:29744, 11:2704, 52:52, 52:1) = onnx::Reshape(%384, %473) # /usr/src/app/models/yolo.py:26:0
  %475 : Float(1:89232, 3:29744, 52:572, 52:11, 11:1) = onnx::Transpose[perm=[0, 1, 3, 4, 2]](%474) # /usr/src/app/models/yolo.py:26:0
  return (%output, %456, %475)

The 'Concat' has 5 inputs, 3 are the outputs from the 'Unsqueeze' and the other 2 are this INT64, there are 3 of this blocks of layers, so it makes the 6 parameters INT64

Here is a image using Netron:
end_of_graph

I guess this is part of the 'Detect' block, right?

So basically, I was not getting the INT64, because is the transformation of the model to ONNX who generates them, if I understand this correctly. So I should somehow turn the INT64 to INT32 after exporting the model to ONNX. But I don't really know how to do that...

BTW the 2 INT64 values are the same in the 3 cases, their values are 3 and 11:
details_of_concat

@edurenye I pushed a few updates to export.py today for improved introspection and incremented to opset 12. The initial int64's no longer appear in the results, though I believe the batchnorm int64's remain. You might want to git pull and see where the updates put you.

Thanks @glenn-jocher, but same result. I'll try a diferent approach I'll try to use something diferent to OpenCV, but I don't really know how to use OpenVINO in anything else, that is why I wanted to use the model with OpenCV.

Thanks for your help, I think the problem is more in the plate of the OpenCV guys.

@edurenye oh that's too bad. My output looks like this now, the original list of 6 int64's no longer appear:

cd yolov5
export PYTHONPATH="$PWD"  # add path
python models/export.py --weights yolov5s.pt --img 640 --batch 1  # export

Output is:

Namespace(batch_size=1, img_size=[640, 640], weights='yolov5s.pt')
TorchScript export success, saved as yolov5s.torchscript  # <-------- TorchScript exports first

Fusing layers...
Model Summary: 140 layers, 7.45958e+06 parameters, 7.45958e+06 gradients
graph torch-jit-export (
  %images[FLOAT, 1x3x640x640]
) initializers (
  %model.0.conv.conv.bias[FLOAT, 32]
  %model.0.conv.conv.weight[FLOAT, 32x12x3x3]
  %model.1.conv.bias[FLOAT, 64]
  %model.1.conv.weight[FLOAT, 64x32x3x3]
  %model.10.conv.bias[FLOAT, 256]
  %model.10.conv.weight[FLOAT, 256x512x1x1]
...
  %650 = Gather[axis = 0](%648, %649)
  %653 = Unsqueeze[axes = [0]](%644)
  %656 = Unsqueeze[axes = [0]](%647)
  %657 = Unsqueeze[axes = [0]](%650)
  %658 = Concat[axis = 0](%653, %665, %666, %656, %657)
  %659 = Reshape(%569, %658)
  %660 = Transpose[perm = [0, 1, 3, 4, 2]](%659)
  return %output, %641, %660
}
ONNX export success, saved as yolov5s.onnx  # <-------- ONNX exports second
View with https://github.com/lutzroeder/netron

Strange, I tried again and I still have them :thinking:
What types have now in your exported model the Tensors %665 and %666?

I'm using docker, but inside I'm using my local code as a volume (so I can have the last version from master). Might be something to do with the container, the versions of something? I'm building the Dockerfile from this repo, I just uncommented this line:

#RUN pip install -r requirements.txt

Why is it commented in the Dockerfile?

I don't have much to add unfortunately, but I'm having the same issue. Running in Google Colab & same error while importing into openCV.

@TrInsanity are you seeing the same output as https://github.com/ultralytics/yolov5/issues/250#issuecomment-653102358 at least?

@TrInsanity are you seeing the same output as #250 (comment) at least?

Namespace(batch_size=1, img_size=[640, 640], weights='./weights/last_weights.pt')
TorchScript export success, saved as ./weights/last_weights.torchscript
Fusing layers...
Model Summary: 236 layers, 4.74077e+07 parameters, 4.48868e+07 gradients
graph torch-jit-export (
  %images[FLOAT, 1x3x640x640]
) initializers (
  %model.0.conv.conv.bias[FLOAT, 64]
  %model.0.conv.conv.weight[FLOAT, 64x12x3x3]
  %model.1.conv.bias[FLOAT, 128]
  %model.1.conv.weight[FLOAT, 128x64x3x3]
  %model.10.conv.bias[FLOAT, 512]
  %model.10.conv.weight[FLOAT, 512x1024x1x1]
  ...
  %output = Transpose[perm = [0, 1, 3, 4, 2]](%649)
  %651 = Shape(%606)
  %652 = Constant[value = <Scalar Tensor []>]()
  %653 = Gather[axis = 0](%651, %652)
  %654 = Shape(%606)
  %655 = Constant[value = <Scalar Tensor []>]()
  %656 = Gather[axis = 0](%654, %655)
  %657 = Shape(%606)
  %658 = Constant[value = <Scalar Tensor []>]()
  %659 = Gather[axis = 0](%657, %658)
  %660 = Constant[value = <Scalar Tensor []>]()
  %661 = Constant[value = <Scalar Tensor []>]()
  %662 = Unsqueeze[axes = [0]](%653)
  %663 = Unsqueeze[axes = [0]](%660)
  %664 = Unsqueeze[axes = [0]](%661)
  %665 = Unsqueeze[axes = [0]](%656)
  %666 = Unsqueeze[axes = [0]](%659)
  %667 = Concat[axis = 0](%662, %663, %664, %665, %666)
  %668 = Reshape(%606, %667)
  %669 = Transpose[perm = [0, 1, 3, 4, 2]](%668)
  %670 = Shape(%581)
  %671 = Constant[value = <Scalar Tensor []>]()
  %672 = Gather[axis = 0](%670, %671)
  %673 = Shape(%581)
  %674 = Constant[value = <Scalar Tensor []>]()
  %675 = Gather[axis = 0](%673, %674)
  %676 = Shape(%581)
  %677 = Constant[value = <Scalar Tensor []>]()
  %678 = Gather[axis = 0](%676, %677)
  %679 = Constant[value = <Scalar Tensor []>]()
  %680 = Constant[value = <Scalar Tensor []>]()
  %681 = Unsqueeze[axes = [0]](%672)
  %682 = Unsqueeze[axes = [0]](%679)
  %683 = Unsqueeze[axes = [0]](%680)
  %684 = Unsqueeze[axes = [0]](%675)
  %685 = Unsqueeze[axes = [0]](%678)
  %686 = Concat[axis = 0](%681, %682, %683, %684, %685)
  %687 = Reshape(%581, %686)
  %688 = Transpose[perm = [0, 1, 3, 4, 2]](%687)
  return %output, %669, %688
}
ONNX export success, saved as ./weights/last_weights.onnx
View with https://github.com/lutzroeder/netron

This is the output from the export command. Sorry, I'm not sure what I'm looking for but it looks similar to https://github.com/ultralytics/yolov5/issues/250#issuecomment-653102358

Edit: There's a few examples of lines with INT64 - all num_batches_tracked:
%model.9.bn.num_batches_tracked[INT64, scalar]

@TrInsanity looks good, you are seeing the same thing then. The values you see there are batchnorm statistics which pytorch tracks in int64's to reduce the risk of overflow from very high numbers. @edurenye had a loop above to change these to int32 that may or may not address your problem.

After using export.py to get .onnx, I use
model=cv2.dnn.readNetFromONNX(
.onnx)
error happens:
cv2.error: OpenCV(4.2.0) /io/opencv/modules/dnn/src/onnx/onnx_importer.cpp:134: error: (-215:Assertion failed) !field.empty() in function 'getMatFromTensor'

I'm having the same issues. @edurenye what did you have as the rsetattr function?

Hi @THINK989, that error looks completely unrelated to the one in this issue, as the INT64 error happens in the 'Concat' layer.

Hello @edurenye

Yes Sorry about that, I got confirmation from OpenVino that the model is unsupported. I have deleted the comment.

Any updates on this issue or how to resolve them? I'm having the exact same problem.

I will be great if this is resolved. this is the only good option to run Yolov5 on the edge.

@edurenye Did you manage to solve this issue?

No, I just avoided exporting in my project and had to use PyTorch on the edge with was not nice, but is what I had to do.

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

Single430 picture Single430  路  4Comments

krishnam3065 picture krishnam3065  路  4Comments

cswwp picture cswwp  路  4Comments

jaqub-manuel picture jaqub-manuel  路  4Comments

hktxt picture hktxt  路  3Comments