Onnxruntime: [ONNXRuntimeError] TensorRT EP could not build Engine for fused node

Created on 3 Apr 2020  路  25Comments  路  Source: microsoft/onnxruntime

Describe the bug
After successfully converting the model into ONNX format and successfully running symbolic_shape_infer.py script after the fix #3353, TRT engine build starts. Unfortunately, it throws these errors:

2020-04-01 13:27:24.033733574 [W:onnxruntime:Default, tensorrt_execution_provider.h:35 log] [2020-04-01 12:27:24 ERROR] Layer: (Unnamed Layer* 24)[Select]'s output can not be used as shape tensor.

2020-04-01 13:27:24.033755356 [W:onnxruntime:Default, tensorrt_execution_provider.h:35 log] [2020-04-01 12:27:24 ERROR] Network validation failed.

Previous error message is shown in the jupyter-notebook terminal, while this error is shown in the notebook:
EPFail: [ONNXRuntimeError] : 11 : EP_FAIL : TensorRT EP could not build Engine for fused node: TensorrtExecutionProvider_TRTKernel_6_6.

Can someone help us with resolving this error?

Urgency
Urgent

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux Ubuntu 18.04
  • ONNX Runtime installed from (source or binary): binary
  • ONNX Runtime version: 1.2.0
  • Python version: 3.6
  • CUDA/cuDNN version: 10.0 /
  • GPU model and memory: GeForce 940MX / 4GB

To Reproduce
Model that is optimized and shape inferred can be found here: https://drive.google.com/open?id=1Rc4nXmLGMDmWlx-X_KtIN07FkMuNYyJ_

Expected behavior
Expecting that after the successful conversion and shape inference, the TRT engine will be successfully built.

TensorRT stale

All 25 comments

Any update on this issue? Thanks!

@stevenlix can you help take a look?

Hi, @stevenlix and @jywu-msft! Can you provide any updates on this issue? We would really need to resolve this, but we don't have an idea what's wrong...
Thanks!

Sorry for the late response. the link you provided didn't work.
but I was able to get the model from the link in the previous issue you filed about Pads.
The error
[2020-04-01 12:27:24 ERROR] Layer: (Unnamed Layer* 24)[Select]'s output can not be used as shape tensor.
comes from tensorrt parser (onnx-tensorrt) during network validation, not onnxruntime so we're still trying to figure out what it means.

@jywu-msft Thank you very much for the update! Please update us when you have any news on this, it's really important for us to figure it out...

I got the same error on
(https://github.com/tensorflow/models/blob/v1.13.0/research/object_detection/models/ssd_mobilenet_v2_feature_extractor.py)

onnx-runtime-trt was build from master.

Looking forward to any update on that.

Furthermore I get a got a lot warnings on startup, this was not the case with version 1.2:
e.g.:
2020-04-22 12:12:28.078811862 [W:onnxruntime:, graph.cc:2422 CleanUnusedInitializers] Removing initializer 'const_slice__618'. It is not used by any node and should be removed from the model.
2020-04-22 12:12:28.078820662 [W:onnxruntime:, graph.cc:2422 CleanUnusedInitializers] Removing initializer 'Preprocessor/mul/x:0'. It is not used by any node and should be removed from the model.
2020-04-22 12:12:28.078831417 [W:onnxruntime:, graph.cc:2422 CleanUnusedInitializers] Removing initializer 'ConvBnFusion_BN_B_BoxPredictor_5/BoxEncodingPredictor_depthwise/BatchNorm/beta/read/_72__cf__72:0_139'. It is not used by any node and should be removed from the model.
2020-04-22 12:12:28.078841055 [W:onnxruntime:, graph.cc:2422 CleanUnusedInitializers] Removing initializer 'ConvBnFusion_W_const_fold_opt__947_148'. It is not used by any node and should be removed from the model.
2020-04-22 12:12:28.081366735 [W:onnxruntime:Default, tensorrt_execution_provider.h:36 log] [2020-04-22 12:12:28 WARNING] Tensor DataType is determined at build time for tensors not marked as input or output.
2020-04-22 12:12:28.081385383 [W:onnxruntime:Default, tensorrt_execution_provider.h:36 log] [2020-04-22 12:12:28 WARNING] Calling isShapeTensor before the entire network is constructed may result in an inaccurate result.
2020-04-22 12:12:28.081399211 [W:onnxruntime:Default, tensorrt_execution_provider.h:36 log] [2020-04-22 12:12:28 WARNING] Calling isShapeTensor before the entire network is constructed may result in an inaccurate result.
2020-04-22 12:12:28.081573111 [W:onnxruntime:, graph.cc:2422 CleanUnusedInitializers] Removing initializer 'const_fold_opt__971'. It is not used by any node and should be removed from the model.
2020-04-22 12:12:28.081585333 [W:onnxruntime:, graph.cc:2422 CleanUnusedInitializers] Removing initializer 'Postprocessor/BatchMultiClassNonMaxSuppression/PadOrClipBoxList/sub_5/x:0'. It is not used by any node and should be removed from the model.
2020-04-22 12:12:28.081593714 [W:onnxruntime:, graph.cc:2422 CleanUnusedInitializers] Removing initializer 'const_fold_opt__928'. It is not used by any node and should be removed from the model.
2020-04-22 12:12:28.081602793 [W:onnxruntime:, graph.cc:2422 CleanUnusedInitializers] Removing initializer 'largest_int_val__809'. It is not used by any node and should be removed from the model.
2020-04-22 12:12:28.081610546 [W:onnxruntime:, graph.cc:2422 CleanUnusedInitializers] Removing initializer 'const__737'. It is not used by any node and should be removed from the model.
2020-04-22 12:12:28.081619974 [W:onnxruntime:, graph.cc:2422 CleanUnusedInitializers] Removing initializer 'const_slice__785'. It is not used by any node and should be removed from the model.
2020-04-22 12:12:28.081630101 [W:onnxruntime:, graph.cc:2422 CleanUnusedInitializers] Removing initializer 'Postprocessor/BatchMultiClassNonMaxSuppression/PadOrClipBoxList/Select_1/e:0'. It is not used by any node and should be removed from the model.
2020-04-22 12:12:28.081639948 [W:onnxruntime:, graph.cc:2422 CleanUnusedInitializers] Removing initializer 'Postprocessor/BatchMultiClassNonMaxSuppression/PadOrClipBoxList/zeros_6/_423__cf__423:0'. It is not used by any node and should be removed from the model.
2020-04-22 12:12:28.081646234 [W:onnxruntime:, graph.cc:2422 CleanUnusedInitializers] Removing initializer 'const_slice__697'. It is not used by any node and should be removed from the model.
2020-04-22 12:12:28.081656081 [W:onnxruntime:, graph.cc:2422 CleanUnusedInitializers] Removing initializer 'FeatureExtractor/MobilenetV2/expanded_conv_2/depthwise/Relu6_min__79'. It is not used by any node and should be removed from the model.
2020-04-22 12:12:47.796120281 [W:onnxruntime:Default, tensorrt_execution_provider.h:36 log] [2020-04-22 12:12:47 WARNING] Current optimization profile is: 0. Please ensure there are no enqueued operations pending in this context prior to switching profiles
2020-04-22 12:12:48.192992803 [W:onnxruntime:Default, tensorrt_execution_provider.h:36 log] [2020-04-22 12:12:48 WARNING] Current optimization profile is: 0. Please ensure there are no enqueued operations pending in this context prior to switching profiles
2020-04-22 12:12:48.277763158 [W:onnxruntime:Default, tensorrt_execution_provider.h:36 log] [2020-04-22 12:12:48 WARNING] Current optimization profile is: 0. Please ensure there are no enqueued operations pending in this context prior to switching profiles
2020-04-22 12:12:48.695730598 [W:onnxruntime:Default, tensorrt_execution_provider.h:36 log] [2020-04-22 12:12:48 WARNING] Current optimization profile is: 0. Please ensure there are no enqueued operations pending in this context prior to switching profiles
2020-04-22 12:12:49.084085549 [W:onnxruntime:Default, tensorrt_execution_provider.h:36 log] [2020-04-22 12:12:49 WARNING] Current optimization profile is: 0. Please ensure there are no enqueued operations pending in this context prior to switching profiles
2020-04-22 12:12:50.605779798 [W:onnxruntime:Default, tensorrt_execution_provider.h:36 log] [2020-04-22 12:12:50 WARNING] Current optimization profile is: 0. Please ensure there are no enqueued operations pending in this context prior to switching profiles
2020-04-22 12:12:50.668077020 [W:onnxruntime:Default, tensorrt_execution_provider.h:36 log] [2020-04-22 12:12:50 WARNING] Current optimization profile is: 0. Please ensure there are no enqueued operations pending in this context prior to switching profiles
2020-04-22 12:12:50.774508204 [W:onnxruntime:Default, tensorrt_execution_provider.h:36 log] [2020-04-22 12:12:50 WARNING] Current optimization profile is: 0. Please ensure there are no enqueued operations pending in this context prior to switching profiles
2020-04-22 12:12:51.121557838 [W:onnxruntime:Default, tensorrt_execution_provider.h:36 log] [2020-04-22 12:12:51 WARNING] Current optimization profile is: 0. Please ensure there are no enqueued operations pending in this context prior to switching profiles
2020-04-22 12:12:51.122854830 [W:onnxruntime:Default, tensorrt_execution_provider.h:36 log] [2020-04-22 12:12:51 WARNING] Tensor DataType is determined at build time for tensors not marked as input or output.
2020-04-22 12:12:51.122938498 [W:onnxruntime:Default, tensorrt_execution_provider.h:36 log] [2020-04-22 12:12:51 ERROR] Layer: (Unnamed Layer* 24)[Select]'s output can not be used as shape tensor.
2020-04-22 12:12:51.122960637 [W:onnxruntime:Default, tensorrt_execution_provider.h:36 log] [2020-04-22 12:12:51 ERROR] Network validation failed.

This required some fixes from nvidia for onnx-tensorrt project.
You built onnxruntime + tensorrt EP from source, right?
can you update the reference to onnx-tensorrt submodule?
i.e. git submodule update --remote /path/to/onnxruntime/cmake/external/onnx-tensorrt
and rebuild?

This required some fixes from nvidia for onnx-tensorrt project.
You built onnxruntime + tensorrt EP from source, right?
can you update the reference to onnx-tensorrt submodule?
i.e. git submodule update --remote /path/to/onnxruntime/cmake/external/onnx-tensorrt
and rebuild?

it should have the fix for
2020-04-22 12:12:51.122938498 [W:onnxruntime:Default, tensorrt_execution_provider.h:36 log] [2020-04-22 12:12:51 ERROR] Layer: (Unnamed Layer* 24)[Select]'s output can not be used as shape tensor.

note we have not had time to test the model end to end, but wanted to give you an update asap.

@jywu-msft I followed your instructions and rebuilt onnxruntime and tried to run the model, but I get the same error. Could you please test it and confirm that it runs for you?

I won't have time to test this right now as we're busy with a release.
after
git submodule update --remote /path/to/onnxruntime/cmake/external/onnx-tensorrt

when you rebuild, you cannot use the --update option to build.sh
leave it out of the build.sh invocation.
otherwise, it will reset the onnx-tensorrt submodule to the previous state.
that is probably why you see the same error.

I tested your model ssd_mobilenet_v2_fpn_coco_v19.26032020_tensorrt.onnx and it went through ( I don't have data set so can't verify the accuracy though). Like we mentioned before, go to cmake/external/ and run: git submodule update --remote onnx-tensorrt (which will get you the latest parser fixes), then compile ORT with flag --skip_submodule_sync

Thank you very much, after rebuilding with --skip_submodule_sync, now we manage to pass the creation of ORT session without error, but now it throws:

2020-04-23 08:37:28.802814564 [I:onnxruntime:, sequential_executor.cc:67 Execute] Begin execution terminate called after throwing an instance of 'onnxruntime::OnnxRuntimeException' what(): /media/ivan/storage/Development/onnxruntime/onnxruntime/core/providers/cuda/cuda_call.cc:107 bool onnxruntime::CudaCall(ERRTYPE, const char*, const char*, ERRTYPE, const char*) [with ERRTYPE = cudaError; bool THRW = true] /media/ivan/storage/Development/onnxruntime/onnxruntime/core/providers/cuda/cuda_call.cc:101 bool onnxruntime::CudaCall(ERRTYPE, const char*, const char*, ERRTYPE, const char*) [with ERRTYPE = cudaError; bool THRW = true] CUDA failure 77: an illegal memory access was encountered ; GPU=0 ; hostname=hi-tech ; expr=cudaEventDestroy(read_event_);.

What could cause an illegal memory access? I'm feeding a single image in tensor form with dimensions: (1,720,1280,3)

Thank you @jywu-msft and @stevenlix for looking into it. I rebuild from source with the changes you suggested but with a similar outcome as @qraleq. The model was loaded successfully, but I encountered an illegal memory access as well.

terminate called after throwing an instance of 'onnxruntime::OnnxRuntimeException' what(): /code/onnxruntime/onnxruntime/core/providers/cuda/cuda_call.cc:123 bool onnxruntime::CudaCall(ERRTYPE, const char*, const char*, ERRTYPE, const char*) [with ERRTYPE = cudaError; bool THRW = true] /code/onnxruntime/onnxruntime/core/providers/cuda/cuda_call.cc:117 bool onnxruntime::CudaCall(ERRTYPE, const char*, const char*, ERRTYPE, const char*) [with ERRTYPE = cudaError; bool THRW = true] CUDA failure 700: an illegal memory access was encountered ; GPU=0 ; hostname=c091e53a040c ; expr=cudaEventDestroy(read_event_);

build based on the docker image in the current master:

root@f84e58608ad4:/usr/local# dpkg -l | grep nvinfer
ii  libnvinfer-bin              7.0.0-1+cuda10.2                  amd64        TensorRT binaries
ii  libnvinfer-dev              7.0.0-1+cuda10.2                  amd64        TensorRT development libraries and headers
ii  libnvinfer-doc              7.0.0-1+cuda10.2                  all          TensorRT documentation
ii  libnvinfer-plugin-dev       7.0.0-1+cuda10.2                  amd64        TensorRT plugin libraries
ii  libnvinfer-plugin7          7.0.0-1+cuda10.2                  amd64        TensorRT plugin libraries
ii  libnvinfer-samples          7.0.0-1+cuda10.2                  all          TensorRT samples
ii  libnvinfer7                 7.0.0-1+cuda10.2                  amd64        TensorRT runtime libraries
ii  python3-libnvinfer          7.0.0-1+cuda10.2                  amd64        Python 3 bindings for TensorRT
ii  python3-libnvinfer-dev      7.0.0-1+cuda10.2                  amd64        Python 3 development package for TensorRT

build process executed with commands from Dockerfile.tensorrt adjusted with your notes

  • tested on Ubuntu 16.04 within docker
  • System has 4 1080-TI, tests on Quattro 6000 and T4 are possible and planned when the model is able to run with TRT (on linux and windows)

if the core dump is of any use I can provide it to you

Probably the build log is good enough to look at. Or can you share your model?

@stevenlix, I send you an email with the model, samples and a run script. The model itself is nothing to fancy, a Mobilenet SSD on grayscale, but the samples shouldn't be public.

Here the content, except the download link:

It would be great if we could switch to the onnxruntime-trt聽execution provider with your assistance, as fallback the CUDA provider聽would be ok until this is fixed, but here some asymmetric聽padding forces some convolutions to run on the CPU execution provider, if you have any idea how to fix that, without changing聽the model to retrain, that would be very helpful (by chanching the onnx graph and changing the asymectic聽paddings to symmetric, the model of course has an invalid output, but has a speedup of 4)

Hi @stevenlix, @jywu-msft,

do you have any update with the TRT conversion?

TLDR;

  • I did more testing, see the numbers below, not looking good, if neither the asym padding problem nor the TRT gets fixed we will not be able to use onnx-runtime
  • I found a workaround for the asymetric padding problem for keeping the processing on the GPU with CUDA, this brings a speedup of x2.5 but is still two times slower than the Tensorflog GPU runtime (without TRT)

Long Version:

  • The system we evaluate using onnxruntime on is using Windows, Tensorflow has no (official) support for TensorRT on Windows, therefore onnxruntime looks promising that we could stay on Windows if the TRT integration works. When onnxruntime would have a compareable performance to the Tensorflow runtime and we have an outlook that TRT will be working in the near future this would be fine too, but as stated above Tensorflow is twice as fast with this Mobilenet SSD V2 model (four times faster without hacking the graph), because on onnxruntime we have the fallback to CPU on asymetric padded (strided) convolutions on CUDA.

tested on a 2080-TI
Runtime | Batchsize 1| Batchsize 4 | Note
------- | ----------- | ------------ | ---------
Tensorflow-GPU(CUDA) |7ms | 14ms |
onnxruntime CUDA (asym padding) |27ms | 96ms |
onnxruntime CUDA (no asym padding) |6ms | 13ms | changed the padding in the graph to be symmetric, but gives an invalid output, just for speed compare when this would be fixed
onnxruntime CUDA (add pre padding nodes) |11ms | 44ms | added padding nodes, warnings are gone, output is correct, a lot slower, especially batches

I like your clean. well designed and simple API and hope that we can use onnxruntime in production, but for that I would need an outlook when either asym padding with CUDA or the TRT runtime will work. The problem is when I have to evaluate it negativ, as the numbers are currently, this runtime and Windows will be of the table for a long time, therefore I would appreciate your help on this a lot!

Hi @stevenlix, @jywu-msft,

do you have any update with the TRT conversion?

sorry for the delay. we're in the midst of a release so have been very busy.
will sync with @stevenlix to see if we can find some spare cycles to take a closer look.
the asymmetric padding/fall back to CPU with onnxruntime CUDA has been a longstanding problem.
my understanding was that it was a limitation with cuDNN. It would be good if we can address with TRT (or take a look at how Tensorflow-GPU is handling it and do something similar)

Regarding TRT, I run your model and saw the same issue you posted. Thanks for raising it up. I will dig further into it as soon as I can.

Thanks for checking it guys, beer is on me if you come to Austria ;)!

Maybe this workaround is better than a prepadding operator.
https://github.com/microsoft/MMdnn/issues/153
Expand the padding and slice afterwards, if it performs this could also be rewritten in the CUDA runtime I guess.

I implemented a converter for the 'overpad and slice' idea from above. Works a lot faster and gets rid of the asym padding problem:

Times:
Batchsize 1: 6ms (speedup of 4.5)
Batchsize 4: 17ms (speedup of 5.6)

still losing on bigger batches compared to TF, but the results are resonable.

This method could be implemented in the CUDA executor on loading, instead of the warning and the CPU switch.

Thank you very much, after rebuilding with --skip_submodule_sync, now we manage to pass the creation of ORT session without error, but now it throws:

2020-04-23 08:37:28.802814564 [I:onnxruntime:, sequential_executor.cc:67 Execute] Begin execution terminate called after throwing an instance of 'onnxruntime::OnnxRuntimeException' what(): /media/ivan/storage/Development/onnxruntime/onnxruntime/core/providers/cuda/cuda_call.cc:107 bool onnxruntime::CudaCall(ERRTYPE, const char*, const char*, ERRTYPE, const char*) [with ERRTYPE = cudaError; bool THRW = true] /media/ivan/storage/Development/onnxruntime/onnxruntime/core/providers/cuda/cuda_call.cc:101 bool onnxruntime::CudaCall(ERRTYPE, const char*, const char*, ERRTYPE, const char*) [with ERRTYPE = cudaError; bool THRW = true] CUDA failure 77: an illegal memory access was encountered ; GPU=0 ; hostname=hi-tech ; expr=cudaEventDestroy(read_event_);.

What could cause an illegal memory access? I'm feeding a single image in tensor form with dimensions: (1,720,1280,3)

Hi, guys! Were you able to get a closer look at this issue?

This issue has been automatically marked as stale due to inactivity and will be closed in 7 days if no further activity occurs. If further support is needed, please provide an update and/or more details.

This issue has been automatically closed due to inactivity. Please reactivate if further support is needed.

I retested this issue. After changing the input from uint8 to int8 and running symbolic_shape_infer it ran with Tensor-RT.

Speedup with a Nvidia T4 is ~2.0
Speedup with a Nvidia Quadro 6000 is ~1.3, which is quite strange but I investigate that in more detail.

When activating FP16 on the SSD model I get following error (but I open another issue for that, TRT in general is working)
onnxruntime.capi.onnxruntime_pybind11_state.EPFail: [ONNXRuntimeError] : 11 : EP_FAIL : TensorRT EP could not build engine for fused node: TensorrtExecutionProvider_TRTKernel_graph_tf2onnx_1_1

Was this page helpful?
0 / 5 - 0 ratings

Related issues

cbecker picture cbecker  路  4Comments

Hramchenko picture Hramchenko  路  4Comments

walbermr picture walbermr  路  3Comments

JammyZhou picture JammyZhou  路  3Comments

pranav-prakash picture pranav-prakash  路  4Comments