Onnxruntime: [ONNXRuntimeError] TensorRT EP could not build Engine for fused node

Created on 3 Apr 2020 · 25Comments · Source: microsoft/onnxruntime

Describe the bug
After successfully converting the model into ONNX format and successfully running symbolic_shape_infer.py script after the fix #3353, TRT engine build starts. Unfortunately, it throws these errors:

2020-04-01 13:27:24.033733574 [W:onnxruntime:Default, tensorrt_execution_provider.h:35 log] [2020-04-01 12:27:24 ERROR] Layer: (Unnamed Layer* 24)[Select]'s output can not be used as shape tensor.

2020-04-01 13:27:24.033755356 [W:onnxruntime:Default, tensorrt_execution_provider.h:35 log] [2020-04-01 12:27:24 ERROR] Network validation failed.

Previous error message is shown in the jupyter-notebook terminal, while this error is shown in the notebook:
EPFail: [ONNXRuntimeError] : 11 : EP_FAIL : TensorRT EP could not build Engine for fused node: TensorrtExecutionProvider_TRTKernel_6_6.

Can someone help us with resolving this error?

Urgency
Urgent

System information

OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux Ubuntu 18.04
ONNX Runtime installed from (source or binary): binary
ONNX Runtime version: 1.2.0
Python version: 3.6
CUDA/cuDNN version: 10.0 /
GPU model and memory: GeForce 940MX / 4GB

To Reproduce
Model that is optimized and shape inferred can be found here: https://drive.google.com/open?id=1Rc4nXmLGMDmWlx-X_KtIN07FkMuNYyJ_

Expected behavior
Expecting that after the successful conversion and shape inference, the TRT engine will be successfully built.

TensorRT stale

Source

qraleq

All 25 comments

Any update on this issue? Thanks!

qraleq on 7 Apr 2020

@stevenlix can you help take a look?

jywu-msft on 14 Apr 2020

Hi, @stevenlix and @jywu-msft! Can you provide any updates on this issue? We would really need to resolve this, but we don't have an idea what's wrong...
Thanks!

qraleq on 17 Apr 2020

Sorry for the late response. the link you provided didn't work.
but I was able to get the model from the link in the previous issue you filed about Pads.
The error
[2020-04-01 12:27:24 ERROR] Layer: (Unnamed Layer* 24)[Select]'s output can not be used as shape tensor.
comes from tensorrt parser (onnx-tensorrt) during network validation, not onnxruntime so we're still trying to figure out what it means.

jywu-msft on 20 Apr 2020

@jywu-msft Thank you very much for the update! Please update us when you have any news on this, it's really important for us to figure it out...

qraleq on 20 Apr 2020

I got the same error on
(https://github.com/tensorflow/models/blob/v1.13.0/research/object_detection/models/ssd_mobilenet_v2_feature_extractor.py)

onnx-runtime-trt was build from master.

Looking forward to any update on that.

Furthermore I get a got a lot warnings on startup, this was not the case with version 1.2:
e.g.:
2020-04-22 12:12:28.078811862 [W:onnxruntime:, graph.cc:2422 CleanUnusedInitializers] Removing initializer 'const_slice__618'. It is not used by any node and should be removed from the model.
2020-04-22 12:12:28.078820662 [W:onnxruntime:, graph.cc:2422 CleanUnusedInitializers] Removing initializer 'Preprocessor/mul/x:0'. It is not used by any node and should be removed from the model.
2020-04-22 12:12:28.078831417 [W:onnxruntime:, graph.cc:2422 CleanUnusedInitializers] Removing initializer 'ConvBnFusion_BN_B_BoxPredictor_5/BoxEncodingPredictor_depthwise/BatchNorm/beta/read/_72__cf__72:0_139'. It is not used by any node and should be removed from the model.
2020-04-22 12:12:28.078841055 [W:onnxruntime:, graph.cc:2422 CleanUnusedInitializers] Removing initializer 'ConvBnFusion_W_const_fold_opt__947_148'. It is not used by any node and should be removed from the model.
2020-04-22 12:12:28.081366735 [W:onnxruntime:Default, tensorrt_execution_provider.h:36 log] [2020-04-22 12:12:28 WARNING] Tensor DataType is determined at build time for tensors not marked as input or output.
2020-04-22 12:12:28.081385383 [W:onnxruntime:Default, tensorrt_execution_provider.h:36 log] [2020-04-22 12:12:28 WARNING] Calling isShapeTensor before the entire network is constructed may result in an inaccurate result.
2020-04-22 12:12:28.081399211 [W:onnxruntime:Default, tensorrt_execution_provider.h:36 log] [2020-04-22 12:12:28 WARNING] Calling isShapeTensor before the entire network is constructed may result in an inaccurate result.
2020-04-22 12:12:28.081573111 [W:onnxruntime:, graph.cc:2422 CleanUnusedInitializers] Removing initializer 'const_fold_opt__971'. It is not used by any node and should be removed from the model.
2020-04-22 12:12:28.081585333 [W:onnxruntime:, graph.cc:2422 CleanUnusedInitializers] Removing initializer 'Postprocessor/BatchMultiClassNonMaxSuppression/PadOrClipBoxList/sub_5/x:0'. It is not used by any node and should be removed from the model.
2020-04-22 12:12:28.081593714 [W:onnxruntime:, graph.cc:2422 CleanUnusedInitializers] Removing initializer 'const_fold_opt__928'. It is not used by any node and should be removed from the model.
2020-04-22 12:12:28.081602793 [W:onnxruntime:, graph.cc:2422 CleanUnusedInitializers] Removing initializer 'largest_int_val__809'. It is not used by any node and should be removed from the model.
2020-04-22 12:12:28.081610546 [W:onnxruntime:, graph.cc:2422 CleanUnusedInitializers] Removing initializer 'const__737'. It is not used by any node and should be removed from the model.
2020-04-22 12:12:28.081619974 [W:onnxruntime:, graph.cc:2422 CleanUnusedInitializers] Removing initializer 'const_slice__785'. It is not used by any node and should be removed from the model.
2020-04-22 12:12:28.081630101 [W:onnxruntime:, graph.cc:2422 CleanUnusedInitializers] Removing initializer 'Postprocessor/BatchMultiClassNonMaxSuppression/PadOrClipBoxList/Select_1/e:0'. It is not used by any node and should be removed from the model.
2020-04-22 12:12:28.081639948 [W:onnxruntime:, graph.cc:2422 CleanUnusedInitializers] Removing initializer 'Postprocessor/BatchMultiClassNonMaxSuppression/PadOrClipBoxList/zeros_6/_423__cf__423:0'. It is not used by any node and should be removed from the model.
2020-04-22 12:12:28.081646234 [W:onnxruntime:, graph.cc:2422 CleanUnusedInitializers] Removing initializer 'const_slice__697'. It is not used by any node and should be removed from the model.
2020-04-22 12:12:28.081656081 [W:onnxruntime:, graph.cc:2422 CleanUnusedInitializers] Removing initializer 'FeatureExtractor/MobilenetV2/expanded_conv_2/depthwise/Relu6_min__79'. It is not used by any node and should be removed from the model.
2020-04-22 12:12:47.796120281 [W:onnxruntime:Default, tensorrt_execution_provider.h:36 log] [2020-04-22 12:12:47 WARNING] Current optimization profile is: 0. Please ensure there are no enqueued operations pending in this context prior to switching profiles
2020-04-22 12:12:48.192992803 [W:onnxruntime:Default, tensorrt_execution_provider.h:36 log] [2020-04-22 12:12:48 WARNING] Current optimization profile is: 0. Please ensure there are no enqueued operations pending in this context prior to switching profiles
2020-04-22 12:12:48.277763158 [W:onnxruntime:Default, tensorrt_execution_provider.h:36 log] [2020-04-22 12:12:48 WARNING] Current optimization profile is: 0. Please ensure there are no enqueued operations pending in this context prior to switching profiles
2020-04-22 12:12:48.695730598 [W:onnxruntime:Default, tensorrt_execution_provider.h:36 log] [2020-04-22 12:12:48 WARNING] Current optimization profile is: 0. Please ensure there are no enqueued operations pending in this context prior to switching profiles
2020-04-22 12:12:49.084085549 [W:onnxruntime:Default, tensorrt_execution_provider.h:36 log] [2020-04-22 12:12:49 WARNING] Current optimization profile is: 0. Please ensure there are no enqueued operations pending in this context prior to switching profiles
2020-04-22 12:12:50.605779798 [W:onnxruntime:Default, tensorrt_execution_provider.h:36 log] [2020-04-22 12:12:50 WARNING] Current optimization profile is: 0. Please ensure there are no enqueued operations pending in this context prior to switching profiles
2020-04-22 12:12:50.668077020 [W:onnxruntime:Default, tensorrt_execution_provider.h:36 log] [2020-04-22 12:12:50 WARNING] Current optimization profile is: 0. Please ensure there are no enqueued operations pending in this context prior to switching profiles
2020-04-22 12:12:50.774508204 [W:onnxruntime:Default, tensorrt_execution_provider.h:36 log] [2020-04-22 12:12:50 WARNING] Current optimization profile is: 0. Please ensure there are no enqueued operations pending in this context prior to switching profiles
2020-04-22 12:12:51.121557838 [W:onnxruntime:Default, tensorrt_execution_provider.h:36 log] [2020-04-22 12:12:51 WARNING] Current optimization profile is: 0. Please ensure there are no enqueued operations pending in this context prior to switching profiles
2020-04-22 12:12:51.122854830 [W:onnxruntime:Default, tensorrt_execution_provider.h:36 log] [2020-04-22 12:12:51 WARNING] Tensor DataType is determined at build time for tensors not marked as input or output.
2020-04-22 12:12:51.122938498 [W:onnxruntime:Default, tensorrt_execution_provider.h:36 log] [2020-04-22 12:12:51 ERROR] Layer: (Unnamed Layer* 24)[Select]'s output can not be used as shape tensor.
2020-04-22 12:12:51.122960637 [W:onnxruntime:Default, tensorrt_execution_provider.h:36 log] [2020-04-22 12:12:51 ERROR] Network validation failed.

joba01 on 22 Apr 2020

👍1

This required some fixes from nvidia for onnx-tensorrt project.
You built onnxruntime + tensorrt EP from source, right?
can you update the reference to onnx-tensorrt submodule?
i.e. git submodule update --remote /path/to/onnxruntime/cmake/external/onnx-tensorrt
and rebuild?

jywu-msft on 22 Apr 2020

This required some fixes from nvidia for onnx-tensorrt project.
You built onnxruntime + tensorrt EP from source, right?
can you update the reference to onnx-tensorrt submodule?
i.e. git submodule update --remote /path/to/onnxruntime/cmake/external/onnx-tensorrt
and rebuild?

it should have the fix for
2020-04-22 12:12:51.122938498 [W:onnxruntime:Default, tensorrt_execution_provider.h:36 log] [2020-04-22 12:12:51 ERROR] Layer: (Unnamed Layer* 24)[Select]'s output can not be used as shape tensor.

note we have not had time to test the model end to end, but wanted to give you an update asap.

jywu-msft on 22 Apr 2020

@jywu-msft I followed your instructions and rebuilt onnxruntime and tried to run the model, but I get the same error. Could you please test it and confirm that it runs for you?

qraleq on 22 Apr 2020

I won't have time to test this right now as we're busy with a release.
after
git submodule update --remote /path/to/onnxruntime/cmake/external/onnx-tensorrt

when you rebuild, you cannot use the --update option to build.sh
leave it out of the build.sh invocation.
otherwise, it will reset the onnx-tensorrt submodule to the previous state.
that is probably why you see the same error.

jywu-msft on 23 Apr 2020

I tested your model ssd_mobilenet_v2_fpn_coco_v19.26032020_tensorrt.onnx and it went through ( I don't have data set so can't verify the accuracy though). Like we mentioned before, go to cmake/external/ and run: git submodule update --remote onnx-tensorrt (which will get you the latest parser fixes), then compile ORT with flag --skip_submodule_sync

stevenlix on 23 Apr 2020

Thank you very much, after rebuilding with --skip_submodule_sync, now we manage to pass the creation of ORT session without error, but now it throws:

2020-04-23 08:37:28.802814564 [I:onnxruntime:, sequential_executor.cc:67 Execute] Begin execution terminate called after throwing an instance of 'onnxruntime::OnnxRuntimeException' what(): /media/ivan/storage/Development/onnxruntime/onnxruntime/core/providers/cuda/cuda_call.cc:107 bool onnxruntime::CudaCall(ERRTYPE, const char*, const char*, ERRTYPE, const char*) [with ERRTYPE = cudaError; bool THRW = true] /media/ivan/storage/Development/onnxruntime/onnxruntime/core/providers/cuda/cuda_call.cc:101 bool onnxruntime::CudaCall(ERRTYPE, const char*, const char*, ERRTYPE, const char*) [with ERRTYPE = cudaError; bool THRW = true] CUDA failure 77: an illegal memory access was encountered ; GPU=0 ; hostname=hi-tech ; expr=cudaEventDestroy(read_event_);.

What could cause an illegal memory access? I'm feeding a single image in tensor form with dimensions: (1,720,1280,3)

qraleq on 23 Apr 2020

Thank you @jywu-msft and @stevenlix for looking into it. I rebuild from source with the changes you suggested but with a similar outcome as @qraleq. The model was loaded successfully, but I encountered an illegal memory access as well.

terminate called after throwing an instance of 'onnxruntime::OnnxRuntimeException' what(): /code/onnxruntime/onnxruntime/core/providers/cuda/cuda_call.cc:123 bool onnxruntime::CudaCall(ERRTYPE, const char*, const char*, ERRTYPE, const char*) [with ERRTYPE = cudaError; bool THRW = true] /code/onnxruntime/onnxruntime/core/providers/cuda/cuda_call.cc:117 bool onnxruntime::CudaCall(ERRTYPE, const char*, const char*, ERRTYPE, const char*) [with ERRTYPE = cudaError; bool THRW = true] CUDA failure 700: an illegal memory access was encountered ; GPU=0 ; hostname=c091e53a040c ; expr=cudaEventDestroy(read_event_);

build based on the docker image in the current master:

base container: nvcr.io/nvidia/tensorrt:20.01-py3

root@f84e58608ad4:/usr/local# dpkg -l | grep nvinfer
ii  libnvinfer-bin              7.0.0-1+cuda10.2                  amd64        TensorRT binaries
ii  libnvinfer-dev              7.0.0-1+cuda10.2                  amd64        TensorRT development libraries and headers
ii  libnvinfer-doc              7.0.0-1+cuda10.2                  all          TensorRT documentation
ii  libnvinfer-plugin-dev       7.0.0-1+cuda10.2                  amd64        TensorRT plugin libraries
ii  libnvinfer-plugin7          7.0.0-1+cuda10.2                  amd64        TensorRT plugin libraries
ii  libnvinfer-samples          7.0.0-1+cuda10.2                  all          TensorRT samples
ii  libnvinfer7                 7.0.0-1+cuda10.2                  amd64        TensorRT runtime libraries
ii  python3-libnvinfer          7.0.0-1+cuda10.2                  amd64        Python 3 bindings for TensorRT
ii  python3-libnvinfer-dev      7.0.0-1+cuda10.2                  amd64        Python 3 development package for TensorRT

build process executed with commands from Dockerfile.tensorrt adjusted with your notes

tested on Ubuntu 16.04 within docker
System has 4 1080-TI, tests on Quattro 6000 and T4 are possible and planned when the model is able to run with TRT (on linux and windows)

joba01 on 23 Apr 2020

if the core dump is of any use I can provide it to you

joba01 on 23 Apr 2020

Probably the build log is good enough to look at. Or can you share your model?

stevenlix on 24 Apr 2020

@stevenlix, I send you an email with the model, samples and a run script. The model itself is nothing to fancy, a Mobilenet SSD on grayscale, but the samples shouldn't be public.

Here the content, except the download link:

It would be great if we could switch to the onnxruntime-trt execution provider with your assistance, as fallback the CUDA provider would be ok until this is fixed, but here some asymmetric padding forces some convolutions to run on the CPU execution provider, if you have any idea how to fix that, without changing the model to retrain, that would be very helpful (by chanching the onnx graph and changing the asymectic paddings to symmetric, the model of course has an invalid output, but has a speedup of 4)

joba01 on 24 Apr 2020

Hi @stevenlix, @jywu-msft,

do you have any update with the TRT conversion?

TLDR;

I did more testing, see the numbers below, not looking good, if neither the asym padding problem nor the TRT gets fixed we will not be able to use onnx-runtime
I found a workaround for the asymetric padding problem for keeping the processing on the GPU with CUDA, this brings a speedup of x2.5 but is still two times slower than the Tensorflog GPU runtime (without TRT)

Long Version:

The system we evaluate using onnxruntime on is using Windows, Tensorflow has no (official) support for TensorRT on Windows, therefore onnxruntime looks promising that we could stay on Windows if the TRT integration works. When onnxruntime would have a compareable performance to the Tensorflow runtime and we have an outlook that TRT will be working in the near future this would be fine too, but as stated above Tensorflow is twice as fast with this Mobilenet SSD V2 model (four times faster without hacking the graph), because on onnxruntime we have the fallback to CPU on asymetric padded (strided) convolutions on CUDA.

tested on a 2080-TI
Runtime | Batchsize 1| Batchsize 4 | Note
------- | ----------- | ------------ | ---------
Tensorflow-GPU(CUDA) |7ms | 14ms |
onnxruntime CUDA (asym padding) |27ms | 96ms |
onnxruntime CUDA (no asym padding) |6ms | 13ms | changed the padding in the graph to be symmetric, but gives an invalid output, just for speed compare when this would be fixed
onnxruntime CUDA (add pre padding nodes) |11ms | 44ms | added padding nodes, warnings are gone, output is correct, a lot slower, especially batches

I like your clean. well designed and simple API and hope that we can use onnxruntime in production, but for that I would need an outlook when either asym padding with CUDA or the TRT runtime will work. The problem is when I have to evaluate it negativ, as the numbers are currently, this runtime and Windows will be of the table for a long time, therefore I would appreciate your help on this a lot!

joba01 on 29 Apr 2020

Hi @stevenlix, @jywu-msft,

do you have any update with the TRT conversion?

sorry for the delay. we're in the midst of a release so have been very busy.
will sync with @stevenlix to see if we can find some spare cycles to take a closer look.
the asymmetric padding/fall back to CPU with onnxruntime CUDA has been a longstanding problem.
my understanding was that it was a limitation with cuDNN. It would be good if we can address with TRT (or take a look at how Tensorflow-GPU is handling it and do something similar)

jywu-msft on 30 Apr 2020

👍1

Regarding TRT, I run your model and saw the same issue you posted. Thanks for raising it up. I will dig further into it as soon as I can.

stevenlix on 30 Apr 2020

👍1

Thanks for checking it guys, beer is on me if you come to Austria ;)!

Maybe this workaround is better than a prepadding operator.
https://github.com/microsoft/MMdnn/issues/153
Expand the padding and slice afterwards, if it performs this could also be rewritten in the CUDA runtime I guess.

joba01 on 30 Apr 2020

I implemented a converter for the 'overpad and slice' idea from above. Works a lot faster and gets rid of the asym padding problem:

Times:
Batchsize 1: 6ms (speedup of 4.5)
Batchsize 4: 17ms (speedup of 5.6)

still losing on bigger batches compared to TF, but the results are resonable.

This method could be implemented in the CUDA executor on loading, instead of the warning and the CPU switch.

joba01 on 4 May 2020

👍1

Thank you very much, after rebuilding with --skip_submodule_sync, now we manage to pass the creation of ORT session without error, but now it throws:

2020-04-23 08:37:28.802814564 [I:onnxruntime:, sequential_executor.cc:67 Execute] Begin execution terminate called after throwing an instance of 'onnxruntime::OnnxRuntimeException' what(): /media/ivan/storage/Development/onnxruntime/onnxruntime/core/providers/cuda/cuda_call.cc:107 bool onnxruntime::CudaCall(ERRTYPE, const char*, const char*, ERRTYPE, const char*) [with ERRTYPE = cudaError; bool THRW = true] /media/ivan/storage/Development/onnxruntime/onnxruntime/core/providers/cuda/cuda_call.cc:101 bool onnxruntime::CudaCall(ERRTYPE, const char*, const char*, ERRTYPE, const char*) [with ERRTYPE = cudaError; bool THRW = true] CUDA failure 77: an illegal memory access was encountered ; GPU=0 ; hostname=hi-tech ; expr=cudaEventDestroy(read_event_);.

What could cause an illegal memory access? I'm feeding a single image in tensor form with dimensions: (1,720,1280,3)

Hi, guys! Were you able to get a closer look at this issue?

qraleq on 20 May 2020

This issue has been automatically marked as stale due to inactivity and will be closed in 7 days if no further activity occurs. If further support is needed, please provide an update and/or more details.

stale[bot] on 13 Sep 2020

This issue has been automatically closed due to inactivity. Please reactivate if further support is needed.

stale[bot] on 20 Sep 2020

I retested this issue. After changing the input from uint8 to int8 and running symbolic_shape_infer it ran with Tensor-RT.

Speedup with a Nvidia T4 is ~2.0
Speedup with a Nvidia Quadro 6000 is ~1.3, which is quite strange but I investigate that in more detail.

When activating FP16 on the SSD model I get following error (but I open another issue for that, TRT in general is working)
onnxruntime.capi.onnxruntime_pybind11_state.EPFail: [ONNXRuntimeError] : 11 : EP_FAIL : TensorRT EP could not build engine for fused node: TensorrtExecutionProvider_TRTKernel_graph_tf2onnx_1_1

joba01 on 4 Nov 2020

Was this page helpful?

0 / 5 - 0 ratings