Describe the bug
After successfully converting the model into ONNX format and successfully running symbolic_shape_infer.py script after the fix #3353, TRT engine build starts. Unfortunately, it throws these errors:
2020-04-01 13:27:24.033733574 [W:onnxruntime:Default, tensorrt_execution_provider.h:35 log] [2020-04-01 12:27:24 ERROR] Layer: (Unnamed Layer* 24)[Select]'s output can not be used as shape tensor.
2020-04-01 13:27:24.033755356 [W:onnxruntime:Default, tensorrt_execution_provider.h:35 log] [2020-04-01 12:27:24 ERROR] Network validation failed.
Previous error message is shown in the jupyter-notebook terminal, while this error is shown in the notebook:
EPFail: [ONNXRuntimeError] : 11 : EP_FAIL : TensorRT EP could not build Engine for fused node: TensorrtExecutionProvider_TRTKernel_6_6.
Can someone help us with resolving this error?
Urgency
Urgent
System information
To Reproduce
Model that is optimized and shape inferred can be found here: https://drive.google.com/open?id=1Rc4nXmLGMDmWlx-X_KtIN07FkMuNYyJ_
Expected behavior
Expecting that after the successful conversion and shape inference, the TRT engine will be successfully built.
Any update on this issue? Thanks!
@stevenlix can you help take a look?
Hi, @stevenlix and @jywu-msft! Can you provide any updates on this issue? We would really need to resolve this, but we don't have an idea what's wrong...
Thanks!
Sorry for the late response. the link you provided didn't work.
but I was able to get the model from the link in the previous issue you filed about Pads.
The error
[2020-04-01 12:27:24 ERROR] Layer: (Unnamed Layer* 24)[Select]'s output can not be used as shape tensor.
comes from tensorrt parser (onnx-tensorrt) during network validation, not onnxruntime so we're still trying to figure out what it means.
@jywu-msft Thank you very much for the update! Please update us when you have any news on this, it's really important for us to figure it out...
I got the same error on
(https://github.com/tensorflow/models/blob/v1.13.0/research/object_detection/models/ssd_mobilenet_v2_feature_extractor.py)
onnx-runtime-trt was build from master.
Looking forward to any update on that.
Furthermore I get a got a lot warnings on startup, this was not the case with version 1.2:
e.g.:
2020-04-22 12:12:28.078811862 [W:onnxruntime:, graph.cc:2422 CleanUnusedInitializers] Removing initializer 'const_slice__618'. It is not used by any node and should be removed from the model.
2020-04-22 12:12:28.078820662 [W:onnxruntime:, graph.cc:2422 CleanUnusedInitializers] Removing initializer 'Preprocessor/mul/x:0'. It is not used by any node and should be removed from the model.
2020-04-22 12:12:28.078831417 [W:onnxruntime:, graph.cc:2422 CleanUnusedInitializers] Removing initializer 'ConvBnFusion_BN_B_BoxPredictor_5/BoxEncodingPredictor_depthwise/BatchNorm/beta/read/_72__cf__72:0_139'. It is not used by any node and should be removed from the model.
2020-04-22 12:12:28.078841055 [W:onnxruntime:, graph.cc:2422 CleanUnusedInitializers] Removing initializer 'ConvBnFusion_W_const_fold_opt__947_148'. It is not used by any node and should be removed from the model.
2020-04-22 12:12:28.081366735 [W:onnxruntime:Default, tensorrt_execution_provider.h:36 log] [2020-04-22 12:12:28 WARNING] Tensor DataType is determined at build time for tensors not marked as input or output.
2020-04-22 12:12:28.081385383 [W:onnxruntime:Default, tensorrt_execution_provider.h:36 log] [2020-04-22 12:12:28 WARNING] Calling isShapeTensor before the entire network is constructed may result in an inaccurate result.
2020-04-22 12:12:28.081399211 [W:onnxruntime:Default, tensorrt_execution_provider.h:36 log] [2020-04-22 12:12:28 WARNING] Calling isShapeTensor before the entire network is constructed may result in an inaccurate result.
2020-04-22 12:12:28.081573111 [W:onnxruntime:, graph.cc:2422 CleanUnusedInitializers] Removing initializer 'const_fold_opt__971'. It is not used by any node and should be removed from the model.
2020-04-22 12:12:28.081585333 [W:onnxruntime:, graph.cc:2422 CleanUnusedInitializers] Removing initializer 'Postprocessor/BatchMultiClassNonMaxSuppression/PadOrClipBoxList/sub_5/x:0'. It is not used by any node and should be removed from the model.
2020-04-22 12:12:28.081593714 [W:onnxruntime:, graph.cc:2422 CleanUnusedInitializers] Removing initializer 'const_fold_opt__928'. It is not used by any node and should be removed from the model.
2020-04-22 12:12:28.081602793 [W:onnxruntime:, graph.cc:2422 CleanUnusedInitializers] Removing initializer 'largest_int_val__809'. It is not used by any node and should be removed from the model.
2020-04-22 12:12:28.081610546 [W:onnxruntime:, graph.cc:2422 CleanUnusedInitializers] Removing initializer 'const__737'. It is not used by any node and should be removed from the model.
2020-04-22 12:12:28.081619974 [W:onnxruntime:, graph.cc:2422 CleanUnusedInitializers] Removing initializer 'const_slice__785'. It is not used by any node and should be removed from the model.
2020-04-22 12:12:28.081630101 [W:onnxruntime:, graph.cc:2422 CleanUnusedInitializers] Removing initializer 'Postprocessor/BatchMultiClassNonMaxSuppression/PadOrClipBoxList/Select_1/e:0'. It is not used by any node and should be removed from the model.
2020-04-22 12:12:28.081639948 [W:onnxruntime:, graph.cc:2422 CleanUnusedInitializers] Removing initializer 'Postprocessor/BatchMultiClassNonMaxSuppression/PadOrClipBoxList/zeros_6/_423__cf__423:0'. It is not used by any node and should be removed from the model.
2020-04-22 12:12:28.081646234 [W:onnxruntime:, graph.cc:2422 CleanUnusedInitializers] Removing initializer 'const_slice__697'. It is not used by any node and should be removed from the model.
2020-04-22 12:12:28.081656081 [W:onnxruntime:, graph.cc:2422 CleanUnusedInitializers] Removing initializer 'FeatureExtractor/MobilenetV2/expanded_conv_2/depthwise/Relu6_min__79'. It is not used by any node and should be removed from the model.
2020-04-22 12:12:47.796120281 [W:onnxruntime:Default, tensorrt_execution_provider.h:36 log] [2020-04-22 12:12:47 WARNING] Current optimization profile is: 0. Please ensure there are no enqueued operations pending in this context prior to switching profiles
2020-04-22 12:12:48.192992803 [W:onnxruntime:Default, tensorrt_execution_provider.h:36 log] [2020-04-22 12:12:48 WARNING] Current optimization profile is: 0. Please ensure there are no enqueued operations pending in this context prior to switching profiles
2020-04-22 12:12:48.277763158 [W:onnxruntime:Default, tensorrt_execution_provider.h:36 log] [2020-04-22 12:12:48 WARNING] Current optimization profile is: 0. Please ensure there are no enqueued operations pending in this context prior to switching profiles
2020-04-22 12:12:48.695730598 [W:onnxruntime:Default, tensorrt_execution_provider.h:36 log] [2020-04-22 12:12:48 WARNING] Current optimization profile is: 0. Please ensure there are no enqueued operations pending in this context prior to switching profiles
2020-04-22 12:12:49.084085549 [W:onnxruntime:Default, tensorrt_execution_provider.h:36 log] [2020-04-22 12:12:49 WARNING] Current optimization profile is: 0. Please ensure there are no enqueued operations pending in this context prior to switching profiles
2020-04-22 12:12:50.605779798 [W:onnxruntime:Default, tensorrt_execution_provider.h:36 log] [2020-04-22 12:12:50 WARNING] Current optimization profile is: 0. Please ensure there are no enqueued operations pending in this context prior to switching profiles
2020-04-22 12:12:50.668077020 [W:onnxruntime:Default, tensorrt_execution_provider.h:36 log] [2020-04-22 12:12:50 WARNING] Current optimization profile is: 0. Please ensure there are no enqueued operations pending in this context prior to switching profiles
2020-04-22 12:12:50.774508204 [W:onnxruntime:Default, tensorrt_execution_provider.h:36 log] [2020-04-22 12:12:50 WARNING] Current optimization profile is: 0. Please ensure there are no enqueued operations pending in this context prior to switching profiles
2020-04-22 12:12:51.121557838 [W:onnxruntime:Default, tensorrt_execution_provider.h:36 log] [2020-04-22 12:12:51 WARNING] Current optimization profile is: 0. Please ensure there are no enqueued operations pending in this context prior to switching profiles
2020-04-22 12:12:51.122854830 [W:onnxruntime:Default, tensorrt_execution_provider.h:36 log] [2020-04-22 12:12:51 WARNING] Tensor DataType is determined at build time for tensors not marked as input or output.
2020-04-22 12:12:51.122938498 [W:onnxruntime:Default, tensorrt_execution_provider.h:36 log] [2020-04-22 12:12:51 ERROR] Layer: (Unnamed Layer* 24)[Select]'s output can not be used as shape tensor.
2020-04-22 12:12:51.122960637 [W:onnxruntime:Default, tensorrt_execution_provider.h:36 log] [2020-04-22 12:12:51 ERROR] Network validation failed.
This required some fixes from nvidia for onnx-tensorrt project.
You built onnxruntime + tensorrt EP from source, right?
can you update the reference to onnx-tensorrt submodule?
i.e. git submodule update --remote /path/to/onnxruntime/cmake/external/onnx-tensorrt
and rebuild?
This required some fixes from nvidia for onnx-tensorrt project.
You built onnxruntime + tensorrt EP from source, right?
can you update the reference to onnx-tensorrt submodule?
i.e. git submodule update --remote /path/to/onnxruntime/cmake/external/onnx-tensorrt
and rebuild?
it should have the fix for
2020-04-22 12:12:51.122938498 [W:onnxruntime:Default, tensorrt_execution_provider.h:36 log] [2020-04-22 12:12:51 ERROR] Layer: (Unnamed Layer* 24)[Select]'s output can not be used as shape tensor.
note we have not had time to test the model end to end, but wanted to give you an update asap.
@jywu-msft I followed your instructions and rebuilt onnxruntime and tried to run the model, but I get the same error. Could you please test it and confirm that it runs for you?
I won't have time to test this right now as we're busy with a release.
after
git submodule update --remote /path/to/onnxruntime/cmake/external/onnx-tensorrt
when you rebuild, you cannot use the --update option to build.sh
leave it out of the build.sh invocation.
otherwise, it will reset the onnx-tensorrt submodule to the previous state.
that is probably why you see the same error.
I tested your model ssd_mobilenet_v2_fpn_coco_v19.26032020_tensorrt.onnx and it went through ( I don't have data set so can't verify the accuracy though). Like we mentioned before, go to cmake/external/ and run: git submodule update --remote onnx-tensorrt (which will get you the latest parser fixes), then compile ORT with flag --skip_submodule_sync
Thank you very much, after rebuilding with --skip_submodule_sync, now we manage to pass the creation of ORT session without error, but now it throws:
2020-04-23 08:37:28.802814564 [I:onnxruntime:, sequential_executor.cc:67 Execute] Begin execution
terminate called after throwing an instance of 'onnxruntime::OnnxRuntimeException'
what(): /media/ivan/storage/Development/onnxruntime/onnxruntime/core/providers/cuda/cuda_call.cc:107 bool onnxruntime::CudaCall(ERRTYPE, const char*, const char*, ERRTYPE, const char*) [with ERRTYPE = cudaError; bool THRW = true] /media/ivan/storage/Development/onnxruntime/onnxruntime/core/providers/cuda/cuda_call.cc:101 bool onnxruntime::CudaCall(ERRTYPE, const char*, const char*, ERRTYPE, const char*) [with ERRTYPE = cudaError; bool THRW = true] CUDA failure 77: an illegal memory access was encountered ; GPU=0 ; hostname=hi-tech ; expr=cudaEventDestroy(read_event_);.
What could cause an illegal memory access? I'm feeding a single image in tensor form with dimensions: (1,720,1280,3)
Thank you @jywu-msft and @stevenlix for looking into it. I rebuild from source with the changes you suggested but with a similar outcome as @qraleq. The model was loaded successfully, but I encountered an illegal memory access as well.
terminate called after throwing an instance of 'onnxruntime::OnnxRuntimeException'
what(): /code/onnxruntime/onnxruntime/core/providers/cuda/cuda_call.cc:123 bool onnxruntime::CudaCall(ERRTYPE, const char*, const char*, ERRTYPE, const char*) [with ERRTYPE = cudaError; bool THRW = true] /code/onnxruntime/onnxruntime/core/providers/cuda/cuda_call.cc:117 bool onnxruntime::CudaCall(ERRTYPE, const char*, const char*, ERRTYPE, const char*) [with ERRTYPE = cudaError; bool THRW = true] CUDA failure 700: an illegal memory access was encountered ; GPU=0 ; hostname=c091e53a040c ; expr=cudaEventDestroy(read_event_);
build based on the docker image in the current master:
root@f84e58608ad4:/usr/local# dpkg -l | grep nvinfer
ii libnvinfer-bin 7.0.0-1+cuda10.2 amd64 TensorRT binaries
ii libnvinfer-dev 7.0.0-1+cuda10.2 amd64 TensorRT development libraries and headers
ii libnvinfer-doc 7.0.0-1+cuda10.2 all TensorRT documentation
ii libnvinfer-plugin-dev 7.0.0-1+cuda10.2 amd64 TensorRT plugin libraries
ii libnvinfer-plugin7 7.0.0-1+cuda10.2 amd64 TensorRT plugin libraries
ii libnvinfer-samples 7.0.0-1+cuda10.2 all TensorRT samples
ii libnvinfer7 7.0.0-1+cuda10.2 amd64 TensorRT runtime libraries
ii python3-libnvinfer 7.0.0-1+cuda10.2 amd64 Python 3 bindings for TensorRT
ii python3-libnvinfer-dev 7.0.0-1+cuda10.2 amd64 Python 3 development package for TensorRT
build process executed with commands from Dockerfile.tensorrt adjusted with your notes
if the core dump is of any use I can provide it to you
Probably the build log is good enough to look at. Or can you share your model?
@stevenlix, I send you an email with the model, samples and a run script. The model itself is nothing to fancy, a Mobilenet SSD on grayscale, but the samples shouldn't be public.
Here the content, except the download link:
It would be great if we could switch to the onnxruntime-trt聽execution provider with your assistance, as fallback the CUDA provider聽would be ok until this is fixed, but here some asymmetric聽padding forces some convolutions to run on the CPU execution provider, if you have any idea how to fix that, without changing聽the model to retrain, that would be very helpful (by chanching the onnx graph and changing the asymectic聽paddings to symmetric, the model of course has an invalid output, but has a speedup of 4)
聽
Hi @stevenlix, @jywu-msft,
do you have any update with the TRT conversion?
TLDR;
Long Version:
tested on a 2080-TI
Runtime | Batchsize 1| Batchsize 4 | Note
------- | ----------- | ------------ | ---------
Tensorflow-GPU(CUDA) |7ms | 14ms |
onnxruntime CUDA (asym padding) |27ms | 96ms |
onnxruntime CUDA (no asym padding) |6ms | 13ms | changed the padding in the graph to be symmetric, but gives an invalid output, just for speed compare when this would be fixed
onnxruntime CUDA (add pre padding nodes) |11ms | 44ms | added padding nodes, warnings are gone, output is correct, a lot slower, especially batches
I like your clean. well designed and simple API and hope that we can use onnxruntime in production, but for that I would need an outlook when either asym padding with CUDA or the TRT runtime will work. The problem is when I have to evaluate it negativ, as the numbers are currently, this runtime and Windows will be of the table for a long time, therefore I would appreciate your help on this a lot!
Hi @stevenlix, @jywu-msft,
do you have any update with the TRT conversion?
sorry for the delay. we're in the midst of a release so have been very busy.
will sync with @stevenlix to see if we can find some spare cycles to take a closer look.
the asymmetric padding/fall back to CPU with onnxruntime CUDA has been a longstanding problem.
my understanding was that it was a limitation with cuDNN. It would be good if we can address with TRT (or take a look at how Tensorflow-GPU is handling it and do something similar)
Regarding TRT, I run your model and saw the same issue you posted. Thanks for raising it up. I will dig further into it as soon as I can.
Thanks for checking it guys, beer is on me if you come to Austria ;)!
Maybe this workaround is better than a prepadding operator.
https://github.com/microsoft/MMdnn/issues/153
Expand the padding and slice afterwards, if it performs this could also be rewritten in the CUDA runtime I guess.
I implemented a converter for the 'overpad and slice' idea from above. Works a lot faster and gets rid of the asym padding problem:
Times:
Batchsize 1: 6ms (speedup of 4.5)
Batchsize 4: 17ms (speedup of 5.6)
still losing on bigger batches compared to TF, but the results are resonable.
This method could be implemented in the CUDA executor on loading, instead of the warning and the CPU switch.
Thank you very much, after rebuilding with --skip_submodule_sync, now we manage to pass the creation of ORT session without error, but now it throws:
2020-04-23 08:37:28.802814564 [I:onnxruntime:, sequential_executor.cc:67 Execute] Begin execution terminate called after throwing an instance of 'onnxruntime::OnnxRuntimeException' what(): /media/ivan/storage/Development/onnxruntime/onnxruntime/core/providers/cuda/cuda_call.cc:107 bool onnxruntime::CudaCall(ERRTYPE, const char*, const char*, ERRTYPE, const char*) [with ERRTYPE = cudaError; bool THRW = true] /media/ivan/storage/Development/onnxruntime/onnxruntime/core/providers/cuda/cuda_call.cc:101 bool onnxruntime::CudaCall(ERRTYPE, const char*, const char*, ERRTYPE, const char*) [with ERRTYPE = cudaError; bool THRW = true] CUDA failure 77: an illegal memory access was encountered ; GPU=0 ; hostname=hi-tech ; expr=cudaEventDestroy(read_event_);.What could cause an illegal memory access? I'm feeding a single image in tensor form with dimensions: (1,720,1280,3)
Hi, guys! Were you able to get a closer look at this issue?
This issue has been automatically marked as stale due to inactivity and will be closed in 7 days if no further activity occurs. If further support is needed, please provide an update and/or more details.
This issue has been automatically closed due to inactivity. Please reactivate if further support is needed.
I retested this issue. After changing the input from uint8 to int8 and running symbolic_shape_infer it ran with Tensor-RT.
Speedup with a Nvidia T4 is ~2.0
Speedup with a Nvidia Quadro 6000 is ~1.3, which is quite strange but I investigate that in more detail.
When activating FP16 on the SSD model I get following error (but I open another issue for that, TRT in general is working)
onnxruntime.capi.onnxruntime_pybind11_state.EPFail: [ONNXRuntimeError] : 11 : EP_FAIL : TensorRT EP could not build engine for fused node: TensorrtExecutionProvider_TRTKernel_graph_tf2onnx_1_1