Is your feature request related to a problem? Please describe.
We build our software for x86, aarch64 and ppc64le and our developers would like to use onnxruntime but as it does not build for ppc64le archs, so we can not integrate it.
System information
Describe the solution you'd like
We would like to build and use onnxruntime on PPC64 archs.
Describe alternatives you've considered
Nothing yet
It's hard for us to make progress on it because our team don't have any ppc64le hardware that can be used for dev and testing.
Seems new manylinux2014 docker images can help us solve this.
@snnn , we were able to build onnxruntime for ppc64le using the changes here https://github.com/cms-externals/onnxruntime/pull/4 but some of our tests failed to produce identical results. One of onnxruntime test also failed to run. @mrodozov do you remember which test was failing?
Have you tried using proot and qemu to get emulate powerpc ? We do use it to install ppc64le rpm packages on our x86_64 server.
@tracysh Could you please take a look at cms-externals#4 ?
turn this:
-Donnxruntime_BUILD_UNIT_TESTS=ON
and then the test is:
onnxruntime_mlas_test
a validation test to my understanding.
We were trying to bring this implementation:
https://github.com/microsoft/onnxruntime/blob/master/onnxruntime/core/mlas/lib/arm/sgemmc.cpp#L22
on powerpc. the failing test is SGEMM,
with output like:
mismatch TransA=111, TransB=111, M=1, N=1, K=1, alpha=1.000000, beta=0.000000 0.000000 529.000000!
mismatch TransA=111, TransB=112, M=1, N=1, K=1, alpha=1.000000, beta=0.000000 0.000000 529.000000!
mismatch TransA=112, TransB=111, M=1, N=1, K=1, alpha=1.000000, beta=0.000000 0.000000 529.000000!
mismatch TransA=112, TransB=112, M=1, N=1, K=1, alpha=1.000000, beta=0.000000 0.000000 529.000000!
mismatch TransA=111, TransB=111, M=2, N=2, K=2, alpha=1.000000, beta=0.000000 0.000000 991.000000!
mismatch TransA=111, TransB=111, M=2, N=2, K=2, alpha=1.000000, beta=0.000000 0.000000 946.000000!
mismatch TransA=111, TransB=111, M=2, N=2, K=2, alpha=1.000000, beta=0.000000 0.000000 903.000000!
mismatch TransA=111, TransB=111, M=2, N=2, K=2, alpha=1.000000, beta=0.000000 0.000000 862.000000!
mismatch TransA=111, TransB=112, M=2, N=2, K=2, alpha=1.000000, beta=0.000000 0.000000 1013.000000!
mismatch TransA=111, TransB=112, M=2, N=2, K=2, alpha=1.000000, beta=0.000000 0.000000 923.000000!
mismatch TransA=111, TransB=112, M=2, N=2, K=2, alpha=1.000000, beta=0.000000 0.000000 923.000000!
mismatch TransA=111, TransB=112, M=2, N=2, K=2, alpha=1.000000, beta=0.000000 0.000000 841.000000!
mismatch TransA=112, TransB=111, M=2, N=2, K=2, alpha=1.000000, beta=0.000000 0.000000 970.000000!
mismatch TransA=112, TransB=111, M=2, N=2, K=2, alpha=1.000000, beta=0.000000 0.000000 926.000000!
mismatch TransA=112, TransB=111, M=2, N=2, K=2, alpha=1.000000, beta=0.000000 0.000000 926.000000!
mismatch TransA=112, TransB=111, M=2, N=2, K=2, alpha=1.000000, beta=0.000000 0.000000 884.000000!
mismatch TransA=112, TransB=112, M=2, N=2, K=2, alpha=1.000000, beta=0.000000 0.000000 991.000000!
mismatch TransA=112, TransB=112, M=2, N=2, K=2, alpha=1.000000, beta=0.000000 0.000000 903.000000!
mismatch TransA=112, TransB=112, M=2, N=2, K=2, alpha=1.000000, beta=0.000000 0.000000 946.000000!
mismatch TransA=112, TransB=112, M=2, N=2, K=2, alpha=1.000000, beta=0.000000 0.000000 862.000000!
mismatch TransA=111, TransB=111, M=3, N=3, K=3, alpha=1.000000, beta=0.000000 1194.000000 1326.000000!
the other tests:
Conv2D tests.
Pool2D tests.
Pool3D tests.
Activation tests.
are going fine (no mismatch prints at least)
That's strange, because the Conv2D tests build on the GEMM routine. MlasFgemmTest::ExecuteShort first loops over small GEMMs from 1-15 which stresses some of the partial vector stores. Do the tests after this, which are multiples of 16, work okay?
this is the full unittest output
unit_test.txt
Update: I was curious about the latest state of Power ISA (I worked on Xbox 360, a PowerPC 2.02 implementation), so I updated MLAS to directly use VSX intrinsics. I verified the GEMM using gcc 7.4 to cross compile then run from qemu. I'll get my changes into a branch you can try on your end in a few days.
We will be happy to test it as soon as it is available. many thanks for looking in to this.
@tracysh , any update which we can test?
I'm going to need a few more days to clean this up. Just curious, which POWER versions are you using this on?
We are using power8
> lscpu
Architecture: ppc64le
Byte Order: Little Endian
CPU(s): 128
On-line CPU(s) list: 0-127
Thread(s) per core: 8
Core(s) per socket: 8
Socket(s): 2
NUMA node(s): 2
Model: 2.0 (pvr 004d 0200)
Model name: POWER8 (raw), altivec supported
CPU max MHz: 3857.0000
CPU min MHz: 2061.0000
L1d cache: 64K
L1i cache: 32K
L2 cache: 512K
L3 cache: 8192K
NUMA node0 CPU(s): 64-127
NUMA node8 CPU(s): 0-63
I'm going to need a few more days to clean this up.
I'm curious if there was an update for this issue
Please let me know.
Thank you.
Apologies for the delay on this. I've put the changes into the branch tracysh/mlas_powerpc. With this, I was able to build with gcc 7.5 and run under qemu. I ran onnxruntime_mlas_test and was able to run the subset of the GEMM tests. There are more GEMM tests that I usually run for validation of big changes, but qemu was too slow to tackle that.
I was also able to run through onnxruntime_test_all (run as part of the build), but there was a MathSinFloat test that uses Eigen that was failing. I'm curious what happens on real hardware to know if this is worth investigating further.
I was also able to point onnx_test_runner at resnet50 and bertsquad from the onnx model zoo and both passed successfully.
I have no idea how performant the SGEMM might be. It may be possible to scale up the GEMM further, but I'll need some help from you to measure on real hardware. Also, I want to make a few changes to onnxruntime_mlas_test to test a few more things out.
Let me know how it goes.
Thanks @tracysh , we are testing your changes now and will let you know soon.
Hello again,
the code builds now on our powerpc machine,
which is different from the prev one:
Architecture: ppc64le
Byte Order: Little Endian
CPU(s): 128
On-line CPU(s) list: 0-127
Thread(s) per core: 8
Core(s) per socket: 8
Socket(s): 2
NUMA node(s): 2
Model: 1.0 (pvr 004c 0100)
Model name: POWER8NVL (raw), altivec supported
CPU max MHz: 4023.0000
CPU min MHz: 2061.0000
L1d cache: 64K
L1i cache: 32K
L2 cache: 512K
L3 cache: 8192K
NUMA node0 CPU(s): 0-63
NUMA node1 CPU(s): 64-127
when I run:
SGEMM tests.
Conv2D tests.
Pool2D tests.
Pool3D tests.
Done.
SGEMM tests.
Conv2D tests.
Pool2D tests.
Pool3D tests.
Done.
Activation tests.
mismatch activation kind=3 i=2 value=bf800000 expected=7ff00002
mismatch activation kind=3 i=3 value=bf800000 expected=fff00002
mismatch activation kind=4 i=2 value=b3800000 expected=7ff00002
mismatch activation kind=4 i=3 value=b3800000 expected=fff00002
./onnxruntime_shared_lib_test
[==========] Running 21 tests from 2 test suites.
[----------] Global test environment set-up.
[----------] 16 tests from CApiTest
[ RUN ] CApiTest.dim_param
[ OK ] CApiTest.dim_param (19 ms)
[ RUN ] CApiTest.custom_op_handler
Running custom op inference
Running simple inference with default provider
[ OK ] CApiTest.custom_op_handler (16 ms)
[ RUN ] CApiTest.create_tensor
[ OK ] CApiTest.create_tensor (0 ms)
[ RUN ] CApiTest.create_tensor_with_data
[ OK ] CApiTest.create_tensor_with_data (0 ms)
[ RUN ] CApiTest.override_initializer
[ OK ] CApiTest.override_initializer (16 ms)
[ RUN ] CApiTest.end_profiling
[ OK ] CApiTest.end_profiling (31 ms)
[ RUN ] CApiTest.model_metadata
[ OK ] CApiTest.model_metadata (15 ms)
[ RUN ] CApiTest.session_options_graph_optimization_level
[ OK ] CApiTest.session_options_graph_optimization_level (0 ms)
[ RUN ] CApiTest.run_options
[ OK ] CApiTest.run_options (0 ms)
[ RUN ] CApiTest.allocation_info
[ OK ] CApiTest.allocation_info (0 ms)
[ RUN ] CApiTest.DefaultAllocator
[ OK ] CApiTest.DefaultAllocator (0 ms)
[ RUN ] CApiTest.CreateGetVectorOfMapsInt64Float
[ OK ] CApiTest.CreateGetVectorOfMapsInt64Float (0 ms)
[ RUN ] CApiTest.CreateGetVectorOfMapsStringFloat
[ OK ] CApiTest.CreateGetVectorOfMapsStringFloat (0 ms)
[ RUN ] CApiTest.CreateGetSeqTensors
[ OK ] CApiTest.CreateGetSeqTensors (0 ms)
[ RUN ] CApiTest.CreateGetSeqStringTensors
[ OK ] CApiTest.CreateGetSeqStringTensors (0 ms)
[ RUN ] CApiTest.model_from_array
[ OK ] CApiTest.model_from_array (16 ms)
[----------] 16 tests from CApiTest (113 ms total)
[----------] 5 tests from CApiTestWithProviders/CApiTestWithProvider
[ RUN ] CApiTestWithProviders/CApiTestWithProvider.simple/0
Running simple inference with default provider
[ OK ] CApiTestWithProviders/CApiTestWithProvider.simple/0 (15 ms)
[ RUN ] CApiTestWithProviders/CApiTestWithProvider.simple/1
[ OK ] CApiTestWithProviders/CApiTestWithProvider.simple/1 (0 ms)
[ RUN ] CApiTestWithProviders/CApiTestWithProvider.simple/2
[ OK ] CApiTestWithProviders/CApiTestWithProvider.simple/2 (0 ms)
[ RUN ] CApiTestWithProviders/CApiTestWithProvider.simple/3
[ OK ] CApiTestWithProviders/CApiTestWithProvider.simple/3 (0 ms)
[ RUN ] CApiTestWithProviders/CApiTestWithProvider.simple/4
Running simple inference with default provider
[ OK ] CApiTestWithProviders/CApiTestWithProvider.simple/4 (15 ms)
[----------] 5 tests from CApiTestWithProviders/CApiTestWithProvider (31 ms total)
[----------] Global test environment tear-down
[==========] 21 tests from 2 test suites ran. (144 ms total)
[ PASSED ] 21 tests.
YOU HAVE 1 DISABLED TEST
md5-72ee11b42df10fbc6262fe5e4f8fc859
./onnxruntime_global_thread_pools_test
[==========] Running 15 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 15 tests from CApiTestGlobalThreadPoolsWithProviders/CApiTestGlobalThreadPoolsWithProvider
[ RUN ] CApiTestGlobalThreadPoolsWithProviders/CApiTestGlobalThreadPoolsWithProvider.simple/0
Running simple inference with default provider
2020-04-09 12:10:09.920787225 [I:onnxruntime:, inference_session.cc:208 ConstructorCommon] Using global/env threadpools since use_per_session_threads_ is false
2020-04-09 12:10:09.923644854 [I:onnxruntime:, inference_session.cc:829 Initialize] Initializing session.
2020-04-09 12:10:09.923682633 [I:onnxruntime:, inference_session.cc:847 Initialize] Adding default CPU execution provider.
2020-04-09 12:10:09.926045591 [I:onnxruntime:, reshape_fusion.cc:37 ApplyImpl] Total fused reshape node count: 0
2020-04-09 12:10:09.927908098 [I:onnxruntime:, reshape_fusion.cc:37 ApplyImpl] Total fused reshape node count: 0
2020-04-09 12:10:09.928285204 [I:onnxruntime:, reshape_fusion.cc:37 ApplyImpl] Total fused reshape node count: 0
2020-04-09 12:10:09.930865047 [V:onnxruntime:, inference_session.cc:675 TransformGraph] Node placements
2020-04-09 12:10:09.930901035 [V:onnxruntime:, inference_session.cc:677 TransformGraph] All nodes have been placed on [CPUExecutionProvider].
2020-04-09 12:10:09.930963711 [I:onnxruntime:, session_state.cc:22 SetGraph] SaveMLValueNameIndexMapping
2020-04-09 12:10:09.931095141 [I:onnxruntime:, session_state.cc:67 SetGraph] Done saving OrtValue mappings.
2020-04-09 12:10:09.931800325 [I:onnxruntime:, session_state_initializer.cc:179 SaveInitializedTensors] Saving initialized tensors.
2020-04-09 12:10:09.932915858 [I:onnxruntime:, session_state_initializer.cc:224 SaveInitializedTensors] Done saving initialized tensors
2020-04-09 12:10:09.933515767 [I:onnxruntime:, inference_session.cc:917 Initialize] Session successfully initialized.
2020-04-09 12:10:09.933594683 [I:onnxruntime:, sequential_executor.cc:67 Execute] Begin execution
[ OK ] CApiTestGlobalThreadPoolsWithProviders/CApiTestGlobalThreadPoolsWithProvider.simple/0 (46 ms)
[ RUN ] CApiTestGlobalThreadPoolsWithProviders/CApiTestGlobalThreadPoolsWithProvider.simple/1
[ OK ] CApiTestGlobalThreadPoolsWithProviders/CApiTestGlobalThreadPoolsWithProvider.simple/1 (0 ms)
[ RUN ] CApiTestGlobalThreadPoolsWithProviders/CApiTestGlobalThreadPoolsWithProvider.simple/2
[ OK ] CApiTestGlobalThreadPoolsWithProviders/CApiTestGlobalThreadPoolsWithProvider.simple/2 (1 ms)
[ RUN ] CApiTestGlobalThreadPoolsWithProviders/CApiTestGlobalThreadPoolsWithProvider.simple/3
[ OK ] CApiTestGlobalThreadPoolsWithProviders/CApiTestGlobalThreadPoolsWithProvider.simple/3 (0 ms)
[ RUN ] CApiTestGlobalThreadPoolsWithProviders/CApiTestGlobalThreadPoolsWithProvider.simple/4
Running simple inference with default provider
2020-04-09 12:10:09.967065780 [I:onnxruntime:, inference_session.cc:208 ConstructorCommon] Using global/env threadpools since use_per_session_threads_ is false
2020-04-09 12:10:09.969736207 [I:onnxruntime:, inference_session.cc:829 Initialize] Initializing session.
2020-04-09 12:10:09.969773430 [I:onnxruntime:, inference_session.cc:847 Initialize] Adding default CPU execution provider.
2020-04-09 12:10:09.970006107 [I:onnxruntime:, reshape_fusion.cc:37 ApplyImpl] Total fused reshape node count: 0
2020-04-09 12:10:09.971855079 [I:onnxruntime:, reshape_fusion.cc:37 ApplyImpl] Total fused reshape node count: 0
2020-04-09 12:10:09.972233963 [I:onnxruntime:, reshape_fusion.cc:37 ApplyImpl] Total fused reshape node count: 0
2020-04-09 12:10:09.974794212 [V:onnxruntime:, inference_session.cc:675 TransformGraph] Node placements
2020-04-09 12:10:09.974830396 [V:onnxruntime:, inference_session.cc:677 TransformGraph] All nodes have been placed on [CPUExecutionProvider].
2020-04-09 12:10:09.974893681 [I:onnxruntime:, session_state.cc:22 SetGraph] SaveMLValueNameIndexMapping
2020-04-09 12:10:09.974995939 [I:onnxruntime:, session_state.cc:67 SetGraph] Done saving OrtValue mappings.
2020-04-09 12:10:09.975696547 [I:onnxruntime:, session_state_initializer.cc:179 SaveInitializedTensors] Saving initialized tensors.
2020-04-09 12:10:09.977450182 [I:onnxruntime:, session_state_initializer.cc:224 SaveInitializedTensors] Done saving initialized tensors
2020-04-09 12:10:09.977908551 [I:onnxruntime:, inference_session.cc:917 Initialize] Session successfully initialized.
2020-04-09 12:10:09.977978610 [I:onnxruntime:, sequential_executor.cc:67 Execute] Begin execution
[ OK ] CApiTestGlobalThreadPoolsWithProviders/CApiTestGlobalThreadPoolsWithProvider.simple/4 (39 ms)
[ RUN ] CApiTestGlobalThreadPoolsWithProviders/CApiTestGlobalThreadPoolsWithProvider.simple2/0
Running simple inference with default provider
2020-04-09 12:10:10.006238624 [I:onnxruntime:, inference_session.cc:208 ConstructorCommon] Using global/env threadpools since use_per_session_threads_ is false
2020-04-09 12:10:10.008900539 [I:onnxruntime:, inference_session.cc:829 Initialize] Initializing session.
2020-04-09 12:10:10.008938569 [I:onnxruntime:, inference_session.cc:847 Initialize] Adding default CPU execution provider.
2020-04-09 12:10:10.009153048 [I:onnxruntime:, reshape_fusion.cc:37 ApplyImpl] Total fused reshape node count: 0
2020-04-09 12:10:10.011013542 [I:onnxruntime:, reshape_fusion.cc:37 ApplyImpl] Total fused reshape node count: 0
2020-04-09 12:10:10.011391594 [I:onnxruntime:, reshape_fusion.cc:37 ApplyImpl] Total fused reshape node count: 0
2020-04-09 12:10:10.013950101 [V:onnxruntime:, inference_session.cc:675 TransformGraph] Node placements
2020-04-09 12:10:10.013986316 [V:onnxruntime:, inference_session.cc:677 TransformGraph] All nodes have been placed on [CPUExecutionProvider].
2020-04-09 12:10:10.014049687 [I:onnxruntime:, session_state.cc:22 SetGraph] SaveMLValueNameIndexMapping
2020-04-09 12:10:10.014151584 [I:onnxruntime:, session_state.cc:67 SetGraph] Done saving OrtValue mappings.
2020-04-09 12:10:10.014849628 [I:onnxruntime:, session_state_initializer.cc:179 SaveInitializedTensors] Saving initialized tensors.
2020-04-09 12:10:10.016065118 [I:onnxruntime:, session_state_initializer.cc:224 SaveInitializedTensors] Done saving initialized tensors
2020-04-09 12:10:10.016520071 [I:onnxruntime:, inference_session.cc:917 Initialize] Session successfully initialized.
Running simple inference with default provider
2020-04-09 12:10:10.019779584 [I:onnxruntime:, inference_session.cc:208 ConstructorCommon] Using global/env threadpools since use_per_session_threads_ is false
2020-04-09 12:10:10.022475948 [I:onnxruntime:, inference_session.cc:829 Initialize] Initializing session.
2020-04-09 12:10:10.022514226 [I:onnxruntime:, inference_session.cc:847 Initialize] Adding default CPU execution provider.
2020-04-09 12:10:10.022723799 [I:onnxruntime:, reshape_fusion.cc:37 ApplyImpl] Total fused reshape node count: 0
2020-04-09 12:10:10.024586024 [I:onnxruntime:, reshape_fusion.cc:37 ApplyImpl] Total fused reshape node count: 0
2020-04-09 12:10:10.024964934 [I:onnxruntime:, reshape_fusion.cc:37 ApplyImpl] Total fused reshape node count: 0
2020-04-09 12:10:10.027514511 [V:onnxruntime:, inference_session.cc:675 TransformGraph] Node placements
2020-04-09 12:10:10.027550402 [V:onnxruntime:, inference_session.cc:677 TransformGraph] All nodes have been placed on [CPUExecutionProvider].
2020-04-09 12:10:10.027612250 [I:onnxruntime:, session_state.cc:22 SetGraph] SaveMLValueNameIndexMapping
2020-04-09 12:10:10.027713176 [I:onnxruntime:, session_state.cc:67 SetGraph] Done saving OrtValue mappings.
2020-04-09 12:10:10.028405672 [I:onnxruntime:, session_state_initializer.cc:179 SaveInitializedTensors] Saving initialized tensors.
2020-04-09 12:10:10.029567600 [I:onnxruntime:, session_state_initializer.cc:224 SaveInitializedTensors] Done saving initialized tensors
2020-04-09 12:10:10.030039855 [I:onnxruntime:, inference_session.cc:917 Initialize] Session successfully initialized.
2020-04-09 12:10:10.030111410 [I:onnxruntime:, sequential_executor.cc:67 Execute] Begin execution
2020-04-09 12:10:10.051622176 [I:onnxruntime:, sequential_executor.cc:67 Execute] Begin execution
[ OK ] CApiTestGlobalThreadPoolsWithProviders/CApiTestGlobalThreadPoolsWithProvider.simple2/0 (76 ms)
[ RUN ] CApiTestGlobalThreadPoolsWithProviders/CApiTestGlobalThreadPoolsWithProvider.simple2/1
[ OK ] CApiTestGlobalThreadPoolsWithProviders/CApiTestGlobalThreadPoolsWithProvider.simple2/1 (1 ms)
[ RUN ] CApiTestGlobalThreadPoolsWithProviders/CApiTestGlobalThreadPoolsWithProvider.simple2/2
[ OK ] CApiTestGlobalThreadPoolsWithProviders/CApiTestGlobalThreadPoolsWithProvider.simple2/2 (0 ms)
[ RUN ] CApiTestGlobalThreadPoolsWithProviders/CApiTestGlobalThreadPoolsWithProvider.simple2/3
[ OK ] CApiTestGlobalThreadPoolsWithProviders/CApiTestGlobalThreadPoolsWithProvider.simple2/3 (0 ms)
[ RUN ] CApiTestGlobalThreadPoolsWithProviders/CApiTestGlobalThreadPoolsWithProvider.simple2/4
Running simple inference with default provider
2020-04-09 12:10:10.083698038 [I:onnxruntime:, inference_session.cc:208 ConstructorCommon] Using global/env threadpools since use_per_session_threads_ is false
2020-04-09 12:10:10.086331852 [I:onnxruntime:, inference_session.cc:829 Initialize] Initializing session.
2020-04-09 12:10:10.086368899 [I:onnxruntime:, inference_session.cc:847 Initialize] Adding default CPU execution provider.
2020-04-09 12:10:10.086582950 [I:onnxruntime:, reshape_fusion.cc:37 ApplyImpl] Total fused reshape node count: 0
2020-04-09 12:10:10.088431406 [I:onnxruntime:, reshape_fusion.cc:37 ApplyImpl] Total fused reshape node count: 0
2020-04-09 12:10:10.088810771 [I:onnxruntime:, reshape_fusion.cc:37 ApplyImpl] Total fused reshape node count: 0
2020-04-09 12:10:10.091389283 [V:onnxruntime:, inference_session.cc:675 TransformGraph] Node placements
2020-04-09 12:10:10.091425277 [V:onnxruntime:, inference_session.cc:677 TransformGraph] All nodes have been placed on [CPUExecutionProvider].
2020-04-09 12:10:10.091488425 [I:onnxruntime:, session_state.cc:22 SetGraph] SaveMLValueNameIndexMapping
2020-04-09 12:10:10.091611267 [I:onnxruntime:, session_state.cc:67 SetGraph] Done saving OrtValue mappings.
2020-04-09 12:10:10.092309856 [I:onnxruntime:, session_state_initializer.cc:179 SaveInitializedTensors] Saving initialized tensors.
2020-04-09 12:10:10.093717814 [I:onnxruntime:, session_state_initializer.cc:224 SaveInitializedTensors] Done saving initialized tensors
2020-04-09 12:10:10.094172376 [I:onnxruntime:, inference_session.cc:917 Initialize] Session successfully initialized.
Running simple inference with default provider
2020-04-09 12:10:10.097515136 [I:onnxruntime:, inference_session.cc:208 ConstructorCommon] Using global/env threadpools since use_per_session_threads_ is false
2020-04-09 12:10:10.100225496 [I:onnxruntime:, inference_session.cc:829 Initialize] Initializing session.
2020-04-09 12:10:10.100262135 [I:onnxruntime:, inference_session.cc:847 Initialize] Adding default CPU execution provider.
2020-04-09 12:10:10.100469505 [I:onnxruntime:, reshape_fusion.cc:37 ApplyImpl] Total fused reshape node count: 0
2020-04-09 12:10:10.102329418 [I:onnxruntime:, reshape_fusion.cc:37 ApplyImpl] Total fused reshape node count: 0
2020-04-09 12:10:10.102707731 [I:onnxruntime:, reshape_fusion.cc:37 ApplyImpl] Total fused reshape node count: 0
2020-04-09 12:10:10.105260567 [V:onnxruntime:, inference_session.cc:675 TransformGraph] Node placements
2020-04-09 12:10:10.105296315 [V:onnxruntime:, inference_session.cc:677 TransformGraph] All nodes have been placed on [CPUExecutionProvider].
2020-04-09 12:10:10.105358030 [I:onnxruntime:, session_state.cc:22 SetGraph] SaveMLValueNameIndexMapping
2020-04-09 12:10:10.105459450 [I:onnxruntime:, session_state.cc:67 SetGraph] Done saving OrtValue mappings.
2020-04-09 12:10:10.106151105 [I:onnxruntime:, session_state_initializer.cc:179 SaveInitializedTensors] Saving initialized tensors.
2020-04-09 12:10:10.107641782 [I:onnxruntime:, session_state_initializer.cc:224 SaveInitializedTensors] Done saving initialized tensors
2020-04-09 12:10:10.108095530 [I:onnxruntime:, inference_session.cc:917 Initialize] Session successfully initialized.
2020-04-09 12:10:10.108166743 [I:onnxruntime:, sequential_executor.cc:67 Execute] Begin execution
2020-04-09 12:10:10.134985230 [I:onnxruntime:, sequential_executor.cc:67 Execute] Begin execution
[ OK ] CApiTestGlobalThreadPoolsWithProviders/CApiTestGlobalThreadPoolsWithProvider.simple2/4 (78 ms)
[ RUN ] CApiTestGlobalThreadPoolsWithProviders/CApiTestGlobalThreadPoolsWithProvider.simple3/0
Running simple inference with default provider
2020-04-09 12:10:10.161779645 [I:onnxruntime:, inference_session.cc:208 ConstructorCommon] Using global/env threadpools since use_per_session_threads_ is false
2020-04-09 12:10:10.164416227 [I:onnxruntime:, inference_session.cc:829 Initialize] Initializing session.
2020-04-09 12:10:10.164454129 [I:onnxruntime:, inference_session.cc:847 Initialize] Adding default CPU execution provider.
2020-04-09 12:10:10.164667892 [I:onnxruntime:, reshape_fusion.cc:37 ApplyImpl] Total fused reshape node count: 0
2020-04-09 12:10:10.166516273 [I:onnxruntime:, reshape_fusion.cc:37 ApplyImpl] Total fused reshape node count: 0
2020-04-09 12:10:10.166895228 [I:onnxruntime:, reshape_fusion.cc:37 ApplyImpl] Total fused reshape node count: 0
2020-04-09 12:10:10.169455112 [V:onnxruntime:, inference_session.cc:675 TransformGraph] Node placements
2020-04-09 12:10:10.169490844 [V:onnxruntime:, inference_session.cc:677 TransformGraph] All nodes have been placed on [CPUExecutionProvider].
2020-04-09 12:10:10.169555170 [I:onnxruntime:, session_state.cc:22 SetGraph] SaveMLValueNameIndexMapping
2020-04-09 12:10:10.169671501 [I:onnxruntime:, session_state.cc:67 SetGraph] Done saving OrtValue mappings.
2020-04-09 12:10:10.170383589 [I:onnxruntime:, session_state_initializer.cc:179 SaveInitializedTensors] Saving initialized tensors.
2020-04-09 12:10:10.171827361 [I:onnxruntime:, session_state_initializer.cc:224 SaveInitializedTensors] Done saving initialized tensors
2020-04-09 12:10:10.172278874 [I:onnxruntime:, inference_session.cc:917 Initialize] Session successfully initialized.
2020-04-09 12:10:10.172346424 [I:onnxruntime:, sequential_executor.cc:67 Execute] Begin execution
Running simple inference with default provider
2020-04-09 12:10:10.195078460 [I:onnxruntime:, inference_session.cc:208 ConstructorCommon] Using global/env threadpools since use_per_session_threads_ is false
2020-04-09 12:10:10.197688391 [I:onnxruntime:, inference_session.cc:829 Initialize] Initializing session.
2020-04-09 12:10:10.197725612 [I:onnxruntime:, inference_session.cc:847 Initialize] Adding default CPU execution provider.
2020-04-09 12:10:10.197937578 [I:onnxruntime:, reshape_fusion.cc:37 ApplyImpl] Total fused reshape node count: 0
2020-04-09 12:10:10.199797713 [I:onnxruntime:, reshape_fusion.cc:37 ApplyImpl] Total fused reshape node count: 0
2020-04-09 12:10:10.200216561 [I:onnxruntime:, reshape_fusion.cc:37 ApplyImpl] Total fused reshape node count: 0
2020-04-09 12:10:10.202773912 [V:onnxruntime:, inference_session.cc:675 TransformGraph] Node placements
2020-04-09 12:10:10.202809814 [V:onnxruntime:, inference_session.cc:677 TransformGraph] All nodes have been placed on [CPUExecutionProvider].
2020-04-09 12:10:10.202872762 [I:onnxruntime:, session_state.cc:22 SetGraph] SaveMLValueNameIndexMapping
2020-04-09 12:10:10.202974787 [I:onnxruntime:, session_state.cc:67 SetGraph] Done saving OrtValue mappings.
2020-04-09 12:10:10.203672710 [I:onnxruntime:, session_state_initializer.cc:179 SaveInitializedTensors] Saving initialized tensors.
2020-04-09 12:10:10.204626779 [I:onnxruntime:, session_state_initializer.cc:224 SaveInitializedTensors] Done saving initialized tensors
2020-04-09 12:10:10.205075086 [I:onnxruntime:, inference_session.cc:917 Initialize] Session successfully initialized.
2020-04-09 12:10:10.205141170 [I:onnxruntime:, sequential_executor.cc:67 Execute] Begin execution
[ OK ] CApiTestGlobalThreadPoolsWithProviders/CApiTestGlobalThreadPoolsWithProvider.simple3/0 (67 ms)
[ RUN ] CApiTestGlobalThreadPoolsWithProviders/CApiTestGlobalThreadPoolsWithProvider.simple3/1
[ OK ] CApiTestGlobalThreadPoolsWithProviders/CApiTestGlobalThreadPoolsWithProvider.simple3/1 (1 ms)
[ RUN ] CApiTestGlobalThreadPoolsWithProviders/CApiTestGlobalThreadPoolsWithProvider.simple3/2
[ OK ] CApiTestGlobalThreadPoolsWithProviders/CApiTestGlobalThreadPoolsWithProvider.simple3/2 (0 ms)
[ RUN ] CApiTestGlobalThreadPoolsWithProviders/CApiTestGlobalThreadPoolsWithProvider.simple3/3
[ OK ] CApiTestGlobalThreadPoolsWithProviders/CApiTestGlobalThreadPoolsWithProvider.simple3/3 (0 ms)
[ RUN ] CApiTestGlobalThreadPoolsWithProviders/CApiTestGlobalThreadPoolsWithProvider.simple3/4
Running simple inference with default provider
2020-04-09 12:10:10.229769044 [I:onnxruntime:, inference_session.cc:208 ConstructorCommon] Using global/env threadpools since use_per_session_threads_ is false
2020-04-09 12:10:10.232439349 [I:onnxruntime:, inference_session.cc:829 Initialize] Initializing session.
2020-04-09 12:10:10.232477255 [I:onnxruntime:, inference_session.cc:847 Initialize] Adding default CPU execution provider.
2020-04-09 12:10:10.232690281 [I:onnxruntime:, reshape_fusion.cc:37 ApplyImpl] Total fused reshape node count: 0
2020-04-09 12:10:10.234536881 [I:onnxruntime:, reshape_fusion.cc:37 ApplyImpl] Total fused reshape node count: 0
2020-04-09 12:10:10.234915160 [I:onnxruntime:, reshape_fusion.cc:37 ApplyImpl] Total fused reshape node count: 0
2020-04-09 12:10:10.237472849 [V:onnxruntime:, inference_session.cc:675 TransformGraph] Node placements
2020-04-09 12:10:10.237509013 [V:onnxruntime:, inference_session.cc:677 TransformGraph] All nodes have been placed on [CPUExecutionProvider].
2020-04-09 12:10:10.237572490 [I:onnxruntime:, session_state.cc:22 SetGraph] SaveMLValueNameIndexMapping
2020-04-09 12:10:10.237674529 [I:onnxruntime:, session_state.cc:67 SetGraph] Done saving OrtValue mappings.
2020-04-09 12:10:10.238372696 [I:onnxruntime:, session_state_initializer.cc:179 SaveInitializedTensors] Saving initialized tensors.
2020-04-09 12:10:10.239761337 [I:onnxruntime:, session_state_initializer.cc:224 SaveInitializedTensors] Done saving initialized tensors
2020-04-09 12:10:10.240230078 [I:onnxruntime:, inference_session.cc:917 Initialize] Session successfully initialized.
2020-04-09 12:10:10.240297610 [I:onnxruntime:, sequential_executor.cc:67 Execute] Begin execution
Running simple inference with default provider
2020-04-09 12:10:10.267648379 [I:onnxruntime:, inference_session.cc:208 ConstructorCommon] Using global/env threadpools since use_per_session_threads_ is false
2020-04-09 12:10:10.270301774 [I:onnxruntime:, inference_session.cc:829 Initialize] Initializing session.
2020-04-09 12:10:10.270338515 [I:onnxruntime:, inference_session.cc:847 Initialize] Adding default CPU execution provider.
2020-04-09 12:10:10.270549998 [I:onnxruntime:, reshape_fusion.cc:37 ApplyImpl] Total fused reshape node count: 0
2020-04-09 12:10:10.272409207 [I:onnxruntime:, reshape_fusion.cc:37 ApplyImpl] Total fused reshape node count: 0
2020-04-09 12:10:10.272785896 [I:onnxruntime:, reshape_fusion.cc:37 ApplyImpl] Total fused reshape node count: 0
2020-04-09 12:10:10.275340095 [V:onnxruntime:, inference_session.cc:675 TransformGraph] Node placements
2020-04-09 12:10:10.275375767 [V:onnxruntime:, inference_session.cc:677 TransformGraph] All nodes have been placed on [CPUExecutionProvider].
2020-04-09 12:10:10.275439329 [I:onnxruntime:, session_state.cc:22 SetGraph] SaveMLValueNameIndexMapping
2020-04-09 12:10:10.275562638 [I:onnxruntime:, session_state.cc:67 SetGraph] Done saving OrtValue mappings.
2020-04-09 12:10:10.276257615 [I:onnxruntime:, session_state_initializer.cc:179 SaveInitializedTensors] Saving initialized tensors.
2020-04-09 12:10:10.277270498 [I:onnxruntime:, session_state_initializer.cc:224 SaveInitializedTensors] Done saving initialized tensors
2020-04-09 12:10:10.277721931 [I:onnxruntime:, inference_session.cc:917 Initialize] Session successfully initialized.
2020-04-09 12:10:10.277787556 [I:onnxruntime:, sequential_executor.cc:67 Execute] Begin execution
[ OK ] CApiTestGlobalThreadPoolsWithProviders/CApiTestGlobalThreadPoolsWithProvider.simple3/4 (78 ms)
[----------] 15 tests from CApiTestGlobalThreadPoolsWithProviders/CApiTestGlobalThreadPoolsWithProvider (387 ms total)
[----------] Global test environment tear-down
[==========] 15 tests from 1 test suite ran. (387 ms total)
[ PASSED ] 15 tests.
And the result from
onnxruntime_test_all:
test_results_ppc_onnxruntime_test_all.txt
I pushed some new changes to cleanup the GEMM kernel templating.
How is performance of the runtime? I'm curious what you see for resnet50 or other test models from the ONNX model zoo. If you download some models + test data from the zoo (https://github.com/onnx/models), you can use onnx_test_runner to verify that the models run. And you can use "onnxruntime_perf_test -e cpu -t 30 path/to/model_and_data" to get a reference time.
Once you have some timing data, can you try updating MlasSgemmKernel in SgemmKernelPower.cpp to see if doing 6 rows improves or degrades performance? GCC seemed to build this and keep everything in registers, but this isn't always faster.
if (CountM >= 6) {
RowsHandled = MlasSgemmProcessCount<6>(A, B, C, CountK, CountN, lda, ldc, AlphaBroadcast, ZeroMode);
As far as the onnxruntime_mlas_test errors, I see the same problem in the ARM64 build. The expected data is based on what is observed with x86/x64.
@smuzaffar Are there any additional comments on these changes (see my last comment for some questions)? Are you able to run your models successfully with these changes?
@tracysh , we are working on it https://github.com/cms-sw/cmsdist/pull/5743 . We needed few fixes on top of v1.2.0 to build it ( https://github.com/cms-externals/onnxruntime/commits/cms/v1.2.0_plus_ppc_update_pb31130 ) . @mrodozov is working on it.
I merged all of the pending Power changes into master.
Thanks @tracysh , we have integrated this in our software and things looks in much better state.
Hi, @smuzaffar, just checking in: how does the performance of ONNX Runtime compare to the other runtimes you were using on Power? Do these systems have GPUs too that might benefit from using the CUDA support?
@tracysh , as x86_64 is our production architecture so when we migrated to onnxruntime then we did a performance test for x86_64. You can find the preformance results here https://github.com/cms-sw/cmssw/pull/28112 . In short we noticed 7x gain in modules where we have used onnxruntime.
Unfortunately we do not have same exact comparison for Power (i.e. exact cmssw with and without onnxruntime). But the comparison between cmssw from Dec 2019 (which was without onnxruntime) and latest nightlies we see much better gain (this could be due to both onnxruntime plus improve,ent in our code)
CMSSW 2019-12-04 + without ONNXRuntime
TimeReport 0.101799 0.101799 0.101799 pfDeepFlavourJetTagsWithDeepInfo
TimeReport 0.001237 0.001237 0.001237 pfDeepFlavourTagInfosWithDeepInfo
TimeReport 0.009642 0.009642 0.009642 pfMassDecorrelatedDeepBoostedJetTagsAK8WithDeepInfo
CMSSW 2020-05-07 + ONNXRuntime
TimeReport 0.009132 0.009132 0.009132 pfDeepFlavourJetTagsWithDeepInfo
TimeReport 0.001243 0.001243 0.001243 pfDeepFlavourTagInfosWithDeepInfo
TimeReport 0.000803 0.000803 0.000803 pfMassDecorrelatedDeepBoostedJetTagsAK8WithDeepInfo
Although our Power machines have GPU but currently we are not building with cuda support. Hopefully in near future we will enable it and report back the results.
Hi, @tracysh, there is a comparison between ONNX Runtime and another runtime to measure performance on x86_64, results are available here:
https://github.com/cms-sw/cmssw/pull/28711
and in the comments the researchers conclude "it depends on the use case" IIRC
We haven't run performance comparison yet on Power but we might, at least to know for ourselves, although we are using x86_64 as prod arch and our Arm and Power builds are lets call it "a research interest".
Having ONNX Runtime for Power was needed to cover the external package requirements for the PPC build. We have GPU devices that can benefit from the CUDA support, yes, about that effort you can read here: https://patatrack.web.cern.ch/patatrack/ and because the direction is any heavy computation to be executed on GPUs we are interested in the CUDA support, on any arch
Most helpful comment
Hello again,
the code builds now on our powerpc machine,
which is different from the prev one:
when I run: