We have been trying apply ARM-software/ComputeLibrary v20.11.
There is som changes in v20.08.
And now the ARM Cl had released v21.02 in last week.
So I want to try to sync arm cl library with v21.02.
DRAFT PR : https://github.com/Samsung/ONE/pull/6241
[x] CL.
[ ] NEON.
[x] NDK Build - In temporary, The PR will try to build with ARMCOMPUTE Source codes.
[x] Tizen gbs build : This requires 1) SR of tizen's armcl update and 2) Pause gbs build(internal CI).
In v21.02.
Removed
Chagned
I share two log running inception_V3 with armcl(v20.05 and v21.02).
Some logs add in ARMCL v21.02. but the results seems to be same.
You can check details on below.
In v21.02.
Removed
CLMemsetKernel
NEFlattenLayerKernel
Chagned
CLScale and NEScale's configuration parameter have been reduced.
@Hyunjun85 How you handle this issue when you are running IV3 model ? (maybe not resolved yet, right ?)
@chunseoklee I copied several file which is needed from v20.05 into ArmcomputeEx folder.
And NEFlattenLayerKernel is replaced with NEFlattenLayer.
@chunseoklee I copied several file which is needed from v20.05 into ArmcomputeEx folder.
And NEFlattenLayerKernel is replaced with NEFlattenLayer.
Could you please share a link for this draft ?
544.857 ms -> 650.495 ms
Is there a regression on IV3 model ? or just a measurement issue ?
544.857 ms -> 650.495 ms
Is there a regression on IV3 model ? or just a measurement issue ?
ARMCOMPUTE/src/runtime/CL/gemm_auto_heuristics/CLGEMMAutoHeuristics.cpp has added on v21.02.
In CLGEMM class, was added a logic to select gemm kernel type through CLGEMMAutoHeuristics. and I guess this codes can get effect for performance.
AFAIK, IV3 model does not have either 1 FC or 0 FC layer on it. Could you please compare with non FC model(https://tfhub.dev/tensorflow/lite-model/inception_v3/1/default/1) again?
AFAIK, IV3 model does not have either 1 FC or 0 FC layer on it. Could you please compare with non FC model(https://tfhub.dev/tensorflow/lite-model/inception_v3/1/default/1) again?
Also CLGEMMAutoHeuristics was called by CLGEMMConvolutionLayer in configuration time.
Stack trace info.
#0 0xb5406d38 in arm_compute::cl_gemm::auto_heuristics::select_mlgo_gemm_kernel(arm_compute::cl_gemm::auto_heuristics::CommonQuery const&, bool)@plt ()
from /home/Product_V21/out/bin/../lib/../lib/../lib/../lib/libarm_compute.so
#1 0xb550c290 in arm_compute::(anonymous namespace)::auto_select_gemm_kernel (query=...,
reshape_b_only_on_first_run=true) at src/runtime/CL/functions/CLGEMM.cpp:122
#2 0xb5510f8c in arm_compute::CLGEMM::validate (a=0xbeffd304, b=0xbeffd4ac, c=0x6a8060,
output=0x6b5cc8, alpha=1, beta=1, gemm_info=...) at src/runtime/CL/functions/CLGEMM.cpp:741
#3 0xb5516116 in arm_compute::CLGEMMConvolutionLayer::validate_mm (input=0xbeffd304,
weights=0xbeffd4ac, biases=0x6a8060, output=0x6b5cc8, gemmlowp_output_stage=...,
gemm_3d_depth=149, skip_im2col=false, act_info=...)
at src/runtime/CL/functions/CLGEMMConvolutionLayer.cpp:202
break points info. and commands.
breakpoint already hit 475 times
3 breakpoint keep y 0xb55dba7c in arm_compute::cl_gemm::auto_heuristics::select_mlgo_gemm_kernel(arm_compute::cl_gemm::auto_heuristics::CommonQuery const&, bool)
at src/runtime/CL/gemm_auto_heuristics/CLGEMMAutoHeuristics.cpp:48
breakpoint already hit 474 times
python import time
python starttime=time.time()
continue
4 breakpoint keep y 0xb55dbe58 in arm_compute::cl_gemm::auto_heuristics::select_mlgo_gemm_kernel(arm_compute::cl_gemm::auto_heuristics::CommonQuery const&, bool)
at src/runtime/CL/gemm_auto_heuristics/CLGEMMAutoHeuristics.cpp:64
breakpoint already hit 475 times
python count += 1
python print (count, (time.time()-starttime)*1000) # unit : ms.
continue
But It need to find why the execution time is increased.
6 test cases have been failed. and It seems to be affected by changes of ARM Compute Library v21.02.
@Hyunjun85 Please read ONE/tests/nnapi/nnapi_gtest.skip.armv7l-linux.acl_cl.
GeneratedTests.fully_connected_dynamic_nnfw
GeneratedTests.fully_connected_float_2_weights_as_inputs
Those tests are not tested on CI.
@Hyunjun85 Please read ONE/tests/nnapi/nnapi_gtest.skip.armv7l-linux.acl_cl.
Thanks. I checked that file.
[ FAILED ] GeneratedTests.pad_BHWC_nnfw --> failed in cl::Command::enqueueReadBuffer()
[ FAILED ] GeneratedTests.pad_BHW_nnfw --> failed in cl::Command::enqueueReadBuffer()
Both Testcases are failed when called cl::Command::enqueueReadBuffer(). And it is used to get result of operation from opencl's buffer().
NEON's side effect.
@Hyunjun85 GeneratedTests.pad_BHW_nnfw --> failed in cl::Command::enqueueReadBuffer() issue is resolved ?
Yes, It seems to be an issue for changes of internal implementation for pad in v21.02. So I added some pad implementation of v20.05 into armcomputeex.
@Hyunjun85 I have failed GeneratedTests.softmax_float_1 both with acl_cl and acl_neon using commit : 51aac11ff1b240110e .
@Hyunjun85 I have failed GeneratedTests.softmax_float_1 both with acl_cl and acl_neon using commit : 51aac11 .
This testcase is passed. I upated draft pr to share latest codes.
For fully connected operation,
The window information to calculate is different. It looks like to be a cause.
in v21.02
151 /home/one/ws/one-odroid/compute/ARMComputeEx/src/core/NEON/kernels/NEGEMMMatrixAccumulateBiasesKernel.cpp
(gdb) p window
$1 = (const arm_compute::Window &) @0xb218d88c: {static DimX = 0, static DimY = 1, static DimZ = 2, static DimW = 3,
_dims = {
_M_elems = {
{_start = 0, _end = 16, _step = 16},
{_start = 2, _end = 3, _step = 1},
{_start = 0, _end = 1, _step = 1},
{_start = 0, _end = 1, _step = 1},
{_start = 0, _end = 1, _step = 1},
{_start = 0, _end = 1, _step = 1}}},
_is_broadcasted = {_M_elems = {false, false, false, false, false, false}}}
in v20.05
124 in src/core/NEON/kernels/NEGEMMMatrixAccumulateBiasesKernel.cpp
(gdb) p window
$1 = (const arm_compute::Window &) @0xb40a7cc4: {static DimX = 0, static DimY = 1, static DimZ = 2,
_dims = {
_M_elems = {
{_start = 0, _end = 16, _step = 16},
{_start = 0, _end = 1, _step = 1},
{_start = 0, _end = 1, _step = 1},
{_start = 0, _end = 1, _step = 1},
{_start = 0, _end = 1, _step = 1},
{_start = 0, _end = 1, _step = 1}}},
_is_broadcasted = {_M_elems = {false, false, false, false, false, false}}}
@Hyunjun85 I found that the patch for softmax failure(need to look into it) :
➜ git diff
diff --git a/runtime/onert/backend/acl_neon/KernelGenerator.cc b/runtime/onert/backend/acl_neon/KernelGenerator.cc
index 8d597b3eb..2477818d6 100644
--- a/runtime/onert/backend/acl_neon/KernelGenerator.cc
+++ b/runtime/onert/backend/acl_neon/KernelGenerator.cc
@@ -991,7 +991,7 @@ void KernelGenerator::visit(const ir::operation::Softmax &node)
// NOTE NESoftmaxLayer's default axis is -1
auto fn = acl_common::generateLayer<arm_compute::NESoftmaxLayer>(
_tensor_builder->acl_tensor_manager()->internal_buffer_manager(), input_tensor->handle(),
- output_tensor->handle(), beta, 1);
+ output_tensor->handle(), beta);
// Revert disabling applied dim_correction
if (input_tensor->getShape().dim(0) == 1)
@Hyunjun85 I found that the patch for softmax failure(need to look into it) :
➜ git diff diff --git a/runtime/onert/backend/acl_neon/KernelGenerator.cc b/runtime/onert/backend/acl_neon/KernelGenerator.cc index 8d597b3eb..2477818d6 100644 --- a/runtime/onert/backend/acl_neon/KernelGenerator.cc +++ b/runtime/onert/backend/acl_neon/KernelGenerator.cc @@ -991,7 +991,7 @@ void KernelGenerator::visit(const ir::operation::Softmax &node) // NOTE NESoftmaxLayer's default axis is -1 auto fn = acl_common::generateLayer<arm_compute::NESoftmaxLayer>( _tensor_builder->acl_tensor_manager()->internal_buffer_manager(), input_tensor->handle(), - output_tensor->handle(), beta, 1); + output_tensor->handle(), beta); // Revert disabling applied dim_correction if (input_tensor->getShape().dim(0) == 1)
Thanks, I will test more.
I found several facts about softmax definition and implementation :
Thus, ONE(for now) assumes that softmax layer takes -1 as axis value(implicitly).
Related PR : https://github.com/Samsung/ONE/pull/6328
@Hyunjun85 About softmax issue, it confused me a lot(See https://github.com/Samsung/ONE/pull/6328#issuecomment-806389967). I need more time to investigate this.
It looks that acl internal implementation for softmax is changed. So maybe default axis is changed.
disableDimCorrection().disableDimCorrection() and configure, we should call enableDimCorrection(), but we called disableDimCorrection() again. But fortunately, it passed.About, GeneratedTests.fully_connected_float_2_weights_as_inputs.
_FullyConnected operation for NEON doesn't change weights called multiple times._
_For the first time, The weight are applied to operation. but after do operate, It never change_. Perhaps, It's a side effect of matrix multiply operations updated.
Only sees the situation.
The weight is decided when called NEFullyConnectedLayerEx::config. and it's called only once.
| CL | NEON | CL | NEON | |
| v20.05 | v21.02 | |||
| fully_connected_float_2_weights_as_inputs | Skip | Pass | Skip | Fail |
patch for softmax isseu looks like :
modified runtime/onert/backend/acl_neon/KernelGenerator.cc @@ -980,24 +980,12 @@ void KernelGenerator::visit(const ir::operation::Softmax &node)
auto output_tensor = _tensor_reg->getAclTensor(output_index);
auto input_tensor = _tensor_reg->getAclTensor(input_index);
- // Disable applied dim_correction
- if (static_cast<size_t>(input_tensor->getShape().rank()) !=
- input_tensor->info()->num_dimensions())
- {
- // This means that high dimension's value is 1 and input tensor is applied dim_correction
- acl_common::disableDimCorrection(input_tensor);
- }
// NOTE NESoftmaxLayer's default axis is -1
auto fn = acl_common::generateLayer<arm_compute::NESoftmaxLayer>(
_tensor_builder->acl_tensor_manager()->internal_buffer_manager(), input_tensor->handle(),
- output_tensor->handle(), beta, 1);
+ output_tensor->handle(), beta);
- // Revert disabling applied dim_correction
- if (input_tensor->getShape().dim(0) == 1)
- {
- acl_common::disableDimCorrection(input_tensor);
- }
_return_fn = asAclFunction(std::move(fn));
}
I will apply below patch to pr..
patch for softmax isseu looks like :
modified runtime/onert/backend/acl_neon/KernelGenerator.cc @@ -980,24 +980,12 @@ void KernelGenerator::visit(const ir::operation::Softmax &node) auto output_tensor = _tensor_reg->getAclTensor(output_index); auto input_tensor = _tensor_reg->getAclTensor(input_index); - // Disable applied dim_correction - if (static_cast<size_t>(input_tensor->getShape().rank()) != - input_tensor->info()->num_dimensions()) - { - // This means that high dimension's value is 1 and input tensor is applied dim_correction - acl_common::disableDimCorrection(input_tensor); - } // NOTE NESoftmaxLayer's default axis is -1 auto fn = acl_common::generateLayer<arm_compute::NESoftmaxLayer>( _tensor_builder->acl_tensor_manager()->internal_buffer_manager(), input_tensor->handle(), - output_tensor->handle(), beta, 1); + output_tensor->handle(), beta); - // Revert disabling applied dim_correction - if (input_tensor->getShape().dim(0) == 1) - { - acl_common::disableDimCorrection(input_tensor); - } _return_fn = asAclFunction(std::move(fn)); }
Below 3 items are resolved.
[ FAILED ] GeneratedTests.resize_bilinear_2
[ FAILED ] GeneratedTests.resize_bilinear
[ FAILED ] GeneratedTests.resize_bilinear_quant8_nnfw
Cause,
resize_bilinear op is not supported padding but ScaleKernelInfo's use_padding info is set use padding(true) as a default.
So, I changed initiating code to set false.
About fully_connected_float_2_weights_as_inputs test, we'd better add this to skiplist since , for now, it seems not easy to support this on FC kernel(especially, arm_gemm::GemmInterleaved kernel).
TC Fail about OneOp_Cast_*.
The Cast Op was inherited INESimpleFunction and it has borderhandler as a member variable. but the variable had changed unique pointer and add a NULL checking logic before run the kernel. So it return error always.
Now, NECast inherits INESimpleFunctionNoBorder class.
PR merged.
The remaining jobs for this are :
1) CI issue
2) tizen SR