One: [onert] Apply latest ARM-software/ComputeLibrary(v21.02).

Created on 3 Mar 2021 · 30Comments · Source: Samsung/ONE

We have been trying apply ARM-software/ComputeLibrary v20.11.

There is som changes in v20.08.

CLGEMMMatrixAccumulateBiasesKernel and NEGEMMMatrixAccumulateBiasesKernel were removed.
But these classes are used for CL/NE FullyConnectedLayer extends.
ARM-software/ComputeLibrary/doc/00_introduction.dox#L440
_So need to consider how replace those classes._

And now the ARM Cl had released v21.02 in last week.
So I want to try to sync arm cl library with v21.02.

DRAFT PR : https://github.com/Samsung/ONE/pull/6241

[x] CL.
- [x] [ FAILED ] GeneratedTests.logical_and_broadcast_nnfw
- [x] [ FAILED ] GeneratedTests.pad_BHWC_nnfw
- [x] [ FAILED ] GeneratedTests.pad_BHW_nnfw --> failed in cl::Command::enqueueReadBuffer()
- [x] [ FAILED ] GeneratedTests.fully_connected_dynamic_nnfw
- [x] [ FAILED ] GeneratedTests.fully_connected_hybrid_1_nnfw
- [x] [ FAILED ] GeneratedTests.fully_connected_hybrid_2_nnfw
- [x] [ FAILED ] GeneratedTests.fully_connected_float_2_weights_as_inputs --> failed with wrong result.
[ ] NEON.
- [x] [ FAILED ] GeneratedTests.fully_connected_float_large_weights_as_inputs_relaxed
- [x] [ FAILED ] GeneratedTests.fully_connected_float_weights_as_inputs_relaxed
- [x] [ FAILED ] GeneratedTests.softmax_float_1_relaxed
- [x] [ FAILED ] GeneratedTests.softmax_float_2_relaxed
- [x] [ FAILED ] GeneratedTests.fully_connected_float_large_weights_as_inputs
- [x] [ FAILED ] GeneratedTests.fully_connected_float_weights_as_inputs
- [x] [ FAILED ] GeneratedTests.fully_connected_hybrid_1_nnfw
- [x] [ FAILED ] GeneratedTests.fully_connected_hybrid_2_nnfw
- [x] [ FAILED ] GeneratedTests.softmax_float_1
- [x] [ FAILED ] GeneratedTests.softmax_float_2
- [x] [ FAILED ] GeneratedTests.softmax_quant8_1
- [x] [ FAILED ] GeneratedTests.softmax_quant8_2
- [x] [ FAILED ] GeneratedTests.fully_connected_float_2_weights_as_inputs
- [x] [ FAILED ] GeneratedTests.resize_bilinear_2
- [x] [ FAILED ] GeneratedTests.resize_bilinear
- [x] [ FAILED ] GeneratedTests.resize_bilinear_quant8_nnfw
- [x] [ FAILED ] GeneratedTests.OneOp_Cast_*
- [x] [ FAILED ] GenModelTest/AveragePool2DVariation - Need to follow up as a known issue.
[x] NDK Build - In temporary, The PR will try to build with ARMCOMPUTE Source codes.
- [x] Need to update CI to use ARMCOMPUTE library v21.02. (currently It ueses v20.05)
[x] Tizen gbs build : This requires 1) SR of tizen's armcl update and 2) Pause gbs build(internal CI).

Source

Hyunjun85

👍4

All 30 comments

In v21.02.
Removed

CLMemsetKernel
NEFlattenLayerKernel

Chagned

CLScale and NEScale's configuration parameter have been reduced.

Hyunjun85 on 4 Mar 2021

I share two log running inception_V3 with armcl(v20.05 and v21.02).
Some logs add in ARMCL v21.02. but the results seems to be same.

You can check details on below.

v21.02.log
v20.05.log

Hyunjun85 on 9 Mar 2021

In v21.02.
Removed

CLMemsetKernel
NEFlattenLayerKernel
Chagned

CLScale and NEScale's configuration parameter have been reduced.

@Hyunjun85 How you handle this issue when you are running IV3 model ? (maybe not resolved yet, right ?)

chunseoklee on 9 Mar 2021

@chunseoklee I copied several file which is needed from v20.05 into ArmcomputeEx folder.
And NEFlattenLayerKernel is replaced with NEFlattenLayer.

Hyunjun85 on 9 Mar 2021

@chunseoklee I copied several file which is needed from v20.05 into ArmcomputeEx folder.
And NEFlattenLayerKernel is replaced with NEFlattenLayer.

Could you please share a link for this draft ?

chunseoklee on 9 Mar 2021

544.857 ms -> 650.495 ms

Is there a regression on IV3 model ? or just a measurement issue ?

chunseoklee on 9 Mar 2021

544.857 ms -> 650.495 ms

Is there a regression on IV3 model ? or just a measurement issue ?

ARMCOMPUTE/src/runtime/CL/gemm_auto_heuristics/CLGEMMAutoHeuristics.cpp has added on v21.02.
In CLGEMM class, was added a logic to select gemm kernel type through CLGEMMAutoHeuristics. and I guess this codes can get effect for performance.

Hyunjun85 on 10 Mar 2021

AFAIK, IV3 model does not have either 1 FC or 0 FC layer on it. Could you please compare with non FC model(https://tfhub.dev/tensorflow/lite-model/inception_v3/1/default/1) again?

chunseoklee on 10 Mar 2021

AFAIK, IV3 model does not have either 1 FC or 0 FC layer on it. Could you please compare with non FC model(https://tfhub.dev/tensorflow/lite-model/inception_v3/1/default/1) again?

Also CLGEMMAutoHeuristics was called by CLGEMMConvolutionLayer in configuration time.
Stack trace info.

#0  0xb5406d38 in arm_compute::cl_gemm::auto_heuristics::select_mlgo_gemm_kernel(arm_compute::cl_gemm::auto_heuristics::CommonQuery const&, bool)@plt ()
   from /home/Product_V21/out/bin/../lib/../lib/../lib/../lib/libarm_compute.so
#1  0xb550c290 in arm_compute::(anonymous namespace)::auto_select_gemm_kernel (query=..., 
    reshape_b_only_on_first_run=true) at src/runtime/CL/functions/CLGEMM.cpp:122
#2  0xb5510f8c in arm_compute::CLGEMM::validate (a=0xbeffd304, b=0xbeffd4ac, c=0x6a8060, 
    output=0x6b5cc8, alpha=1, beta=1, gemm_info=...) at src/runtime/CL/functions/CLGEMM.cpp:741
#3  0xb5516116 in arm_compute::CLGEMMConvolutionLayer::validate_mm (input=0xbeffd304, 
    weights=0xbeffd4ac, biases=0x6a8060, output=0x6b5cc8, gemmlowp_output_stage=..., 
    gemm_3d_depth=149, skip_im2col=false, act_info=...)
    at src/runtime/CL/functions/CLGEMMConvolutionLayer.cpp:202

break points info. and commands.

    breakpoint already hit 475 times
3       breakpoint     keep y   0xb55dba7c in arm_compute::cl_gemm::auto_heuristics::select_mlgo_gemm_kernel(arm_compute::cl_gemm::auto_heuristics::CommonQuery const&, bool) 
                                           at src/runtime/CL/gemm_auto_heuristics/CLGEMMAutoHeuristics.cpp:48
    breakpoint already hit 474 times
        python import time
        python starttime=time.time()
        continue
4       breakpoint     keep y   0xb55dbe58 in arm_compute::cl_gemm::auto_heuristics::select_mlgo_gemm_kernel(arm_compute::cl_gemm::auto_heuristics::CommonQuery const&, bool) 
                                           at src/runtime/CL/gemm_auto_heuristics/CLGEMMAutoHeuristics.cpp:64
    breakpoint already hit 475 times
        python count += 1
        python print (count, (time.time()-starttime)*1000) # unit : ms.
        continue

But It need to find why the execution time is increased.

Hyunjun85 on 10 Mar 2021

6 test cases have been failed. and It seems to be affected by changes of ARM Compute Library v21.02.

[x] [ FAILED ] GeneratedTests.logical_and_broadcast_nnfw
[x] [ FAILED ] GeneratedTests.pad_BHWC_nnfw
[x] [ FAILED ] GeneratedTests.pad_BHW_nnfw --> failed in cl::Command::enqueueReadBuffer()
[x] [ FAILED ] GeneratedTests.fully_connected_dynamic_nnfw
[x] [ FAILED ] GeneratedTests.fully_connected_hybrid_1_nnfw
[x] [ FAILED ] GeneratedTests.fully_connected_hybrid_2_nnfw
[x] [ FAILED ] GeneratedTests.fully_connected_float_2_weights_as_inputs --> failed with wrong result.

Hyunjun85 on 15 Mar 2021

@Hyunjun85 Please read ONE/tests/nnapi/nnapi_gtest.skip.armv7l-linux.acl_cl.

GeneratedTests.fully_connected_dynamic_nnfw
GeneratedTests.fully_connected_float_2_weights_as_inputs

Those tests are not tested on CI.

chunseoklee on 17 Mar 2021

@Hyunjun85 Please read ONE/tests/nnapi/nnapi_gtest.skip.armv7l-linux.acl_cl.

Thanks. I checked that file.

Hyunjun85 on 17 Mar 2021

[ FAILED ] GeneratedTests.pad_BHWC_nnfw --> failed in cl::Command::enqueueReadBuffer()
[ FAILED ] GeneratedTests.pad_BHW_nnfw --> failed in cl::Command::enqueueReadBuffer()

Both Testcases are failed when called cl::Command::enqueueReadBuffer(). And it is used to get result of operation from opencl's buffer().

Hyunjun85 on 18 Mar 2021

NEON's side effect.

[x] [ FAILED ] GeneratedTests.fully_connected_float_large_weights_as_inputs_relaxed
[x] [ FAILED ] GeneratedTests.fully_connected_float_weights_as_inputs_relaxed
[x] [ FAILED ] GeneratedTests.softmax_float_1_relaxed
[x] [ FAILED ] GeneratedTests.softmax_float_2_relaxed
[x] [ FAILED ] GeneratedTests.fully_connected_float_large_weights_as_inputs
[x] [ FAILED ] GeneratedTests.fully_connected_float_weights_as_inputs
[x] [ FAILED ] GeneratedTests.fully_connected_hybrid_1_nnfw
[x] [ FAILED ] GeneratedTests.fully_connected_hybrid_2_nnfw
[x] [ FAILED ] GeneratedTests.softmax_float_1
[x] [ FAILED ] GeneratedTests.softmax_float_2
[x] [ FAILED ] GeneratedTests.softmax_quant8_1
[x] [ FAILED ] GeneratedTests.softmax_quant8_2
[ ] [ FAILED ] GeneratedTests.fully_connected_float_2_weights_as_inputs
[x] [ FAILED ] GeneratedTests.resize_bilinear_2
[x] [ FAILED ] GeneratedTests.resize_bilinear
[x] [ FAILED ] GeneratedTests.resize_bilinear_quant8_nnfw

Hyunjun85 on 19 Mar 2021

👀1

@Hyunjun85 GeneratedTests.pad_BHW_nnfw --> failed in cl::Command::enqueueReadBuffer() issue is resolved ?

Yes, It seems to be an issue for changes of internal implementation for pad in v21.02. So I added some pad implementation of v20.05 into armcomputeex.

chunseoklee on 22 Mar 2021

@Hyunjun85 I have failed GeneratedTests.softmax_float_1 both with acl_cl and acl_neon using commit : 51aac11ff1b240110e .

chunseoklee on 23 Mar 2021

@Hyunjun85 I have failed GeneratedTests.softmax_float_1 both with acl_cl and acl_neon using commit : 51aac11 .

This testcase is passed. I upated draft pr to share latest codes.

Hyunjun85 on 23 Mar 2021

~~For fully connected operation,~~
~~The window information to calculate is different. It looks like to be a cause.~~

~~in v21.02~~

151 /home/one/ws/one-odroid/compute/ARMComputeEx/src/core/NEON/kernels/NEGEMMMatrixAccumulateBiasesKernel.cpp
(gdb) p window
$1 = (const arm_compute::Window &) @0xb218d88c: {static DimX = 0, static DimY = 1, static DimZ = 2, static DimW = 3, 
  _dims = {
    _M_elems = {
        {_start = 0, _end = 16, _step = 16}, 
        {_start = 2, _end =  3, _step =  1}, 
        {_start = 0, _end =  1, _step =  1}, 
        {_start = 0, _end =  1, _step =  1}, 
        {_start = 0, _end =  1, _step =  1}, 
        {_start = 0, _end =  1, _step =  1}}}, 
        _is_broadcasted = {_M_elems = {false, false, false, false, false, false}}}

~~in v20.05~~

124 in src/core/NEON/kernels/NEGEMMMatrixAccumulateBiasesKernel.cpp
(gdb) p window
$1 = (const arm_compute::Window &) @0xb40a7cc4: {static DimX = 0, static DimY = 1, static DimZ = 2, 
  _dims = {
    _M_elems = {
        {_start = 0, _end = 16, _step = 16}, 
        {_start = 0, _end =  1, _step =  1}, 
        {_start = 0, _end =  1, _step =  1}, 
        {_start = 0, _end =  1, _step =  1}, 
        {_start = 0, _end =  1, _step =  1}, 
        {_start = 0, _end =  1, _step =  1}}},
        _is_broadcasted = {_M_elems = {false, false, false, false, false, false}}}

Hyunjun85 on 24 Mar 2021

@Hyunjun85 I found that the patch for softmax failure(need to look into it) :

➜  git diff
diff --git a/runtime/onert/backend/acl_neon/KernelGenerator.cc b/runtime/onert/backend/acl_neon/KernelGenerator.cc
index 8d597b3eb..2477818d6 100644
--- a/runtime/onert/backend/acl_neon/KernelGenerator.cc
+++ b/runtime/onert/backend/acl_neon/KernelGenerator.cc
@@ -991,7 +991,7 @@ void KernelGenerator::visit(const ir::operation::Softmax &node)
   // NOTE NESoftmaxLayer's default axis is -1
   auto fn = acl_common::generateLayer<arm_compute::NESoftmaxLayer>(
     _tensor_builder->acl_tensor_manager()->internal_buffer_manager(), input_tensor->handle(),
-    output_tensor->handle(), beta, 1);
+    output_tensor->handle(), beta);

   // Revert disabling applied dim_correction
   if (input_tensor->getShape().dim(0) == 1)

chunseoklee on 24 Mar 2021

🚀1

@Hyunjun85 I found that the patch for softmax failure(need to look into it) :

➜  git diff
diff --git a/runtime/onert/backend/acl_neon/KernelGenerator.cc b/runtime/onert/backend/acl_neon/KernelGenerator.cc
index 8d597b3eb..2477818d6 100644
--- a/runtime/onert/backend/acl_neon/KernelGenerator.cc
+++ b/runtime/onert/backend/acl_neon/KernelGenerator.cc
@@ -991,7 +991,7 @@ void KernelGenerator::visit(const ir::operation::Softmax &node)
   // NOTE NESoftmaxLayer's default axis is -1
   auto fn = acl_common::generateLayer<arm_compute::NESoftmaxLayer>(
     _tensor_builder->acl_tensor_manager()->internal_buffer_manager(), input_tensor->handle(),
-    output_tensor->handle(), beta, 1);
+    output_tensor->handle(), beta);

   // Revert disabling applied dim_correction
   if (input_tensor->getShape().dim(0) == 1)

Thanks, I will test more.

Hyunjun85 on 24 Mar 2021

I found several facts about softmax definition and implementation :

TFLite have no parameter(or input) for axis while TF and ComputeLibrary does.
ONE does not define axis in softmax operation(like TFLite)
TOCO(tflite_converter) inserts TRANSPOSE into both before and after softmax layer during converting if axis is not -1.

Thus, ONE(for now) assumes that softmax layer takes -1 as axis value(implicitly).

chunseoklee on 24 Mar 2021

@Hyunjun85 About softmax issue, it confused me a lot(See https://github.com/Samsung/ONE/pull/6328#issuecomment-806389967). I need more time to investigate this.

chunseoklee on 26 Mar 2021

Default argument

It looks that acl internal implementation for softmax is changed. So maybe default axis is changed.

disableDimCorrection() call

On 20.05, validate function in acl neon checks axis validation, but acl-cl don't checks axis validation. So fortunately, acl-cl backend passed without disableDimCorrection().
Bug: On acl-neon backend, after disableDimCorrection() and configure, we should call enableDimCorrection(), but we called disableDimCorrection() again. But fortunately, it passed.

hseok-oh on 29 Mar 2021

About, GeneratedTests.fully_connected_float_2_weights_as_inputs.
_FullyConnected operation for NEON doesn't change weights called multiple times._
_For the first time, The weight are applied to operation. but after do operate, It never change_. Perhaps, It's a side effect of matrix multiply operations updated.

Only sees the situation.
The weight is decided when called NEFullyConnectedLayerEx::config. and it's called only once.

	CL	NEON	CL	NEON
	v20.05		v21.02
fully_connected_float_2_weights_as_inputs	Skip	Pass	Skip	Fail

Hyunjun85 on 30 Mar 2021

patch for softmax isseu looks like :

modified   runtime/onert/backend/acl_neon/KernelGenerator.cc                                                                                   @@ -980,24 +980,12 @@ void KernelGenerator::visit(const ir::operation::Softmax &node)                                                          
   auto output_tensor = _tensor_reg->getAclTensor(output_index);                                                                               
   auto input_tensor = _tensor_reg->getAclTensor(input_index);                                                                                 

-  // Disable applied dim_correction                                                                                                           
-  if (static_cast<size_t>(input_tensor->getShape().rank()) !=                                                                                 
-      input_tensor->info()->num_dimensions())                                                                                                 
-  {                                                                                                                                           
-    // This means that high dimension's value is 1 and input tensor is applied dim_correction                                                 
-    acl_common::disableDimCorrection(input_tensor);                                                                                           
-  }                                                                                                                                           

   // NOTE NESoftmaxLayer's default axis is -1                                                                                                 
   auto fn = acl_common::generateLayer<arm_compute::NESoftmaxLayer>(                                                                           
     _tensor_builder->acl_tensor_manager()->internal_buffer_manager(), input_tensor->handle(),                                                 
-    output_tensor->handle(), beta, 1);                                                                                                        
+    output_tensor->handle(), beta);                                                                                                           

-  // Revert disabling applied dim_correction                                                                                                  
-  if (input_tensor->getShape().dim(0) == 1)                                                                                                   
-  {                                                                                                                                           
-    acl_common::disableDimCorrection(input_tensor);                                                                                           
-  }                                                                                                                                           

   _return_fn = asAclFunction(std::move(fn));                                                                                                  
 }

chunseoklee on 30 Mar 2021

I will apply below patch to pr..

patch for softmax isseu looks like :

modified   runtime/onert/backend/acl_neon/KernelGenerator.cc                                                                                   @@ -980,24 +980,12 @@ void KernelGenerator::visit(const ir::operation::Softmax &node)                                                          
   auto output_tensor = _tensor_reg->getAclTensor(output_index);                                                                               
   auto input_tensor = _tensor_reg->getAclTensor(input_index);                                                                                 

-  // Disable applied dim_correction                                                                                                           
-  if (static_cast<size_t>(input_tensor->getShape().rank()) !=                                                                                 
-      input_tensor->info()->num_dimensions())                                                                                                 
-  {                                                                                                                                           
-    // This means that high dimension's value is 1 and input tensor is applied dim_correction                                                 
-    acl_common::disableDimCorrection(input_tensor);                                                                                           
-  }                                                                                                                                           

   // NOTE NESoftmaxLayer's default axis is -1                                                                                                 
   auto fn = acl_common::generateLayer<arm_compute::NESoftmaxLayer>(                                                                           
     _tensor_builder->acl_tensor_manager()->internal_buffer_manager(), input_tensor->handle(),                                                 
-    output_tensor->handle(), beta, 1);                                                                                                        
+    output_tensor->handle(), beta);                                                                                                           

-  // Revert disabling applied dim_correction                                                                                                  
-  if (input_tensor->getShape().dim(0) == 1)                                                                                                   
-  {                                                                                                                                           
-    acl_common::disableDimCorrection(input_tensor);                                                                                           
-  }                                                                                                                                           

   _return_fn = asAclFunction(std::move(fn));                                                                                                  
 }

Hyunjun85 on 30 Mar 2021

Below 3 items are resolved.

[ FAILED ] GeneratedTests.resize_bilinear_2
[ FAILED ] GeneratedTests.resize_bilinear
[ FAILED ] GeneratedTests.resize_bilinear_quant8_nnfw

Cause,
resize_bilinear op is not supported padding but ScaleKernelInfo's use_padding info is set use padding(true) as a default.
So, I changed initiating code to set false.

Hyunjun85 on 30 Mar 2021

👍1

About fully_connected_float_2_weights_as_inputs test, we'd better add this to skiplist since , for now, it seems not easy to support this on FC kernel(especially, arm_gemm::GemmInterleaved kernel).

chunseoklee on 30 Mar 2021

TC Fail about OneOp_Cast_*.

The Cast Op was inherited INESimpleFunction and it has borderhandler as a member variable. but the variable had changed unique pointer and add a NULL checking logic before run the kernel. So it return error always.

Now, NECast inherits INESimpleFunctionNoBorder class.

Hyunjun85 on 31 Mar 2021

PR merged.
The remaining jobs for this are :
1) CI issue
2) tizen SR

chunseoklee on 14 Apr 2021

Was this page helpful?

0 / 5 - 0 ratings