One: [onert] Apply latest ARM-software/ComputeLibrary(v21.02).

Created on 3 Mar 2021  Â·  30Comments  Â·  Source: Samsung/ONE

We have been trying apply ARM-software/ComputeLibrary v20.11.

There is som changes in v20.08.

And now the ARM Cl had released v21.02 in last week.
So I want to try to sync arm cl library with v21.02.

DRAFT PR : https://github.com/Samsung/ONE/pull/6241

  • [x] CL.

    • [x] [ FAILED ] GeneratedTests.logical_and_broadcast_nnfw
    • [x] [ FAILED ] GeneratedTests.pad_BHWC_nnfw
    • [x] [ FAILED ] GeneratedTests.pad_BHW_nnfw --> failed in cl::Command::enqueueReadBuffer()
    • [x] [ FAILED ] GeneratedTests.fully_connected_dynamic_nnfw
    • [x] [ FAILED ] GeneratedTests.fully_connected_hybrid_1_nnfw
    • [x] [ FAILED ] GeneratedTests.fully_connected_hybrid_2_nnfw
    • [x] [ FAILED ] GeneratedTests.fully_connected_float_2_weights_as_inputs --> failed with wrong result.
  • [ ] NEON.

    • [x] [ FAILED ] GeneratedTests.fully_connected_float_large_weights_as_inputs_relaxed
    • [x] [ FAILED ] GeneratedTests.fully_connected_float_weights_as_inputs_relaxed
    • [x] [ FAILED ] GeneratedTests.softmax_float_1_relaxed
    • [x] [ FAILED ] GeneratedTests.softmax_float_2_relaxed
    • [x] [ FAILED ] GeneratedTests.fully_connected_float_large_weights_as_inputs
    • [x] [ FAILED ] GeneratedTests.fully_connected_float_weights_as_inputs
    • [x] [ FAILED ] GeneratedTests.fully_connected_hybrid_1_nnfw
    • [x] [ FAILED ] GeneratedTests.fully_connected_hybrid_2_nnfw
    • [x] [ FAILED ] GeneratedTests.softmax_float_1
    • [x] [ FAILED ] GeneratedTests.softmax_float_2
    • [x] [ FAILED ] GeneratedTests.softmax_quant8_1
    • [x] [ FAILED ] GeneratedTests.softmax_quant8_2
    • [x] [ FAILED ] GeneratedTests.fully_connected_float_2_weights_as_inputs
    • [x] [ FAILED ] GeneratedTests.resize_bilinear_2
    • [x] [ FAILED ] GeneratedTests.resize_bilinear
    • [x] [ FAILED ] GeneratedTests.resize_bilinear_quant8_nnfw
    • [x] [ FAILED ] GeneratedTests.OneOp_Cast_*
    • [x] [ FAILED ] GenModelTest/AveragePool2DVariation - Need to follow up as a known issue.
  • [x] NDK Build - In temporary, The PR will try to build with ARMCOMPUTE Source codes.

    • [x] Need to update CI to use ARMCOMPUTE library v21.02. (currently It ueses v20.05)
  • [x] Tizen gbs build : This requires 1) SR of tizen's armcl update and 2) Pause gbs build(internal CI).

All 30 comments

In v21.02.
Removed

  • CLMemsetKernel
  • NEFlattenLayerKernel

Chagned

  • CLScale and NEScale's configuration parameter have been reduced.

I share two log running inception_V3 with armcl(v20.05 and v21.02).
Some logs add in ARMCL v21.02. but the results seems to be same.

You can check details on below.

v21.02.log
v20.05.log

In v21.02.
Removed

CLMemsetKernel
NEFlattenLayerKernel
Chagned

CLScale and NEScale's configuration parameter have been reduced.

@Hyunjun85 How you handle this issue when you are running IV3 model ? (maybe not resolved yet, right ?)

@chunseoklee I copied several file which is needed from v20.05 into ArmcomputeEx folder.
And NEFlattenLayerKernel is replaced with NEFlattenLayer.

@chunseoklee I copied several file which is needed from v20.05 into ArmcomputeEx folder.
And NEFlattenLayerKernel is replaced with NEFlattenLayer.

Could you please share a link for this draft ?

544.857 ms -> 650.495 ms

Is there a regression on IV3 model ? or just a measurement issue ?

544.857 ms -> 650.495 ms

Is there a regression on IV3 model ? or just a measurement issue ?

ARMCOMPUTE/src/runtime/CL/gemm_auto_heuristics/CLGEMMAutoHeuristics.cpp has added on v21.02.
In CLGEMM class, was added a logic to select gemm kernel type through CLGEMMAutoHeuristics. and I guess this codes can get effect for performance.

AFAIK, IV3 model does not have either 1 FC or 0 FC layer on it. Could you please compare with non FC model(https://tfhub.dev/tensorflow/lite-model/inception_v3/1/default/1) again?

AFAIK, IV3 model does not have either 1 FC or 0 FC layer on it. Could you please compare with non FC model(https://tfhub.dev/tensorflow/lite-model/inception_v3/1/default/1) again?

Also CLGEMMAutoHeuristics was called by CLGEMMConvolutionLayer in configuration time.
Stack trace info.

#0  0xb5406d38 in arm_compute::cl_gemm::auto_heuristics::select_mlgo_gemm_kernel(arm_compute::cl_gemm::auto_heuristics::CommonQuery const&, bool)@plt ()
   from /home/Product_V21/out/bin/../lib/../lib/../lib/../lib/libarm_compute.so
#1  0xb550c290 in arm_compute::(anonymous namespace)::auto_select_gemm_kernel (query=..., 
    reshape_b_only_on_first_run=true) at src/runtime/CL/functions/CLGEMM.cpp:122
#2  0xb5510f8c in arm_compute::CLGEMM::validate (a=0xbeffd304, b=0xbeffd4ac, c=0x6a8060, 
    output=0x6b5cc8, alpha=1, beta=1, gemm_info=...) at src/runtime/CL/functions/CLGEMM.cpp:741
#3  0xb5516116 in arm_compute::CLGEMMConvolutionLayer::validate_mm (input=0xbeffd304, 
    weights=0xbeffd4ac, biases=0x6a8060, output=0x6b5cc8, gemmlowp_output_stage=..., 
    gemm_3d_depth=149, skip_im2col=false, act_info=...)
    at src/runtime/CL/functions/CLGEMMConvolutionLayer.cpp:202

break points info. and commands.

    breakpoint already hit 475 times
3       breakpoint     keep y   0xb55dba7c in arm_compute::cl_gemm::auto_heuristics::select_mlgo_gemm_kernel(arm_compute::cl_gemm::auto_heuristics::CommonQuery const&, bool) 
                                           at src/runtime/CL/gemm_auto_heuristics/CLGEMMAutoHeuristics.cpp:48
    breakpoint already hit 474 times
        python import time
        python starttime=time.time()
        continue
4       breakpoint     keep y   0xb55dbe58 in arm_compute::cl_gemm::auto_heuristics::select_mlgo_gemm_kernel(arm_compute::cl_gemm::auto_heuristics::CommonQuery const&, bool) 
                                           at src/runtime/CL/gemm_auto_heuristics/CLGEMMAutoHeuristics.cpp:64
    breakpoint already hit 475 times
        python count += 1
        python print (count, (time.time()-starttime)*1000) # unit : ms.
        continue

But It need to find why the execution time is increased.

6 test cases have been failed. and It seems to be affected by changes of ARM Compute Library v21.02.

  • [x] [ FAILED ] GeneratedTests.logical_and_broadcast_nnfw
  • [x] [ FAILED ] GeneratedTests.pad_BHWC_nnfw
  • [x] [ FAILED ] GeneratedTests.pad_BHW_nnfw --> failed in cl::Command::enqueueReadBuffer()
  • [x] [ FAILED ] GeneratedTests.fully_connected_dynamic_nnfw
  • [x] [ FAILED ] GeneratedTests.fully_connected_hybrid_1_nnfw
  • [x] [ FAILED ] GeneratedTests.fully_connected_hybrid_2_nnfw
  • [x] [ FAILED ] GeneratedTests.fully_connected_float_2_weights_as_inputs --> failed with wrong result.

@Hyunjun85 Please read ONE/tests/nnapi/nnapi_gtest.skip.armv7l-linux.acl_cl.

GeneratedTests.fully_connected_dynamic_nnfw
GeneratedTests.fully_connected_float_2_weights_as_inputs

Those tests are not tested on CI.

@Hyunjun85 Please read ONE/tests/nnapi/nnapi_gtest.skip.armv7l-linux.acl_cl.

Thanks. I checked that file.

[ FAILED ] GeneratedTests.pad_BHWC_nnfw --> failed in cl::Command::enqueueReadBuffer()
[ FAILED ] GeneratedTests.pad_BHW_nnfw --> failed in cl::Command::enqueueReadBuffer()

Both Testcases are failed when called cl::Command::enqueueReadBuffer(). And it is used to get result of operation from opencl's buffer().

NEON's side effect.

  • [x] [ FAILED ] GeneratedTests.fully_connected_float_large_weights_as_inputs_relaxed
  • [x] [ FAILED ] GeneratedTests.fully_connected_float_weights_as_inputs_relaxed
  • [x] [ FAILED ] GeneratedTests.softmax_float_1_relaxed
  • [x] [ FAILED ] GeneratedTests.softmax_float_2_relaxed
  • [x] [ FAILED ] GeneratedTests.fully_connected_float_large_weights_as_inputs
  • [x] [ FAILED ] GeneratedTests.fully_connected_float_weights_as_inputs
  • [x] [ FAILED ] GeneratedTests.fully_connected_hybrid_1_nnfw
  • [x] [ FAILED ] GeneratedTests.fully_connected_hybrid_2_nnfw
  • [x] [ FAILED ] GeneratedTests.softmax_float_1
  • [x] [ FAILED ] GeneratedTests.softmax_float_2
  • [x] [ FAILED ] GeneratedTests.softmax_quant8_1
  • [x] [ FAILED ] GeneratedTests.softmax_quant8_2
  • [ ] [ FAILED ] GeneratedTests.fully_connected_float_2_weights_as_inputs
  • [x] [ FAILED ] GeneratedTests.resize_bilinear_2
  • [x] [ FAILED ] GeneratedTests.resize_bilinear
  • [x] [ FAILED ] GeneratedTests.resize_bilinear_quant8_nnfw

@Hyunjun85 GeneratedTests.pad_BHW_nnfw --> failed in cl::Command::enqueueReadBuffer() issue is resolved ?

Yes, It seems to be an issue for changes of internal implementation for pad in v21.02. So I added some pad implementation of v20.05 into armcomputeex.

@Hyunjun85 I have failed GeneratedTests.softmax_float_1 both with acl_cl and acl_neon using commit : 51aac11ff1b240110e .

@Hyunjun85 I have failed GeneratedTests.softmax_float_1 both with acl_cl and acl_neon using commit : 51aac11 .

This testcase is passed. I upated draft pr to share latest codes.

For fully connected operation,
The window information to calculate is different. It looks like to be a cause.

in v21.02

151 /home/one/ws/one-odroid/compute/ARMComputeEx/src/core/NEON/kernels/NEGEMMMatrixAccumulateBiasesKernel.cpp
(gdb) p window
$1 = (const arm_compute::Window &) @0xb218d88c: {static DimX = 0, static DimY = 1, static DimZ = 2, static DimW = 3, 
  _dims = {
    _M_elems = {
        {_start = 0, _end = 16, _step = 16}, 
        {_start = 2, _end =  3, _step =  1}, 
        {_start = 0, _end =  1, _step =  1}, 
        {_start = 0, _end =  1, _step =  1}, 
        {_start = 0, _end =  1, _step =  1}, 
        {_start = 0, _end =  1, _step =  1}}}, 
        _is_broadcasted = {_M_elems = {false, false, false, false, false, false}}}

in v20.05

124 in src/core/NEON/kernels/NEGEMMMatrixAccumulateBiasesKernel.cpp
(gdb) p window
$1 = (const arm_compute::Window &) @0xb40a7cc4: {static DimX = 0, static DimY = 1, static DimZ = 2, 
  _dims = {
    _M_elems = {
        {_start = 0, _end = 16, _step = 16}, 
        {_start = 0, _end =  1, _step =  1}, 
        {_start = 0, _end =  1, _step =  1}, 
        {_start = 0, _end =  1, _step =  1}, 
        {_start = 0, _end =  1, _step =  1}, 
        {_start = 0, _end =  1, _step =  1}}},
        _is_broadcasted = {_M_elems = {false, false, false, false, false, false}}}

@Hyunjun85 I found that the patch for softmax failure(need to look into it) :

➜  git diff
diff --git a/runtime/onert/backend/acl_neon/KernelGenerator.cc b/runtime/onert/backend/acl_neon/KernelGenerator.cc
index 8d597b3eb..2477818d6 100644
--- a/runtime/onert/backend/acl_neon/KernelGenerator.cc
+++ b/runtime/onert/backend/acl_neon/KernelGenerator.cc
@@ -991,7 +991,7 @@ void KernelGenerator::visit(const ir::operation::Softmax &node)
   // NOTE NESoftmaxLayer's default axis is -1
   auto fn = acl_common::generateLayer<arm_compute::NESoftmaxLayer>(
     _tensor_builder->acl_tensor_manager()->internal_buffer_manager(), input_tensor->handle(),
-    output_tensor->handle(), beta, 1);
+    output_tensor->handle(), beta);

   // Revert disabling applied dim_correction
   if (input_tensor->getShape().dim(0) == 1)

@Hyunjun85 I found that the patch for softmax failure(need to look into it) :

➜  git diff
diff --git a/runtime/onert/backend/acl_neon/KernelGenerator.cc b/runtime/onert/backend/acl_neon/KernelGenerator.cc
index 8d597b3eb..2477818d6 100644
--- a/runtime/onert/backend/acl_neon/KernelGenerator.cc
+++ b/runtime/onert/backend/acl_neon/KernelGenerator.cc
@@ -991,7 +991,7 @@ void KernelGenerator::visit(const ir::operation::Softmax &node)
   // NOTE NESoftmaxLayer's default axis is -1
   auto fn = acl_common::generateLayer<arm_compute::NESoftmaxLayer>(
     _tensor_builder->acl_tensor_manager()->internal_buffer_manager(), input_tensor->handle(),
-    output_tensor->handle(), beta, 1);
+    output_tensor->handle(), beta);

   // Revert disabling applied dim_correction
   if (input_tensor->getShape().dim(0) == 1)

Thanks, I will test more.

I found several facts about softmax definition and implementation :

  • TFLite have no parameter(or input) for axis while TF and ComputeLibrary does.
  • ONE does not define axis in softmax operation(like TFLite)
  • TOCO(tflite_converter) inserts TRANSPOSE into both before and after softmax layer during converting if axis is not -1.

Thus, ONE(for now) assumes that softmax layer takes -1 as axis value(implicitly).

Related PR : https://github.com/Samsung/ONE/pull/6328

@Hyunjun85 About softmax issue, it confused me a lot(See https://github.com/Samsung/ONE/pull/6328#issuecomment-806389967). I need more time to investigate this.

Default argument

It looks that acl internal implementation for softmax is changed. So maybe default axis is changed.

disableDimCorrection() call

  • On 20.05, validate function in acl neon checks axis validation, but acl-cl don't checks axis validation. So fortunately, acl-cl backend passed without disableDimCorrection().
  • Bug: On acl-neon backend, after disableDimCorrection() and configure, we should call enableDimCorrection(), but we called disableDimCorrection() again. But fortunately, it passed.

About, GeneratedTests.fully_connected_float_2_weights_as_inputs.
_FullyConnected operation for NEON doesn't change weights called multiple times._
_For the first time, The weight are applied to operation. but after do operate, It never change_. Perhaps, It's a side effect of matrix multiply operations updated.

Only sees the situation.
The weight is decided when called NEFullyConnectedLayerEx::config. and it's called only once.

CL NEON CL NEON
v20.05 v21.02
fully_connected_float_2_weights_as_inputs Skip Pass Skip Fail

patch for softmax isseu looks like :

modified   runtime/onert/backend/acl_neon/KernelGenerator.cc                                                                                   @@ -980,24 +980,12 @@ void KernelGenerator::visit(const ir::operation::Softmax &node)                                                          
   auto output_tensor = _tensor_reg->getAclTensor(output_index);                                                                               
   auto input_tensor = _tensor_reg->getAclTensor(input_index);                                                                                 

-  // Disable applied dim_correction                                                                                                           
-  if (static_cast<size_t>(input_tensor->getShape().rank()) !=                                                                                 
-      input_tensor->info()->num_dimensions())                                                                                                 
-  {                                                                                                                                           
-    // This means that high dimension's value is 1 and input tensor is applied dim_correction                                                 
-    acl_common::disableDimCorrection(input_tensor);                                                                                           
-  }                                                                                                                                           

   // NOTE NESoftmaxLayer's default axis is -1                                                                                                 
   auto fn = acl_common::generateLayer<arm_compute::NESoftmaxLayer>(                                                                           
     _tensor_builder->acl_tensor_manager()->internal_buffer_manager(), input_tensor->handle(),                                                 
-    output_tensor->handle(), beta, 1);                                                                                                        
+    output_tensor->handle(), beta);                                                                                                           

-  // Revert disabling applied dim_correction                                                                                                  
-  if (input_tensor->getShape().dim(0) == 1)                                                                                                   
-  {                                                                                                                                           
-    acl_common::disableDimCorrection(input_tensor);                                                                                           
-  }                                                                                                                                           

   _return_fn = asAclFunction(std::move(fn));                                                                                                  
 }                                                                                                                                             


I will apply below patch to pr..

patch for softmax isseu looks like :

modified   runtime/onert/backend/acl_neon/KernelGenerator.cc                                                                                   @@ -980,24 +980,12 @@ void KernelGenerator::visit(const ir::operation::Softmax &node)                                                          
   auto output_tensor = _tensor_reg->getAclTensor(output_index);                                                                               
   auto input_tensor = _tensor_reg->getAclTensor(input_index);                                                                                 

-  // Disable applied dim_correction                                                                                                           
-  if (static_cast<size_t>(input_tensor->getShape().rank()) !=                                                                                 
-      input_tensor->info()->num_dimensions())                                                                                                 
-  {                                                                                                                                           
-    // This means that high dimension's value is 1 and input tensor is applied dim_correction                                                 
-    acl_common::disableDimCorrection(input_tensor);                                                                                           
-  }                                                                                                                                           

   // NOTE NESoftmaxLayer's default axis is -1                                                                                                 
   auto fn = acl_common::generateLayer<arm_compute::NESoftmaxLayer>(                                                                           
     _tensor_builder->acl_tensor_manager()->internal_buffer_manager(), input_tensor->handle(),                                                 
-    output_tensor->handle(), beta, 1);                                                                                                        
+    output_tensor->handle(), beta);                                                                                                           

-  // Revert disabling applied dim_correction                                                                                                  
-  if (input_tensor->getShape().dim(0) == 1)                                                                                                   
-  {                                                                                                                                           
-    acl_common::disableDimCorrection(input_tensor);                                                                                           
-  }                                                                                                                                           

   _return_fn = asAclFunction(std::move(fn));                                                                                                  
 }                                                                                                                                             

Below 3 items are resolved.

[ FAILED ] GeneratedTests.resize_bilinear_2
[ FAILED ] GeneratedTests.resize_bilinear
[ FAILED ] GeneratedTests.resize_bilinear_quant8_nnfw

Cause,
resize_bilinear op is not supported padding but ScaleKernelInfo's use_padding info is set use padding(true) as a default.
So, I changed initiating code to set false.

About fully_connected_float_2_weights_as_inputs test, we'd better add this to skiplist since , for now, it seems not easy to support this on FC kernel(especially, arm_gemm::GemmInterleaved kernel).

TC Fail about OneOp_Cast_*.

The Cast Op was inherited INESimpleFunction and it has borderhandler as a member variable. but the variable had changed unique pointer and add a NULL checking logic before run the kernel. So it return error always.

Now, NECast inherits INESimpleFunctionNoBorder class.

PR merged.
The remaining jobs for this are :
1) CI issue
2) tizen SR

Was this page helpful?
0 / 5 - 0 ratings

Related issues

wateret picture wateret  Â·  4Comments

jinevening picture jinevening  Â·  3Comments

kishcs picture kishcs  Â·  3Comments

KimDongEon picture KimDongEon  Â·  4Comments

lucenticus picture lucenticus  Â·  3Comments