Now, there are two reasons why PermuteLayer has a bit poor performance.
PermuteLayer does not use multi-threading.arm_compute::ICLTensor::map(). Our runtime calls the arm_compute::ICLTensor::map() with the blocking flag set to always true. This task is going to optimize PermuteLayer of controlflow by resolving above two reasons.
PermuteLayer uses multi-threading by using thread pool of ruy::Context.clEnqueueWriteBuffer() or clEnqueueReadBuffer() instead of arm_compute::ICLTensor::map() when output or intput of PermuteLater is tensor of acl_cl backend.Draft PR : #4395
/cc @Samsung/nnfw_committers @periannath
I have one curiosity. Can we use the benefit of verctorization when we copy memory?
@hyunsik-yoon
I have one curiosity. Can we use the benefit of verctorization when we copy memory?
I think it's a good point. But I don't have a plan to use it yet for the following reason.
We can use vectorization (SIMD). But I'm not sure if it will always be benefit of performance. To use vectorization(SIMD), we have to use instructions of loading and storing. It depends on how big and fast the caches are in target device. In the other words, using SIMD for copying memory could decrease performance than memcpy(), contrary to our expectations.
- To make our runtime calls arm_compute::ICLTensor::map() by distinguishing how to set the blocking flag value.
I tried to change using map() to use clEnqueueMapBuffer() for only writing CLTensors. But I failed it because there is no way how to get memory buffer without mapping. The function always map data to buffer even though I try all the available flags. To say that mapping cannot be prevented means that there is no way to prevent mismatching of results.
So, I will try to use clEnqueueWriteBuffer() instead of clEnqueueMapBuffer().
- To make
PermuteLayeruses multi-threading by using thread pool ofruy::Context.
Depending on some conditions such as device or model, using multi-threading can improve or can worsens performance. So we have to set the number of threads well.
- To make our runtime calls
clEnqueueWriteBuffer()orclEnqueueReadBuffer()instead ofarm_compute::ICLTensor::map()when output or intput ofPermuteLateris tensor ofacl_clbackend.
This improves performance in the models that creates multiple PermuteLayer such as having many inputs.
This improves performance in the case where PermuteLayer is located in the middle.
- To cache offsets of tensors in PermuteLayer
This improves performance in the models that has large sized input and output with pads.
acl_cl backend| model | before (d1886fab6a3bcbcef7a7cdd8a547c538b1574506) | after (#4395) | thread count | improved performance rate (before - after / before) * 100 |
|:--:|:--:|:--:|:--:|:--:|
| d1 | 21.417 ms | 19.612 ms | 1 | 8.4 % |
| d2 | 7.333 ms | 6.636 ms | 1 | 9.5 % |
| d3 | 19.513 ms | 17.487 ms | 1 | 10.4 % |
| e1 | 28.082 ms | 27.465 ms | 1 | 2.2 % |
| e2 | 41.323 ms | 37.822 ms | 1 | 8.5 % |
| mobilenet | 73.013 ms | 71.547 ms | 1 | 2.0 % |
| inception | 551.272 ms | 550.894 ms | 1 | 0.07 % |
| p1 | 56.565 ms | 55.283 ms | 1 | 2.3 % |
| p2 | 7.210 ms | 6.713 ms | 1 | 6.9 % |
$ BACKENDS="acl_cl" ./Product/armv7l-linux.release/out/bin/nnpackage_run --nnpackage d1/ -w 10 -r 1000
Package Filename d1/
===================================
MODEL_LOAD takes 22.722 ms
PREPARE takes 1346.501 ms
EXECUTE takes 21.419 ms
- MEAN : 21.419 ms
- MAX : 22.681 ms
- MIN : 20.760 ms
- GEOMEAN : 21.417 ms
===================================
$ BACKENDS="acl_cl" ./Product/armv7l-linux.release/out/bin/nnpackage_run --nnpackage d2/ -w 10 -r 1000
Package Filename d2/
===================================
MODEL_LOAD takes 66.378 ms
PREPARE takes 1295.190 ms
EXECUTE takes 7.334 ms
- MEAN : 7.334 ms
- MAX : 7.760 ms
- MIN : 6.935 ms
- GEOMEAN : 7.333 ms
===================================
$ BACKENDS="acl_cl" ./Product/armv7l-linux.release/out/bin/nnpackage_run --nnpackage d3/ -w 10 -r 1000
Package Filename d3/
===================================
MODEL_LOAD takes 792.003 ms
PREPARE takes 2330.235 ms
EXECUTE takes 19.516 ms
- MEAN : 19.516 ms
- MAX : 20.427 ms
- MIN : 18.736 ms
- GEOMEAN : 19.513 ms
==================================
$ BACKENDS="acl_cl" ./Product/armv7l-linux.release/out/bin/nnpackage_run --nnpackage e1/ -w 10 -r 1000
Package Filename e1/
===================================
MODEL_LOAD takes 220.728 ms
PREPARE takes 1736.628 ms
EXECUTE takes 28.084 ms
- MEAN : 28.084 ms
- MAX : 30.076 ms
- MIN : 27.246 ms
- GEOMEAN : 28.082 ms
===================================
$ BACKENDS="acl_cl" ./Product/armv7l-linux.release/out/bin/nnpackage_run --nnpackage e2/ -w 10 -r 1000
Package Filename e2/
===================================
MODEL_LOAD takes 1598.363 ms
PREPARE takes 2167.587 ms
EXECUTE takes 41.329 ms
- MEAN : 41.329 ms
- MAX : 48.997 ms
- MIN : 40.170 ms
- GEOMEAN : 41.323 ms
===================================
$ BACKENDS="acl_cl" ./Product/armv7l-linux.release/out/bin/nnpackage_run --nnpackage mobilenet_v2_1.0_224 -w 10 -r 1000
Package Filename mobilenet_v2_1.0_224
===================================
MODEL_LOAD takes 42.793 ms
PREPARE takes 5378.029 ms
EXECUTE takes 73.013 ms
- MEAN : 73.013 ms
- MAX : 81.730 ms
- MIN : 72.660 ms
- GEOMEAN : 73.013 ms
===================================
$ BACKENDS="acl_cl" ./Product/armv7l-linux.release/out/bin/nnpackage_run --nnpackage inceptionv3_slim_2016 -w 10 -r 1000
Package Filename inceptionv3_slim_2016
===================================
MODEL_LOAD takes 256.788 ms
PREPARE takes 12407.476 ms
EXECUTE takes 551.273 ms
- MEAN : 551.273 ms
- MAX : 553.389 ms
- MIN : 549.208 ms
- GEOMEAN : 551.272 ms
===================================
$ BACKENDS="acl_cl" ./Product/armv7l-linux.release/out/bin/nnpackage_run --nnpackage p1 -w 10 -r 1000
Package Filename p1
===================================
MODEL_LOAD takes 11.074 ms
PREPARE takes 5193.269 ms
EXECUTE takes 56.566 ms
- MEAN : 56.566 ms
- MAX : 58.516 ms
- MIN : 56.057 ms
- GEOMEAN : 56.565 ms
===================================
$ BACKENDS="acl_cl" ./Product/armv7l-linux.release/out/bin/nnpackage_run --nnpackage p2 -w 10 -r 1000
Package Filename p2
===================================
MODEL_LOAD takes 5.239 ms
PREPARE takes 1711.113 ms
EXECUTE takes 7.211 ms
- MEAN : 7.211 ms
- MAX : 8.856 ms
- MIN : 6.879 ms
- GEOMEAN : 7.210 ms
===================================
$ RUY_THREADS=1 BACKENDS="acl_cl" ./Product/armv7l-linux.release/out/bin/nnpackage_run --nnpackage d1/ -w 10 -r 1000
Package Filename d1/
===================================
MODEL_LOAD takes 22.913 ms
PREPARE takes 1334.244 ms
EXECUTE takes 19.613 ms
- MEAN : 19.613 ms
- MAX : 20.682 ms
- MIN : 19.278 ms
- GEOMEAN : 19.612 ms
===================================
$ RUY_THREADS=1 BACKENDS="acl_cl" ./Product/armv7l-linux.release/out/bin/nnpackage_run --nnpackage d2/ -w 10 -r 1000
Package Filename d2/
===================================
MODEL_LOAD takes 7.850 ms
PREPARE takes 1298.734 ms
EXECUTE takes 6.637 ms
- MEAN : 6.637 ms
- MAX : 7.344 ms
- MIN : 6.437 ms
- GEOMEAN : 6.636 ms
===================================
$ RUY_THREADS=1 BACKENDS="acl_cl" ./Product/armv7l-linux.release/out/bin/nnpackage_run --nnpackage d3/ -w 10 -r 1000
Package Filename d3/
===================================
MODEL_LOAD takes 30.664 ms
PREPARE takes 2324.531 ms
EXECUTE takes 17.488 ms
- MEAN : 17.488 ms
- MAX : 19.223 ms
- MIN : 17.019 ms
- GEOMEAN : 17.487 ms
===================================
$ RUY_THREADS=1 BACKENDS="acl_cl" ./Product/armv7l-linux.release/out/bin/nnpackage_run --nnpackage e1/ -w 10 -r 1000
Package Filename e1/
===================================
MODEL_LOAD takes 14.761 ms
PREPARE takes 1743.793 ms
EXECUTE takes 27.471 ms
- MEAN : 27.471 ms
- MAX : 31.900 ms
- MIN : 26.469 ms
- GEOMEAN : 27.465 ms
===================================
$ RUY_THREADS=1 BACKENDS="acl_cl" ./Product/armv7l-linux.release/out/bin/nnpackage_run --nnpackage e2/ -w 10 -r 1000
Package Filename e2/
===================================
MODEL_LOAD takes 56.136 ms
PREPARE takes 2170.272 ms
EXECUTE takes 37.826 ms
- MEAN : 37.826 ms
- MAX : 40.546 ms
- MIN : 36.727 ms
- GEOMEAN : 37.822 ms
===================================
$ RUY_THREADS=1 BACKENDS="acl_cl" ./Product/armv7l-linux.release/out/bin/nnpackage_run --nnpackage mobilenet_v2_1.0_224 -w 10 -r 1000
Package Filename mobilenet_v2_1.0_224
===================================
MODEL_LOAD takes 47.294 ms
PREPARE takes 5404.024 ms
EXECUTE takes 71.547 ms
- MEAN : 71.547 ms
- MAX : 73.341 ms
- MIN : 71.255 ms
- GEOMEAN : 71.547 ms
===================================
$ RUY_THREADS=1 BACKENDS="acl_cl" ./Product/armv7l-linux.release/out/bin/nnpackage_run --nnpackage inceptionv3_slim_2016 -w 10 -r 1000
Package Filename inceptionv3_slim_2016
===================================
MODEL_LOAD takes 263.214 ms
PREPARE takes 12379.828 ms
EXECUTE takes 550.894 ms
- MEAN : 550.894 ms
- MAX : 551.998 ms
- MIN : 548.923 ms
- GEOMEAN : 550.894 ms
===================================
$ BACKENDS="acl_cl" ./Product/armv7l-linux.release/out/bin/nnpackage_run --nnpackage p1 -w 10 -r 1000
Package Filename p1
===================================
MODEL_LOAD takes 8.615 ms
PREPARE takes 5191.016 ms
EXECUTE takes 55.285 ms
- MEAN : 55.285 ms
- MAX : 68.132 ms
- MIN : 54.630 ms
- GEOMEAN : 55.283 ms
===================================
$ BACKENDS="acl_cl" ./Product/armv7l-linux.release/out/bin/nnpackage_run --nnpackage p2 -w 10 -r 1000
Package Filename p2
===================================
MODEL_LOAD takes 8.182 ms
PREPARE takes 1713.505 ms
EXECUTE takes 6.714 ms
- MEAN : 6.714 ms
- MAX : 7.572 ms
- MIN : 6.421 ms
- GEOMEAN : 6.713 ms
===================================
acl_cl backend| model | before (d1886fab6a3bcbcef7a7cdd8a547c538b1574506) | after (#4395) | thread count | improved performance rate (before - after / before) * 100 |
|:--:|:--:|:--:|:--:|:--:|
| d1 | 8.157 ms | 6.929 ms | 1 | 15.1 % |
| d2 | 4.614 ms | 3.154 ms | 2 | 31.6 % |
| d3 | 10.186 ms | 8.664 ms | 1 | 15.0 % |
| e1 | 12.780 ms | 12.245 ms | 2 | 4.2 % |
| e2 | 19.102 ms | 18.791 ms | 1 | 1.6 % |
| mobilenet | 40.779 ms | 40.727 ms | 1 | 0.13 % |
| inception | 365.441 ms | 366.077 ms | 1 | -0.2 % |
| p1 | 41.660 ms | 36.384 ms | 1 | 12.7 % |
| p2 | 6.633 ms | 5.863 ms | 1 | 11.6 % |
$ BACKENDS="acl_cl" ./Product/aarch64-linux.release/out/bin/nnpackage_run --nnpackage d1/ -w 10 -r 1000
Package Filename d1/
===================================
MODEL_LOAD takes 11.712 ms
PREPARE takes 2270.909 ms
EXECUTE takes 8.191 ms
- MEAN : 8.191 ms
- MAX : 21.204 ms
- MIN : 7.235 ms
- GEOMEAN : 8.157 ms
===================================
$ BACKENDS="acl_cl" ./Product/aarch64-linux.release/out/bin/nnpackage_run --nnpackage d2/ -w 10 -r 1000
Package Filename d2/
===================================
MODEL_LOAD takes 2.745 ms
PREPARE takes 2438.955 ms
EXECUTE takes 4.625 ms
- MEAN : 4.625 ms
- MAX : 7.464 ms
- MIN : 3.201 ms
- GEOMEAN : 4.614 ms
===================================
$ BACKENDS="acl_cl" ./Product/aarch64-linux.release/out/bin/nnpackage_run --nnpackage d3/ -w 10 -r 1000
Package Filename d3/
===================================
MODEL_LOAD takes 11.728 ms
PREPARE takes 4084.234 ms
EXECUTE takes 10.203 ms
- MEAN : 10.203 ms
- MAX : 19.850 ms
- MIN : 8.762 ms
- GEOMEAN : 10.186 ms
===================================
$ BACKENDS="acl_cl" ./Product/aarch64-linux.release/out/bin/nnpackage_run --nnpackage e1/ -w 10 -r 1000
Package Filename e1/
===================================
MODEL_LOAD takes 6.707 ms
PREPARE takes 1983.457 ms
EXECUTE takes 12.803 ms
- MEAN : 12.803 ms
- MAX : 24.841 ms
- MIN : 11.472 ms
- GEOMEAN : 12.780 ms
===================================
$ BACKENDS="acl_cl" ./Product/aarch64-linux.release/out/bin/nnpackage_run --nnpackage e2/ -w 10 -r 1000
Package Filename e2/
===================================
MODEL_LOAD takes 20.181 ms
PREPARE takes 2355.564 ms
EXECUTE takes 19.121 ms
- MEAN : 19.121 ms
- MAX : 29.254 ms
- MIN : 18.055 ms
- GEOMEAN : 19.102 ms
===================================
$ BACKENDS="acl_cl" ./Product/aarch64-linux.release/out/bin/nnpackage_run --nnpackage mobilenet_v2_1.0_224 -w 10 -r 1000
Package Filename mobilenet_v2_1.0_224
===================================
MODEL_LOAD takes 17.108 ms
PREPARE takes 4341.270 ms
EXECUTE takes 40.813 ms
- MEAN : 40.813 ms
- MAX : 58.547 ms
- MIN : 39.492 ms
- GEOMEAN : 40.779 ms
===================================
$ BACKENDS="acl_cl" ./Product/aarch64-linux.release/out/bin/nnpackage_run --nnpackage inceptionv3_slim_2016 -w 10 -r 1000
Package Filename inceptionv3_slim_2016
===================================
MODEL_LOAD takes 76.632 ms
PREPARE takes 11707.544 ms
EXECUTE takes 365.472 ms
- MEAN : 365.472 ms
- MAX : 384.435 ms
- MIN : 359.136 ms
- GEOMEAN : 365.441 ms
===================================
$ BACKENDS="acl_cl" ./Product/aarch64-linux.release/out/bin/nnpackage_run --nnpackage p1 -w 10 -r 1000
Package Filename p1
===================================
MODEL_LOAD takes 2.802 ms
PREPARE takes 5363.464 ms
EXECUTE takes 41.919 ms
- MEAN : 41.919 ms
- MAX : 63.400 ms
- MIN : 36.434 ms
- GEOMEAN : 41.660 ms
===================================
$ BACKENDS="acl_cl" ./Product/aarch64-linux.release/out/bin/nnpackage_run --nnpackage p2 -w 10 -r 1000
Package Filename p2
===================================
MODEL_LOAD takes 1.851 ms
PREPARE takes 2091.100 ms
EXECUTE takes 6.669 ms
- MEAN : 6.669 ms
- MAX : 22.189 ms
- MIN : 5.628 ms
- GEOMEAN : 6.633 ms
===================================
$ RUY_THREADS=1 BACKENDS="acl_cl" ./Product/aarch64-linux.release/out/bin/nnpackage_run --nnpackage d1/ -w 10 -r 1000
Package Filename d1/
===================================
MODEL_LOAD takes 6.152 ms
PREPARE takes 2261.286 ms
EXECUTE takes 6.966 ms
- MEAN : 6.966 ms
- MAX : 19.381 ms
- MIN : 5.963 ms
- GEOMEAN : 6.929 ms
===================================
$ RUY_THREADS=2 BACKENDS="acl_cl" ./Product/aarch64-linux.release/out/bin/nnpackage_run --nnpackage d2/ -w 10 -r 1000
Package Filename d2/
===================================
MODEL_LOAD takes 1.450 ms
PREPARE takes 2438.586 ms
EXECUTE takes 3.182 ms
- MEAN : 3.182 ms
- MAX : 5.737 ms
- MIN : 2.640 ms
- GEOMEAN : 3.154 ms
===================================
$ RUY_THREADS=1 BACKENDS="acl_cl" ./Product/aarch64-linux.release/out/bin/nnpackage_run --nnpackage d3/ -w 10 -r 1000
Package Filename d3/
===================================
MODEL_LOAD takes 7.580 ms
PREPARE takes 4088.470 ms
EXECUTE takes 8.676 ms
- MEAN : 8.676 ms
- MAX : 12.090 ms
- MIN : 6.934 ms
- GEOMEAN : 8.664 ms
===================================
$ RUY_THREADS=2 BACKENDS="acl_cl" ./Product/aarch64-linux.release/out/bin/nnpackage_run --nnpackage e1/ -w 10 -r 1000
Package Filename e1/
===================================
MODEL_LOAD takes 3.468 ms
PREPARE takes 1972.334 ms
EXECUTE takes 12.249 ms
- MEAN : 12.249 ms
- MAX : 15.098 ms
- MIN : 11.463 ms
- GEOMEAN : 12.245 ms
===================================
$ RUY_THREADS=1 BACKENDS="acl_cl" ./Product/aarch64-linux.release/out/bin/nnpackage_run --nnpackage e2/ -w 10 -r 1000
Package Filename e2/
===================================
MODEL_LOAD takes 18.158 ms
PREPARE takes 2331.893 ms
EXECUTE takes 18.803 ms
- MEAN : 18.803 ms
- MAX : 31.326 ms
- MIN : 17.971 ms
- GEOMEAN : 18.791 ms
===================================
$ RUY_THREADS=1 BACKENDS="acl_cl" ./Product/aarch64-linux.release/out/bin/nnpackage_run --nnpackage mobilenet_v2_1.0_224 -w 10 -r 1000
Package Filename mobilenet_v2_1.0_224
===================================
MODEL_LOAD takes 12.159 ms
PREPARE takes 4346.867 ms
EXECUTE takes 40.760 ms
- MEAN : 40.760 ms
- MAX : 61.887 ms
- MIN : 39.568 ms
- GEOMEAN : 40.727 ms
===================================
$ RUY_THREADS=1 BACKENDS="acl_cl" ./Product/aarch64-linux.release/out/bin/nnpackage_run --nnpackage inceptionv3_slim_2016 -w 10 -r 1000
Package Filename inceptionv3_slim_2016
===================================
MODEL_LOAD takes 70.042 ms
PREPARE takes 11668.927 ms
EXECUTE takes 366.110 ms
- MEAN : 366.110 ms
- MAX : 386.276 ms
- MIN : 359.115 ms
- GEOMEAN : 366.077 ms
===================================
$ BACKENDS="acl_cl" ./Product/aarch64-linux.release/out/bin/nnpackage_run --nnpackage p1 -w 10 -r 1000
Package Filename p1
===================================
MODEL_LOAD takes 2.814 ms
PREPARE takes 5346.621 ms
EXECUTE takes 36.441 ms
- MEAN : 36.441 ms
- MAX : 55.867 ms
- MIN : 34.255 ms
- GEOMEAN : 36.384 ms
===================================
$ BACKENDS="acl_cl" ./Product/aarch64-linux.release/out/bin/nnpackage_run --nnpackage ../p2 -w 10 -r 1000
Package Filename ../p2
===================================
MODEL_LOAD takes 3.450 ms
PREPARE takes 2094.042 ms
EXECUTE takes 5.880 ms
- MEAN : 5.880 ms
- MAX : 7.869 ms
- MIN : 4.340 ms
- GEOMEAN : 5.863 ms
===================================
Done.
Most helpful comment
Effects
Depending on some conditions such as device or model, using multi-threading can improve or can worsens performance. So we have to set the number of threads well.
This improves performance in the models that creates multiple
PermuteLayersuch as having many inputs.This improves performance in the case where
PermuteLayeris located in the middle.This improves performance in the models that has large sized input and output with pads.
acl_clbackend| model | before (d1886fab6a3bcbcef7a7cdd8a547c538b1574506) | after (#4395) | thread count | improved performance rate (before - after / before) * 100 |
|:--:|:--:|:--:|:--:|:--:|
| d1 | 21.417 ms | 19.612 ms | 1 | 8.4 % |
| d2 | 7.333 ms | 6.636 ms | 1 | 9.5 % |
| d3 | 19.513 ms | 17.487 ms | 1 | 10.4 % |
| e1 | 28.082 ms | 27.465 ms | 1 | 2.2 % |
| e2 | 41.323 ms | 37.822 ms | 1 | 8.5 % |
| mobilenet | 73.013 ms | 71.547 ms | 1 | 2.0 % |
| inception | 551.272 ms | 550.894 ms | 1 | 0.07 % |
| p1 | 56.565 ms | 55.283 ms | 1 | 2.3 % |
| p2 | 7.210 ms | 6.713 ms | 1 | 6.9 % |
acl_clbackend| model | before (d1886fab6a3bcbcef7a7cdd8a547c538b1574506) | after (#4395) | thread count | improved performance rate (before - after / before) * 100 |
|:--:|:--:|:--:|:--:|:--:|
| d1 | 8.157 ms | 6.929 ms | 1 | 15.1 % |
| d2 | 4.614 ms | 3.154 ms | 2 | 31.6 % |
| d3 | 10.186 ms | 8.664 ms | 1 | 15.0 % |
| e1 | 12.780 ms | 12.245 ms | 2 | 4.2 % |
| e2 | 19.102 ms | 18.791 ms | 1 | 1.6 % |
| mobilenet | 40.779 ms | 40.727 ms | 1 | 0.13 % |
| inception | 365.441 ms | 366.077 ms | 1 | -0.2 % |
| p1 | 41.660 ms | 36.384 ms | 1 | 12.7 % |
| p2 | 6.633 ms | 5.863 ms | 1 | 11.6 % |