One: [luci-interpreter] Investigate performance and memory consumption on ARM Cortex M4 MCU

Created on 19 Nov 2020  路  12Comments  路  Source: Samsung/ONE

Our goal: investigate perspective to run neural networks on microcontrollers.

Current plan:

  1. Build and run luci interpreter on MCU
  2. Measure performance and memory consumption of luci interpreter
  3. Measure performance and memory consumption of tflite micro
  4. Compare our solution and TFLite micro, spot bottlenecks and estimate how much effort we need to catch-up with tflite if needed

Hardware we have: STM32F767 STM32F746 (this is not final HW, just something that fits current needs)
NN to experiment with: link

landmark

Most helpful comment

At this moment - I have compiled with arm-none-eabi 9 2020-q2-update luci interpreter on stm32f767 microcontroller + DSP library, supporting following basic kernels:

luci::CircleAdd
luci::CircleAveragePool2D
luci::CircleConcatenation
luci::CircleConv2D
luci::CircleConst
luci::CircleDepthwiseConv2D
luci::CircleFullyConnected
luci::CircleInput
luci::CircleMaxPool2D
luci::CircleMul
luci::CircleOutput
luci::CircleReshape
luci::CircleSoftmax

I have ran simple NN from tensorflow lite micro examples, which generates a sine, ran in a single thread of MbedOS with system core clock 216MHz, 512kB on-chip SRAM and 128Mbit SDRAM 187MHz on FSMC

  • luci : ~25us single run , on-chip SRAM
  • luci : ~70us single run , external SDRAM
> Luci Interpreter for microcontrollers
> STM32F767 SystemCoreClock 216000000
> model_no_quant.circle size: 2544
> circle::VerifyModelBuffer
> OK
> luci::Importer().importModule
> [luci] NodeFinder INPUT(0) = 0x2000f590
> [luci] NodeFinder const_node(1) -> 0x2000f708
> [luci] NodeFinder const_node(2) -> 0x2000f7d8
> [luci] NodeFinder const_node(3) -> 0x2000f8d8
> [luci] NodeFinder const_node(4) -> 0x2000f9d0
> [luci] NodeFinder const_node(5) -> 0x2000fae0
> [luci] NodeFinder const_node(6) -> 0x2000ffa0
> [luci] NodeFinder OUTPUT(9) = 0x20010518
> Name: main
> --- FixInterGraph main ---
> --- ValidateGraphProp main ---
> --- post_import_graph done ---
> OK
> Interpreter::Interpreter(const luci::Module *module)
> module->size() 1
> createTensors(main_graph);
> createExecutionSequence(main_graph);
> Finished in 29us   0.00000 0.04155
> Finished in 26us   1.00000 0.83879
> Finished in 26us   2.00000 0.91872
> Finished in 25us   3.00000 0.12998
> Finished in 26us   4.00000 -0.73403
> Finished in 26us   5.00000 -0.93702
> Finished in 26us   6.00000 -0.24592
> Finished in 25us   7.00000 0.44518

  • tflite micro : ~50us single run , on-chip SRAM
  • tflite micro : ~110us single run , external SDRAM
> TFLite micro interpreter
> STM32F767 SystemCoreClock 216000000
> read_file_to_buf /fs/model/model_no_quant.tflite 2788
> Verify OK
> TFModel: MLIR Converted. version: 3
> TFModel: metadata min_runtime_version: 11
> model_no_quant.tflite
> MODEL OK
> Finished 58us 0.00000 0.04155
> Finished 51us 1.00000 0.83879
> Finished 60us 2.00000 0.91872
> Finished 51us 3.00000 0.12998
> Finished 56us 4.00000 -0.73403
> Finished 60us 5.00000 -0.93702
> Finished 54us 6.00000 -0.24592
> Finished 54us 7.00000 0.44518

Benchmarks for other kernels are in progress, but it looks promising - it is already possible to recognize hotwords using MFCC or gestures in realtime locally on STM32F7(and I think STM32F4 100Mhz as well) using luci interpreter.

All 12 comments

@jinevening @struss @underflow101

Feel free to ask/discuss this issue, any feedback is welcome.

P.s. If you know who else is interested in this task mention them, please =)

Measure performance and memory consumption of luci interpreter
Measure performance and memory consumption of tflite micro

Interesting project. Those measurements would be useful to start discussion about small-footprint NN runtime.

STM board has extremely low memory (flash memory as well as SDRAM). Code size and model size should be reduced as much as possible. For that, I think it would be better to specify an application (usage scenario) first, so that we can focus on the specific models and operators.

Feel free to ask/discuss this issue, any feedback is welcome.

Arduino Nano 33 BLE Sense officially supports Tensorflow Lite experience, so it could be a reference for us.

Aside that I had an awful experience with TFLite on Arduino,
Arduino Nano 33 BLE Sense has specification below:

  • CPU: nRF52840 @64MHz (ARM Cortex-M4)
  • Flash: 1MB
  • RAM: 256KB

Image classification with binary classes took about 18 seconds per frame, and I used ArduCAM with example Tensorflow has offered.
Speech recognition took less time, but still, it took 2 ~ 7.6 seconds per inference.

I think running luci-interpreter on MCU, especially ARM Cortex-M4 could be experimental, or innovative in some sense, but surely it will be a hard work to optimize our features.

At this moment - I have compiled with arm-none-eabi 9 2020-q2-update luci interpreter on stm32f767 microcontroller + DSP library, supporting following basic kernels:

luci::CircleAdd
luci::CircleAveragePool2D
luci::CircleConcatenation
luci::CircleConv2D
luci::CircleConst
luci::CircleDepthwiseConv2D
luci::CircleFullyConnected
luci::CircleInput
luci::CircleMaxPool2D
luci::CircleMul
luci::CircleOutput
luci::CircleReshape
luci::CircleSoftmax

I have ran simple NN from tensorflow lite micro examples, which generates a sine, ran in a single thread of MbedOS with system core clock 216MHz, 512kB on-chip SRAM and 128Mbit SDRAM 187MHz on FSMC

  • luci : ~25us single run , on-chip SRAM
  • luci : ~70us single run , external SDRAM
> Luci Interpreter for microcontrollers
> STM32F767 SystemCoreClock 216000000
> model_no_quant.circle size: 2544
> circle::VerifyModelBuffer
> OK
> luci::Importer().importModule
> [luci] NodeFinder INPUT(0) = 0x2000f590
> [luci] NodeFinder const_node(1) -> 0x2000f708
> [luci] NodeFinder const_node(2) -> 0x2000f7d8
> [luci] NodeFinder const_node(3) -> 0x2000f8d8
> [luci] NodeFinder const_node(4) -> 0x2000f9d0
> [luci] NodeFinder const_node(5) -> 0x2000fae0
> [luci] NodeFinder const_node(6) -> 0x2000ffa0
> [luci] NodeFinder OUTPUT(9) = 0x20010518
> Name: main
> --- FixInterGraph main ---
> --- ValidateGraphProp main ---
> --- post_import_graph done ---
> OK
> Interpreter::Interpreter(const luci::Module *module)
> module->size() 1
> createTensors(main_graph);
> createExecutionSequence(main_graph);
> Finished in 29us   0.00000 0.04155
> Finished in 26us   1.00000 0.83879
> Finished in 26us   2.00000 0.91872
> Finished in 25us   3.00000 0.12998
> Finished in 26us   4.00000 -0.73403
> Finished in 26us   5.00000 -0.93702
> Finished in 26us   6.00000 -0.24592
> Finished in 25us   7.00000 0.44518

  • tflite micro : ~50us single run , on-chip SRAM
  • tflite micro : ~110us single run , external SDRAM
> TFLite micro interpreter
> STM32F767 SystemCoreClock 216000000
> read_file_to_buf /fs/model/model_no_quant.tflite 2788
> Verify OK
> TFModel: MLIR Converted. version: 3
> TFModel: metadata min_runtime_version: 11
> model_no_quant.tflite
> MODEL OK
> Finished 58us 0.00000 0.04155
> Finished 51us 1.00000 0.83879
> Finished 60us 2.00000 0.91872
> Finished 51us 3.00000 0.12998
> Finished 56us 4.00000 -0.73403
> Finished 60us 5.00000 -0.93702
> Finished 54us 6.00000 -0.24592
> Finished 54us 7.00000 0.44518

Benchmarks for other kernels are in progress, but it looks promising - it is already possible to recognize hotwords using MFCC or gestures in realtime locally on STM32F7(and I think STM32F4 100Mhz as well) using luci interpreter.

Feel free to ask/discuss this issue, any feedback is welcome.

Arduino Nano 33 BLE Sense officially supports Tensorflow Lite experience, so it could be a reference for us.

Aside that I had an awful experience with TFLite on Arduino,
Arduino Nano 33 BLE Sense has specification below:

  • CPU: nRF52840 @64MHz (ARM Cortex-M4)
  • Flash: 1MB
  • RAM: 256KB

Image classification with binary classes took about 18 seconds per frame, and I used ArduCAM with example Tensorflow has offered.
Speech recognition took less time, but still, it took 2 ~ 7.6 seconds per inference.

I think running luci-interpreter on MCU, especially ARM Cortex-M4 could be experimental, or innovative in some sense, but surely it will be a hard work to optimize our features.

CMSIS DSP library is really helpful and significantly speed up some operations and reduce memory consumption, the main problem for heavy nets (e.g. image recognition) may be a small amount of on-chip RAM, but external inexpensive FSMC driven SDRAM solves this problem, openning new perspectives to use something like stm32f7 even for image recognition - it has HW accelerators and L1 cache with separated buses for data and instructions.

The result looks promising! 馃憦

I'm curious why luci-interpreter is much faster (~2x on-chip SRAM) than TFLite micro. What was the target model? Was it a general CNN composed of CONV and FC layers?

Did you use existing kernels in luci-interpreter? or did you write new kernels using CMSIS DSP library?

I'm also taking a look on MCUNet paper from MIT, hoping this could help with this issue.
This paper claims that it can achieve 3x faster inference speed than TFLite Micro could, with 87% of accuracy.

I'm curious why luci-interpreter is much faster (~2x on-chip SRAM) than TFLite micro.

Do we have this answer? On the one hand, if there is a comparison of the memory usage, it will be more helpful to judge.

@SlavikMIPT Could you share your current progress for this task and which issues you have at this moment?

I'm curious why luci-interpreter is much faster (~2x on-chip SRAM) than TFLite micro.

Do we have this answer? On the one hand, if there is a comparison of the memory usage, it will be more helpful to judge.

Generally speaking - they allocate tensors in different ways, the main difference is memory usage. I investigate this in more detail to give a more detailed answer, on microcontrollers the debugging and profiling tools are rather limited. There is almost no difference - which implementation of fully connected is used(optimized or reference)

I'm curious why luci-interpreter is much faster (~2x on-chip SRAM) than TFLite micro.

Do we have this answer? On the one hand, if there is a comparison of the memory usage, it will be more helpful to judge.

I did a memory trace via malloc wrapper. Format is:
#m - malloc
#f - free
For malloc : The second address is where memory was allocated, The third one is caller(return address). The fourth - number of bytes allocated.
Here is raw traces for 832 bytes model:
Luci:

--- post_import_graph done ---
#m:0x2000e950;0x807142f-16
Interpreter::Interpreter(const luci::Module *module)
module->size() 1
#m:0x2000e210;0x804f59b-28
createTensors(main_graph);
#m:0x2000e238;0x804e451-16
#m:0x2000e258;0x804e545-68
#m:0x2000e2a8;0x8050435-300
#m:0x2000e3e0;0x804e5c3-12
#m:0x2000e3f8;0x804e34b-8
#m:0x2000e410;0x804e6c3-16
#m:0x2000e430;0x804e545-68
#m:0x2000e480;0x8050435-96
#m:0x2000e4f0;0x804e5c3-12
#m:0x2000e508;0x804e6c3-4
#m:0x2000e518;0x804e545-68
#m:0x2000e568;0x8050435-8
#m:0x2000e580;0x804e5c3-12
#m:0x2000e598;0x804e34b-20
#f:0x0;0x804e391-0x2000e3f8 // 8 bytes free
#m:0x2000e5b8;0x804e545-68
#m:0x2000e3f8;0x8050435-4
#m:0x2000e608;0x804e5c3-12
#m:0x2000e620;0x804e545-68
#m:0x2000e670;0x8050435-4
#m:0x2000e680;0x804e5c3-12
createExecutionSequence(main_graph);
#m:0x2000e698;0x804ce61-24
#m:0x2000e6c0;0x804d017-4
#f:0x0;0x804d409-0x2000e698 //24 bytes free
#m:0x2000e6d0;0x804c89d-32
#m:0x2000f1c0;0x804c8b3-512
#m:0x2000e698;0x804ca5f-20
#m:0x2000e700;0x804ca5f-20
#m:0x2000e720;0x804ca5f-20
#m:0x2000e740;0x804ca5f-20
#m:0x2000e760;0x804c7c3-4
#m:0x2000e770;0x804ca5f-20
#m:0x2000e790;0x804c7c3-8
#f:0x0;0x804c81f-0x2000e760
#m:0x2000e7a8;0x804ca5f-20
#m:0x2000e7c8;0x804c7c3-16
#f:0x0;0x804c81f-0x2000e790 //8 bytes free
#m:0x2000e7e8;0x804c7c3-32
#f:0x0;0x804c81f-0x2000e7c8  //16 bytes free
#f:0x0;0x804c9a5-0x2000f1c0 //512 bytes free
#f:0x0;0x804c9af-0x2000e6d0 //32 bytes free
#f:0x0;0x804b7e1-0x2000e698 //20 bytes free
#f:0x0;0x804c9c3-0x2000e700 //20 bytes free
#f:0x0;0x804b7e1-0x2000e720 //20 bytes free
#f:0x0;0x804b7e1-0x2000e7a8 //20 bytes free
#f:0x0;0x804c9c3-0x2000e770 //20 bytes free
#f:0x0;0x804c9c3-0x2000e740 //20 bytes free
#f:0x0;0x804e959-0x2000e6c0 //4 bytes free
node->opcode() 78
node->opcode() 8
node->opcode() 8
node->opcode() 9
_execution_sequence.push_back node->opcode() 9
#m:0x2000e818;0x80501af-52
#m:0x2000e858;0x804e0ed-4
node->opcode() 33
_execution_sequence.push_back node->opcode() 33
#m:0x2000e868;0x804fef3-44
#m:0x2000e8a0;0x804e0ed-8
#f:0x0;0x804e189-0x2000e858 //4 bytes free
node->opcode() 79
#f:0x0;0x804eb3d-0x2000e7e8 //32 bytes free
kernel->configure();
#m:0x2000e8b8;0x8054ebf-16
#m:0x2000e8d8;0x8050585-16
#m:0x2000e698;0x8050625-128
#f:0x0;0x8054edf-0x2000e3f8 //4 bytes free
#f:0x0;0x8054ee7-0x2000e8b8 //16 bytes free
kernel->configure();
#m:0x2000e8b8;0x80586f3-16
#m:0x2000e728;0x8050585-16
#m:0x2000e748;0x8050625-128
#f:0x0;0x8058713-0x2000e670 //4 bytes free
#f:0x0;0x805871b-0x2000e8b8 //16 bytes free
#m:0x2000f1c0;0x807154b-300
#f:0x0;0x8071581-0x2000f1c0 //300 bytes free
Bytes allocated: 1300

Tflite micro:

#m:0x200572c0;0x803b463-304
#m:0x20057400;0x803b477-32
#m:0x20057430;0x8036e1f-36
#m:0x20057460;0x8036e39-512
#m:0x20057670;0x802c761-28
#m:0x20057698;0x8036e6b-192
#m:0x20057768;0x8036ead-36
#m:0x20057798;0x8030baf-8192
Bytes allocated: 9332
Bytes allocated by mbed os: 5709

As we can see - TF micro allocates significantly more memory - the cache size for stm32f7 is 16kbytes - so I suppose 8192bytes allocated block doesn't fit into the cache.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

binarman picture binarman  路  3Comments

hasw7569 picture hasw7569  路  4Comments

YongseopKim picture YongseopKim  路  3Comments

KimDongEon picture KimDongEon  路  4Comments

seanshpark picture seanshpark  路  3Comments