Our goal: investigate perspective to run neural networks on microcontrollers.
Current plan:
Hardware we have: STM32F767 STM32F746 (this is not final HW, just something that fits current needs)
NN to experiment with: link
@jinevening @struss @underflow101
Feel free to ask/discuss this issue, any feedback is welcome.
P.s. If you know who else is interested in this task mention them, please =)
Measure performance and memory consumption of luci interpreter
Measure performance and memory consumption of tflite micro
Interesting project. Those measurements would be useful to start discussion about small-footprint NN runtime.
STM board has extremely low memory (flash memory as well as SDRAM). Code size and model size should be reduced as much as possible. For that, I think it would be better to specify an application (usage scenario) first, so that we can focus on the specific models and operators.
Feel free to ask/discuss this issue, any feedback is welcome.
Arduino Nano 33 BLE Sense officially supports Tensorflow Lite experience, so it could be a reference for us.
Aside that I had an awful experience with TFLite on Arduino,
Arduino Nano 33 BLE Sense has specification below:
Image classification with binary classes took about 18 seconds per frame, and I used ArduCAM with example Tensorflow has offered.
Speech recognition took less time, but still, it took 2 ~ 7.6 seconds per inference.
I think running luci-interpreter on MCU, especially ARM Cortex-M4 could be experimental, or innovative in some sense, but surely it will be a hard work to optimize our features.
At this moment - I have compiled with arm-none-eabi 9 2020-q2-update luci interpreter on stm32f767 microcontroller + DSP library, supporting following basic kernels:
luci::CircleAdd
luci::CircleAveragePool2D
luci::CircleConcatenation
luci::CircleConv2D
luci::CircleConst
luci::CircleDepthwiseConv2D
luci::CircleFullyConnected
luci::CircleInput
luci::CircleMaxPool2D
luci::CircleMul
luci::CircleOutput
luci::CircleReshape
luci::CircleSoftmax
I have ran simple NN from tensorflow lite micro examples, which generates a sine, ran in a single thread of MbedOS with system core clock 216MHz, 512kB on-chip SRAM and 128Mbit SDRAM 187MHz on FSMC
> Luci Interpreter for microcontrollers
> STM32F767 SystemCoreClock 216000000
> model_no_quant.circle size: 2544
> circle::VerifyModelBuffer
> OK
> luci::Importer().importModule
> [luci] NodeFinder INPUT(0) = 0x2000f590
> [luci] NodeFinder const_node(1) -> 0x2000f708
> [luci] NodeFinder const_node(2) -> 0x2000f7d8
> [luci] NodeFinder const_node(3) -> 0x2000f8d8
> [luci] NodeFinder const_node(4) -> 0x2000f9d0
> [luci] NodeFinder const_node(5) -> 0x2000fae0
> [luci] NodeFinder const_node(6) -> 0x2000ffa0
> [luci] NodeFinder OUTPUT(9) = 0x20010518
> Name: main
> --- FixInterGraph main ---
> --- ValidateGraphProp main ---
> --- post_import_graph done ---
> OK
> Interpreter::Interpreter(const luci::Module *module)
> module->size() 1
> createTensors(main_graph);
> createExecutionSequence(main_graph);
> Finished in 29us 0.00000 0.04155
> Finished in 26us 1.00000 0.83879
> Finished in 26us 2.00000 0.91872
> Finished in 25us 3.00000 0.12998
> Finished in 26us 4.00000 -0.73403
> Finished in 26us 5.00000 -0.93702
> Finished in 26us 6.00000 -0.24592
> Finished in 25us 7.00000 0.44518
> TFLite micro interpreter
> STM32F767 SystemCoreClock 216000000
> read_file_to_buf /fs/model/model_no_quant.tflite 2788
> Verify OK
> TFModel: MLIR Converted. version: 3
> TFModel: metadata min_runtime_version: 11
> model_no_quant.tflite
> MODEL OK
> Finished 58us 0.00000 0.04155
> Finished 51us 1.00000 0.83879
> Finished 60us 2.00000 0.91872
> Finished 51us 3.00000 0.12998
> Finished 56us 4.00000 -0.73403
> Finished 60us 5.00000 -0.93702
> Finished 54us 6.00000 -0.24592
> Finished 54us 7.00000 0.44518
Benchmarks for other kernels are in progress, but it looks promising - it is already possible to recognize hotwords using MFCC or gestures in realtime locally on STM32F7(and I think STM32F4 100Mhz as well) using luci interpreter.
Feel free to ask/discuss this issue, any feedback is welcome.
Arduino Nano 33 BLE Sense officially supports Tensorflow Lite experience, so it could be a reference for us.
Aside that I had an awful experience with TFLite on Arduino,
Arduino Nano 33 BLE Sense has specification below:
- CPU: nRF52840 @64MHz (ARM Cortex-M4)
- Flash: 1MB
- RAM: 256KB
Image classification with binary classes took about 18 seconds per frame, and I used ArduCAM with example Tensorflow has offered.
Speech recognition took less time, but still, it took 2 ~ 7.6 seconds per inference.I think running luci-interpreter on MCU, especially ARM Cortex-M4 could be experimental, or innovative in some sense, but surely it will be a hard work to optimize our features.
CMSIS DSP library is really helpful and significantly speed up some operations and reduce memory consumption, the main problem for heavy nets (e.g. image recognition) may be a small amount of on-chip RAM, but external inexpensive FSMC driven SDRAM solves this problem, openning new perspectives to use something like stm32f7 even for image recognition - it has HW accelerators and L1 cache with separated buses for data and instructions.
The result looks promising! 馃憦
I'm curious why luci-interpreter is much faster (~2x on-chip SRAM) than TFLite micro. What was the target model? Was it a general CNN composed of CONV and FC layers?
Did you use existing kernels in luci-interpreter? or did you write new kernels using CMSIS DSP library?
I'm also taking a look on MCUNet paper from MIT, hoping this could help with this issue.
This paper claims that it can achieve 3x faster inference speed than TFLite Micro could, with 87% of accuracy.
Draft PR: https://github.com/Samsung/ONE/pull/5475
I'm curious why luci-interpreter is much faster (~2x on-chip SRAM) than TFLite micro.
Do we have this answer? On the one hand, if there is a comparison of the memory usage, it will be more helpful to judge.
@SlavikMIPT Could you share your current progress for this task and which issues you have at this moment?
I'm curious why luci-interpreter is much faster (~2x on-chip SRAM) than TFLite micro.
Do we have this answer? On the one hand, if there is a comparison of the memory usage, it will be more helpful to judge.
Generally speaking - they allocate tensors in different ways, the main difference is memory usage. I investigate this in more detail to give a more detailed answer, on microcontrollers the debugging and profiling tools are rather limited. There is almost no difference - which implementation of fully connected is used(optimized or reference)
I'm curious why luci-interpreter is much faster (~2x on-chip SRAM) than TFLite micro.
Do we have this answer? On the one hand, if there is a comparison of the memory usage, it will be more helpful to judge.
I did a memory trace via malloc wrapper. Format is:
#m - malloc
#f - free
For malloc : The second address is where memory was allocated, The third one is caller(return address). The fourth - number of bytes allocated.
Here is raw traces for 832 bytes model:
Luci:
--- post_import_graph done ---
#m:0x2000e950;0x807142f-16
Interpreter::Interpreter(const luci::Module *module)
module->size() 1
#m:0x2000e210;0x804f59b-28
createTensors(main_graph);
#m:0x2000e238;0x804e451-16
#m:0x2000e258;0x804e545-68
#m:0x2000e2a8;0x8050435-300
#m:0x2000e3e0;0x804e5c3-12
#m:0x2000e3f8;0x804e34b-8
#m:0x2000e410;0x804e6c3-16
#m:0x2000e430;0x804e545-68
#m:0x2000e480;0x8050435-96
#m:0x2000e4f0;0x804e5c3-12
#m:0x2000e508;0x804e6c3-4
#m:0x2000e518;0x804e545-68
#m:0x2000e568;0x8050435-8
#m:0x2000e580;0x804e5c3-12
#m:0x2000e598;0x804e34b-20
#f:0x0;0x804e391-0x2000e3f8 // 8 bytes free
#m:0x2000e5b8;0x804e545-68
#m:0x2000e3f8;0x8050435-4
#m:0x2000e608;0x804e5c3-12
#m:0x2000e620;0x804e545-68
#m:0x2000e670;0x8050435-4
#m:0x2000e680;0x804e5c3-12
createExecutionSequence(main_graph);
#m:0x2000e698;0x804ce61-24
#m:0x2000e6c0;0x804d017-4
#f:0x0;0x804d409-0x2000e698 //24 bytes free
#m:0x2000e6d0;0x804c89d-32
#m:0x2000f1c0;0x804c8b3-512
#m:0x2000e698;0x804ca5f-20
#m:0x2000e700;0x804ca5f-20
#m:0x2000e720;0x804ca5f-20
#m:0x2000e740;0x804ca5f-20
#m:0x2000e760;0x804c7c3-4
#m:0x2000e770;0x804ca5f-20
#m:0x2000e790;0x804c7c3-8
#f:0x0;0x804c81f-0x2000e760
#m:0x2000e7a8;0x804ca5f-20
#m:0x2000e7c8;0x804c7c3-16
#f:0x0;0x804c81f-0x2000e790 //8 bytes free
#m:0x2000e7e8;0x804c7c3-32
#f:0x0;0x804c81f-0x2000e7c8 //16 bytes free
#f:0x0;0x804c9a5-0x2000f1c0 //512 bytes free
#f:0x0;0x804c9af-0x2000e6d0 //32 bytes free
#f:0x0;0x804b7e1-0x2000e698 //20 bytes free
#f:0x0;0x804c9c3-0x2000e700 //20 bytes free
#f:0x0;0x804b7e1-0x2000e720 //20 bytes free
#f:0x0;0x804b7e1-0x2000e7a8 //20 bytes free
#f:0x0;0x804c9c3-0x2000e770 //20 bytes free
#f:0x0;0x804c9c3-0x2000e740 //20 bytes free
#f:0x0;0x804e959-0x2000e6c0 //4 bytes free
node->opcode() 78
node->opcode() 8
node->opcode() 8
node->opcode() 9
_execution_sequence.push_back node->opcode() 9
#m:0x2000e818;0x80501af-52
#m:0x2000e858;0x804e0ed-4
node->opcode() 33
_execution_sequence.push_back node->opcode() 33
#m:0x2000e868;0x804fef3-44
#m:0x2000e8a0;0x804e0ed-8
#f:0x0;0x804e189-0x2000e858 //4 bytes free
node->opcode() 79
#f:0x0;0x804eb3d-0x2000e7e8 //32 bytes free
kernel->configure();
#m:0x2000e8b8;0x8054ebf-16
#m:0x2000e8d8;0x8050585-16
#m:0x2000e698;0x8050625-128
#f:0x0;0x8054edf-0x2000e3f8 //4 bytes free
#f:0x0;0x8054ee7-0x2000e8b8 //16 bytes free
kernel->configure();
#m:0x2000e8b8;0x80586f3-16
#m:0x2000e728;0x8050585-16
#m:0x2000e748;0x8050625-128
#f:0x0;0x8058713-0x2000e670 //4 bytes free
#f:0x0;0x805871b-0x2000e8b8 //16 bytes free
#m:0x2000f1c0;0x807154b-300
#f:0x0;0x8071581-0x2000f1c0 //300 bytes free
Bytes allocated: 1300
Tflite micro:
#m:0x200572c0;0x803b463-304
#m:0x20057400;0x803b477-32
#m:0x20057430;0x8036e1f-36
#m:0x20057460;0x8036e39-512
#m:0x20057670;0x802c761-28
#m:0x20057698;0x8036e6b-192
#m:0x20057768;0x8036ead-36
#m:0x20057798;0x8030baf-8192
Bytes allocated: 9332
Bytes allocated by mbed os: 5709
As we can see - TF micro allocates significantly more memory - the cache size for stm32f7 is 16kbytes - so I suppose 8192bytes allocated block doesn't fit into the cache.
Most helpful comment
At this moment - I have compiled with arm-none-eabi 9 2020-q2-update luci interpreter on stm32f767 microcontroller + DSP library, supporting following basic kernels:
I have ran simple NN from tensorflow lite micro examples, which generates a sine, ran in a single thread of MbedOS with system core clock 216MHz, 512kB on-chip SRAM and 128Mbit SDRAM 187MHz on FSMC
Benchmarks for other kernels are in progress, but it looks promising - it is already possible to recognize hotwords using MFCC or gestures in realtime locally on STM32F7(and I think STM32F4 100Mhz as well) using luci interpreter.