onert consumes almost twice as much memory as the model size during model loading.
_buffer variable)_buffer and create GraphCachedData :point_left: Peak memory_buffer from memoryAt step 3, weight and bias data are duplicated at _buffer and CachedData.
CachedDatammap to read model file and CachedData has mmap addressmunmap is needed at dtor of CachedDatabash
$ BACKENDS=cpu valgrind --tool=massif --pages-as-heap=yes --detailed-freq=1 ./Product/armv7l-linux.debug/out/bin/nnpackage_run --nnpackage ../nnpkg_tst/inception_v3_2018_04_27/
Using valgrind to profile memory usage for nnpackage_run
Valgrind will terminate after model prepare because some neon instruction is not supported at valgrind
onert master

onert draft

| Runtime | Memory | Load time | Prepare time |
|--|--|--|--|
| master | 187 MB | 419 ms | 288 ms |
| draft | 100 MB | 243 ms | 254 ms |
_buffer is removedRaw data from nnpackage_run
onert master
$ BACKENDS=cpu ./Product/armv7l-linux.release/out/bin/nnpackage_run --nnpackage ../nnpkg_tst/inception_v3_2018_04_27 --mem_poll=true
... run 0 takes 1283.27 ms
===================================
MODEL_LOAD takes 419.034 ms
PREPARE takes 288.84 ms
EXECUTE takes
- Min: 1283.27 ms
- Max: 1283.27 ms
- Mean: 1283.27 ms
- GeoMean: 1283.27 ms
===================================
RSS
- MODEL_LOAD takes 191704 kb
- PREPARE takes 203244 kb
- EXECUTE takes 134940 kb
- PEAK takes 203244 kb
===================================
HWM
- MODEL_LOAD takes 194080 kb
- PREPARE takes 203808 kb
- EXECUTE takes 203808 kb
- PEAK takes 203808 kb
===================================
onert draft
$ BACKENDS=cpu ./Product/armv7l-linux.release/out/bin/nnpackage_run --nnpackage ../nnpkg_tst/inception_v3_2018_04_27 --mem_poll=true
... run 0 takes 1252.29 ms
===================================
MODEL_LOAD takes 253.247 ms
PREPARE takes 254.015 ms
EXECUTE takes
- Min: 1252.29 ms
- Max: 1252.29 ms
- Mean: 1252.29 ms
- GeoMean: 1252.29 ms
===================================
RSS
- MODEL_LOAD takes 101928 kb
- PREPARE takes 203964 kb
- EXECUTE takes 137896 kb
- PEAK takes 203964 kb
===================================
HWM
- MODEL_LOAD takes 102936 kb
- PREPARE takes 204468 kb
- EXECUTE takes 204468 kb
- PEAK takes 204468 kb
===================================
ExternalData has base pointer and its sizeCachedData into ExternalDataExternalData has pointer to mmaped addressmunmap is not yet implemented| Runtime | Load time memory | Inference memory | Load time | Prepare time |
|--|--|--|--|--|
| master | 187 MB | 132 MB | 419 ms | 288 ms |
| option 1 | 100 MB | 137 MB | 243 ms | 254 ms |
| option 2 | 14 MB | 224 MB | 6 ms | 290 ms |
_buffer and CachedData is removedRaw data from nnpackage_run
onert master
$ BACKENDS=cpu ./Product/armv7l-linux.release/out/bin/nnpackage_run --nnpackage ../nnpkg_tst/inception_v3_2018_04_27 --mem_poll=true
... run 0 takes 1283.27 ms
===================================
MODEL_LOAD takes 419.034 ms
PREPARE takes 288.84 ms
EXECUTE takes
- Min: 1283.27 ms
- Max: 1283.27 ms
- Mean: 1283.27 ms
- GeoMean: 1283.27 ms
===================================
RSS
- MODEL_LOAD takes 191704 kb
- PREPARE takes 203244 kb
- EXECUTE takes 134940 kb
- PEAK takes 203244 kb
===================================
HWM
- MODEL_LOAD takes 194080 kb
- PREPARE takes 203808 kb
- EXECUTE takes 203808 kb
- PEAK takes 203808 kb
===================================
option 1
$ BACKENDS=cpu ./Product/armv7l-linux.release/out/bin/nnpackage_run --nnpackage ../nnpkg_tst/inception_v3_2018_04_27 --mem_poll=true
... run 0 takes 1252.29 ms
===================================
MODEL_LOAD takes 253.247 ms
PREPARE takes 254.015 ms
EXECUTE takes
- Min: 1252.29 ms
- Max: 1252.29 ms
- Mean: 1252.29 ms
- GeoMean: 1252.29 ms
===================================
RSS
- MODEL_LOAD takes 101928 kb
- PREPARE takes 203964 kb
- EXECUTE takes 137896 kb
- PEAK takes 203964 kb
===================================
HWM
- MODEL_LOAD takes 102936 kb
- PREPARE takes 204468 kb
- EXECUTE takes 204468 kb
- PEAK takes 204468 kb
===================================
option 2
$ BACKENDS=cpu ./Product/armv7l-linux.release/out/bin/nnpackage_run --nnpackage ../nnpkg_tst/inception_v3_2018_04_27 -m1
... run 0 takes 1235.91 ms
===================================
MODEL_LOAD takes 6.275 ms
PREPARE takes 295.026 ms
EXECUTE takes
- Min: 1235.91 ms
- Max: 1235.91 ms
- Mean: 1235.91 ms
- GeoMean: 1235.91 ms
===================================
RSS
- MODEL_LOAD takes 13984 kb
- PREPARE takes 200916 kb
- EXECUTE takes 229052 kb
- PEAK takes 229052 kb
===================================
HWM
- MODEL_LOAD takes 13984 kb
- PREPARE takes 200916 kb
- EXECUTE takes 229340 kb
- PEAK takes 229340 kb
===================================
Personally I've tried this further for using ExternalData all phases(ModelLoad~Execute). https://github.com/YongseopKim/ONE/commit/e69aa3a0c674fb8c74f449175b352329e299f280
device: xu4-ubuntu
backend: cpu
Memory
| version | ModelLoad | Prepare | Execute |
|---------|-----------|---------|---------|
| master | 192 MB | 203 MB | 135 MB |
| draft | 10 MB | 21 MB | 204 MB |
Latency
| version | ModelLoad | Prepare | Execute |
|---------|-----------|---------|---------|
| master | 400 ms | 229 ms | 1271 ms |
| draft | 71 ms | 109 ms | 1310 ms |
Memory
| version | ModelLoad | Prepare | Execute |
|---------|-----------|---------|---------|
| master | 33 MB | 41 MB | 34 MB |
| draft | 7 MB | 16 MB | 41 MB |
Latency
| version | ModelLoad | Prepare | Execute |
|---------|-----------|---------|---------|
| master | 149 ms | 153 ms | 296 ms |
| draft | 5 ms | 26 ms | 297 ms |
The draft can't run the other models for now.
The above table seems too weird... I'll update it soon. updated.
I've profiled my draft version because I thought the peak of rss is too high(223mb).
external data from model load to executeLatency
| | MODEL_LOAD | PREPARE | EXECUTE |
|-----------------|------------|---------|---------|
| tflite | 16 | 0.5 | 1565 |
| onert-master | 235 | 1000 | 1278 |
| onert-draft-T=8 | 33 | 822 | 1248 |
| onert-draft-T=4 | 33 | 784 | 1331 |
| onert-draft-T=2 | 35 | 810 | 1494 |
| onert-draft-T=1 | 32 | 814 | 2155 |
RSS
| | MODEL_LOAD | PREPARE | EXECUTE |
|-----------------|------------|---------|---------|
| tflite | 8580 | 9752 | 228017 |
| onert-master | 101760 | 204152 | 136860 |
| onert-draft-T=8 | 7132 | 22236 | 223887 |
| onert-draft-T=4 | 7128 | 21440 | 223615 |
| onert-draft-T=2 | 7188 | 22756 | 224579 |
| onert-draft-T=1 | 7128 | 22232 | 205872 |
VMS
| | MODEL_LOAD | PREPARE | EXECUTE |
|-----------------|------------|---------|---------|
| tflite | 173016 | 305596 | 352232 |
| onert-master | 121924 | 217348 | 242270 |
| onert-draft-T=8 | 113740 | 123228 | 325640 |
| onert-draft-T=4 | 113868 | 123224 | 288756 |
| onert-draft-T=2 | 113736 | 123224 | 270276 |
| onert-draft-T=1 | 113736 | 123224 | 226891 |
If you're interested in, please see https://github.com/YongseopKim/ONE/issues/2
Draft of @YongseopKim : https://github.com/YongseopKim/ONE/tree/test/use_external_data_pulled
I've profiled above draft using valgrind on x86_64 system. Valgrind on x86_64 runs well without any error.


filter_datafilter_data in CachedData is deallocatedfilter_data in ExternalData is not unmappedExternalData after it is copied for convolutionAccroding to @periannath 's advice, I can remove the external data on the execute phase.
draft2: https://github.com/YongseopKim/ONE/tree/test/use_external_data_pulled
Latency
| | MODEL_LOAD | PREPARE | EXECUTE |
|-----------------|------------|---------|---------|
| tflite | 16 | 0.5 | 1565 |
| onert-master | 235 | 1000 | 1278 |
| onert-draft | 33 | 822 | 1248 |
| onert-draft2 | 33 | 769 | 1251 |
RSS
| | MODEL_LOAD | PREPARE | EXECUTE |
|-----------------|------------|---------|---------|
| tflite | 8580 | 9752 | 228017 |
| onert-master | 101760 | 204152 | 136860 |
| onert-draft | 7132 | 22236 | 223887 |
| onert-draft2 | 7292 | 21860 | 132173 |
All related PRs merged. Please see #2580 if someone is interested in using ExternalData instead of CachedData.
Most helpful comment
Accroding to @periannath 's advice, I can remove the external data on the execute phase.
draft2: https://github.com/YongseopKim/ONE/tree/test/use_external_data_pulled
Latency
| | MODEL_LOAD | PREPARE | EXECUTE |
|-----------------|------------|---------|---------|
| tflite | 16 | 0.5 | 1565 |
| onert-master | 235 | 1000 | 1278 |
| onert-draft | 33 | 822 | 1248 |
| onert-draft2 | 33 | 769 | 1251 |
RSS
| | MODEL_LOAD | PREPARE | EXECUTE |
|-----------------|------------|---------|---------|
| tflite | 8580 | 9752 | 228017 |
| onert-master | 101760 | 204152 | 136860 |
| onert-draft | 7132 | 22236 | 223887 |
| onert-draft2 | 7292 | 21860 | 132173 |