If a tensor was static at first but then it changes during the run, it is allocated to both static and dynamic memory managers. This leads waste of memory. However deallocating static memory for those cases is not that easy. As its memory space is likely to be shared with other tensors that is not overlapping lifetime.
FYI, some scenarios (they are similar) would be:
at compilation time, #0, #1 are allocated.
#0 = palceholder(shape=[100, 100])
#1 = relu(#0)
but after nnfw_prepare(), if we call
nnfw_set_input_tensorinfo(#0, [2, 2])
static memory are still there but #0, #1 are now treated as dynamic memory.
at compilation time, #0, #1, #2 are allocated (None is treated as dim == 1 in tflite)
#0 = palceholder(shape=[None, 100])
#1 = palceholder(shape=[None, 200])
#2 = add(#0, #1)
but after nnfw_prepare(), if we call
nnfw_set_input_tensorinfo(#0, [200, 100])
nnfw_set_input_tensorinfo(#1, [200, 200])
static memory are still there but #0, #1, #2 are now treated as dynamic memory.
I think we can approach those cases differently.
From my observation, it does not seem like there is a perfect solution for this. That is because any tensors can become dynamic which were static at first. However static tensors share memory space with other tensors so deallocating one is not possible.
Here are some candidates I came up with:
(#151 is a prerequisite.)
I think this is solvable as we know that we are not using static tensors. For those, we should make the as dynamic.
I did some benchmark with one of our test models.
1) Running a model with --shape_run option (all dynamic tensors):
EXECUTE takes 15.488 ms
- MEAN : 15.488 ms
- MAX : 16.377 ms
- MIN : 15.256 ms
- GEOMEAN : 15.484 ms
2) Running a model _without_ --shape_run option (all static tensors):
EXECUTE takes 12.305 ms
- MEAN : 12.305 ms
- MAX : 12.413 ms
- MIN : 12.189 ms
- GEOMEAN : 12.305 ms
It seems that using dynamic tensor is slow due to memory alloc and free.
I think we should _use static tensor as much as possible_.
how to run benchmark:
BACKENDS=cpu ./Product/out/bin/nnpackage_run test_model/package/dir -r 20 -w 10
I tried to list cases in more details:
(internally case 1) and case 2) are handled same)
nnfw_set_input_tensorinfo() is not called at allnnfw_set_input_tensorinfo() before nnfw_prepare()nnfw_set_input_tensorinfo(0, [1, 2, 3])
nnfw_prepare() ---> static tensors are allocated.
nnfw_run()
nnfw_run()
...
nnfw_set_input_tensorinfo() after nnfw_prepare()nnfw_prepare() ---> currently, static tensors are allocated. not good.
nnfw_set_input_tensorinfo(0, [1, 2, 3])
nnfw_run() --> dynamic tensors are allocated and freed
nnfw_run() --> dynamic tensors are allocated and freed
...
it will be good if we make the behavior like the following?
nnfw_prepare() ---> do nothing (only memory for const is allocated)
nnfw_set_input_tensorinfo(0, [1, 2, 3])
nnfw_run() --> statically and also dynamically allocate memory
nnfw_run() --> reuse statically allocated memory + dynamic tensors are allocated / freed
...
nnfw_set_input_tensorinfo() whenever nnfw_run() is calledthis seems same with case C).
nnfw_prepare() ---> currently, static tensors are allocated. not good.
nnfw_set_input_tensorinfo(0, [1, 2, 3])
nnfw_run() --> dynamic tensors are allocated and freed
nnfw_set_input_tensorinfo(0, [3, 2, 3])
nnfw_run() --> dynamic tensors are allocated and freed
...
nnfw_set_input_tensorinfo(0, [1, 2, 3])
nnfw_prepare() ---> all static tensors! seems good!
nnfw_set_input_tensorinfo(0, [3, 2, 4]) --> hmm???
nnfw_run() --> dynamic tensors are allocated and freed
...
Let me try to fix like the following (thanks @wateret):
nnfw_run()nnfw_run()Cleared milestone since it was the plan when I work on it.
Related - #151
Current approach in #3122 is (for CPU backend)
plan, prepare, allocateplan, prepare, allocatennfw_set_input_tensorinfo() was _not_ called for the 2nd run,plan, prepare, allocateupdated: https://github.com/Samsung/ONE/issues/2525#issuecomment-658501660