One: [onert] Reducing copies of models' inputs/outputs

Created on 21 Apr 2020  路  7Comments  路  Source: Samsung/ONE

For the scenario #152 let's find out how many copies are performed.

The current nnfw_api(oneapi) requires for users to pass input/output tensor memory.

  1. Prepare input buffer
  2. Running Model 1

    1. Copy input data from input buffer

    2. Run the model

    3. Copy output data to output buffer

  3. Run Model 2 (the output buffer above is the input buffer below)

    1. Copy input data from input buffer

    2. Run the model

    3. Copy output data to the buffer

% It turns out that there is not much difference with this scenario and running just one model

Ideas

To get rid of all 4 copies above do either one of these:

  1. Revise API setInput/setOutput to get the internal memory buffer
  2. Revise the runtime to allow using memory from outside

Constraints

This optimization is not always possible if:

  • layout mismatch
  • Internal tensor has paddings(alignment)
  • gpu memory is used
  • And some more...

Possible solutions w/ pros and cons

Based on Ideas above, let's think about its feasibility and pros/cons.

1. Revise API setInput/setOutput to get the internal memory buffer

  • Feasibility

    • Looks feasible

  • Considerations

    • If we support multiple async runs for a single session, it will be much more complicated

2. Revise the runtime to allow using memory from outside

Let's define a couple of terms.

Runtime MM(memory manager) : Runtime has a memory manager for all tensors and it passes memory handles to backends (what we want to do here)
Backend MM : Each backend has its own memory management system (As-is)

  • Feasibility

    • Need to check this is available in ACL backends

  • Considerations

    • Do we need to support both Backend MM and Runtime MM?

    • If we still keep the as-is API, the user must give enough size of output tensors as we don't know the exact size of output tensors for dynamic models(Same issue with Android NN API)

    • This is also mandatory if we want to optimize copying between backends (and possibly between subgraphs)

typdiscussion

All 7 comments

  • Revise the runtime to allow using memory from outside

@wateret , in short, except constraint cases, isn't all this solved if only this single feature is supported?

@lemmaa Yes, as I stated "do either one of these". However that has huge amount of work. I will update "Possible solutions" section soon.

Need to check this is available in ACL backends

If you have a plan with using inputs/output of models as allocated tensor memory by our runtime, there are two issues.

  1. Padding
    If models are compiled separately, the padding of prev model's output and next model's input can be different.
  2. Dynamic tensor
    If shape of inputs changes every execution time, our runtime must sequentially call configure methods of all layers and do planning memory size of tensors in every execution time.

@ragmani

  1. Padding
    If models are compiled separately, the padding of prev model's output and next model's input can be different.

Yes, just like I mentioned in Constraints section. This optimization is not always possible. For that case we may need to put a conversion operation. The thing is, By doing "Revise the runtime to allow using memory from outside", we give the runtime a chance not to copy inputs. Then if the conditions(like no paddings) are not met, we must perform copy. Meanwhile now we always copy inputs.

  1. Dynamic tensor
    If shape of inputs changes every execution time, our runtime must sequentially call configure methods of all layers and do planning memory size of tensors in every execution time.

As we discussed offline sometime before, yes we will do that. It is ACL(CL)'s nature so I don't see any other solution.

@wateret

Yes, just like I mentioned in Constraints section. This optimization is not always possible.

If inputs/outputs are dynamic tensor, those tensors's padding can be changed every execution time. In other words, Some tensors may have padding in the second, even if those tensors didn't have padding in the first. we have to also consider about this to support reducing copies of models.

@ragmani I see, that means we have another constraint - dynamic tensors are not supported. Or didn't I get you right?

@wateret
That's not what I mean. If tensor does have padding, there is no problem. So I think you can apply reducing copies for dynamic tensors of cpu backend.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

lucenticus picture lucenticus  路  3Comments

seanshpark picture seanshpark  路  3Comments

hasw7569 picture hasw7569  路  4Comments

mhs4670go picture mhs4670go  路  4Comments

seanshpark picture seanshpark  路  3Comments