In most cases, onert uses one fastest kernel for each operation in cpu backend. For example, Conv2D (FP32) uses Eigen and FullyConnected operation with weight quantize uses ruy library.
This policy fits well until now, but I found a counter-example. When I tried to use ruy library for FullyConnected (FP32) layer, a benchmark shows that ruy library is faster than current kernel only in some cases. https://github.com/Samsung/ONE/issues/4482#issuecomment-704798829
Some models become 3x faster, while others become 3x slower.
I think it is better to support multiple kernels for one operation in onert and use one of them depending on each model.
OP_KERNEL_MAP environmental variable to select kernel typeOP_BACKEND_TYPEOP_KERNEL_MAP="2=ruy;5=neon"OP_KERNEL_MAPAny suggestions are welcome!
/cc @Samsung/nnfw
As there can be lots of Ops so, we can think of another way like ruy=2,3,10,11;neon=5,6,7
I don't know which would be better...
Just write down what I think of.
As there can be lots of Ops so, we can think of another way like
ruy=2,3,10,11;neon=5,6,7
I don't know which would be better...
I have a similar idea. We might think in terms of expanding the way we use it now. If acl_cl means mapping acl_cl to all operations, it refers to a part in the same manner as acl_cl=2,3,4,11 or acl_cl=2-4,11. Anyway, it's just one of an opinion. :)
- Introduce
OP_KERNEL_MAPenvironmental variable to select kernel type
- This variable is for testing only and its format is the same as
OP_BACKEND_TYPE
Need to limit this to testing only? Isn't the purpose of the test eventually to find the best combination of kernels for each model? When the optimal combination is found, it would be good if the manifest file of the NN package was used so that the information could be used as a hint at runtime at runtime. (Of course, if this is far from the original purpose of this issue, it could be discussed separately.)
- ex)
OP_KERNEL_MAP="2=ruy;5=neon"
- Operation 2 uses ruy library and operation 5 uses neon library
Trivial, but we also have to think about the meaning of ;. Although that is not intended, in use so far the ; identified fall-back targets. Would it mean the same fall-back here? Wouldn't it be better to use different symbols for different purposes? For example, : or so.
We also need to consider control flow.
FYI,
In the following code, mul's input shape changes in while loop
and input shape of first calling of mul is small but those of 100th calling of mul become huge.
So we should also consider some format like, "n,m=ruy" where _n_ = operation index, m is _m_ th calling of operation _n_
x = some tensor of shape [2, 2]
WHILE 100 times (input of WHILE is x)
y = concat (x, x)
z = mul(y, y)
(output of WHILE op is z, which will be used as input x in next round of WHILE op)
@seanshpark @lemmaa Thanks! ruy=2,3,10,11;neon=5,6,7 looks better. Some models have a lot of operations, and this representation seems more suitable to support them.
@lemmaa
Need to limit this to testing only? Isn't the purpose of the test eventually to find the best combination of kernels for each model? When the optimal combination is found, it would be good if the manifest file of the NN package was used so that the information could be used as a hint at runtime at runtime. (Of course, if this is far from the original purpose of this issue, it could be discussed separately.)
I said this variable as testing because it cannot be used on applications. We need other ways such as manifest entry in nnpackage or nnfw API to support kernel selection if we want to use it on the application.
Trivial, but we also have to think about the meaning of ;. Although that is not intended, in use so far the ; identified fall-back targets. Would it mean the same fall-back here? Wouldn't it be better to use different symbols for different purposes? For example, : or so.
I didn't think about fall-back targets. This variable just assumes that selected kernel is available. ; is used as separator of operations. : or ; looks same for me.
@hyunsik-yoon Wow... That's something I never thought of.
Define environmental variable as "n,m=ruy" can handle WHILE operation but it can't handle IF operation. Considering control flow looks way more complicated than expected.
@periannath , agree, it would be nice to start simply for experimental purposes. It's a good idea to leave notes on assumptions to avoid unnecessary confusion or future improvement. :)
I said this variable as testing because it cannot be used on applications. We need other ways such as manifest entry in nnpackage or nnfw API to support kernel selection if we want to use it on the application.
I think little different. Although this is said to be started for experimental purposes, it can be applied to the actual application by adding a simple spec to the NN package. Of course not perfect. It is enough to add many configurations passed as environment variables in the command line to the manifest of the NN package as a single line so that the runtime can interpret them. (Of course, this is separate from this issue. It's just a shared thought, so don't mind)
I didn't think about fall-back targets. This variable just assumes that selected kernel is available. ; is used as separator of operations. : or ; looks same for me.
Yes. It's a trivial matter right now. I only hope that our notation will have a unified meaning in the future. :)
@periannath @lemmaa
Looks good for testing purpose. But I also see @lemmaa 's comment:
I think little different. Although this is said to be started for experimental purposes, it can be applied to the actual application by adding a simple spec to the NN package.
And I would like us to think about these.
ruy backend - then we can reuse OP_BACKEND_{OPNAME} (This way is the most proper way according to my initial design)cpu backend KernelGenerator smart enough to choose the better kernel considering the op's params and operands?IMHO as we already have two places to manipulate kernel selection, I would like not to introduce another kernel-selection mechanism, for the sake of simplicity.
And one more thing I came up:
cpu backend is not part of onert, it is just same as any other backends - it could not be present in some environments. So I'm not sure if it can be included in NN Package spec.OP_KERNEL_MAP="2=ruy;5=neon", as ruy is a kernel neon is a backendAbout the syntax, "n,m=ruy" I don't mind any syntax, but I wish all are consistent.
; is used as separator for all option variables, so if it is changed in an option we should change allOP_KERNEL_MAP and OP_BACKEND_MAP should have an identical syntax2=cpu, then all operation ID 2 in all subgraphs are going to be assigned to cpu)@wateret
Create ruy backend - then we can reuse OP_BACKEND_{OPNAME} (This way is the most proper way according to my initial design)
I thought about that design. AFAIK, may codes such as KernelGenerator, StaticTensorManager, and TensorBuilder should be duplicated from cpu backend to create ruy as new backend. Not only ruy, but also other libraries such as XNNPACK can be used. Every time a backend is added, the total amount of code and its management code will increased.
Can we make cpu backend KernelGenerator smart enough to choose the better kernel considering the op's params and operands?
We can make some heuristic to select proper kernel considering op's params and operands. However, that heuristic will be created by profiling kernels on some specific devices. It may not work on some other devices. If we have profile data, we can choose kernels that produce fastest inference time on target device.
- cpu backend is not part of onert, it is just same as any other backends - it could not be present in some environments. So I'm not sure if it can be included in NN Package spec.
- Maybe we could have it as hints, so the runtime can ignore it if unavailable
I didn't think about environment where cpu backend is not available. As you said, considering it as hint would be better.
I'm not sure what this exactly means - OP_KERNEL_MAP="2=ruy;5=neon", as ruy is a kernel neon is a backend
I use neon to express neon kernel of cpu backend. Some kernels use neon instruction and others use just C++ code only. I just want to distinguish two kernel types in current cpu backend.
I'd like to share short note about the discussion with @periannath offline yesterday:
// x and y are inputs, and model output is 'mul(x, y)'
nnfw_prepare()
nnfw_set_input_tensorinfo( x, to some shape)
nnfw_set_input_tensorinfo( y, to some shape)
nnfw_run()
In some case, x and y are small. In some case x and y is large. In such cases, the exact shape can be known only at runtime and kernels can be selected only at runtime.
(We need some additional kernel-mapping table that can be used at runtime.)
@periannath , is your thought like this?
OP_BACKEND_TYPE will let scheduler choose from the items given. this value contains candidates to choose which backendOP_KERNEL_MAP will specifically inform which cpu backend kernel for the specific operator. if the scheduler choose not cpu then the value will be ignoredIf OP_KERNEL_MAP is not just for cpu backend, the we can think of specific kernel name for each operators
@periannath @lemmaa
What I'm concerned the most is if we can have these things that meet these conditions:
For details:
cpu backend is just a backend like others. onert core does not handle it specially. (there is some code that requires cpu backend, but it is temporary workaround)cpu backend implementation can change#ifdef USE_RUY_GEMV@seanshpark
@periannath , is your thought like this?
* `OP_BACKEND_TYPE` will let scheduler choose from the items given. this value contains candidates to choose which backend * `OP_KERNEL_MAP` will specifically inform which `cpu` backend kernel for the specific operator. if the scheduler choose not `cpu` then the value will be ignored
Yes it is.
* do we have an implementation for this in current scheduler? or you are going to introduce such logic?
KernelGenerator to choose kernel of FullyConnected layers in some models.@periannath
Create ruy backend - then we can reuse OP_BACKEND_{OPNAME} (This way is the most proper way according to my initial design)
I thought about that design. AFAIK, may codes such as
KernelGenerator,StaticTensorManager, andTensorBuildershould be duplicated from cpu backend to create ruy as new backend. Not only ruy, but also other libraries such as XNNPACK can be used. Every time a backend is added, the total amount of code and its management code will increased.
Agree. With the current backend API, there will be a lot of work. But what if we simplified the backend API? (I have been having discussions with @YongseopKim , soon I will raise an issue.)
Can we make cpu backend KernelGenerator smart enough to choose the better kernel considering the op's params and operands?
Just to clarify this, I meant KernelGenerator and ***Layer(for dynamic tensor cases).
* do we have an implementation for this in current scheduler? or you are going to introduce such logic?
- We don't have any logic to choose kernel type in current scheduler. When this option is introduced, I will use it in
KernelGeneratorto choose kernel of FullyConnected layers in some models.
As I see this now, introducing kernel choosing API would also complicate automatic scheduler implementations.
@wateret
Agree. With the current backend API, there will be a lot of work. But what if we simplified the backend API? (I have been having discussions with @YongseopKim , soon I will raise an issue.)
If backend implementation will be simplified, it would be better to use current backend API. Then we can reuse HEScheduler for kernel scheduling.
Added) I have one question. Can we change backend assigned to operation during runtime? We may need to change kernel assigned to operation during runtime considering runtime status.
Can we change backend assigned to operation during runtime? We may need to change kernel assigned to operation during runtime considering runtime status.
No. 馃槩 We should implement the whole infra for that.
There is an old thread in our old repo - https://github.sec.samsung.net/STAR/nnfw/issues/1197#issuecomment-90097
- CPU/GPU static scheduling + static execution (Goal by end of June)
- CPU/GPU adaptive scheduling + static execution
- CPU/GPU static/adaptive scheduling + CPU/GPU co-execution
- CPU/GPU adaptive scheduling + CPU/GPU adaptive execution
What you are saying is 2 and 4, and those are never discussed since then.
Let's summarize this discussion here.
OP_KERNEL_MAPOp = Kernel : 2=ruy;5=neonKernel = Op : ruy=2,3,10,11;neon=5,6,7KernelGenerator or *LayerWe're going to introduce some kernels as backend. (RUY backend #4863 and XNNPACK backend #4968)
Maybe it is better to support infra for runtime backend switching.
Most helpful comment
Let's summarize this discussion here.
Format of
OP_KERNEL_MAPOp = Kernel:2=ruy;5=neonKernel = Op:ruy=2,3,10,11;neon=5,6,7How to implement new kernel?
KernelGeneratoror*LayerWe're going to introduce some kernels as backend. (
RUYbackend #4863 andXNNPACKbackend #4968)Maybe it is better to support infra for runtime backend switching.