Tvm: [TUTORIAL] Tutorial for inline micro-kernel ASM

Created on 4 Oct 2018 · 11Comments · Source: apache/tvm

Background, inline asm is supported by #1276, according to @ajtulloch 's experiment, it can be very useful to get to the state of art performance on certain platforms.

Like https://github.com/dmlc/tvm/pull/1774 we would like to have a tutorial introducing this feature to the user. I think it would also be very helpful to have a plan forward to include some of the useful inline asm kernels.

help wanted

Source

tqchen

Most helpful comment

yes, that would be awesome, also it would be nice to put in further context with conv2d (maybe as a separate one)as that is what we usually cares about

tqchen on 4 Oct 2018

👍2

All 11 comments

@ajtulloch would you be interested in sending in something :)?

tqchen on 4 Oct 2018

@tqchen sure, I can contribute this. Would something like a tutorial that modifies opt_gemm.py to hit peak performance on AVX2 be reasonable?

ajtulloch on 4 Oct 2018

👍2

yes, that would be awesome, also it would be nice to put in further context with conv2d (maybe as a separate one)as that is what we usually cares about

tqchen on 4 Oct 2018

👍2

@ajtulloch I am also interested in ARM / ARM64 platform. If you could leverage this tutorial: https://docs.tvm.ai/tutorials/autotvm/tune_nnvm_arm.html which would be nice.

FrozenGene on 19 Oct 2018

We are working on custom compute engines that have instruction sets that represent the universe of linear algebra. So we have instructions for dot, matvec, matmul, but also cg, bicgstab, cholesky, etc.. Think what Tensilica did for custom instructions in a von Neumann machine, we do for distributed data flow machines.

Would this be the environment/discussion/tutorial in which these approaches to accelerate bottleneck operators could be introduced and expanded?

Ravenwater on 5 Nov 2018

@Ravenwater this issue is around the specific aspect of using micro-asm kernels. Your proposal seems to be bigger and could be related to the general high level IR in tvm, and we would love to see contributions on that front as well

tqchen on 7 Nov 2018

@tqchen I see, I am looking for the interfaces that define the abstract hardware instructions of the execution engine. In our hardware, we separate the memory access patterns from the computation so that we can build memory streams that have optimal system resource utilization and performance properties. The hw has a coarse data flow command packet semantic, which when mapped against the IR is likely to collapse many nodes into a single instruction. If you can point me to the right interfaces, we can explore how to contribute our high-performance data flow hardware.

Ravenwater on 8 Nov 2018

revive this thread a bit cc @ajtulloch @cowanmeg

tqchen on 13 Dec 2018

Try to revive this thread a bit. @FrozenGene would you be interested in contributing a tutorial?

tqchen on 28 Jul 2019

@tqchen Ok. What hardware we would like to focus on? However, I have to say I am busing in our internal dsp things. But I will try to do it as soon as possible.

FrozenGene on 29 Jul 2019

I think the idea is to demonstrate the feature(of inline asm and how to do so), so we just have to do it say for x86.

tqchen on 29 Jul 2019

👍1

Was this page helpful?

0 / 5 - 0 ratings

Related issues

[TEXPR][PASS] Loop distribution pass generates incorrect code

derisavi · 6Comments

[Torch] Support aten::tensor, aten::empty and aten::numel

zhiqwang · 4Comments

[RFC][Relay] Pass API Discussion

MarisaKirisame · 5Comments

[DOCS] Neural network Deployment Guide with System Module Mode

tqchen · 3Comments

[RELAY] Avoid eager creation of global target object

tqchen · 4Comments