Tvm: [TUTORIAL] Tutorial for inline micro-kernel ASM

Created on 4 Oct 2018  路  11Comments  路  Source: apache/tvm

Background, inline asm is supported by #1276, according to @ajtulloch 's experiment, it can be very useful to get to the state of art performance on certain platforms.

Like https://github.com/dmlc/tvm/pull/1774 we would like to have a tutorial introducing this feature to the user. I think it would also be very helpful to have a plan forward to include some of the useful inline asm kernels.

help wanted

Most helpful comment

yes, that would be awesome, also it would be nice to put in further context with conv2d (maybe as a separate one)as that is what we usually cares about

All 11 comments

@ajtulloch would you be interested in sending in something :)?

@tqchen sure, I can contribute this. Would something like a tutorial that modifies opt_gemm.py to hit peak performance on AVX2 be reasonable?

yes, that would be awesome, also it would be nice to put in further context with conv2d (maybe as a separate one)as that is what we usually cares about

@ajtulloch I am also interested in ARM / ARM64 platform. If you could leverage this tutorial: https://docs.tvm.ai/tutorials/autotvm/tune_nnvm_arm.html which would be nice.

We are working on custom compute engines that have instruction sets that represent the universe of linear algebra. So we have instructions for dot, matvec, matmul, but also cg, bicgstab, cholesky, etc.. Think what Tensilica did for custom instructions in a von Neumann machine, we do for distributed data flow machines.

Would this be the environment/discussion/tutorial in which these approaches to accelerate bottleneck operators could be introduced and expanded?

@Ravenwater this issue is around the specific aspect of using micro-asm kernels. Your proposal seems to be bigger and could be related to the general high level IR in tvm, and we would love to see contributions on that front as well

@tqchen I see, I am looking for the interfaces that define the abstract hardware instructions of the execution engine. In our hardware, we separate the memory access patterns from the computation so that we can build memory streams that have optimal system resource utilization and performance properties. The hw has a coarse data flow command packet semantic, which when mapped against the IR is likely to collapse many nodes into a single instruction. If you can point me to the right interfaces, we can explore how to contribute our high-performance data flow hardware.

revive this thread a bit cc @ajtulloch @cowanmeg

Try to revive this thread a bit. @FrozenGene would you be interested in contributing a tutorial?

@tqchen Ok. What hardware we would like to focus on? However, I have to say I am busing in our internal dsp things. But I will try to do it as soon as possible.

I think the idea is to demonstrate the feature(of inline asm and how to do so), so we just have to do it say for x86.

Was this page helpful?
0 / 5 - 0 ratings