Tvm: Upgrade AutoTensorCore as to a TIR Pass

Created on 6 Jun 2020  路  6Comments  路  Source: apache/tvm

AutoTensorCore is a pattern detection util that detects the matrix multiplication pattern, and rewrites the compute to make use of the tensorcore intrinsics. https://github.com/apache/incubator-tvm/pull/4234

However, because part of the pattern analysis depends on the tensor expression information and couples with the analysis, it does not qualify as a pass.

Under the unified IR, a transformation pass should to take in information from a PrimFunc in the IRModule, and output another information as a PrimFunc. A pass should not take additional information from the high level DSL stages. This being said, we could apply transformations in the high level to add decorations(e.g. pragma) to a loop to achieve the same goal.

Due to the current restriction, the AutoTensorCore rewrite has been temporarily moved as a special post processor in
https://github.com/apache/incubator-tvm/blob/master/src/te/schedule/schedule_postproc_rewrite_for_tensor_core.cc

However, this rewriting should really qualifies as a pass. As a part of unified IR effort, we want to reduce "non-pass" transformations to a minimum set(only lowering from te to TIR).

This is an issue to track the issue and discuss potential solutions. There are two potential ways to migrate the pass.

  • E0: Directly migrate the matmul pattern detector to search over the loop nest, instead of the te stage.
  • E1: If analysis on the te stage is necessary, run a light weight transformation on te to tag the tensor core related information.

Ideally E0 is preferred. Notably, @Hzfengsy is also working on related changes to TIR to make direct pattern detection in the TIR easier.

Most helpful comment

Thanks @jcf94 . I agree that making a better approach for tensorization is important and we should continue to push that direction.

My specific issue is about what to do with the pass as in its current state now. Specifically, there are a few options:

  • A0: Temporarily remove AutoTensorCore, assuming it has limited usecase and most of the current TensorCore goes through the tensorcore intrinsics, add a better solution back later.
  • A1: Migrate the AutoTensorCore to TIR by pattern matching the loop nest(instead of the compute expression), so that it can becomes part of the TIR pass, replace the pass later once we have a better solution.
  • A2: Keep AutoTensorCore in its current location, which introduce a friction pt to the overall design itself, maintaining it as we refactor the code base while tolerating the design friction, remove it once we have a better solution.

We can find that in all of these cases, the AutoTensorCore pass as it is will get removed eventually once we find a better solution. They have different pros and cons, for example, A2 brings a design friction pt to the overall architecture, and could cause problem if we want to release before we find a better solution. A0 is the easiest for the developers, but also means the feature will be un-available until we find a better solution. A1 maintains the codebase itself, while continues to migrating it to a better state that fits into the current design, of course it puts more demands on the developers themselves.

This is an interesting case, the code itself becomes a technical debt that we need to pay as maintainers. It is fun to develop new features. In the meanwhile, maintaining existing code, revisit the design and keep keep migrating them to a better infrastructure is equally important, if not more important for a healthy project. As all new features eventually become technical debts when other new features are added on top. It is important for us to keep infrastructure innovation and refactoring to reduce the amount of key concepts back to minimum. So that we can more effectively evolve to deliver great new features.

Would love to see thoughts wrt to the three options.

All 6 comments

cc @minminsun @Hzfengsy @merrymercy @jcf94, @yangjunpro

5498 Kind of related to this one.

The current TensorCore code generation is tricky --- since we have no fundamental wrap reduction support in TVM, generating TensorCore code inevitably messes up the thread bind. No threadIdxwas bound to a reduce axis before, but we inevitably need a threadIdx for that reduce axis.

Is it possible to somehow represent the warp wise reduce in the schedule, and then the TIR analyzer and rewriter can detect this opportunity of matching TensorCore.

cc @minminsun @Hzfengsy @merrymercy @jcf94, @yangjunpro

Thanks! We've also noticed the problem that current implementation of AutoTensorCore is not pretty enough.
We're working on enabling TVM to auto generate schedule in the project Ansor with @merrymercy now, and we also had some discussions with @Hzfengsy during the development.
Auto TensorCore codegen support is an important feature for us, we'll continue to work on it and try to figure out a better way.

Thanks @jcf94 . I agree that making a better approach for tensorization is important and we should continue to push that direction.

My specific issue is about what to do with the pass as in its current state now. Specifically, there are a few options:

  • A0: Temporarily remove AutoTensorCore, assuming it has limited usecase and most of the current TensorCore goes through the tensorcore intrinsics, add a better solution back later.
  • A1: Migrate the AutoTensorCore to TIR by pattern matching the loop nest(instead of the compute expression), so that it can becomes part of the TIR pass, replace the pass later once we have a better solution.
  • A2: Keep AutoTensorCore in its current location, which introduce a friction pt to the overall design itself, maintaining it as we refactor the code base while tolerating the design friction, remove it once we have a better solution.

We can find that in all of these cases, the AutoTensorCore pass as it is will get removed eventually once we find a better solution. They have different pros and cons, for example, A2 brings a design friction pt to the overall architecture, and could cause problem if we want to release before we find a better solution. A0 is the easiest for the developers, but also means the feature will be un-available until we find a better solution. A1 maintains the codebase itself, while continues to migrating it to a better state that fits into the current design, of course it puts more demands on the developers themselves.

This is an interesting case, the code itself becomes a technical debt that we need to pay as maintainers. It is fun to develop new features. In the meanwhile, maintaining existing code, revisit the design and keep keep migrating them to a better infrastructure is equally important, if not more important for a healthy project. As all new features eventually become technical debts when other new features are added on top. It is important for us to keep infrastructure innovation and refactoring to reduce the amount of key concepts back to minimum. So that we can more effectively evolve to deliver great new features.

Would love to see thoughts wrt to the three options.

Thanks @tqchen.

It is fun to develop new features. In the meanwhile, maintaining existing code, revisit the design and keep keep migrating them to a better infrastructure is equally important

Can't agree more!
Just as @jcf94 said this pass is required for Ansor to generate code for TensorCore, so we prefer not to remove it for now. We will try to figure out the possiblilty of matmul pattern matching of on TIR instead of TE.

close for now due to inactive status

Was this page helpful?
0 / 5 - 0 ratings