Tvm: TVM v0.5 Roadmap

Created on 13 Aug 2018 · 27Comments · Source: apache/tvm

This roadmap for TVM v0.5. TVM is a community-driven project and we love your feedback and proposals on where we should be heading. Please open up discussion in the discussion forum as well as bring RFCs.

Feel free to volunteer yourself if you are interested in trying out some items(they do not have to be on the list).
Please also check out the help wanted list in the github issues on things that need help

Features

Fully featured 8-bit network support
- [x] 8bit quantizer
- [x] arbibtary bits quantization algorithm
- [x] ARM support
- [x] Intel cpu support
NVidia GPU 8-bit kernel
- [x] int8 gemm recipe
- [x] int8 conv2d
- [x] autotvm integration
Automated tuning and scheduling
- [x] AutoTVM optimizations for mobile GPUs
- [x] AutoTVM optimizations for CUDA
- [x] AutoTVM for x86
- [ ] graph level automated optimization
Ultra low-bit support
- [ ] tutorials of low-bit ops
- [ ] customized accelerator support
VTA enhancements
- [ ] support generic high level models
- [ ] Enhanced operator/model coverage
- [ ] Ultra-96, ZCU102 support
- [ ] Amazon F1 preliminary support
- [ ] Low-bit support, bit serial support
- [ ] Chisel version
High level IR improvements
- [x] A more coupled design with tvm runtime system
- [x] support control flows
- [x] Type system support
Runtime
- [x] Hetrogenuous runtime
Micro-asm kernel exploration
- [ ] Core micro-asm primitives for certain ops
Hybrid python programming model
- [ ] transition of vision operators to hybrid mode.
RPC and Device API
- [ ] Support a c++ version of cross platform RPC
Security
- [x] tutorials on how to use SGX backend
Tutorials and docs
- [x] How to write a pass in python
- [x] General lowering flow of TVM
Language runtime
- [x] Golang runtime
- Rust support
  - [x] rust runtime
  - [x] rust frontend

roadmap

Source

tqchen

👍33 ❤12

Most helpful comment

@tqchen from TVM perspective, any comments on ONNXIFI? I'm thinking about how TVM stack can fit into it.

JammyZhou on 22 Aug 2018

👍4

All 27 comments

Shall we add heterogeneous graph runtime? @zhiics is working on that.

yzhliu on 13 Aug 2018

👍2

I am interested in implementing the Intel CPU support for INT8 quantization

anijain2305 on 14 Aug 2018

👍4

I'm interested in implementing the RUST runtime.

siju-samuel on 14 Aug 2018

👍1

@tqchen @siju-samuel My Rust runtime (dylib) support which follows the same generic API as Java for example (CPU, GPU, etc.) is 70%-ish done! I'll need to finish the callback support, add docs and cleanup. Any contributions is welcomed!

@nhynes Rust static support is in a good shape as well but is specific to CPU with custom allocator etc.

ehsanmok on 14 Aug 2018

👍1

@ehsanmok OK
Anyone doing "Support a c++ version of cross platform RPC"? If not, I'm interested in taking up this.

siju-samuel on 14 Aug 2018

@tqchen I have started working 8 bit quantizer and its operator support for conv2d, dense and relu. To avoid duplicate work pls let me know if anyone else is doing this work.

PariksheetPinjari909 on 14 Aug 2018

PR for static Rust runtime in https://github.com/dmlc/tvm/issues/1597.

@ehsanmok I'm not sure what you mean by "custom allocator etc." It uses whatever GlobalAlloc you care to use.

nhynes on 14 Aug 2018

@nhynes I meant you've defined your own allocator, threading, parallel backend support for CPU usage only for staticlib compiling with xargo while I've taken different route relying on existing layeouts for example and seems working for GPU. Though I admit I've done the project for my own enrichment first.

ehsanmok on 14 Aug 2018

@PariksheetPinjari909 the UW SAML team is working on a generic n-bit quantizer and hopefully things will get RFCed and upstreamed in this release cycle

tqchen on 14 Aug 2018

Please feel free to open new issues to track the working items, @siju-samuel standalone RPC is tracked by https://github.com/dmlc/tvm/issues/1496

tqchen on 14 Aug 2018

The first post contains an initial list of things based on the community feedback, please also feel free to propose new things and we will add it to the roadmap

tqchen on 14 Aug 2018

Will the new graph runtime make it into this release? I'd love to upstream some training codes, but they all depend on the semi-kluge FExpandCompute.

nhynes on 14 Aug 2018

@nhynes it belongs to the "high-level IR improvements"

tqchen on 14 Aug 2018

@tqchen Ok. Let me know what support i can give in 8 bit quantization. I am interested to contribute here.

PariksheetPinjari909 on 14 Aug 2018

👍1

I would like to take up the control flow ops. Let me know if someone is working on that.

PariksheetPinjari909 on 14 Aug 2018

@PariksheetPinjari909 We will make a major RFC to upgrade the IR system including control flow ops and type system, and after the first phase proposal is done, everyone is welcomed to contribute

tqchen on 14 Aug 2018

👍1

Sorry for being late. I’d like to add preliminary support for HLS shecudler to allow compiling actual neural networks with AOCL and SDAccel backends.

kazum on 16 Aug 2018

👍3

int8 cuda gemm recipe https://github.com/dmlc/tvm/pull/1614

tqchen on 21 Aug 2018

@tqchen from TVM perspective, any comments on ONNXIFI? I'm thinking about how TVM stack can fit into it.

JammyZhou on 22 Aug 2018

👍4

Re microkernels/tensorization, I've been looking at that stuff the last few months or so. There's some WIP stuff in https://github.com/ajtulloch/tvm/tree/tvm-using-val/tensorize, notably well-tuned assembly versions of:

FP32 GEMM kernels (ARMv7, AVX2)
Int8 x Int8 -> Int32 GEMM kernels (AVX2, adding ARMv7 shortly)

My hypothesis is that we can get a pretty decent part of the way with just GEMM microkernels for a lot of these dense workloads, but it's to-be-tested currently.

Some examples of using them in GEMM-based convs and for the batch gemm of a minimal F(6x6, 3x3) Winograd (~2-3x faster than current trunk on most configurations for ARMv7) are in that dir as well. For folks interested in the "Micro-asm kernel exploration" and "8-bit network stuff" (esp on CPUs), it'd be good to collaborate :).

ajtulloch on 25 Aug 2018

@ajtulloch I am working on Intel 8-bit Conv implementation using Intel Skylake AVX512 instructions (with the long-term goal of using VNNI instructions). I am not using GEMM-based convolution though. I am starting from NCHWc format direct convolution present in current conv2d topi implementation. I should have some numbers for the conv operator by the next weekend and can share them.

anijain2305 on 25 Aug 2018

@ajtulloch It will be great if you can send a tutorial or topi recipe

merrymercy on 27 Aug 2018

@anijain2305 you might find https://github.com/ajtulloch/tvm/blob/tvm-using-val/tensorize/gemm__avx2.c#L424-L531 or a similar microkernel for AVX512 useful on Skylake (same as MKL-DNN's vpmaddubsw/vpmaddwd/vpaddd sequence on AVX2/AVX512 pre VNNI).

@merrymercy what would be useful to have documented/tutorialized or made into a recipe?

ajtulloch on 27 Aug 2018

I think making a simple runnable conv2d example and showing its speedup will be very useful.

merrymercy on 28 Aug 2018

👍1

+1 to one conv2d runnable example. Besides ARMv7 / AVX2, I think we should also add SSE too. For some embbeding platforms, which would use Intel ATOM processors. However, Intel ATOM processors only support SSE4.2 at most, not AVX2.