Let's summarize steps needed to update our PyTorch support for the latest version, 1.6 (the CI is at 1.4).
[x] Quantization support: It is completely broken due to a bug introduced in 1.6 (see https://github.com/pytorch/pytorch/issues/42497). I found a simple workaround for that problem (finally!) and the fix is WIP. The representation of quantized weights and APIs of quantized ops have changed. To support both 1.6 and older versions, we need to handle both cases (with version check) for now.
[x] torchvision detection model support: Need to add converters for aten::tensor, aten::empty, aten::numel and aten::IntImplicit. All of these ops seem trivial. I found that other missing ops I mentioned in https://github.com/apache/incubator-tvm/pull/6449#issuecomment-693023530 can be removed using torch._C._jit_pass_onnx_function_substitution(...) pass. We also need to deal with one function batched_nms that was used to be traced but now it is scripted, see https://github.com/pytorch/vision/blob/6e10e3f88158f12b7a304d3c2f803d2bbdde0823/torchvision/ops/boxes.py#L42-L43
We need to solve a typing problem involving If and Any type.
[x] Disable or fix tests that no longer work. For example, tracing googlenet is broken on Torch 1.5 or higher (see https://github.com/pytorch/vision/issues/2161), so we need to disable googlenet tests.
[x] Upgrade CI docker image
cc @t-vi @siju-samuel @kevinthesun @yongwww @alexwong @tqchen
@kevinthesun @yongwww I've taken a look at supporting detection models in 1.6. It turned out we need to convert a function batched_nms that was traced in 1.4, but now it is scripted. This brings a very nasty typing problem.
Looking at this code,
https://github.com/pytorch/vision/blob/6e10e3f88158f12b7a304d3c2f803d2bbdde0823/torchvision/ops/boxes.py#L75-L86
One of the types of if branch is a zero dim empty tensor of dtype int64 https://github.com/pytorch/vision/blob/6e10e3f88158f12b7a304d3c2f803d2bbdde0823/torchvision/ops/boxes.py#L76
while the type of other branch is a dynamic 1D tensor of dtype int32 https://github.com/pytorch/vision/blob/6e10e3f88158f12b7a304d3c2f803d2bbdde0823/torchvision/ops/boxes.py#L85-L86
We cannot type check the converted Relay model with If, because dtype of two branches are different. Even if I force the dtype to be the same, if one branch is Any shape while the other is static, Relay type inference chooses a static shape for the type of If. So the return type of above function becomes a zero dim tensor.
Is there a way to turn a static shape tensor to dynamic, when one of If branch is dynamic while other is static?
@yongwww It seems torchvision nms op https://github.com/pytorch/vision/blob/master/torchvision/ops/boxes.py#L35 returns int64 indices, while nms in Relay returns int32 bit indices. We need to cast the output indices to int64, that should resolve the one of typing problem (two branches of If having different dtypes, int32 and int64).
@masahi That shape problem coming from If branch looks like a Relay type inference issue to me. Type inference should generate dynamic shape in this case.
@masahi
import tvm
from tvm import relay
dtype = "float32"
branch_a = relay.var("a", shape=(relay.Any(),), dtype=dtype)
branch_b = relay.var("b", shape=(0,), dtype=dtype)
cond = relay.var("cond", shape=(), dtype="bool")
out = relay.If(cond, branch_a, branch_b)
mod = tvm.IRModule()
mod["main"] = relay.Function([cond, branch_a,branch_b], out)
mod = relay.transform.InferType()(mod)
print(mod["main"])
Will generate
fn (%cond: bool, %a: Tensor[(0), float32], %b: Tensor[(0), float32]) -> Tensor[(0), float32] {
if (%cond) {
%a
} else {
%b
}
}
while input a should be with shape(?,). There is an issue when inferring such an expression.
@kevinthesun thanks for pointing this out and providing a minimum test case. I'm going to take a look at what's going on inside type checker and hopefully fix it. With this fixed, I can send a PR to support PyTorch 1.6 mask rcnn and faster rcnn.
Finding suspicious code was easy, here, when lhs is IntImmNode and rhs is AnyNode, ulhs is returned.
Returning urhs there seems to fix this issue. I'm not sure if this is intended or a bug.
@kevinthesun @jroesch @lixiaoquan @MarisaKirisame I found that this change was introduced in https://github.com/apache/incubator-tvm/pull/5795
If I make that above change, that effectively undos that PR and breaks the test case introduced there. Basically, what we want from type checking If involving both static and dynamic shape are complete opposite to the motivation of #5795.
What should we do about this? I think #5795 should be reverted, since type inference is supposed to pick the most general types.
Hmmm, I think that PR is an optimization for some cases but is actually not correct since static dim is just a case of Any but not vise versa. The case provided in that PR only stands when False is fed with (Any, 1) tensor at runtime.
In TF frontend there are lots of cases when we want to make expression as static as possible. My guessing is that False actually has shape (Any, 1) but somehow not fully optimized to be less dynamic. Usually such kind of optimization is done in the frontend.
@lixiaoquan We should revert it?
I've sent the revert to https://github.com/apache/incubator-tvm/pull/6658
With this and other minor fixes, I can finally run torchvision detection models from v1.6. See https://github.com/apache/incubator-tvm/pull/6659.
Most helpful comment
I've sent the revert to https://github.com/apache/incubator-tvm/pull/6658
With this and other minor fixes, I can finally run torchvision detection models from v1.6. See https://github.com/apache/incubator-tvm/pull/6659.