tvm 🚀 - [Performance regression] Revamp IntSet #3272 causing GluonCV SSD performance issue

@kevinthesun Would be great if you can look deeper into the issue.
For example, you print out the entire loop nest (mainly the select, as in https://github.com/dmlc/tvm/pull/3132), need to confirm if the condition is something that can be simplified. Previously the simplification happens due to boundary guard condition that provides a guard on the bound of ax0.ax1.fused

tqchen on 30 Jun 2019

A snippet of ir for before:

T_concat[((ax0.ax1.fused*6) + ax2)] = tvm_if_then_else((116508 <= ax0.ax1.fused), tvm_if_then_else((2 <= ax2), tvm_if_then_else((5 <= ax2), ((((tvm_if_then_else((24512 <= ((((ax0.ax1.fused*4) + ax2) + -4) % 24528)), placeholder[((((ax0.ax1.fused*4) + ax2) + -4) % 16)], tvm_if_then_else((24448 <= ((((ax0.ax1.fused*4) + ax2) + -4) % 24528)), placeholder[((((((((ax0.ax1.fused*4) + ax2) + -4) % 16)*2) + ((((((ax0.ax1.fused*4) + ax2) + -4) % 24528) % 64)/32))*2) + ((((((ax0.ax1.fused*4) + ax2) + -4) % 24528)/16) % 2))], tvm_if_then_else(...

Now:

_concat[((ax0.ax1.fused*6) + ax2)] = tvm_if_then_else((116508 <= ax0.ax1.fused), tvm_if_then_else((2 <= ax2), tvm_if_then_else((5 <= ax2), ((((tvm_if_then_else(((490548 - ax2) <= (ax0.ax1.fused*4)), placeholder[((((ax0.ax1.fused*4) + ax2) + -466036) % 16)], tvm_if_then_else(((490484 - ax2) <= (ax0.ax1.fused*4)), placeholder[((((ax0.ax1.fused*4) + ax2) + -466036) % 64)], tvm_if_then_else(((490100 - ax2) <= (ax0.ax1.fused*4)), placeholder[(((((((((ax0.ax1.fused*4) + ax2) + -466052) % 24)/12)*16) + ((((ax0.ax1.fused*4) + ax2) + -490100)/24))*12) + ((((ax0.ax1.fused*4) + ax2) + -466052) % 12))], tvm_if_then_else(...

Full ir : before now

kevinthesun on 30 Jun 2019

Looking into the condition, the main reason was due to the condition being changed. Now somehow the condition becomes (490548 - ax2) <= (ax0.ax1.fused*4) instead of 490548 <= (ax0.ax1.fused*4) + ax2 and the const bound detector was not able to use this condition. I will send a possible fix to this.

tqchen on 1 Jul 2019

https://github.com/dmlc/tvm/pull/3467

tqchen on 1 Jul 2019

👍1

@tqchen I applied that patch but issue still exists.

kevinthesun on 1 Jul 2019

I'm not sure whether this is related to this thread. In bert, batch_matmul-ones_like-where fused op will generate:

produce T_where {
  for (ax0, 0, 12) {
    for (ax1, 0, 384) {
      for (ax2, 0, 384) {
        T_where[((((ax0*384) + ax1)*384) + ax2)] = tvm_if_then_else((placeholder[((((ax0*384) + ax1)*384) + ax2)] == 0.000000f), (T_full_like[((((ax0*384) + ax1)*384) + ax2)]*-999999984306749440.000000f), compute[((((ax0*384) + ax1)*384) + ax2)])
      }
    }
  }
}

and the performance is significantly lower.
@icemelon9

kevinthesun on 1 Jul 2019

@kevinthesun it would be great if you can dig a bit into what exactly happens, and see if explicitly apply the CanonicalSimplify helps(given that the new simplifier should at least fix some cases). Alternatively, we could construct a minimum case that illustrates the case.

tqchen on 1 Jul 2019

@tqchen After applying your PR now the condition part is fixed. One key different for indexing is:
before:

 tvm_if_then_else((24064 <= ((((ax0.ax1.fused*4) + ax2) + -4) % 24528)), placeholder[((((((((ax0.ax1.fused*4) + ax2) + -4) % 24528) + -16) % 24)*16) + ((((((ax0.ax1.fused*4) + ax2) + -4) % 24528) + -24064)/24))], tvm_if_then_else((22528 <= ((((ax0.ax1.fused*4) + ax2) + -4) % 24528)), ...

now:

tvm_if_then_else((24064 <= (((ax0.ax1.fused*4) + ax2) - 466036)), placeholder[((((((((ax0.ax1.fused*4) + ax2) - 466052) % 24)/12)*192) + ((((((ax0.ax1.fused*4) + ax2) - 466292) % 384)/24)*12)) + ((((ax0.ax1.fused*4) + ax2) - 466040) % 12))], ...

In placeholder indexing, mod operation involves a large number. I'm not sure whether this is the possible cause.

kevinthesun on 3 Jul 2019

https://github.com/dmlc/tvm/issues/3478 might also help to alleviate the situation. from what I see, perhaps we still need to enhance the simplifier a bit. e.g. 24064 <= (((ax0.ax1.fused*4) + ax2) - 466036)) could have been rewritten to give clear bound for the internal part

tqchen on 4 Jul 2019

@kevinthesun Please test again using the latest master, as some recent PRs enhances the simplifications.

tqchen on 8 Jul 2019

@tqchen I updated the git gist to reflect the latest ir. Now the performance is better, but still 3 ms slower than the original performance. The condition has clear bound. Would https://github.com/dmlc/tvm/issues/3478 be the next step of improvement?

kevinthesun on 9 Jul 2019

Yes, introducing floordiv/mod might improve the perf further, but will need a few more PRs to change the division mode to take benefit of that. I would encourage us to separate the issue. If you can still isolate things that can be improved, we can dig further here

tqchen on 9 Jul 2019

👍1

@tqchen I find another regression regarding to new simplifier. conv2d with large batch size performs slower on x86 cpu. These workloads appear in rcnn models.

One example:

import tvm
import topi
import numpy as np

from tvm.autotvm.record import decode

target = "llvm -mcpu=skylake-avx512"
run_times = 100
record = decode('{"i": ["llvm -mcpu=skylake-avx512", "topi_x86_conv2d_NCHWc", [["TENSOR", [300, 2048, 7, 7], "float32"], ["TENSOR", [512, 2048, 1, 1], "float32"], [1, 1], [0, 0], [1, 1], "NCHW", "float32"], {}, ["conv2d", [300, 2048, 7, 7, "float32"], [512, 2048, 1, 1, "float32"], [1, 1], [0, 0], [1, 1], "NCHW", "float32"], {"i": 185, "t": "direct", "c": null, "e": [["tile_ic", "sp", [64, 32]], ["tile_oc", "sp", [16, 32]], ["tile_ow", "sp", [1, 7]], ["tile_oh", "ot", 1]]}], "r": [[0.0156591308515625], 0, 4.519576787948608, 1559627669.4038029], "v": 0.1}')

wkl = record[0].task.workload
n, ic, ih, iw, _ = wkl[1]
oc, _, kh, kw, _ = wkl[2]
sh, sw = wkl[3]
ph, pw = wkl[4]
dh, dw = wkl[5]
dilated_kernel_h = (kh - 1) * dh + 1
dilated_kernel_w = (kw - 1) * dw + 1
oh = (ih + 2 * ph - dilated_kernel_h) // sh + 1
ow = (iw + 2 * pw - dilated_kernel_w) // sw + 1

cfg = record[0].config
ic_bn = cfg["tile_ic"].val if hasattr(cfg["tile_ic"], "val") else cfg["tile_ic"].size[-1]
oc_bn = cfg["tile_oc"].val if hasattr(cfg["tile_oc"], "val") else cfg["tile_oc"].size[-1]

dshape = (n, ic // ic_bn, ih, iw, ic_bn)
kshape = (oc // oc_bn, ic // ic_bn, kh, kw, ic_bn, oc_bn)
oshape = (n, oc // oc_bn, oh, ow, oc_bn)

data_layout = "NCHW%dc" % ic_bn
out_layout = "NCHW%dc" % oc_bn

data = tvm.placeholder(dshape, name="data")
kernel = tvm.placeholder(kshape, name="weight")
out = topi.x86.conv2d._declaration_conv_NCHWc(cfg, data, kernel, (sh, sw), (ph, pw), (dh, dw), data_layout, out_layout, "float32")
s = topi.x86.conv2d._schedule_conv2d_NCHWc(cfg, [out])
func = tvm.build(s, [data, kernel, out], target=target)
ctx = tvm.cpu()
d = tvm.nd.array(np.random.uniform(size=dshape).astype("float32"), ctx)
k = tvm.nd.array(np.random.uniform(size=kshape).astype("float32"), ctx)
o = tvm.nd.empty(oshape, "float32", ctx)
time_f = func.time_evaluator(func.entry_name, ctx, number=run_times)
cost = time_f(d, k, o).mean
tvm_time = cost * 1000

print(tvm_time)

On c5.9x machine, this conv2d workload performance drops from 16ms -> 19ms. I tested some other workloads with different cfg combinations. The difference can be as large as 80%.

IR before that PR:

Tensor(shape=[300, 64, 7, 7, 32], op.name=data) Tensor(shape=[16, 64, 1, 1, 32, 32], op.name=weight) (1, 1) (0, 0) (1, 1)
[('tile_ic', [64, 32]), ('tile_oc', [16, 32]), ('tile_ow', [1, 7]), ('tile_oh', 1)],direct,None,185
default_function
produce conv2d_NCHWc {
  parallel (n.oc_chunk.fused.oh.outer.fused, 0, 33600) {
    // attr [conv2d_NCHWc.global] storage_scope = "global"
    allocate conv2d_NCHWc.global[float32x32 * 7]
    produce conv2d_NCHWc.global {
      conv2d_NCHWc.global[ramp(0, 1, 32)] = x32(0.000000f)
      conv2d_NCHWc.global[ramp(32, 1, 32)] = x32(0.000000f)
      conv2d_NCHWc.global[ramp(64, 1, 32)] = x32(0.000000f)
      conv2d_NCHWc.global[ramp(96, 1, 32)] = x32(0.000000f)
      conv2d_NCHWc.global[ramp(128, 1, 32)] = x32(0.000000f)
      conv2d_NCHWc.global[ramp(160, 1, 32)] = x32(0.000000f)
      conv2d_NCHWc.global[ramp(192, 1, 32)] = x32(0.000000f)
      for (ic.outer, 0, 64) {
        for (ic.inner, 0, 32) {
          conv2d_NCHWc.global[ramp(0, 1, 32)] = (conv2d_NCHWc.global[ramp(0, 1, 32)] + (x32(data[(((((((n.oc_chunk.fused.oh.outer.fused/112)*64) + ic.outer)*7) + (n.oc_chunk.fused.oh.outer.fused % 7))*224) + ic.inner)])*weight[ramp((((((((n.oc_chunk.fused.oh.outer.fused % 112)/7)*64) + ic.outer)*32) + ic.inner)*32), 1, 32)]))
          conv2d_NCHWc.global[ramp(32, 1, 32)] = (conv2d_NCHWc.global[ramp(32, 1, 32)] + (x32(data[((((((((n.oc_chunk.fused.oh.outer.fused/112)*64) + ic.outer)*7) + (n.oc_chunk.fused.oh.outer.fused % 7))*224) + ic.inner) + 32)])*weight[ramp((((((((n.oc_chunk.fused.oh.outer.fused % 112)/7)*64) + ic.outer)*32) + ic.inner)*32), 1, 32)]))
          conv2d_NCHWc.global[ramp(64, 1, 32)] = (conv2d_NCHWc.global[ramp(64, 1, 32)] + (x32(data[((((((((n.oc_chunk.fused.oh.outer.fused/112)*64) + ic.outer)*7) + (n.oc_chunk.fused.oh.outer.fused % 7))*224) + ic.inner) + 64)])*weight[ramp((((((((n.oc_chunk.fused.oh.outer.fused % 112)/7)*64) + ic.outer)*32) + ic.inner)*32), 1, 32)]))
          conv2d_NCHWc.global[ramp(96, 1, 32)] = (conv2d_NCHWc.global[ramp(96, 1, 32)] + (x32(data[((((((((n.oc_chunk.fused.oh.outer.fused/112)*64) + ic.outer)*7) + (n.oc_chunk.fused.oh.outer.fused % 7))*224) + ic.inner) + 96)])*weight[ramp((((((((n.oc_chunk.fused.oh.outer.fused % 112)/7)*64) + ic.outer)*32) + ic.inner)*32), 1, 32)]))
          conv2d_NCHWc.global[ramp(128, 1, 32)] = (conv2d_NCHWc.global[ramp(128, 1, 32)] + (x32(data[((((((((n.oc_chunk.fused.oh.outer.fused/112)*64) + ic.outer)*7) + (n.oc_chunk.fused.oh.outer.fused % 7))*224) + ic.inner) + 128)])*weight[ramp((((((((n.oc_chunk.fused.oh.outer.fused % 112)/7)*64) + ic.outer)*32) + ic.inner)*32), 1, 32)]))
          conv2d_NCHWc.global[ramp(160, 1, 32)] = (conv2d_NCHWc.global[ramp(160, 1, 32)] + (x32(data[((((((((n.oc_chunk.fused.oh.outer.fused/112)*64) + ic.outer)*7) + (n.oc_chunk.fused.oh.outer.fused % 7))*224) + ic.inner) + 160)])*weight[ramp((((((((n.oc_chunk.fused.oh.outer.fused % 112)/7)*64) + ic.outer)*32) + ic.inner)*32), 1, 32)]))
          conv2d_NCHWc.global[ramp(192, 1, 32)] = (conv2d_NCHWc.global[ramp(192, 1, 32)] + (x32(data[((((((((n.oc_chunk.fused.oh.outer.fused/112)*64) + ic.outer)*7) + (n.oc_chunk.fused.oh.outer.fused % 7))*224) + ic.inner) + 192)])*weight[ramp((((((((n.oc_chunk.fused.oh.outer.fused % 112)/7)*64) + ic.outer)*32) + ic.inner)*32), 1, 32)]))
        }
      }
    }
    for (ow.inner, 0, 7) {
      conv2d_NCHWc[ramp((((n.oc_chunk.fused.oh.outer.fused*7) + ow.inner)*32), 1, 32)] = conv2d_NCHWc.global[ramp((ow.inner*32), 1, 32)]
    }
  }
}

IR after that PR:

produce conv2d_NCHWc {
  parallel (n.oc_chunk.fused.oh.outer.fused, 0, 33600) {
    // attr [conv2d_NCHWc.global] storage_scope = "global"
    allocate conv2d_NCHWc.global[float32x32 * 7]
    produce conv2d_NCHWc.global {
      conv2d_NCHWc.global[ramp(0, 1, 32)] = x32(0.000000f)
      conv2d_NCHWc.global[ramp(32, 1, 32)] = x32(0.000000f)
      conv2d_NCHWc.global[ramp(64, 1, 32)] = x32(0.000000f)
      conv2d_NCHWc.global[ramp(96, 1, 32)] = x32(0.000000f)
      conv2d_NCHWc.global[ramp(128, 1, 32)] = x32(0.000000f)
      conv2d_NCHWc.global[ramp(160, 1, 32)] = x32(0.000000f)
      conv2d_NCHWc.global[ramp(192, 1, 32)] = x32(0.000000f)
      for (ic.outer, 0, 64) {
        for (ic.inner, 0, 32) {
          conv2d_NCHWc.global[ramp(0, 1, 32)] = (conv2d_NCHWc.global[ramp(0, 1, 32)] + (x32(data[(((((n.oc_chunk.fused.oh.outer.fused/112)*100352) + (ic.outer*1568)) + ((n.oc_chunk.fused.oh.outer.fused % 7)*224)) + ic.inner)])*weight[ramp((((((n.oc_chunk.fused.oh.outer.fused % 112)/7)*65536) + (ic.outer*1024)) + (ic.inner*32)), 1, 32)]))
          conv2d_NCHWc.global[ramp(32, 1, 32)] = (conv2d_NCHWc.global[ramp(32, 1, 32)] + (x32(data[((((((n.oc_chunk.fused.oh.outer.fused/112)*100352) + (ic.outer*1568)) + ((n.oc_chunk.fused.oh.outer.fused % 7)*224)) + ic.inner) + 32)])*weight[ramp((((((n.oc_chunk.fused.oh.outer.fused % 112)/7)*65536) + (ic.outer*1024)) + (ic.inner*32)), 1, 32)]))
          conv2d_NCHWc.global[ramp(64, 1, 32)] = (conv2d_NCHWc.global[ramp(64, 1, 32)] + (x32(data[((((((n.oc_chunk.fused.oh.outer.fused/112)*100352) + (ic.outer*1568)) + ((n.oc_chunk.fused.oh.outer.fused % 7)*224)) + ic.inner) + 64)])*weight[ramp((((((n.oc_chunk.fused.oh.outer.fused % 112)/7)*65536) + (ic.outer*1024)) + (ic.inner*32)), 1, 32)]))
          conv2d_NCHWc.global[ramp(96, 1, 32)] = (conv2d_NCHWc.global[ramp(96, 1, 32)] + (x32(data[((((((n.oc_chunk.fused.oh.outer.fused/112)*100352) + (ic.outer*1568)) + ((n.oc_chunk.fused.oh.outer.fused % 7)*224)) + ic.inner) + 96)])*weight[ramp((((((n.oc_chunk.fused.oh.outer.fused % 112)/7)*65536) + (ic.outer*1024)) + (ic.inner*32)), 1, 32)]))
          conv2d_NCHWc.global[ramp(128, 1, 32)] = (conv2d_NCHWc.global[ramp(128, 1, 32)] + (x32(data[((((((n.oc_chunk.fused.oh.outer.fused/112)*100352) + (ic.outer*1568)) + ((n.oc_chunk.fused.oh.outer.fused % 7)*224)) + ic.inner) + 128)])*weight[ramp((((((n.oc_chunk.fused.oh.outer.fused % 112)/7)*65536) + (ic.outer*1024)) + (ic.inner*32)), 1, 32)]))
          conv2d_NCHWc.global[ramp(160, 1, 32)] = (conv2d_NCHWc.global[ramp(160, 1, 32)] + (x32(data[((((((n.oc_chunk.fused.oh.outer.fused/112)*100352) + (ic.outer*1568)) + ((n.oc_chunk.fused.oh.outer.fused % 7)*224)) + ic.inner) + 160)])*weight[ramp((((((n.oc_chunk.fused.oh.outer.fused % 112)/7)*65536) + (ic.outer*1024)) + (ic.inner*32)), 1, 32)]))
          conv2d_NCHWc.global[ramp(192, 1, 32)] = (conv2d_NCHWc.global[ramp(192, 1, 32)] + (x32(data[((((((n.oc_chunk.fused.oh.outer.fused/112)*100352) + (ic.outer*1568)) + ((n.oc_chunk.fused.oh.outer.fused % 7)*224)) + ic.inner) + 192)])*weight[ramp((((((n.oc_chunk.fused.oh.outer.fused % 112)/7)*65536) + (ic.outer*1024)) + (ic.inner*32)), 1, 32)]))
        }
      }
    }
    for (ow.inner, 0, 7) {
      conv2d_NCHWc[ramp(((n.oc_chunk.fused.oh.outer.fused*224) + (ow.inner*32)), 1, 32)] = conv2d_NCHWc.global[ramp((ow.inner*32), 1, 32)]
    }
  }
}

The only difference is that now conv2d_NCHWc.global index is expanded and multiplies large number. Can this be the cause?

kevinthesun on 5 Sep 2019

@kevinthesun please check again now that the the integer simplification infra lands

tqchen on 9 Oct 2019

👍1

produce conv2d_NCHWc {
  parallel (n.oc_chunk.fused.oh.outer.fused, 0, 33600) {
    // attr [conv2d_NCHWc.global] storage_scope = "global"
    allocate conv2d_NCHWc.global[float32x32 * 7]
    produce conv2d_NCHWc.global {
      conv2d_NCHWc.global[ramp(0, 1, 32)] = x32(0f)
      conv2d_NCHWc.global[ramp(32, 1, 32)] = x32(0f)
      conv2d_NCHWc.global[ramp(64, 1, 32)] = x32(0f)
      conv2d_NCHWc.global[ramp(96, 1, 32)] = x32(0f)
      conv2d_NCHWc.global[ramp(128, 1, 32)] = x32(0f)
      conv2d_NCHWc.global[ramp(160, 1, 32)] = x32(0f)
      conv2d_NCHWc.global[ramp(192, 1, 32)] = x32(0f)
      for (ic.outer, 0, 64) {
        for (ic.inner, 0, 32) {
          conv2d_NCHWc.global[ramp(0, 1, 32)] = (conv2d_NCHWc.global[ramp(0, 1, 32)] + (x32(data[((((floordiv(n.oc_chunk.fused.oh.outer.fused, 112)*100352) + (ic.outer*1568)) + (floormod(n.oc_chunk.fused.oh.outer.fused, 7)*224)) + ic.inner)])*weight[ramp((((floordiv(floormod(n.oc_chunk.fused.oh.outer.fused, 112), 7)*65536) + (ic.outer*1024)) + (ic.inner*32)), 1, 32)]))
          conv2d_NCHWc.global[ramp(32, 1, 32)] = (conv2d_NCHWc.global[ramp(32, 1, 32)] + (x32(data[(((((floordiv(n.oc_chunk.fused.oh.outer.fused, 112)*100352) + (ic.outer*1568)) + (floormod(n.oc_chunk.fused.oh.outer.fused, 7)*224)) + ic.inner) + 32)])*weight[ramp((((floordiv(floormod(n.oc_chunk.fused.oh.outer.fused, 112), 7)*65536) + (ic.outer*1024)) + (ic.inner*32)), 1, 32)]))
          conv2d_NCHWc.global[ramp(64, 1, 32)] = (conv2d_NCHWc.global[ramp(64, 1, 32)] + (x32(data[(((((floordiv(n.oc_chunk.fused.oh.outer.fused, 112)*100352) + (ic.outer*1568)) + (floormod(n.oc_chunk.fused.oh.outer.fused, 7)*224)) + ic.inner) + 64)])*weight[ramp((((floordiv(floormod(n.oc_chunk.fused.oh.outer.fused, 112), 7)*65536) + (ic.outer*1024)) + (ic.inner*32)), 1, 32)]))
          conv2d_NCHWc.global[ramp(96, 1, 32)] = (conv2d_NCHWc.global[ramp(96, 1, 32)] + (x32(data[(((((floordiv(n.oc_chunk.fused.oh.outer.fused, 112)*100352) + (ic.outer*1568)) + (floormod(n.oc_chunk.fused.oh.outer.fused, 7)*224)) + ic.inner) + 96)])*weight[ramp((((floordiv(floormod(n.oc_chunk.fused.oh.outer.fused, 112), 7)*65536) + (ic.outer*1024)) + (ic.inner*32)), 1, 32)]))
          conv2d_NCHWc.global[ramp(128, 1, 32)] = (conv2d_NCHWc.global[ramp(128, 1, 32)] + (x32(data[(((((floordiv(n.oc_chunk.fused.oh.outer.fused, 112)*100352) + (ic.outer*1568)) + (floormod(n.oc_chunk.fused.oh.outer.fused, 7)*224)) + ic.inner) + 128)])*weight[ramp((((floordiv(floormod(n.oc_chunk.fused.oh.outer.fused, 112), 7)*65536) + (ic.outer*1024)) + (ic.inner*32)), 1, 32)]))
          conv2d_NCHWc.global[ramp(160, 1, 32)] = (conv2d_NCHWc.global[ramp(160, 1, 32)] + (x32(data[(((((floordiv(n.oc_chunk.fused.oh.outer.fused, 112)*100352) + (ic.outer*1568)) + (floormod(n.oc_chunk.fused.oh.outer.fused, 7)*224)) + ic.inner) + 160)])*weight[ramp((((floordiv(floormod(n.oc_chunk.fused.oh.outer.fused, 112), 7)*65536) + (ic.outer*1024)) + (ic.inner*32)), 1, 32)]))
          conv2d_NCHWc.global[ramp(192, 1, 32)] = (conv2d_NCHWc.global[ramp(192, 1, 32)] + (x32(data[(((((floordiv(n.oc_chunk.fused.oh.outer.fused, 112)*100352) + (ic.outer*1568)) + (floormod(n.oc_chunk.fused.oh.outer.fused, 7)*224)) + ic.inner) + 192)])*weight[ramp((((floordiv(floormod(n.oc_chunk.fused.oh.outer.fused, 112), 7)*65536) + (ic.outer*1024)) + (ic.inner*32)), 1, 32)]))
        }
      }
    }
    for (ow.inner, 0, 7) {
      conv2d_NCHWc[ramp(((n.oc_chunk.fused.oh.outer.fused*224) + (ow.inner*32)), 1, 32)] = conv2d_NCHWc.global[ramp((ow.inner*32), 1, 32)]
    }
  }
}

@tqchen I tested with latest master but the performance is better, but still around 18.5 ms.

kevinthesun on 11 Oct 2019

@kevinthesun can you look a bit into what was happening? Specifically, please look into the code after LowerIntrin to make sure that the floordiv/mod are lowered correctly to the corresponding https://github.com/dmlc/tvm/blob/master/python/tvm/build_module.py#L498

Note that we will likely need to retune the workload to given that the division order is different. To recover the original behavior, right now we flatten all the multiplications for simplification, but the original code eagerly fold the common constants.

(x * 4 + y) * 2 vs x * 8 + y * 2. If you find that this is a cause, adding an additional pass before low-level codegen that folds the common multiplication factors will likely resolve this problem.

tqchen on 11 Oct 2019

@tqchen Yes. I think retune is necessary here. Let me redo tuning and check the ir.

kevinthesun on 11 Oct 2019

After retuning, the performance comes to 13.5 ms! I'll retune ssd as well.

kevinthesun on 11 Oct 2019

With new ir simplifier infrastructure, ssd resnet50_v1 performance improves from 33ms to 26.5ms. I'll close this issue and open new one if more performance issues are found.

kevinthesun on 14 Oct 2019

Tvm: [Performance regression] Revamp IntSet #3272 causing GluonCV SSD performance issue

All 19 comments

Related issues