https://github.com/dmlc/tvm/pull/3272 is causing the similar issue used to happen in https://github.com/dmlc/tvm/issues/3097
The operator fused_strided_slice_greater_cast_strided_slice_zeros_like_add_add_add_add_add_ad_11203150218747419416_ is much slower due to mod operation not simplified:
placeholder[((((ax0.ax1.fused*4) + ax2) + -466036) % 16)]
While before this PR it is:
placeholder[((((ax0.ax1.fused*4) + ax2) + -4) % 16)]
@tqchen @wweic
@kevinthesun Would be great if you can look deeper into the issue.
For example, you print out the entire loop nest (mainly the select, as in https://github.com/dmlc/tvm/pull/3132), need to confirm if the condition is something that can be simplified. Previously the simplification happens due to boundary guard condition that provides a guard on the bound of ax0.ax1.fused
A snippet of ir for before:
T_concat[((ax0.ax1.fused*6) + ax2)] = tvm_if_then_else((116508 <= ax0.ax1.fused), tvm_if_then_else((2 <= ax2), tvm_if_then_else((5 <= ax2), ((((tvm_if_then_else((24512 <= ((((ax0.ax1.fused*4) + ax2) + -4) % 24528)), placeholder[((((ax0.ax1.fused*4) + ax2) + -4) % 16)], tvm_if_then_else((24448 <= ((((ax0.ax1.fused*4) + ax2) + -4) % 24528)), placeholder[((((((((ax0.ax1.fused*4) + ax2) + -4) % 16)*2) + ((((((ax0.ax1.fused*4) + ax2) + -4) % 24528) % 64)/32))*2) + ((((((ax0.ax1.fused*4) + ax2) + -4) % 24528)/16) % 2))], tvm_if_then_else(...
Now:
_concat[((ax0.ax1.fused*6) + ax2)] = tvm_if_then_else((116508 <= ax0.ax1.fused), tvm_if_then_else((2 <= ax2), tvm_if_then_else((5 <= ax2), ((((tvm_if_then_else(((490548 - ax2) <= (ax0.ax1.fused*4)), placeholder[((((ax0.ax1.fused*4) + ax2) + -466036) % 16)], tvm_if_then_else(((490484 - ax2) <= (ax0.ax1.fused*4)), placeholder[((((ax0.ax1.fused*4) + ax2) + -466036) % 64)], tvm_if_then_else(((490100 - ax2) <= (ax0.ax1.fused*4)), placeholder[(((((((((ax0.ax1.fused*4) + ax2) + -466052) % 24)/12)*16) + ((((ax0.ax1.fused*4) + ax2) + -490100)/24))*12) + ((((ax0.ax1.fused*4) + ax2) + -466052) % 12))], tvm_if_then_else(...
Looking into the condition, the main reason was due to the condition being changed. Now somehow the condition becomes (490548 - ax2) <= (ax0.ax1.fused*4) instead of 490548 <= (ax0.ax1.fused*4) + ax2 and the const bound detector was not able to use this condition. I will send a possible fix to this.
@tqchen I applied that patch but issue still exists.
I'm not sure whether this is related to this thread. In bert, batch_matmul-ones_like-where fused op will generate:
produce T_where {
for (ax0, 0, 12) {
for (ax1, 0, 384) {
for (ax2, 0, 384) {
T_where[((((ax0*384) + ax1)*384) + ax2)] = tvm_if_then_else((placeholder[((((ax0*384) + ax1)*384) + ax2)] == 0.000000f), (T_full_like[((((ax0*384) + ax1)*384) + ax2)]*-999999984306749440.000000f), compute[((((ax0*384) + ax1)*384) + ax2)])
}
}
}
}
and the performance is significantly lower.
@icemelon9
@kevinthesun it would be great if you can dig a bit into what exactly happens, and see if explicitly apply the CanonicalSimplify helps(given that the new simplifier should at least fix some cases). Alternatively, we could construct a minimum case that illustrates the case.
@tqchen After applying your PR now the condition part is fixed. One key different for indexing is:
before:
tvm_if_then_else((24064 <= ((((ax0.ax1.fused*4) + ax2) + -4) % 24528)), placeholder[((((((((ax0.ax1.fused*4) + ax2) + -4) % 24528) + -16) % 24)*16) + ((((((ax0.ax1.fused*4) + ax2) + -4) % 24528) + -24064)/24))], tvm_if_then_else((22528 <= ((((ax0.ax1.fused*4) + ax2) + -4) % 24528)), ...
now:
tvm_if_then_else((24064 <= (((ax0.ax1.fused*4) + ax2) - 466036)), placeholder[((((((((ax0.ax1.fused*4) + ax2) - 466052) % 24)/12)*192) + ((((((ax0.ax1.fused*4) + ax2) - 466292) % 384)/24)*12)) + ((((ax0.ax1.fused*4) + ax2) - 466040) % 12))], ...
In placeholder indexing, mod operation involves a large number. I'm not sure whether this is the possible cause.
https://github.com/dmlc/tvm/issues/3478 might also help to alleviate the situation. from what I see, perhaps we still need to enhance the simplifier a bit. e.g. 24064 <= (((ax0.ax1.fused*4) + ax2) - 466036)) could have been rewritten to give clear bound for the internal part
@kevinthesun Please test again using the latest master, as some recent PRs enhances the simplifications.
@tqchen I updated the git gist to reflect the latest ir. Now the performance is better, but still 3 ms slower than the original performance. The condition has clear bound. Would https://github.com/dmlc/tvm/issues/3478 be the next step of improvement?
Yes, introducing floordiv/mod might improve the perf further, but will need a few more PRs to change the division mode to take benefit of that. I would encourage us to separate the issue. If you can still isolate things that can be improved, we can dig further here
@tqchen I find another regression regarding to new simplifier. conv2d with large batch size performs slower on x86 cpu. These workloads appear in rcnn models.
One example:
import tvm
import topi
import numpy as np
from tvm.autotvm.record import decode
target = "llvm -mcpu=skylake-avx512"
run_times = 100
record = decode('{"i": ["llvm -mcpu=skylake-avx512", "topi_x86_conv2d_NCHWc", [["TENSOR", [300, 2048, 7, 7], "float32"], ["TENSOR", [512, 2048, 1, 1], "float32"], [1, 1], [0, 0], [1, 1], "NCHW", "float32"], {}, ["conv2d", [300, 2048, 7, 7, "float32"], [512, 2048, 1, 1, "float32"], [1, 1], [0, 0], [1, 1], "NCHW", "float32"], {"i": 185, "t": "direct", "c": null, "e": [["tile_ic", "sp", [64, 32]], ["tile_oc", "sp", [16, 32]], ["tile_ow", "sp", [1, 7]], ["tile_oh", "ot", 1]]}], "r": [[0.0156591308515625], 0, 4.519576787948608, 1559627669.4038029], "v": 0.1}')
wkl = record[0].task.workload
n, ic, ih, iw, _ = wkl[1]
oc, _, kh, kw, _ = wkl[2]
sh, sw = wkl[3]
ph, pw = wkl[4]
dh, dw = wkl[5]
dilated_kernel_h = (kh - 1) * dh + 1
dilated_kernel_w = (kw - 1) * dw + 1
oh = (ih + 2 * ph - dilated_kernel_h) // sh + 1
ow = (iw + 2 * pw - dilated_kernel_w) // sw + 1
cfg = record[0].config
ic_bn = cfg["tile_ic"].val if hasattr(cfg["tile_ic"], "val") else cfg["tile_ic"].size[-1]
oc_bn = cfg["tile_oc"].val if hasattr(cfg["tile_oc"], "val") else cfg["tile_oc"].size[-1]
dshape = (n, ic // ic_bn, ih, iw, ic_bn)
kshape = (oc // oc_bn, ic // ic_bn, kh, kw, ic_bn, oc_bn)
oshape = (n, oc // oc_bn, oh, ow, oc_bn)
data_layout = "NCHW%dc" % ic_bn
out_layout = "NCHW%dc" % oc_bn
data = tvm.placeholder(dshape, name="data")
kernel = tvm.placeholder(kshape, name="weight")
out = topi.x86.conv2d._declaration_conv_NCHWc(cfg, data, kernel, (sh, sw), (ph, pw), (dh, dw), data_layout, out_layout, "float32")
s = topi.x86.conv2d._schedule_conv2d_NCHWc(cfg, [out])
func = tvm.build(s, [data, kernel, out], target=target)
ctx = tvm.cpu()
d = tvm.nd.array(np.random.uniform(size=dshape).astype("float32"), ctx)
k = tvm.nd.array(np.random.uniform(size=kshape).astype("float32"), ctx)
o = tvm.nd.empty(oshape, "float32", ctx)
time_f = func.time_evaluator(func.entry_name, ctx, number=run_times)
cost = time_f(d, k, o).mean
tvm_time = cost * 1000
print(tvm_time)
On c5.9x machine, this conv2d workload performance drops from 16ms -> 19ms. I tested some other workloads with different cfg combinations. The difference can be as large as 80%.
IR before that PR:
Tensor(shape=[300, 64, 7, 7, 32], op.name=data) Tensor(shape=[16, 64, 1, 1, 32, 32], op.name=weight) (1, 1) (0, 0) (1, 1)
[('tile_ic', [64, 32]), ('tile_oc', [16, 32]), ('tile_ow', [1, 7]), ('tile_oh', 1)],direct,None,185
default_function
produce conv2d_NCHWc {
parallel (n.oc_chunk.fused.oh.outer.fused, 0, 33600) {
// attr [conv2d_NCHWc.global] storage_scope = "global"
allocate conv2d_NCHWc.global[float32x32 * 7]
produce conv2d_NCHWc.global {
conv2d_NCHWc.global[ramp(0, 1, 32)] = x32(0.000000f)
conv2d_NCHWc.global[ramp(32, 1, 32)] = x32(0.000000f)
conv2d_NCHWc.global[ramp(64, 1, 32)] = x32(0.000000f)
conv2d_NCHWc.global[ramp(96, 1, 32)] = x32(0.000000f)
conv2d_NCHWc.global[ramp(128, 1, 32)] = x32(0.000000f)
conv2d_NCHWc.global[ramp(160, 1, 32)] = x32(0.000000f)
conv2d_NCHWc.global[ramp(192, 1, 32)] = x32(0.000000f)
for (ic.outer, 0, 64) {
for (ic.inner, 0, 32) {
conv2d_NCHWc.global[ramp(0, 1, 32)] = (conv2d_NCHWc.global[ramp(0, 1, 32)] + (x32(data[(((((((n.oc_chunk.fused.oh.outer.fused/112)*64) + ic.outer)*7) + (n.oc_chunk.fused.oh.outer.fused % 7))*224) + ic.inner)])*weight[ramp((((((((n.oc_chunk.fused.oh.outer.fused % 112)/7)*64) + ic.outer)*32) + ic.inner)*32), 1, 32)]))
conv2d_NCHWc.global[ramp(32, 1, 32)] = (conv2d_NCHWc.global[ramp(32, 1, 32)] + (x32(data[((((((((n.oc_chunk.fused.oh.outer.fused/112)*64) + ic.outer)*7) + (n.oc_chunk.fused.oh.outer.fused % 7))*224) + ic.inner) + 32)])*weight[ramp((((((((n.oc_chunk.fused.oh.outer.fused % 112)/7)*64) + ic.outer)*32) + ic.inner)*32), 1, 32)]))
conv2d_NCHWc.global[ramp(64, 1, 32)] = (conv2d_NCHWc.global[ramp(64, 1, 32)] + (x32(data[((((((((n.oc_chunk.fused.oh.outer.fused/112)*64) + ic.outer)*7) + (n.oc_chunk.fused.oh.outer.fused % 7))*224) + ic.inner) + 64)])*weight[ramp((((((((n.oc_chunk.fused.oh.outer.fused % 112)/7)*64) + ic.outer)*32) + ic.inner)*32), 1, 32)]))
conv2d_NCHWc.global[ramp(96, 1, 32)] = (conv2d_NCHWc.global[ramp(96, 1, 32)] + (x32(data[((((((((n.oc_chunk.fused.oh.outer.fused/112)*64) + ic.outer)*7) + (n.oc_chunk.fused.oh.outer.fused % 7))*224) + ic.inner) + 96)])*weight[ramp((((((((n.oc_chunk.fused.oh.outer.fused % 112)/7)*64) + ic.outer)*32) + ic.inner)*32), 1, 32)]))
conv2d_NCHWc.global[ramp(128, 1, 32)] = (conv2d_NCHWc.global[ramp(128, 1, 32)] + (x32(data[((((((((n.oc_chunk.fused.oh.outer.fused/112)*64) + ic.outer)*7) + (n.oc_chunk.fused.oh.outer.fused % 7))*224) + ic.inner) + 128)])*weight[ramp((((((((n.oc_chunk.fused.oh.outer.fused % 112)/7)*64) + ic.outer)*32) + ic.inner)*32), 1, 32)]))
conv2d_NCHWc.global[ramp(160, 1, 32)] = (conv2d_NCHWc.global[ramp(160, 1, 32)] + (x32(data[((((((((n.oc_chunk.fused.oh.outer.fused/112)*64) + ic.outer)*7) + (n.oc_chunk.fused.oh.outer.fused % 7))*224) + ic.inner) + 160)])*weight[ramp((((((((n.oc_chunk.fused.oh.outer.fused % 112)/7)*64) + ic.outer)*32) + ic.inner)*32), 1, 32)]))
conv2d_NCHWc.global[ramp(192, 1, 32)] = (conv2d_NCHWc.global[ramp(192, 1, 32)] + (x32(data[((((((((n.oc_chunk.fused.oh.outer.fused/112)*64) + ic.outer)*7) + (n.oc_chunk.fused.oh.outer.fused % 7))*224) + ic.inner) + 192)])*weight[ramp((((((((n.oc_chunk.fused.oh.outer.fused % 112)/7)*64) + ic.outer)*32) + ic.inner)*32), 1, 32)]))
}
}
}
for (ow.inner, 0, 7) {
conv2d_NCHWc[ramp((((n.oc_chunk.fused.oh.outer.fused*7) + ow.inner)*32), 1, 32)] = conv2d_NCHWc.global[ramp((ow.inner*32), 1, 32)]
}
}
}
IR after that PR:
produce conv2d_NCHWc {
parallel (n.oc_chunk.fused.oh.outer.fused, 0, 33600) {
// attr [conv2d_NCHWc.global] storage_scope = "global"
allocate conv2d_NCHWc.global[float32x32 * 7]
produce conv2d_NCHWc.global {
conv2d_NCHWc.global[ramp(0, 1, 32)] = x32(0.000000f)
conv2d_NCHWc.global[ramp(32, 1, 32)] = x32(0.000000f)
conv2d_NCHWc.global[ramp(64, 1, 32)] = x32(0.000000f)
conv2d_NCHWc.global[ramp(96, 1, 32)] = x32(0.000000f)
conv2d_NCHWc.global[ramp(128, 1, 32)] = x32(0.000000f)
conv2d_NCHWc.global[ramp(160, 1, 32)] = x32(0.000000f)
conv2d_NCHWc.global[ramp(192, 1, 32)] = x32(0.000000f)
for (ic.outer, 0, 64) {
for (ic.inner, 0, 32) {
conv2d_NCHWc.global[ramp(0, 1, 32)] = (conv2d_NCHWc.global[ramp(0, 1, 32)] + (x32(data[(((((n.oc_chunk.fused.oh.outer.fused/112)*100352) + (ic.outer*1568)) + ((n.oc_chunk.fused.oh.outer.fused % 7)*224)) + ic.inner)])*weight[ramp((((((n.oc_chunk.fused.oh.outer.fused % 112)/7)*65536) + (ic.outer*1024)) + (ic.inner*32)), 1, 32)]))
conv2d_NCHWc.global[ramp(32, 1, 32)] = (conv2d_NCHWc.global[ramp(32, 1, 32)] + (x32(data[((((((n.oc_chunk.fused.oh.outer.fused/112)*100352) + (ic.outer*1568)) + ((n.oc_chunk.fused.oh.outer.fused % 7)*224)) + ic.inner) + 32)])*weight[ramp((((((n.oc_chunk.fused.oh.outer.fused % 112)/7)*65536) + (ic.outer*1024)) + (ic.inner*32)), 1, 32)]))
conv2d_NCHWc.global[ramp(64, 1, 32)] = (conv2d_NCHWc.global[ramp(64, 1, 32)] + (x32(data[((((((n.oc_chunk.fused.oh.outer.fused/112)*100352) + (ic.outer*1568)) + ((n.oc_chunk.fused.oh.outer.fused % 7)*224)) + ic.inner) + 64)])*weight[ramp((((((n.oc_chunk.fused.oh.outer.fused % 112)/7)*65536) + (ic.outer*1024)) + (ic.inner*32)), 1, 32)]))
conv2d_NCHWc.global[ramp(96, 1, 32)] = (conv2d_NCHWc.global[ramp(96, 1, 32)] + (x32(data[((((((n.oc_chunk.fused.oh.outer.fused/112)*100352) + (ic.outer*1568)) + ((n.oc_chunk.fused.oh.outer.fused % 7)*224)) + ic.inner) + 96)])*weight[ramp((((((n.oc_chunk.fused.oh.outer.fused % 112)/7)*65536) + (ic.outer*1024)) + (ic.inner*32)), 1, 32)]))
conv2d_NCHWc.global[ramp(128, 1, 32)] = (conv2d_NCHWc.global[ramp(128, 1, 32)] + (x32(data[((((((n.oc_chunk.fused.oh.outer.fused/112)*100352) + (ic.outer*1568)) + ((n.oc_chunk.fused.oh.outer.fused % 7)*224)) + ic.inner) + 128)])*weight[ramp((((((n.oc_chunk.fused.oh.outer.fused % 112)/7)*65536) + (ic.outer*1024)) + (ic.inner*32)), 1, 32)]))
conv2d_NCHWc.global[ramp(160, 1, 32)] = (conv2d_NCHWc.global[ramp(160, 1, 32)] + (x32(data[((((((n.oc_chunk.fused.oh.outer.fused/112)*100352) + (ic.outer*1568)) + ((n.oc_chunk.fused.oh.outer.fused % 7)*224)) + ic.inner) + 160)])*weight[ramp((((((n.oc_chunk.fused.oh.outer.fused % 112)/7)*65536) + (ic.outer*1024)) + (ic.inner*32)), 1, 32)]))
conv2d_NCHWc.global[ramp(192, 1, 32)] = (conv2d_NCHWc.global[ramp(192, 1, 32)] + (x32(data[((((((n.oc_chunk.fused.oh.outer.fused/112)*100352) + (ic.outer*1568)) + ((n.oc_chunk.fused.oh.outer.fused % 7)*224)) + ic.inner) + 192)])*weight[ramp((((((n.oc_chunk.fused.oh.outer.fused % 112)/7)*65536) + (ic.outer*1024)) + (ic.inner*32)), 1, 32)]))
}
}
}
for (ow.inner, 0, 7) {
conv2d_NCHWc[ramp(((n.oc_chunk.fused.oh.outer.fused*224) + (ow.inner*32)), 1, 32)] = conv2d_NCHWc.global[ramp((ow.inner*32), 1, 32)]
}
}
}
The only difference is that now conv2d_NCHWc.global index is expanded and multiplies large number. Can this be the cause?
@kevinthesun please check again now that the the integer simplification infra lands
produce conv2d_NCHWc {
parallel (n.oc_chunk.fused.oh.outer.fused, 0, 33600) {
// attr [conv2d_NCHWc.global] storage_scope = "global"
allocate conv2d_NCHWc.global[float32x32 * 7]
produce conv2d_NCHWc.global {
conv2d_NCHWc.global[ramp(0, 1, 32)] = x32(0f)
conv2d_NCHWc.global[ramp(32, 1, 32)] = x32(0f)
conv2d_NCHWc.global[ramp(64, 1, 32)] = x32(0f)
conv2d_NCHWc.global[ramp(96, 1, 32)] = x32(0f)
conv2d_NCHWc.global[ramp(128, 1, 32)] = x32(0f)
conv2d_NCHWc.global[ramp(160, 1, 32)] = x32(0f)
conv2d_NCHWc.global[ramp(192, 1, 32)] = x32(0f)
for (ic.outer, 0, 64) {
for (ic.inner, 0, 32) {
conv2d_NCHWc.global[ramp(0, 1, 32)] = (conv2d_NCHWc.global[ramp(0, 1, 32)] + (x32(data[((((floordiv(n.oc_chunk.fused.oh.outer.fused, 112)*100352) + (ic.outer*1568)) + (floormod(n.oc_chunk.fused.oh.outer.fused, 7)*224)) + ic.inner)])*weight[ramp((((floordiv(floormod(n.oc_chunk.fused.oh.outer.fused, 112), 7)*65536) + (ic.outer*1024)) + (ic.inner*32)), 1, 32)]))
conv2d_NCHWc.global[ramp(32, 1, 32)] = (conv2d_NCHWc.global[ramp(32, 1, 32)] + (x32(data[(((((floordiv(n.oc_chunk.fused.oh.outer.fused, 112)*100352) + (ic.outer*1568)) + (floormod(n.oc_chunk.fused.oh.outer.fused, 7)*224)) + ic.inner) + 32)])*weight[ramp((((floordiv(floormod(n.oc_chunk.fused.oh.outer.fused, 112), 7)*65536) + (ic.outer*1024)) + (ic.inner*32)), 1, 32)]))
conv2d_NCHWc.global[ramp(64, 1, 32)] = (conv2d_NCHWc.global[ramp(64, 1, 32)] + (x32(data[(((((floordiv(n.oc_chunk.fused.oh.outer.fused, 112)*100352) + (ic.outer*1568)) + (floormod(n.oc_chunk.fused.oh.outer.fused, 7)*224)) + ic.inner) + 64)])*weight[ramp((((floordiv(floormod(n.oc_chunk.fused.oh.outer.fused, 112), 7)*65536) + (ic.outer*1024)) + (ic.inner*32)), 1, 32)]))
conv2d_NCHWc.global[ramp(96, 1, 32)] = (conv2d_NCHWc.global[ramp(96, 1, 32)] + (x32(data[(((((floordiv(n.oc_chunk.fused.oh.outer.fused, 112)*100352) + (ic.outer*1568)) + (floormod(n.oc_chunk.fused.oh.outer.fused, 7)*224)) + ic.inner) + 96)])*weight[ramp((((floordiv(floormod(n.oc_chunk.fused.oh.outer.fused, 112), 7)*65536) + (ic.outer*1024)) + (ic.inner*32)), 1, 32)]))
conv2d_NCHWc.global[ramp(128, 1, 32)] = (conv2d_NCHWc.global[ramp(128, 1, 32)] + (x32(data[(((((floordiv(n.oc_chunk.fused.oh.outer.fused, 112)*100352) + (ic.outer*1568)) + (floormod(n.oc_chunk.fused.oh.outer.fused, 7)*224)) + ic.inner) + 128)])*weight[ramp((((floordiv(floormod(n.oc_chunk.fused.oh.outer.fused, 112), 7)*65536) + (ic.outer*1024)) + (ic.inner*32)), 1, 32)]))
conv2d_NCHWc.global[ramp(160, 1, 32)] = (conv2d_NCHWc.global[ramp(160, 1, 32)] + (x32(data[(((((floordiv(n.oc_chunk.fused.oh.outer.fused, 112)*100352) + (ic.outer*1568)) + (floormod(n.oc_chunk.fused.oh.outer.fused, 7)*224)) + ic.inner) + 160)])*weight[ramp((((floordiv(floormod(n.oc_chunk.fused.oh.outer.fused, 112), 7)*65536) + (ic.outer*1024)) + (ic.inner*32)), 1, 32)]))
conv2d_NCHWc.global[ramp(192, 1, 32)] = (conv2d_NCHWc.global[ramp(192, 1, 32)] + (x32(data[(((((floordiv(n.oc_chunk.fused.oh.outer.fused, 112)*100352) + (ic.outer*1568)) + (floormod(n.oc_chunk.fused.oh.outer.fused, 7)*224)) + ic.inner) + 192)])*weight[ramp((((floordiv(floormod(n.oc_chunk.fused.oh.outer.fused, 112), 7)*65536) + (ic.outer*1024)) + (ic.inner*32)), 1, 32)]))
}
}
}
for (ow.inner, 0, 7) {
conv2d_NCHWc[ramp(((n.oc_chunk.fused.oh.outer.fused*224) + (ow.inner*32)), 1, 32)] = conv2d_NCHWc.global[ramp((ow.inner*32), 1, 32)]
}
}
}
@tqchen I tested with latest master but the performance is better, but still around 18.5 ms.
@kevinthesun can you look a bit into what was happening? Specifically, please look into the code after LowerIntrin to make sure that the floordiv/mod are lowered correctly to the corresponding https://github.com/dmlc/tvm/blob/master/python/tvm/build_module.py#L498
Note that we will likely need to retune the workload to given that the division order is different. To recover the original behavior, right now we flatten all the multiplications for simplification, but the original code eagerly fold the common constants.
(x * 4 + y) * 2 vs x * 8 + y * 2. If you find that this is a cause, adding an additional pass before low-level codegen that folds the common multiplication factors will likely resolve this problem.
@tqchen Yes. I think retune is necessary here. Let me redo tuning and check the ir.
After retuning, the performance comes to 13.5 ms! I'll retune ssd as well.
With new ir simplifier infrastructure, ssd resnet50_v1 performance improves from 33ms to 26.5ms. I'll close this issue and open new one if more performance issues are found.