Currently, slice on an MKLDNN array requires to convert the array to the default layout before taking a slice. However, the MKLDNN library actually provides a view for MKLDNN memory. As such, by taking advantage of the MKLDNN view, we don't really need to convert data layout for slice.
For details, please see the discussion here: https://github.com/intel/mkl-dnn/issues/306, https://github.com/intel/mkl-dnn/issues/69, https://github.com/intel/mkl-dnn/issues/290
@pengzhao-intel @TaoLv @azai91 @safrooze
Yes, I think it's doable and worth to do.
In other words, we need an MKL-DNN based slice OP.
Do you need our engineer to help with this kind of functionality?
@pengzhao-intel if your team has bandwidth to make it happen, it'll be great.
OK, we will take over this work and submit PR later.
@mxnet-label-bot : [MKLDNN, Feature Request]
@safrooze Could you provide a use case for @pengzhao-intel for testing?
@safrooze we're starting the implementation of slide OP.
It will be more focus if you can provide the case for us.
But if it's also fine if it is not convenience from your side and we will make the OP as general as possible.
The use case is implementing effectively a circular buffer using concat+slice. Here is the code:
from mxnet import profiler
profiler.set_config(profile_all=True, aggregate_stats=True,
filename='/home/ec2-user/src/mkl_slice_op_profile.json')
class TestBlock(gluon.HybridBlock):
def __init__(self):
super(TestBlock, self).__init__()
with self.name_scope():
self.conv = gluon.nn.Conv2D(512, kernel_size=(1, 3), dilation=512)
def hybrid_forward(self, F, x):
out = self.conv(x)
x = F.concat(x, out, dim=3)
x = F.slice_axis(x, axis=3, begin=-1025, end=None)
# x = F.slice(x, begin=(None, None, None, -1025), end=(None, None, None, None))
return x
x = nd.random.uniform(shape=(32, 512, 1, 1025))
net = TestBlock()
net.initialize()
net.hybridize(static_alloc=True, static_shape=True)
x = net(x)
profiler.set_state('run')
for _ in range(100):
x = net(x)
nd.waitall()
profiler.set_state('stop')
profiler.dump()
print(profiler.dumps(reset=True))
exit(0)
And here are the interesting profiling results.
slice_axis operator (in hybrid_forward(), uncomment slice and comment slice_axis) (no MKL)=================
Name Total Count Time (ms) Min Time (ms) Max Time (ms) Avg Time (ms)
---- ----------- --------- ------------- ------------- -------------
slice_axis 200 4048.8311 20.1010 20.3790 20.2442
Concat 200 17641.7461 88.0750 89.5890 88.2087
Convolution 200 2944.2839 14.5890 14.8890 14.7214
DeleteVariable 206 517.0800 0.0030 2.6670 2.5101
slice operator (no MKL) (Consistently performs ~2% better than slice_axis!!)=================
Name Total Count Time (ms) Min Time (ms) Max Time (ms) Avg Time (ms)
---- ----------- --------- ------------- ------------- -------------
slice 200 3938.1279 19.5190 19.9520 19.6906
Concat 200 17636.0566 88.0600 88.7120 88.1803
Convolution 200 2945.0759 14.5760 14.8420 14.7254
DeleteVariable 206 521.2870 0.0030 2.6960 2.5305
slice_axis operator (with MKLDNN)=================
Name Total Count Time (ms) Min Time (ms) Max Time (ms) Avg Time (ms)
---- ----------- --------- ------------- ------------- -------------
Reorder 202 2.9610 0.0000 1.3190 0.0147
slice_axis 200 4979.5488 24.6100 26.1240 24.8977
Concat 200 881.7350 4.3000 4.5370 4.4087
Convolution 200 1231.0720 5.9080 11.6130 6.1554
DeleteVariable 408 982.9400 0.0030 2.8100 2.4092
slice operator (with MKLDNN)=================
Name Total Count Time (ms) Min Time (ms) Max Time (ms) Avg Time (ms)
---- ----------- --------- ------------- ------------- -------------
Reorder 202 2.8510 0.0000 1.2710 0.0141
slice 200 5012.6240 24.8500 27.0280 25.0631
Concat 200 880.1710 4.2900 4.5270 4.4009
Convolution 200 1252.7841 5.9060 11.7800 6.2639
DeleteVariable 408 970.0030 0.0040 2.8370 2.3775
Thanks @safrooze :)
@fall4knight will follow up your test cases.
@safrooze Thanks for your usecase. I have implemented the first edition of an MKL-DNN supported version for slice OP.
In cases on format nChw16c, which is the most widely used format, MKL-DNN is proved to have the capability to boost the slice OP by a lot.
Additionally, we found that the larger the input size is, the bigger the improvement is in the case of nChw16c.
Please check the profile log down below.
Name|Total Count|Time (ms)|Min Time (ms)|Max Time (ms)|Avg Time (ms)
:-----:|:-----:|:-----:|:-----:|:-----:|:-----:
Reorder|202|2.808|0|1.318|0.0139
slice|202|1145.891|5.357|6.295|5.6727
Convolution|202|518.247|2.423|5.015|2.5656
CopyCPU2CPU|4|4.495|0.02|2.228|1.1237
Concat|202|352.702|1.668|4.333|1.746
_full|2|0.023|0.011|0.012|0.0115
_random_uniform|4|19.74|0.386|9.484|4.935
_zeros|8|6.206|0.003|2.733|0.7757
DeleteVariable|408|102.104|0.003|0.349|0.2503
ResourceParallelRandomSetSeed|2|6.704|3.351|3.353|3.352
Name|Total Count|Time (ms)|Min Time (ms)|Max Time (ms)|Avg Time (ms)
:-----:|:-----:|:-----:|:-----:|:-----:|:-----:
Reorder|202|2.212|0|1.012|0.011
slice|202|507.673|2.395|2.802|2.5132
Convolution|202|520.934|2.372|4.951|2.5789
CopyCPU2CPU|4|5.424|0.023|2.689|1.356
Concat|202|332.056|1.601|2.755|1.6438
_full|2|0.025|0.012|0.013|0.0125
_random_uniform|4|19.853|0.413|9.515|4.9633
_zeros|8|8.877|0.004|4.09|1.1096
DeleteVariable|408|37.766|0.005|0.217|0.1833
ResourceParallelRandomSetSeed|2|7.638|3.818|3.82|3.819
Great results @fall4knight! Does it make sense to you that slice is about 50% more expensive than concat and almost as expensive as convolution?
@safrooze I think the reason is that in your usecase you set dilation=512, in which case the convolution is completely skipped according to the optimal algorithm. You can set dilation to other commonly used numbers like 1 and see what happens. Thanks.
@fall4knight Any update on submitting a PR for this fix?
Thanks @safrooze We are still WIP for the different type of slice, like sliceChannel, and the backward path.
@pengzhao-intel , @safrooze
Update profile result with #13730 [Add mkldnn OP for slice]
The implementation of slice mkldnn can speed up by about 2+ times, which is consistent with @fall4knight.
Name Total Count Time (ms) Min Time (ms) Max Time (ms) Avg Time (ms)
---- ----------- --------- ------------- ------------- -------------
Reorder 202 3.2840 0.0000 1.5200 0.0163
slice 200 948.5400 4.3540 5.5340 4.7427
Concat 200 258.6810 1.2110 1.4840 1.2934
Convolution 200 474.5550 2.1420 3.8940 2.3728
DeleteVariable 408 140.0790 0.0050 0.5690 0.3433
Name Total Count Time (ms) Min Time (ms) Max Time (ms) Avg Time (ms)
---- ----------- --------- ------------- ------------- -------------
Reorder 202 3.8630 0.0000 1.7790 0.0191
slice 200 437.4760 1.9620 2.4460 2.1874
Concat 200 273.7890 1.2180 1.6770 1.3689
Convolution 200 486.8030 2.1530 4.0300 2.4340
DeleteVariable 206 47.9190 0.0050 0.4690 0.2326
Closed via #13730 .
Most helpful comment
The use case is implementing effectively a circular buffer using
concat+slice. Here is the code:And here are the interesting profiling results.
slice_axisoperator (inhybrid_forward(), uncommentsliceand commentslice_axis) (no MKL)sliceoperator (no MKL) (Consistently performs ~2% better thanslice_axis!!)slice_axisoperator (with MKLDNN)sliceoperator (with MKLDNN)