Incubator-mxnet: support slice on MKLDNN arrays better

Created on 23 Aug 2018 · 16Comments · Source: apache/incubator-mxnet

Currently, slice on an MKLDNN array requires to convert the array to the default layout before taking a slice. However, the MKLDNN library actually provides a view for MKLDNN memory. As such, by taking advantage of the MKLDNN view, we don't really need to convert data layout for slice.
For details, please see the discussion here: https://github.com/intel/mkl-dnn/issues/306, https://github.com/intel/mkl-dnn/issues/69, https://github.com/intel/mkl-dnn/issues/290
@pengzhao-intel @TaoLv @azai91 @safrooze

Feature request MKLDNN

Source

zheng-da

👍2

Most helpful comment

The use case is implementing effectively a circular buffer using concat+slice. Here is the code:

from mxnet import profiler
profiler.set_config(profile_all=True, aggregate_stats=True,
                    filename='/home/ec2-user/src/mkl_slice_op_profile.json')


class TestBlock(gluon.HybridBlock):
    def __init__(self):
        super(TestBlock, self).__init__()
        with self.name_scope():
            self.conv = gluon.nn.Conv2D(512, kernel_size=(1, 3), dilation=512)

    def hybrid_forward(self, F, x):
        out = self.conv(x)
        x = F.concat(x, out, dim=3)
        x = F.slice_axis(x, axis=3, begin=-1025, end=None)
        # x = F.slice(x, begin=(None, None, None, -1025), end=(None, None, None, None))
        return x


x = nd.random.uniform(shape=(32, 512, 1, 1025))
net = TestBlock()
net.initialize()
net.hybridize(static_alloc=True, static_shape=True)
x = net(x)

profiler.set_state('run')
for _ in range(100):
    x = net(x)

nd.waitall()
profiler.set_state('stop')
profiler.dump()
print(profiler.dumps(reset=True))
exit(0)

And here are the interesting profiling results.

Profile with mxnet package and slice_axis operator (in hybrid_forward(), uncomment slice and comment slice_axis) (no MKL)

=================
Name                          Total Count        Time (ms)    Min Time (ms)    Max Time (ms)    Avg Time (ms)
----                          -----------        ---------    -------------    -------------    -------------
slice_axis                            200        4048.8311          20.1010          20.3790          20.2442
Concat                                200       17641.7461          88.0750          89.5890          88.2087
Convolution                           200        2944.2839          14.5890          14.8890          14.7214
DeleteVariable                        206         517.0800           0.0030           2.6670           2.5101

Profile with mxnet package and slice operator (no MKL) (Consistently performs ~2% better than slice_axis!!)

=================
Name                          Total Count        Time (ms)    Min Time (ms)    Max Time (ms)    Avg Time (ms)
----                          -----------        ---------    -------------    -------------    -------------
slice                                 200        3938.1279          19.5190          19.9520          19.6906
Concat                                200       17636.0566          88.0600          88.7120          88.1803
Convolution                           200        2945.0759          14.5760          14.8420          14.7254
DeleteVariable                        206         521.2870           0.0030           2.6960           2.5305

Profile with mxnet-mkl package and slice_axis operator (with MKLDNN)

=================
Name                          Total Count        Time (ms)    Min Time (ms)    Max Time (ms)    Avg Time (ms)
----                          -----------        ---------    -------------    -------------    -------------
Reorder                               202           2.9610           0.0000           1.3190           0.0147
slice_axis                            200        4979.5488          24.6100          26.1240          24.8977
Concat                                200         881.7350           4.3000           4.5370           4.4087
Convolution                           200        1231.0720           5.9080          11.6130           6.1554
DeleteVariable                        408         982.9400           0.0030           2.8100           2.4092

Profile with mxnet-mkl package and slice operator (with MKLDNN)

=================
Name                          Total Count        Time (ms)    Min Time (ms)    Max Time (ms)    Avg Time (ms)
----                          -----------        ---------    -------------    -------------    -------------
Reorder                               202           2.8510           0.0000           1.2710           0.0141
slice                                 200        5012.6240          24.8500          27.0280          25.0631
Concat                                200         880.1710           4.2900           4.5270           4.4009
Convolution                           200        1252.7841           5.9060          11.7800           6.2639
DeleteVariable                        408         970.0030           0.0040           2.8370           2.3775

safrooze on 5 Sep 2018

👍2

All 16 comments

Yes, I think it's doable and worth to do.

In other words, we need an MKL-DNN based slice OP.

Do you need our engineer to help with this kind of functionality?

pengzhao-intel on 23 Aug 2018

@pengzhao-intel if your team has bandwidth to make it happen, it'll be great.

zheng-da on 23 Aug 2018

OK, we will take over this work and submit PR later.

pengzhao-intel on 23 Aug 2018

@mxnet-label-bot : [MKLDNN, Feature Request]

ankkhedia on 23 Aug 2018

@safrooze Could you provide a use case for @pengzhao-intel for testing?

zheng-da on 28 Aug 2018

@safrooze we're starting the implementation of slide OP.
It will be more focus if you can provide the case for us.
But if it's also fine if it is not convenience from your side and we will make the OP as general as possible.

pengzhao-intel on 1 Sep 2018

The use case is implementing effectively a circular buffer using concat+slice. Here is the code:

from mxnet import profiler
profiler.set_config(profile_all=True, aggregate_stats=True,
                    filename='/home/ec2-user/src/mkl_slice_op_profile.json')


class TestBlock(gluon.HybridBlock):
    def __init__(self):
        super(TestBlock, self).__init__()
        with self.name_scope():
            self.conv = gluon.nn.Conv2D(512, kernel_size=(1, 3), dilation=512)

    def hybrid_forward(self, F, x):
        out = self.conv(x)
        x = F.concat(x, out, dim=3)
        x = F.slice_axis(x, axis=3, begin=-1025, end=None)
        # x = F.slice(x, begin=(None, None, None, -1025), end=(None, None, None, None))
        return x


x = nd.random.uniform(shape=(32, 512, 1, 1025))
net = TestBlock()
net.initialize()
net.hybridize(static_alloc=True, static_shape=True)
x = net(x)

profiler.set_state('run')
for _ in range(100):
    x = net(x)

nd.waitall()
profiler.set_state('stop')
profiler.dump()
print(profiler.dumps(reset=True))
exit(0)

And here are the interesting profiling results.

Profile with mxnet package and slice_axis operator (in hybrid_forward(), uncomment slice and comment slice_axis) (no MKL)

=================
Name                          Total Count        Time (ms)    Min Time (ms)    Max Time (ms)    Avg Time (ms)
----                          -----------        ---------    -------------    -------------    -------------
slice_axis                            200        4048.8311          20.1010          20.3790          20.2442
Concat                                200       17641.7461          88.0750          89.5890          88.2087
Convolution                           200        2944.2839          14.5890          14.8890          14.7214
DeleteVariable                        206         517.0800           0.0030           2.6670           2.5101

Profile with mxnet package and slice operator (no MKL) (Consistently performs ~2% better than slice_axis!!)

=================
Name                          Total Count        Time (ms)    Min Time (ms)    Max Time (ms)    Avg Time (ms)
----                          -----------        ---------    -------------    -------------    -------------
slice                                 200        3938.1279          19.5190          19.9520          19.6906
Concat                                200       17636.0566          88.0600          88.7120          88.1803
Convolution                           200        2945.0759          14.5760          14.8420          14.7254
DeleteVariable                        206         521.2870           0.0030           2.6960           2.5305

Profile with mxnet-mkl package and slice_axis operator (with MKLDNN)

=================
Name                          Total Count        Time (ms)    Min Time (ms)    Max Time (ms)    Avg Time (ms)
----                          -----------        ---------    -------------    -------------    -------------
Reorder                               202           2.9610           0.0000           1.3190           0.0147
slice_axis                            200        4979.5488          24.6100          26.1240          24.8977
Concat                                200         881.7350           4.3000           4.5370           4.4087
Convolution                           200        1231.0720           5.9080          11.6130           6.1554
DeleteVariable                        408         982.9400           0.0030           2.8100           2.4092

Profile with mxnet-mkl package and slice operator (with MKLDNN)

=================
Name                          Total Count        Time (ms)    Min Time (ms)    Max Time (ms)    Avg Time (ms)
----                          -----------        ---------    -------------    -------------    -------------
Reorder                               202           2.8510           0.0000           1.2710           0.0141
slice                                 200        5012.6240          24.8500          27.0280          25.0631
Concat                                200         880.1710           4.2900           4.5270           4.4009
Convolution                           200        1252.7841           5.9060          11.7800           6.2639
DeleteVariable                        408         970.0030           0.0040           2.8370           2.3775

safrooze on 5 Sep 2018

👍2

Thanks @safrooze :)
@fall4knight will follow up your test cases.

pengzhao-intel on 5 Sep 2018

@safrooze Thanks for your usecase. I have implemented the first edition of an MKL-DNN supported version for slice OP.
In cases on format nChw16c, which is the most widely used format, MKL-DNN is proved to have the capability to boost the slice OP by a lot.
Additionally, we found that the larger the input size is, the bigger the improvement is in the case of nChw16c.
Please check the profile log down below.

slice w/o MKL-DNN

Name|Total Count|Time (ms)|Min Time (ms)|Max Time (ms)|Avg Time (ms)
:-----:|:-----:|:-----:|:-----:|:-----:|:-----:
Reorder|202|2.808|0|1.318|0.0139
slice|202|1145.891|5.357|6.295|5.6727
Convolution|202|518.247|2.423|5.015|2.5656
CopyCPU2CPU|4|4.495|0.02|2.228|1.1237
Concat|202|352.702|1.668|4.333|1.746
_full|2|0.023|0.011|0.012|0.0115
_random_uniform|4|19.74|0.386|9.484|4.935
_zeros|8|6.206|0.003|2.733|0.7757
DeleteVariable|408|102.104|0.003|0.349|0.2503
ResourceParallelRandomSetSeed|2|6.704|3.351|3.353|3.352

slice w/ MKL-DNN

Name|Total Count|Time (ms)|Min Time (ms)|Max Time (ms)|Avg Time (ms)
:-----:|:-----:|:-----:|:-----:|:-----:|:-----:
Reorder|202|2.212|0|1.012|0.011
slice|202|507.673|2.395|2.802|2.5132
Convolution|202|520.934|2.372|4.951|2.5789
CopyCPU2CPU|4|5.424|0.023|2.689|1.356
Concat|202|332.056|1.601|2.755|1.6438
_full|2|0.025|0.012|0.013|0.0125
_random_uniform|4|19.853|0.413|9.515|4.9633
_zeros|8|8.877|0.004|4.09|1.1096
DeleteVariable|408|37.766|0.005|0.217|0.1833
ResourceParallelRandomSetSeed|2|7.638|3.818|3.82|3.819

fall4knight on 5 Sep 2018

👍1

Great results @fall4knight! Does it make sense to you that slice is about 50% more expensive than concat and almost as expensive as convolution?

safrooze on 5 Sep 2018

@safrooze I think the reason is that in your usecase you set dilation=512, in which case the convolution is completely skipped according to the optimal algorithm. You can set dilation to other commonly used numbers like 1 and see what happens. Thanks.

fall4knight on 7 Sep 2018

@fall4knight Any update on submitting a PR for this fix?

safrooze on 16 Oct 2018

Thanks @safrooze We are still WIP for the different type of slice, like sliceChannel, and the backward path.

pengzhao-intel on 16 Oct 2018

13730 @zheng-da @safrooze

pengzhao-intel on 26 Dec 2018

@pengzhao-intel , @safrooze
Update profile result with #13730 [Add mkldnn OP for slice]
The implementation of slice mkldnn can speed up by about 2+ times, which is consistent with @fall4knight.

slice w/o MKL-DNN

Name                          Total Count        Time (ms)    Min Time (ms)    Max Time (ms)    Avg Time (ms)
----                          -----------        ---------    -------------    -------------    -------------
Reorder                               202           3.2840           0.0000           1.5200           0.0163
slice                                 200         948.5400           4.3540           5.5340           4.7427
Concat                                200         258.6810           1.2110           1.4840           1.2934
Convolution                           200         474.5550           2.1420           3.8940           2.3728
DeleteVariable                        408         140.0790           0.0050           0.5690           0.3433

slice w/ MKL-DNN

Name                          Total Count        Time (ms)    Min Time (ms)    Max Time (ms)    Avg Time (ms)
----                          -----------        ---------    -------------    -------------    -------------
Reorder                               202           3.8630           0.0000           1.7790           0.0191
slice                                 200         437.4760           1.9620           2.4460           2.1874
Concat                                200         273.7890           1.2180           1.6770           1.3689
Convolution                           200         486.8030           2.1530           4.0300           2.4340
DeleteVariable                        206          47.9190           0.0050           0.4690           0.2326