Incubator-mxnet: performance degradation from 1.3.1 to 1.4.0

Created on 21 Mar 2019 · 15Comments · Source: apache/incubator-mxnet

There appears to be some performance degradation between the 1.3.1 and 1.4.0 releases. So far we know of imdecode and tranpose operators having reduced performance.

We've tracked the responsible PRs for these operators to:
Transpose:
https://github.com/dmlc/mshadow/pull/359/files
https://github.com/apache/incubator-mxnet/pull/11742

Imdecode:
https://github.com/apache/incubator-mxnet/pull/8757

Im using the cat.jpg from here as input data:
https://github.com/dmlc/web-data/blob/master/mxnet/doc/tutorials/python/predict_image/cat.jpg

Heres the current performance benchmarking script for imdecode:

import time
import mxnet as mx
from mxnet import image as img
import numpy as np

flag=1
to_rgb=True
out=None

dtimes = []
for i in range(1000):
    buf = open("cat.jpg", 'rb').read()

    start = time.time()
    img.imdecode(buf, flag, to_rgb, out)
    decode = (time.time() - start)*1000
    dtimes.append(decode)

print('mxnet version: %s' % mx.__version__)
print('decode:---------')
print('p50: %4.2f ms' % np.percentile(dtimes,50))
print('p90: %4.2f ms' % np.percentile(dtimes,90))
print('p99: %4.2f ms' % np.percentile(dtimes,99))

And here are the performance results using mxnet-cu90mkl for 1.4.0 and 1.3.1:

mxnet version: 1.4.0
decode:--------- 
p50: 13.75 ms 
p90: 15.74 ms 
p99: 19.89 ms 

mxnet version: 1.3.1 
decode:--------- 
p50: 12.82 ms 
p90: 13.59 ms 
p99: 13.88 ms

Setting the flag = 1 + 128 (instead of just 1) as an argument to imdecode results in the following results:

mxnet version: 1.4.0 
load:--------- 
p50: 0.10 ms 
p90: 0.16 ms 
p99: 0.18 ms 
decode:--------- 
p50: 13.27 ms 
p90: 13.83 ms 
p99: 15.63 ms

So there is some additional work thats going on that makes the imdecode take longer. This is a "feature" of the v1.4 release and changed the defaults which is where some performance degradation is happening at least in the imdecode function.

Heres the current performance benchmarking script for transpose:

import mxnet as mx
import time
import numpy as np

sizes = [10, 50, 100,200,500]
iters = [10000,1000,500,200,20]
times = []
for size in range(len(sizes)):
    data = []
    s = sizes[size]
    print(s)
    for i in range(iters[size]):
        x = mx.nd.ones((s,s,s))
        mx.nd.waitall()
        start = time.time()
        y = mx.nd.transpose(x,(2,0,1))
        mx.nd.waitall()
        data.append((time.time() - start)*1000)
        #print(data[-1])                                                                                                                                                                            
    times.append(data)

print('mxnet version: %s' % mx.__version__)
for s in range(len(sizes)):
    print('--------------------')
    print('size: %s' % str(sizes[s]))
    print('p50: %4.2f ms' % np.percentile(times[s],50))
    print('p90: %4.2f ms' % np.percentile(times[s],90))
    print('p99: %4.2f ms' % np.percentile(times[s],99))

And here are the performance results using mxnet-cu90mkl for 1.4.0 and 1.3.1:

mxnet version: 1.4.0 
-------------------- 
size: 10 
p50: 0.04 ms 
p90: 0.04 ms 
p99: 0.05 ms 
-------------------- 
size: 50 
p50: 1.08 ms 
p90: 1.09 ms 
p99: 1.13 ms 
-------------------- 
size: 100 
p50: 12.50 ms 
p90: 12.53 ms 
p99: 13.52 ms 
-------------------- 
size: 200 
p50: 123.06 ms 
p90: 125.63 ms 
p99: 125.95 ms 
-------------------- 
size: 500 
p50: 2768.49 ms 
p90: 2797.46 ms 
p99: 2809.45 ms 


mxnet version: 1.3.1 
-------------------- 
size: 10 
p50: 0.03 ms 
p90: 0.04 ms 
p99: 0.04 ms 
-------------------- 
size: 50 
p50: 0.36 ms 
p90: 0.37 ms 
p99: 0.37 ms 
-------------------- 
size: 100 
p50: 2.79 ms 
p90: 2.90 ms 
p99: 3.97 ms 
-------------------- 
size: 200 
p50: 46.05 ms 
p90: 50.07 ms 
p99: 50.11 ms 
-------------------- 
size: 500 
p50: 1094.50 ms 
p90: 1095.89 ms 
p99: 1096.44 ms

Backend Performance

Source

samskalicky

Most helpful comment

@sun-dev is correct. The data type is the shape/index data type not data type of the tensor elements. There are add, multiplication and division involved in the transpose operator here

I did a performance comparison of different arithmetic operations between 32-bit and 64-bit integers on CPU. There are noticable difference below. FYI, you can use this code to reproduce.

result = 49995000
Add 32 time in clocks 24869
Add 32 time in ms 1359
result = 49995000
Add 64 time in clocks 6070
Add 64 time in ms 1971
result = 349965000
Add Mul 32 time in clocks 3601
Add Mul 32 time in ms 1196
result = 349965000
Add Mul 64 time in clocks 9967
Add Mul 64 time in ms 3477
result = 7137858
Add Div 32 time in clocks 8273
Add Div 32 time in ms 2878
result = 7137858
Add Div 64 time in clocks 24016
Add Div 64 time in ms 8499

apeforest on 27 Mar 2019

👍3

All 15 comments

Hey, this is the MXNet Label Bot.
Thank you for submitting the issue! I will try and suggest some labels so that the appropriate MXNet community members can help resolve it.
Here are my recommended labels: Performance

mxnet-label-bot on 21 Mar 2019

So question for the community, is are others seeing performance issues moving from 1.3.1 to 1.4.0?

And for what operators do you see issue?

samskalicky on 21 Mar 2019

@mxnet-label-bot add [Backend, Performance]

zachgk on 21 Mar 2019

@apeforest @zheng-da

pengzhao-intel on 22 Mar 2019

I could verify the performance degradation using Sam's script on the transpose operator.

The performance slow down is mainly due to arithmetic operation performance difference between int32_t and int64_t data types. I changed the data type in Tensor object in mshadow from index_t (which is typedef to int64_t) to int32_t and can see runtime of transpose operator come back to almost same as in 1.3.1.

I am doing the following experiment:

1) casting tensor index data type to int32_t at runtime based on tensor size
2) using a env variable to select between a large tensor object (defined by int64_t) and small tensor object (defined by int32_t)

I will post the experiment results and from it see how we can generalize one of these approaches.

@pengzhao-intel Any compiler optimization flags for int64_t data types in Intel architecture? Also, how did MKLDNN handle int64_t performance? Your insight will be appreciated.

apeforest on 23 Mar 2019

👍3

Although the option 1 will enlarge the size of the binary library, I agree it in performance-sensitive codes.

I wrote an example: https://github.com/wkcn/incubator-mxnet/blob/support_large_array/src/operator/mxnet_op.h#L41

wkcn on 24 Mar 2019

👍1

Although the option 1 will enlarge the size of the binary library, I am in favor of it in performance-sensitive codes.

wkcn on 24 Mar 2019

@apeforest
Regarding INT64, MKL-DNN only supports the INT32 at now. The next major version of 1.0 will support INT64 by default.
On the other hand, the Intel MKL library can handle INT64 (libmkl_intel_ilp64.a) and build MXNet with CFLAGS += -DMKL_ILP64.

pengzhao-intel on 25 Mar 2019

Just to clarify, for the data type change from 32bit to 64bit, we're talking about the tensor shape data type, not the actual tensor values. So for example, @apeforest's change allowed 64bit shape data type for a tensor with float elements.

And this performance degradation for transpose was measured with pure-C++ (mshadow) implementations, not MKLDNN.

@pengzhao-intel does MKLDNN support 64-bit tensor shapes?

I cant imagine that MXNet's changes from 1.3.1 to 1.4.0 reduced the performance of MKLDNN operators, but has anyone verified if there was any issues for MKLDNN ops comapred to older versions?

sun-dev on 25 Mar 2019

@sun-dev The performance issue talked in this issue is NOT related to MKLDNN. We have verified the performance of MKLDNN, please see below link.

https://cwiki.apache.org/confluence/display/MXNET/MXNet+with+Intel+MKL-DNN+-+Performance+Benchmarking

The MKLDNN doesn't support 64-bit tensor shape now. It supposes to fall back original CPU implementation.

Feel free to let me know if you see any issues in MKLDNN.

Please refer to check if MKLDNN OP is used: https://mxnet.incubator.apache.org/versions/master/tutorials/mkldnn/MKLDNN_README.html#4

pengzhao-intel on 26 Mar 2019

@sun-dev is correct. The data type is the shape/index data type not data type of the tensor elements. There are add, multiplication and division involved in the transpose operator here

I did a performance comparison of different arithmetic operations between 32-bit and 64-bit integers on CPU. There are noticable difference below. FYI, you can use this code to reproduce.

result = 49995000
Add 32 time in clocks 24869
Add 32 time in ms 1359
result = 49995000
Add 64 time in clocks 6070
Add 64 time in ms 1971
result = 349965000
Add Mul 32 time in clocks 3601
Add Mul 32 time in ms 1196
result = 349965000
Add Mul 64 time in clocks 9967
Add Mul 64 time in ms 3477
result = 7137858
Add Div 32 time in clocks 8273
Add Div 32 time in ms 2878
result = 7137858
Add Div 64 time in clocks 24016
Add Div 64 time in ms 8499

apeforest on 27 Mar 2019

👍3

Is there any fix for transpose? I noticed now transpose takes significant amount of time in BERT.