Incubator-mxnet: performance degradation from 1.3.1 to 1.4.0

Created on 21 Mar 2019  路  15Comments  路  Source: apache/incubator-mxnet

There appears to be some performance degradation between the 1.3.1 and 1.4.0 releases. So far we know of imdecode and tranpose operators having reduced performance.

We've tracked the responsible PRs for these operators to:
Transpose:
https://github.com/dmlc/mshadow/pull/359/files
https://github.com/apache/incubator-mxnet/pull/11742

Imdecode:
https://github.com/apache/incubator-mxnet/pull/8757

Im using the cat.jpg from here as input data:
https://github.com/dmlc/web-data/blob/master/mxnet/doc/tutorials/python/predict_image/cat.jpg

Heres the current performance benchmarking script for imdecode:

import time
import mxnet as mx
from mxnet import image as img
import numpy as np

flag=1
to_rgb=True
out=None

dtimes = []
for i in range(1000):
    buf = open("cat.jpg", 'rb').read()

    start = time.time()
    img.imdecode(buf, flag, to_rgb, out)
    decode = (time.time() - start)*1000
    dtimes.append(decode)

print('mxnet version: %s' % mx.__version__)
print('decode:---------')
print('p50: %4.2f ms' % np.percentile(dtimes,50))
print('p90: %4.2f ms' % np.percentile(dtimes,90))
print('p99: %4.2f ms' % np.percentile(dtimes,99))

And here are the performance results using mxnet-cu90mkl for 1.4.0 and 1.3.1:

mxnet version: 1.4.0
decode:--------- 
p50: 13.75 ms 
p90: 15.74 ms 
p99: 19.89 ms 

mxnet version: 1.3.1 
decode:--------- 
p50: 12.82 ms 
p90: 13.59 ms 
p99: 13.88 ms 

Setting the flag = 1 + 128 (instead of just 1) as an argument to imdecode results in the following results:

mxnet version: 1.4.0 
load:--------- 
p50: 0.10 ms 
p90: 0.16 ms 
p99: 0.18 ms 
decode:--------- 
p50: 13.27 ms 
p90: 13.83 ms 
p99: 15.63 ms 

So there is some additional work thats going on that makes the imdecode take longer. This is a "feature" of the v1.4 release and changed the defaults which is where some performance degradation is happening at least in the imdecode function.

Heres the current performance benchmarking script for transpose:

import mxnet as mx
import time
import numpy as np

sizes = [10, 50, 100,200,500]
iters = [10000,1000,500,200,20]
times = []
for size in range(len(sizes)):
    data = []
    s = sizes[size]
    print(s)
    for i in range(iters[size]):
        x = mx.nd.ones((s,s,s))
        mx.nd.waitall()
        start = time.time()
        y = mx.nd.transpose(x,(2,0,1))
        mx.nd.waitall()
        data.append((time.time() - start)*1000)
        #print(data[-1])                                                                                                                                                                            
    times.append(data)

print('mxnet version: %s' % mx.__version__)
for s in range(len(sizes)):
    print('--------------------')
    print('size: %s' % str(sizes[s]))
    print('p50: %4.2f ms' % np.percentile(times[s],50))
    print('p90: %4.2f ms' % np.percentile(times[s],90))
    print('p99: %4.2f ms' % np.percentile(times[s],99))

And here are the performance results using mxnet-cu90mkl for 1.4.0 and 1.3.1:

mxnet version: 1.4.0 
-------------------- 
size: 10 
p50: 0.04 ms 
p90: 0.04 ms 
p99: 0.05 ms 
-------------------- 
size: 50 
p50: 1.08 ms 
p90: 1.09 ms 
p99: 1.13 ms 
-------------------- 
size: 100 
p50: 12.50 ms 
p90: 12.53 ms 
p99: 13.52 ms 
-------------------- 
size: 200 
p50: 123.06 ms 
p90: 125.63 ms 
p99: 125.95 ms 
-------------------- 
size: 500 
p50: 2768.49 ms 
p90: 2797.46 ms 
p99: 2809.45 ms 


mxnet version: 1.3.1 
-------------------- 
size: 10 
p50: 0.03 ms 
p90: 0.04 ms 
p99: 0.04 ms 
-------------------- 
size: 50 
p50: 0.36 ms 
p90: 0.37 ms 
p99: 0.37 ms 
-------------------- 
size: 100 
p50: 2.79 ms 
p90: 2.90 ms 
p99: 3.97 ms 
-------------------- 
size: 200 
p50: 46.05 ms 
p90: 50.07 ms 
p99: 50.11 ms 
-------------------- 
size: 500 
p50: 1094.50 ms 
p90: 1095.89 ms 
p99: 1096.44 ms 
Backend Performance

Most helpful comment

@sun-dev is correct. The data type is the shape/index data type not data type of the tensor elements. There are add, multiplication and division involved in the transpose operator here

I did a performance comparison of different arithmetic operations between 32-bit and 64-bit integers on CPU. There are noticable difference below. FYI, you can use this code to reproduce.

result = 49995000
Add 32 time in clocks 24869
Add 32 time in ms 1359
result = 49995000
Add 64 time in clocks 6070
Add 64 time in ms 1971
result = 349965000
Add Mul 32 time in clocks 3601
Add Mul 32 time in ms 1196
result = 349965000
Add Mul 64 time in clocks 9967
Add Mul 64 time in ms 3477
result = 7137858
Add Div 32 time in clocks 8273
Add Div 32 time in ms 2878
result = 7137858
Add Div 64 time in clocks 24016
Add Div 64 time in ms 8499

All 15 comments

Hey, this is the MXNet Label Bot.
Thank you for submitting the issue! I will try and suggest some labels so that the appropriate MXNet community members can help resolve it.
Here are my recommended labels: Performance

So question for the community, is are others seeing performance issues moving from 1.3.1 to 1.4.0?

And for what operators do you see issue?

@mxnet-label-bot add [Backend, Performance]

@apeforest @zheng-da

I could verify the performance degradation using Sam's script on the transpose operator.

The performance slow down is mainly due to arithmetic operation performance difference between int32_t and int64_t data types. I changed the data type in Tensor object in mshadow from index_t (which is typedef to int64_t) to int32_t and can see runtime of transpose operator come back to almost same as in 1.3.1.

I am doing the following experiment:

1) casting tensor index data type to int32_t at runtime based on tensor size
2) using a env variable to select between a large tensor object (defined by int64_t) and small tensor object (defined by int32_t)

I will post the experiment results and from it see how we can generalize one of these approaches.

@pengzhao-intel Any compiler optimization flags for int64_t data types in Intel architecture? Also, how did MKLDNN handle int64_t performance? Your insight will be appreciated.

Although the option 1 will enlarge the size of the binary library, I agree it in performance-sensitive codes.

I wrote an example: https://github.com/wkcn/incubator-mxnet/blob/support_large_array/src/operator/mxnet_op.h#L41

Although the option 1 will enlarge the size of the binary library, I am in favor of it in performance-sensitive codes.

@apeforest
Regarding INT64, MKL-DNN only supports the INT32 at now. The next major version of 1.0 will support INT64 by default.
On the other hand, the Intel MKL library can handle INT64 (libmkl_intel_ilp64.a) and build MXNet with CFLAGS += -DMKL_ILP64.

Just to clarify, for the data type change from 32bit to 64bit, we're talking about the tensor shape data type, not the actual tensor values. So for example, @apeforest's change allowed 64bit shape data type for a tensor with float elements.

And this performance degradation for transpose was measured with pure-C++ (mshadow) implementations, not MKLDNN.

@pengzhao-intel does MKLDNN support 64-bit tensor shapes?

I cant imagine that MXNet's changes from 1.3.1 to 1.4.0 reduced the performance of MKLDNN operators, but has anyone verified if there was any issues for MKLDNN ops comapred to older versions?

@sun-dev The performance issue talked in this issue is NOT related to MKLDNN. We have verified the performance of MKLDNN, please see below link.

https://cwiki.apache.org/confluence/display/MXNET/MXNet+with+Intel+MKL-DNN+-+Performance+Benchmarking

The MKLDNN doesn't support 64-bit tensor shape now. It supposes to fall back original CPU implementation.

Feel free to let me know if you see any issues in MKLDNN.

Please refer to check if MKLDNN OP is used: https://mxnet.incubator.apache.org/versions/master/tutorials/mkldnn/MKLDNN_README.html#4

@sun-dev is correct. The data type is the shape/index data type not data type of the tensor elements. There are add, multiplication and division involved in the transpose operator here

I did a performance comparison of different arithmetic operations between 32-bit and 64-bit integers on CPU. There are noticable difference below. FYI, you can use this code to reproduce.

result = 49995000
Add 32 time in clocks 24869
Add 32 time in ms 1359
result = 49995000
Add 64 time in clocks 6070
Add 64 time in ms 1971
result = 349965000
Add Mul 32 time in clocks 3601
Add Mul 32 time in ms 1196
result = 349965000
Add Mul 64 time in clocks 9967
Add Mul 64 time in ms 3477
result = 7137858
Add Div 32 time in clocks 8273
Add Div 32 time in ms 2878
result = 7137858
Add Div 64 time in clocks 24016
Add Div 64 time in ms 8499

Is there any fix for transpose? I noticed now transpose takes significant amount of time in BERT.

@eric-haibin-lin FYI, https://github.com/apache/incubator-mxnet/pull/14545#issuecomment-479921467 improved the performance of transpose a lot by MKLDNN.

@samskalicky could you try the latest master and see if the performance is improved?

Was this page helpful?
0 / 5 - 0 ratings

Related issues

dmadeka picture dmadeka  路  3Comments

yuconglin picture yuconglin  路  3Comments

dushoufu picture dushoufu  路  3Comments

qiliux picture qiliux  路  3Comments

Ajoo picture Ajoo  路  3Comments