Incubator-mxnet: Operator Performance Regression on CPU

Created on 2 Jul 2019 · 23Comments · Source: apache/incubator-mxnet

Follow up on dev list discussion:

https://lists.apache.org/thread.html/154ef1e4010671e7375c7a7cbedb413d5a4a3677321488440fb32a3a@%3Cdev.mxnet.apache.org%3E

We have found some operators to have performance regression using the operator benchmark module here:
https://github.com/apache/incubator-mxnet/tree/master/benchmark/opperf

@sandeep-krishnamurthy has helped to run the benchmark and this is the training mode result:
https://gist.github.com/sandeep-krishnamurthy/e0a2be893c8c4d484390c9c8813bdf50

The above result is using training mode (autograd.record()) and calculating both forward and backward time.

As most users use CPU for inference only, to further investigate the impact on inference I have run the scripts using inference mode

Please find the inference and training mode results here:
https://docs.google.com/spreadsheets/d/1_eezNWbrBAm3s3i6G1m0Rd3YYdTEnmKlYtn4klqdyN0/edit?usp=sharing

I have calculated the regression percentage and sorted them, thanks to @aaronmarkham for providing the first version.

Although there are variances on perf numbers between runs, we observe the following commonly used operators be slower consistently.

We need to look into them and fix if root caused.

[x] Dropout
[x] relu
[x] LeakyReLU
[x] dot
[x] element wise ops (mul, div, sub)
[x] broadcast ops (mul, sub)

Some ops regression seems only to happen on mxnet-mkl version (refer to 4th sheet of the google sheet)

Environment:

AWS C5.18xLarge
Deep Learning Base AMI (Ubuntu) Version 18.1
Python 3.6

MXNet versions:

with MKLDNN
pip install mxnet-mkl==1.5.0b20190627 
pip install mxnet-mkl==1.4.1

without MKLDNN
pip install mxnet==1.5.0b20190627
pip install mxnet==1.4.1

Note: nightly 20190627 contains the latest commit in v.1.5.x

Scripts:
https://github.com/apache/incubator-mxnet/tree/master/benchmark/opperf
Notes: you need to modify the scripts a bit to run

requires python 3.6, requires to add your scripts path to PYTHONPATH, follow instructions to run benchmark on all operators.
To run operators in inference mode, you need to set False at this line
and change run_backward to False in all files under:
https://github.com/apache/incubator-mxnet/tree/master/benchmark/opperf/nd_operations
for example here.

Operator Performance

Source

roywei

👍2

Most helpful comment

Conclusion

No significant regression for broadcast ops and batchnorm. This issue should no block 1.5.0 release.
Regression on ReLU is expected as it's necessary as stated in https://github.com/apache/incubator-mxnet/pull/14262
We should use timeit or time module for benchmark. Or use the same profiler module to make sure it's apple to apple comparison
Need better benchmark methods to reduce the variance between runs to clearly identify if there is a regession, otherwise it's hard to root cause and fix. Right now we use fixed seed and set OMP_NUM_THREADS

roywei on 5 Jul 2019

🚀5

All 23 comments

Hey, this is the MXNet Label Bot.
Thank you for submitting the issue! I will try and suggest some labels so that the appropriate MXNet community members can help resolve it.
Here are my recommended labels: Performance

mxnet-label-bot on 2 Jul 2019

cc @PatricZhao @szha @ddavydenko

roywei on 2 Jul 2019

script command:
python incubator-mxnet/benchmark/opperf/opperf.py --output-format md --output-file mxnet-{version}-infer.md

raw script output:
https://drive.google.com/drive/folders/1OE-bedTFd-HN0Og6vW8XGWkswrRrdyW0?usp=sharing

roywei on 2 Jul 2019

Maybe this is the same thing with #15430. @ciyongch is WIP

pengzhao-intel on 2 Jul 2019

Yes, #15430 is also caused by op perf regression. I will try to reproduce the case and focus the inference firstly.

ciyongch on 2 Jul 2019

@roywei I've collected some performance data of Dropout, relu, and dot on C5.18xlarge, and saw some big run to run variance on dot and Dropout as below table without binding cores as table 1.
After binding cores to a socket via below command before running benchmarking, I got a stable and better results as table 2.

export KMP_AFFINITY=granularity=fine,noduplicates,compact,1,0
export OMP_NUM_THREADS=18

I ran 5 times with 10 warmup + 100 runs for each time. Not sure if you're setting the env variables when running the benchmark. And I suggest to re-run the ops for several times to see if it's a real degradation.

As for relu operator got worse perf on v1.5, I found v1.5 included the commit of c64559178 (#14262), which added "IsNan(a)" check for data which introduced some overhead. Will add more data and more operator results later.

Table1

| Operator | Avg Forward Time (ms) 1.4.1 | Avg Forward Time (ms) 1.5.0 | Regression | data shape |
|----------|-----------------------------|-----------------------------|------------|----------------------------------------------------------------------------------|
| dot | 0.1791 | 0.1576 | 12.0% | {'lhs': (1000, 1), 'rhs': (100, 1000), 'transpose_b': True, 'transpose_a': True} |
| dot | 0.2505 | 0.1146 | 54.3% | {'lhs': (1000, 1), 'rhs': (100, 1000), 'transpose_b': True, 'transpose_a': True} |
| dot | 0.1798 | 0.2531 | -40.8% | {'lhs': (1000, 1), 'rhs': (100, 1000), 'transpose_b': True, 'transpose_a': True} |
| dot | 0.1529 | 0.2786 | -82.2% | {'lhs': (1000, 1), 'rhs': (100, 1000), 'transpose_b': True, 'transpose_a': True} |
| dot | 0.4024 | 0.0763 | 81.0% | {'lhs': (1000, 1), 'rhs': (100, 1000), 'transpose_b': True, 'transpose_a': True} |
| dot | 0.1824 | 0.3437 | -88.4% | {'lhs': (1000, 10), 'rhs': (1000, 10), 'transpose_b': True} |
| dot | 0.1107 | 0.2491 | -125.0% | {'lhs': (1000, 10), 'rhs': (1000, 10), 'transpose_b': True} |
| dot | 0.4099 | 0.2401 | 41.4% | {'lhs': (1000, 10), 'rhs': (1000, 10), 'transpose_b': True} |
| dot | 1.4348 | 0.2439 | 83.0% | {'lhs': (1000, 10), 'rhs': (1000, 10), 'transpose_b': True} |
| dot | 0.3124 | 0.2922 | 6.5% | {'lhs': (1000, 10), 'rhs': (1000, 10), 'transpose_b': True} |
| Dropout | 0.0551 | 0.0477 | 13.4% | {'mode': 'always', 'data': (10000, 10), 'p': 0.5} |
| Dropout | 0.0446 | 0.0462 | -3.6% | {'mode': 'always', 'data': (10000, 10), 'p': 0.5} |
| Dropout | 0.078 | 0.0567 | 27.3% | {'mode': 'always', 'data': (10000, 10), 'p': 0.5} |
| Dropout | 0.0311 | 0.0797 | -156.3% | {'mode': 'always', 'data': (10000, 10), 'p': 0.5} |
| Dropout | 0.3707 | 0.0671 | 81.9% | {'mode': 'always', 'data': (10000, 10), 'p': 0.5} |

Table 2

| Operator | Avg Forward Time (ms) 1.4.1 | Avg Forward Time (ms) 1.5.0 | Regression | data shape |
|----------|-----------------------------|-----------------------------|------------|----------------------------------------------------------------------------------|
| dot | 0.0079 | 0.0087 | -10.1% | {'lhs': (1000, 1), 'rhs': (100, 1000), 'transpose_b': True, 'transpose_a': True} |
| dot | 0.0076 | 0.0089 | -17.1% | {'lhs': (1000, 1), 'rhs': (100, 1000), 'transpose_b': True, 'transpose_a': True} |
| dot | 0.0079 | 0.009 | -13.9% | {'lhs': (1000, 1), 'rhs': (100, 1000), 'transpose_b': True, 'transpose_a': True} |
| dot | 0.008 | 0.0093 | -16.3% | {'lhs': (1000, 1), 'rhs': (100, 1000), 'transpose_b': True, 'transpose_a': True} |
| dot | 0.0078 | 0.0089 | -14.1% | {'lhs': (1000, 1), 'rhs': (100, 1000), 'transpose_b': True, 'transpose_a': True} |
| dot | 0.0946 | 0.0895 | 5.4% | {'lhs': (1000, 10), 'rhs': (1000, 10), 'transpose_b': True} |
| dot | 0.0937 | 0.0889 | 5.1% | {'lhs': (1000, 10), 'rhs': (1000, 10), 'transpose_b': True} |
| dot | 0.093 | 0.0895 | 3.8% | {'lhs': (1000, 10), 'rhs': (1000, 10), 'transpose_b': True} |
| dot | 0.097 | 0.0898 | 7.4% | {'lhs': (1000, 10), 'rhs': (1000, 10), 'transpose_b': True} |
| dot | 0.0908 | 0.089 | 2.0% | {'lhs': (1000, 10), 'rhs': (1000, 10), 'transpose_b': True} |
| Dropout | 0.0238 | 0.0235 | 1.3% | {'mode': 'always', 'data': (10000, 10), 'p': 0.5} |
| Dropout | 0.0244 | 0.0234 | 4.1% | {'mode': 'always', 'data': (10000, 10), 'p': 0.5} |
| Dropout | 0.024 | 0.0234 | 2.5% | {'mode': 'always', 'data': (10000, 10), 'p': 0.5} |
| Dropout | 0.024 | 0.0235 | 2.1% | {'mode': 'always', 'data': (10000, 10), 'p': 0.5} |
| Dropout | 0.022 | 0.0234 | -6.4% | {'mode': 'always', 'data': (10000, 10), 'p': 0.5} |

ciyongch on 2 Jul 2019

👍2

Thanks @ciyongch , setting the environment variables did reduce the variances.
I have updated the document(1st sheet): https://docs.google.com/spreadsheets/d/1_eezNWbrBAm3s3i6G1m0Rd3YYdTEnmKlYtn4klqdyN0/edit#gid=196553607

With the current data Dot and Dropout is not a big concern now. Relu's regression is something we have to accept as otherwise it could lead to nans and bugs.

roywei on 3 Jul 2019

@roywei I looked at the table, except a few ops with big variances, most of the degraded ops listed in the table dropped into mshadow. I did some further analysis, and found that the performance regression is caused by the commit of 6e94643 (https://github.com/dmlc/mshadow/pull/371), which changed the data type for index_t. Seems the quick fix in that PR didn't work for these ops, and a better solution is required for the type changing.

Here's some perf data collected on my local machine (SKX-8180), you can have a try on your platform :)

| Operator | Avg Forward Time (ms) 1.4.1 | Avg Forward Time (ms) 1.5.0 | Regression | Avg Forward Time (ms) 1.5.0 (revert 6e94643) | Updated Regression | data shape |
|-------------|-----------------------------|-----------------------------|------------|----------------------------------------------|------------|----------------------|
| relu | 0.0047 | 0.0107 | 127.7% | 0.0057 | 21.3% | {'data': (10000, 1)} |
| relu | 0.0065 | 0.0143 | 120.0% | 0.005 | -23.1% | {'data': (10000, 1)} |
| relu | 0.005 | 0.0144 | 188.0% | 0.0051 | 2.0% | {'data': (10000, 1)} |
| relu | 0.0052 | 0.0115 | 121.2% | 0.0066 | 26.9% | {'data': (10000, 1)} |
| relu | 0.005 | 0.0138 | 176.0% | 0.0061 | 22.0% | {'data': (10000, 1)} |
| logical_not | 0.005 | 0.0221 | 342.0% | 0.0057 | 14.0% | {'data': (10000, 1)} |
| logical_not | 0.0051 | 0.0214 | 319.6% | 0.0056 | 9.8% | {'data': (10000, 1)} |
| logical_not | 0.0054 | 0.0215 | 298.1% | 0.0056 | 3.7% | {'data': (10000, 1)} |
| logical_not | 0.0051 | 0.0214 | 319.6% | 0.0057 | 11.8% | {'data': (10000, 1)} |
| logical_not | 0.0051 | 0.0216 | 323.5% | 0.0056 | 9.8% | {'data': (10000, 1)} |
| softsign | 0.0054 | 0.0166 | 207.4% | 0.0059 | 9.3% | {'data': (10000, 1)} |
| softsign | 0.0053 | 0.0166 | 213.2% | 0.006 | 13.2% | {'data': (10000, 1)} |
| softsign | 0.0058 | 0.0163 | 181.0% | 0.006 | 3.4% | {'data': (10000, 1)} |
| softsign | 0.0053 | 0.0162 | 205.7% | 0.006 | 13.2% | {'data': (10000, 1)} |
| softsign | 0.0054 | 0.0164 | 203.7% | 0.0062 | 14.8% | {'data': (10000, 1)} |

ciyongch on 3 Jul 2019

👍3 🎉2

cc @apeforest

TaoLv on 3 Jul 2019

Thanks @ciyongch for diving deep and help in identifying root cause of the issues.

Can you please help me understand why we see variance only for few operators without setting environment variables for KMP and OMP? Our users don't do these environment variable setup so is it something we need to worry about that these operators will perform badly and indeterministic without that setting?

sandeep-krishnamurthy on 3 Jul 2019

@TaoLv @ciyongch Thanks for the investigation. I will try to verify it on my machine.

apeforest on 3 Jul 2019

@sandeep-krishnamurthy As most of the MXNet's ops will do the parallelism via openmp on CPU, from performance perspective, binding CPU cores will reduce the performance degradation due to cache misses and get better thread scheduling.
Without binding the cores, the performance variance could be bigger due to thread/context switch among those available cores, and plus the time of benchmarking on an individual op is usually small, like several to a hundred microsecond as showed in the above table (in the case of 10 warmup + 100 runs for each op).
BTW, as op level performance tuning is not always reflect its real performance in a real model (for example, data is more easily/likely cached in op benchmarking, which results the better performance compared to a real model scenario). It's better if the future CI will combine some basic/general models benchmarking to determine the regression. What do you think?

ciyongch on 4 Jul 2019

I agree that what impact actuall user experience is the final speed of the model inference/training as operators get fused and there are other performance improvement technics applied on overall model. Currently we don't have enough data to say these ops regression we have will impact actuall model speed.

Regarding the OP regression, we are focusing on root causing regression of broadcast ops, rest ops should not block 1.5.0 release.

We found out no matter what's the flag (int32/64), there is around 15% regression on broadcast ops, on both mxnet-mkl and mxnet pip packages between 1.4.1 and 1.5.0. I'm still root causing it.

| mxnet-mkl | mxnet-mkl | mxnet-mkl | mxnet-mkl | mxnet-mkl | | mxnet | mxnet |
-- | -- | -- | -- | -- | -- | -- | -- | -- | --
| 1.4.1 (int64) | 1.5.0 int32 | 1.5.0 int64 | 1.5 int 32 vs 1.4.1 | 1.5 int 64 vs 1.4.1 | | 1.5.0 mxnet | 1.4.1 mxnet | regression
broadcast_add | 0.00242 | 0.00286 | 0.0029 | 18% | 20% | | 0.0021 | 0.0016 | 31%
broadcast_div | 0.00244 | 0.00286 | 0.00296 | 17% | 21% | | 0.0021 | 0.0017 | 24%
broadcast_equal | 0.0024 | 0.00294 | 0.00294 | 23% | 23% | | 0.0021 | 0.0016 | 31%
broadcast_greater | 0.00244 | 0.00282 | 0.00294 | 16% | 20% | | 0.0021 | 0.0016 | 31%
broadcast_greater_equal | 0.00244 | 0.00284 | 0.00286 | 16% | 17% | | 0.002 | 0.0017 | 18%
broadcast_hypot | 0.0025 | 0.00292 | 0.00302 | 17% | 21% | | 0.0021 | 0.0016 | 31%
broadcast_lesser | 0.00242 | 0.00288 | 0.00298 | 19% | 23% | | 0.002 | 0.0015 | 33%
broadcast_lesser_equal | 0.00246 | 0.00288 | 0.00288 | 17% | 17% | | 0.0021 | 0.0017 | 24%
broadcast_logical_and | 0.0024 | 0.0029 | 0.00292 | 21% | 22% | | 0.0021 | 0.0016 | 31%
broadcast_logical_or | 0.0025 | 0.00288 | 0.00288 | 15% | 15% | | 0.0021 | 0.0016 | 31%
broadcast_logical_xor | 0.00242 | 0.0029 | 0.00296 | 20% | 22% | | 0.0021 | 0.0016 | 31%
broadcast_maximum | 0.00248 | 0.00288 | 0.00284 | 16% | 15% | | 0.0021 | 0.0016 | 31%
broadcast_minimum | 0.0025 | 0.00286 | 0.00284 | 14% | 14% | | 0.002 | 0.0015 | 33%
broadcast_mod | 0.00262 | 0.00294 | 0.00302 | 12% | 15% | | 0.0021 | 0.0016 | 31%
broadcast_mul | 0.00244 | 0.00288 | 0.00298 | 18% | 22% | | 0.0021 | 0.0017 | 24%
broadcast_not_equal | 0.00258 | 0.00296 | 0.00294 | 15% | 14% | | 0.002 | 0.0016 | 25%
broadcast_power | 0.00296 | 0.00336 | 0.00348 | 14% | 18% | | 0.0023 | 0.0019 | 21%
broadcast_sub | 0.0025 | 0.00296 | 0.003 | 18% | 20% | | 0.002 | 0.0016 | 25%

roywei on 5 Jul 2019

👍1

I have ran a search for perf regression on broadcast_add and there are 2 regressions between 0412-0425, 0619-0621

mxnet-mkl version | broadcast_add speed | commit|
-- | --| --
1.5.0b20190411 | 0.003 |
1.5.0b20190412 | 0.003 | 30d479feae11b04d423f2ec055d124fb6d2d5730
1.5.0b20190425 | 0.0034 | 587d480372f41b25fcbffd45ce77373ad93d41b5
1.5.0b20190426 | 0.0034 |
1.5.0b20190619 | 0.0033 | ccbbf6b4b76ea536a6583c99497c83b65a20817b
1.5.0b20190620 | 0.0035 | 2de0db0911f2e71728fa85ab342bd99a10974fc9
1.5.0b20190621 | 0.0037 | 8b5f376d17b644385706016caf8b1e58e95d96df

roywei on 5 Jul 2019

@sandeep-krishnamurthy @Zha0q1
https://github.com/apache/incubator-mxnet/pull/15240 This PR introduced regression from 0619-0621.
broadcast_add speed before this PR: 0.004, after this PR: 0.0047
Note speed is different than pip installed version due to build flags but the regression is there.
build: sudo make -j$(nproc) USE_MKLDNN=1
script:

import mxnet as mx
from mxnet import nd
mx.random.seed(10)
from benchmark.opperf.utils.benchmark_utils import run_performance_test

add_res = run_performance_test(nd.broadcast_add, run_backward=False, dtype='float32', ctx=mx.cpu(),
                               inputs=[{'lhs': [(1024, 1024), (10000, 10), (10000, 1)], 'rhs': [(1024, 1024), (10000, 10), (10000, 1)]} ],
                               warmup=10, runs=500)

print("speed:", add_res[0]['broadcast_add'][0]['avg_time_forward_broadcast_add'])

roywei on 5 Jul 2019

👍3

another regression from PR https://github.com/apache/incubator-mxnet/pull/14661
broadcast_add speed:
before 0.0033
after: 0.0039

roywei on 5 Jul 2019

Update: all benchmark results before this comment maybe inaccurate as the profiler code between 1.4.1 and 1.5 is different. and the opperf script is using profiler.

There is no much regression (accounting variance also) between 1.4.1 and 1.5.0 on broadcast ops if using python time(). it may due to profiler change between versions.
Result on C5.18xlarge.

| 1.4.1 | 1.4.1 | 1.4.1 | average | 1.5.0 | 1.5.0 | 1.5.0 | 1.5.0 | 1.5.0 average
-- | -- | -- | -- | -- | -- | -- | -- | -- | --
broadcast_add | 0.055 | 0.059 | 0.056624 | 0.0567635 | 0.058 | 0.06 | 0.057 | 0.057 | 0.058183
elemwise_add | 0.104 | 0.104460 | 0.104491 | 0.10444 | 0.121223 | 0.127 | 0.124 | 0.124 | 0.125145

Thanks to @reminisce for helping out and providing the scripts.

import time
import mxnet as mx

mx.random.seed(0)

a = mx.nd.random.uniform(shape=(1024, 1024))
b = mx.nd.random.uniform(shape=(1024, 1024))

repeat = 10000

mx.nd.waitall()
start = time.time()
for _ in range(repeat):
    c = mx.nd.broadcast_add(a, b)
    # c = mx.nd.elemwise_add(a, b)
    c.wait_to_read()
elapse = time.time() - start

print("elapse time: %fms" % (elapse * 1000 / repeat))

roywei on 5 Jul 2019

https://github.com/apache/incubator-mxnet/pull/15240 may introduced extra runtime in profiler in mxnet 1.5. Using timeit module, broadcast_add between mxnet1.4 and mxnet1.5 does not show significant difference:

import timeit
SETUP_CODE = '''
import mxnet as mx
from mxnet import nd
mx.random.seed(10)
def binary_op(lhs, rhs):
    return nd.broadcast_add(lhs=lhs, rhs=rhs,  dtype='float32', ctx=mx.cpu())

lhs = nd.random.uniform(shape=(1024, 1024), dtype='float32', ctx=mx.cpu())
rhs = nd.random.uniform(shape=(1, 1024), dtype='float32', ctx=mx.cpu())
res = binary_op(lhs, rhs)
'''

TEST_CODE = '''
res.wait_to_read()
'''
# timeit.repeat statement
times = timeit.repeat(setup = SETUP_CODE,
                      stmt = TEST_CODE,
                      repeat = 3,
                      number = 10000)

# priniting minimum exec. time
print('time: {}'.format(min(times)))

Results using script above

| 1.4.1 | 1.5.0 |
-- | -- | -- |
broadcast_add | 0.0073 | 0.0071 |
elemwise_add | 0.0079 | 0.0072 |
sigmoid | 0.0079 | 0.0075 |
hard_sigmoid | 0.0072 | 0.0069 |

apeforest on 5 Jul 2019

There is also no significant regression on BatchNorm op between 1.4.1 and 1.5.0
Also the speed actually improves if turning int64 flag on.

| 1.4.1 (int64) runs | 1.5.0 (int 32) runs | 1.5.0 (int 64) runs
-- | -- | -- | --
BatchNorm | 2.609942 | 2.63 | 2.031
2nd run | 2.621809 | 2.63 | 2.054
3rd run | 2.621809 | 2.611 | 2.041
average | 2.60764 | 2.61148 | 2.042131

script:

import mxnet as  mx
import time


mx.random.seed(0)

data = mx.nd.random.uniform(0, 256, (32, 3, 256, 256))
beta = mx.nd.random.uniform(shape=(3,))
gamma = mx.nd.random.uniform(shape=(3,))
mean = mx.nd.random.uniform(shape=(3,))
var = mx.nd.random.uniform(shape=(3,))
repeat = 1000

mx.nd.waitall()
start = time.time()
for _ in range(repeat):
    #c = mx.nd.broadcast_add(a, b)
    c = mx.nd.BatchNorm(data=data, gamma=gamma, beta=beta, moving_mean=mean, moving_var=var)
    # c = mx.nd.elemwise_add(a, b)
    c.wait_to_read()
elapse = time.time() - start

print("elapse time: %fms" % (elapse * 1000 / repeat))

roywei on 5 Jul 2019

Conclusion

No significant regression for broadcast ops and batchnorm. This issue should no block 1.5.0 release.
Regression on ReLU is expected as it's necessary as stated in https://github.com/apache/incubator-mxnet/pull/14262
We should use timeit or time module for benchmark. Or use the same profiler module to make sure it's apple to apple comparison
Need better benchmark methods to reduce the variance between runs to clearly identify if there is a regession, otherwise it's hard to root cause and fix. Right now we use fixed seed and set OMP_NUM_THREADS

roywei on 5 Jul 2019

🚀5

@roywei does it mean the operator profiling results with current profiler module is not accurate?

ciyongch on 8 Jul 2019

👍1

@ciyongch The profiler was implemented in 1.5 and there are changes in the profiler code which were not in 1.4.1 release: https://github.com/apache/incubator-mxnet/pull/15240. This could have made difference in runtime as the runtime in each operator is very sensitive.

Going forward, we should enhance the profiler module and make it available in CI so we can capture the performance degradation in time.

apeforest on 8 Jul 2019

@apeforest Thanks for the info, as we saw the current perf degradation only happened for some ops or some certain shapes with v1.5, any suggestion to do profiling for a real model? Could we still rely on the profiling result with latest MXNet code base (with #15240) so far?

ciyongch on 9 Jul 2019

Was this page helpful?

0 / 5 - 0 ratings