Follow up on dev list discussion:
We have found some operators to have performance regression using the operator benchmark module here:
https://github.com/apache/incubator-mxnet/tree/master/benchmark/opperf
@sandeep-krishnamurthy has helped to run the benchmark and this is the training mode result:
https://gist.github.com/sandeep-krishnamurthy/e0a2be893c8c4d484390c9c8813bdf50
The above result is using training mode (autograd.record()
) and calculating both forward and backward time.
As most users use CPU for inference only, to further investigate the impact on inference I have run the scripts using inference mode
Please find the inference and training mode results here:
https://docs.google.com/spreadsheets/d/1_eezNWbrBAm3s3i6G1m0Rd3YYdTEnmKlYtn4klqdyN0/edit?usp=sharing
I have calculated the regression percentage and sorted them, thanks to @aaronmarkham for providing the first version.
Although there are variances on perf numbers between runs, we observe the following commonly used operators be slower consistently.
We need to look into them and fix if root caused.
Some ops regression seems only to happen on mxnet-mkl version (refer to 4th sheet of the google sheet)
Environment:
AWS C5.18xLarge
Deep Learning Base AMI (Ubuntu) Version 18.1
Python 3.6
MXNet versions:
with MKLDNN
pip install mxnet-mkl==1.5.0b20190627
pip install mxnet-mkl==1.4.1
without MKLDNN
pip install mxnet==1.5.0b20190627
pip install mxnet==1.4.1
Note: nightly 20190627 contains the latest commit in v.1.5.x
Scripts:
https://github.com/apache/incubator-mxnet/tree/master/benchmark/opperf
Notes: you need to modify the scripts a bit to run
False
at this linerun_backward
to False
in all files under:Hey, this is the MXNet Label Bot.
Thank you for submitting the issue! I will try and suggest some labels so that the appropriate MXNet community members can help resolve it.
Here are my recommended labels: Performance
cc @PatricZhao @szha @ddavydenko
script command:
python incubator-mxnet/benchmark/opperf/opperf.py --output-format md --output-file mxnet-{version}-infer.md
raw script output:
https://drive.google.com/drive/folders/1OE-bedTFd-HN0Og6vW8XGWkswrRrdyW0?usp=sharing
Maybe this is the same thing with #15430. @ciyongch is WIP
Yes, #15430 is also caused by op perf regression. I will try to reproduce the case and focus the inference firstly.
@roywei I've collected some performance data of Dropout
, relu
, and dot
on C5.18xlarge, and saw some big run to run variance on dot
and Dropout
as below table without binding cores as table 1.
After binding cores to a socket via below command before running benchmarking, I got a stable and better results as table 2.
export KMP_AFFINITY=granularity=fine,noduplicates,compact,1,0
export OMP_NUM_THREADS=18
I ran 5 times with 10 warmup + 100 runs for each time. Not sure if you're setting the env variables when running the benchmark. And I suggest to re-run the ops for several times to see if it's a real degradation.
As for relu
operator got worse perf on v1.5, I found v1.5 included the commit of c64559178 (#14262), which added "IsNan(a)" check for data which introduced some overhead. Will add more data and more operator results later.
Table1
| Operator | Avg Forward Time (ms) 1.4.1 | Avg Forward Time (ms) 1.5.0 | Regression | data shape |
|----------|-----------------------------|-----------------------------|------------|----------------------------------------------------------------------------------|
| dot | 0.1791 | 0.1576 | 12.0% | {'lhs': (1000, 1), 'rhs': (100, 1000), 'transpose_b': True, 'transpose_a': True} |
| dot | 0.2505 | 0.1146 | 54.3% | {'lhs': (1000, 1), 'rhs': (100, 1000), 'transpose_b': True, 'transpose_a': True} |
| dot | 0.1798 | 0.2531 | -40.8% | {'lhs': (1000, 1), 'rhs': (100, 1000), 'transpose_b': True, 'transpose_a': True} |
| dot | 0.1529 | 0.2786 | -82.2% | {'lhs': (1000, 1), 'rhs': (100, 1000), 'transpose_b': True, 'transpose_a': True} |
| dot | 0.4024 | 0.0763 | 81.0% | {'lhs': (1000, 1), 'rhs': (100, 1000), 'transpose_b': True, 'transpose_a': True} |
| dot | 0.1824 | 0.3437 | -88.4% | {'lhs': (1000, 10), 'rhs': (1000, 10), 'transpose_b': True} |
| dot | 0.1107 | 0.2491 | -125.0% | {'lhs': (1000, 10), 'rhs': (1000, 10), 'transpose_b': True} |
| dot | 0.4099 | 0.2401 | 41.4% | {'lhs': (1000, 10), 'rhs': (1000, 10), 'transpose_b': True} |
| dot | 1.4348 | 0.2439 | 83.0% | {'lhs': (1000, 10), 'rhs': (1000, 10), 'transpose_b': True} |
| dot | 0.3124 | 0.2922 | 6.5% | {'lhs': (1000, 10), 'rhs': (1000, 10), 'transpose_b': True} |
| Dropout | 0.0551 | 0.0477 | 13.4% | {'mode': 'always', 'data': (10000, 10), 'p': 0.5} |
| Dropout | 0.0446 | 0.0462 | -3.6% | {'mode': 'always', 'data': (10000, 10), 'p': 0.5} |
| Dropout | 0.078 | 0.0567 | 27.3% | {'mode': 'always', 'data': (10000, 10), 'p': 0.5} |
| Dropout | 0.0311 | 0.0797 | -156.3% | {'mode': 'always', 'data': (10000, 10), 'p': 0.5} |
| Dropout | 0.3707 | 0.0671 | 81.9% | {'mode': 'always', 'data': (10000, 10), 'p': 0.5} |
Table 2
| Operator | Avg Forward Time (ms) 1.4.1 | Avg Forward Time (ms) 1.5.0 | Regression | data shape |
|----------|-----------------------------|-----------------------------|------------|----------------------------------------------------------------------------------|
| dot | 0.0079 | 0.0087 | -10.1% | {'lhs': (1000, 1), 'rhs': (100, 1000), 'transpose_b': True, 'transpose_a': True} |
| dot | 0.0076 | 0.0089 | -17.1% | {'lhs': (1000, 1), 'rhs': (100, 1000), 'transpose_b': True, 'transpose_a': True} |
| dot | 0.0079 | 0.009 | -13.9% | {'lhs': (1000, 1), 'rhs': (100, 1000), 'transpose_b': True, 'transpose_a': True} |
| dot | 0.008 | 0.0093 | -16.3% | {'lhs': (1000, 1), 'rhs': (100, 1000), 'transpose_b': True, 'transpose_a': True} |
| dot | 0.0078 | 0.0089 | -14.1% | {'lhs': (1000, 1), 'rhs': (100, 1000), 'transpose_b': True, 'transpose_a': True} |
| dot | 0.0946 | 0.0895 | 5.4% | {'lhs': (1000, 10), 'rhs': (1000, 10), 'transpose_b': True} |
| dot | 0.0937 | 0.0889 | 5.1% | {'lhs': (1000, 10), 'rhs': (1000, 10), 'transpose_b': True} |
| dot | 0.093 | 0.0895 | 3.8% | {'lhs': (1000, 10), 'rhs': (1000, 10), 'transpose_b': True} |
| dot | 0.097 | 0.0898 | 7.4% | {'lhs': (1000, 10), 'rhs': (1000, 10), 'transpose_b': True} |
| dot | 0.0908 | 0.089 | 2.0% | {'lhs': (1000, 10), 'rhs': (1000, 10), 'transpose_b': True} |
| Dropout | 0.0238 | 0.0235 | 1.3% | {'mode': 'always', 'data': (10000, 10), 'p': 0.5} |
| Dropout | 0.0244 | 0.0234 | 4.1% | {'mode': 'always', 'data': (10000, 10), 'p': 0.5} |
| Dropout | 0.024 | 0.0234 | 2.5% | {'mode': 'always', 'data': (10000, 10), 'p': 0.5} |
| Dropout | 0.024 | 0.0235 | 2.1% | {'mode': 'always', 'data': (10000, 10), 'p': 0.5} |
| Dropout | 0.022 | 0.0234 | -6.4% | {'mode': 'always', 'data': (10000, 10), 'p': 0.5} |
Thanks @ciyongch , setting the environment variables did reduce the variances.
I have updated the document(1st sheet): https://docs.google.com/spreadsheets/d/1_eezNWbrBAm3s3i6G1m0Rd3YYdTEnmKlYtn4klqdyN0/edit#gid=196553607
With the current data Dot and Dropout is not a big concern now. Relu's regression is something we have to accept as otherwise it could lead to nans and bugs.
@roywei I looked at the table, except a few ops with big variances, most of the degraded ops listed in the table dropped into mshadow. I did some further analysis, and found that the performance regression is caused by the commit of 6e94643
(https://github.com/dmlc/mshadow/pull/371), which changed the data type for index_t
. Seems the quick fix in that PR didn't work for these ops, and a better solution is required for the type changing.
Here's some perf data collected on my local machine (SKX-8180), you can have a try on your platform :)
| Operator | Avg Forward Time (ms) 1.4.1 | Avg Forward Time (ms) 1.5.0 | Regression | Avg Forward Time (ms) 1.5.0 (revert 6e94643) | Updated Regression | data shape |
|-------------|-----------------------------|-----------------------------|------------|----------------------------------------------|------------|----------------------|
| relu | 0.0047 | 0.0107 | 127.7% | 0.0057 | 21.3% | {'data': (10000, 1)} |
| relu | 0.0065 | 0.0143 | 120.0% | 0.005 | -23.1% | {'data': (10000, 1)} |
| relu | 0.005 | 0.0144 | 188.0% | 0.0051 | 2.0% | {'data': (10000, 1)} |
| relu | 0.0052 | 0.0115 | 121.2% | 0.0066 | 26.9% | {'data': (10000, 1)} |
| relu | 0.005 | 0.0138 | 176.0% | 0.0061 | 22.0% | {'data': (10000, 1)} |
| logical_not | 0.005 | 0.0221 | 342.0% | 0.0057 | 14.0% | {'data': (10000, 1)} |
| logical_not | 0.0051 | 0.0214 | 319.6% | 0.0056 | 9.8% | {'data': (10000, 1)} |
| logical_not | 0.0054 | 0.0215 | 298.1% | 0.0056 | 3.7% | {'data': (10000, 1)} |
| logical_not | 0.0051 | 0.0214 | 319.6% | 0.0057 | 11.8% | {'data': (10000, 1)} |
| logical_not | 0.0051 | 0.0216 | 323.5% | 0.0056 | 9.8% | {'data': (10000, 1)} |
| softsign | 0.0054 | 0.0166 | 207.4% | 0.0059 | 9.3% | {'data': (10000, 1)} |
| softsign | 0.0053 | 0.0166 | 213.2% | 0.006 | 13.2% | {'data': (10000, 1)} |
| softsign | 0.0058 | 0.0163 | 181.0% | 0.006 | 3.4% | {'data': (10000, 1)} |
| softsign | 0.0053 | 0.0162 | 205.7% | 0.006 | 13.2% | {'data': (10000, 1)} |
| softsign | 0.0054 | 0.0164 | 203.7% | 0.0062 | 14.8% | {'data': (10000, 1)} |
cc @apeforest
Thanks @ciyongch for diving deep and help in identifying root cause of the issues.
Can you please help me understand why we see variance only for few operators without setting environment variables for KMP and OMP? Our users don't do these environment variable setup so is it something we need to worry about that these operators will perform badly and indeterministic without that setting?
@TaoLv @ciyongch Thanks for the investigation. I will try to verify it on my machine.
@sandeep-krishnamurthy As most of the MXNet's ops will do the parallelism via openmp on CPU, from performance perspective, binding CPU cores will reduce the performance degradation due to cache misses and get better thread scheduling.
Without binding the cores, the performance variance could be bigger due to thread/context switch among those available cores, and plus the time of benchmarking on an individual op is usually small, like several to a hundred microsecond as showed in the above table (in the case of 10 warmup + 100 runs for each op).
BTW, as op level performance tuning is not always reflect its real performance in a real model (for example, data is more easily/likely cached in op benchmarking, which results the better performance compared to a real model scenario). It's better if the future CI will combine some basic/general models benchmarking to determine the regression. What do you think?
I agree that what impact actuall user experience is the final speed of the model inference/training as operators get fused and there are other performance improvement technics applied on overall model. Currently we don't have enough data to say these ops regression we have will impact actuall model speed.
Regarding the OP regression, we are focusing on root causing regression of broadcast ops, rest ops should not block 1.5.0 release.
We found out no matter what's the flag (int32/64), there is around 15% regression on broadcast ops, on both mxnet-mkl and mxnet pip packages between 1.4.1 and 1.5.0. I'm still root causing it.
聽 | mxnet-mkl | mxnet-mkl | mxnet-mkl | mxnet-mkl | mxnet-mkl | 聽 | mxnet | mxnet | 聽
-- | -- | -- | -- | -- | -- | -- | -- | -- | --
聽 | 1.4.1 (int64) | 1.5.0 int32 | 1.5.0 int64 | 1.5 int 32 vs 1.4.1 | 1.5 int 64 vs 1.4.1 | 聽 | 1.5.0 mxnet | 1.4.1 mxnet | regression
broadcast_add | 0.00242 | 0.00286 | 0.0029 | 18% | 20% | 聽 | 0.0021 | 0.0016 | 31%
broadcast_div | 0.00244 | 0.00286 | 0.00296 | 17% | 21% | 聽 | 0.0021 | 0.0017 | 24%
broadcast_equal | 0.0024 | 0.00294 | 0.00294 | 23% | 23% | 聽 | 0.0021 | 0.0016 | 31%
broadcast_greater | 0.00244 | 0.00282 | 0.00294 | 16% | 20% | 聽 | 0.0021 | 0.0016 | 31%
broadcast_greater_equal | 0.00244 | 0.00284 | 0.00286 | 16% | 17% | 聽 | 0.002 | 0.0017 | 18%
broadcast_hypot | 0.0025 | 0.00292 | 0.00302 | 17% | 21% | 聽 | 0.0021 | 0.0016 | 31%
broadcast_lesser | 0.00242 | 0.00288 | 0.00298 | 19% | 23% | 聽 | 0.002 | 0.0015 | 33%
broadcast_lesser_equal | 0.00246 | 0.00288 | 0.00288 | 17% | 17% | 聽 | 0.0021 | 0.0017 | 24%
broadcast_logical_and | 0.0024 | 0.0029 | 0.00292 | 21% | 22% | 聽 | 0.0021 | 0.0016 | 31%
broadcast_logical_or | 0.0025 | 0.00288 | 0.00288 | 15% | 15% | 聽 | 0.0021 | 0.0016 | 31%
broadcast_logical_xor | 0.00242 | 0.0029 | 0.00296 | 20% | 22% | 聽 | 0.0021 | 0.0016 | 31%
broadcast_maximum | 0.00248 | 0.00288 | 0.00284 | 16% | 15% | 聽 | 0.0021 | 0.0016 | 31%
broadcast_minimum | 0.0025 | 0.00286 | 0.00284 | 14% | 14% | 聽 | 0.002 | 0.0015 | 33%
broadcast_mod | 0.00262 | 0.00294 | 0.00302 | 12% | 15% | 聽 | 0.0021 | 0.0016 | 31%
broadcast_mul | 0.00244 | 0.00288 | 0.00298 | 18% | 22% | 聽 | 0.0021 | 0.0017 | 24%
broadcast_not_equal | 0.00258 | 0.00296 | 0.00294 | 15% | 14% | 聽 | 0.002 | 0.0016 | 25%
broadcast_power | 0.00296 | 0.00336 | 0.00348 | 14% | 18% | 聽 | 0.0023 | 0.0019 | 21%
broadcast_sub | 0.0025 | 0.00296 | 0.003 | 18% | 20% | 聽 | 0.002 | 0.0016 | 25%
I have ran a search for perf regression on broadcast_add
and there are 2 regressions between 0412-0425, 0619-0621
mxnet-mkl version | broadcast_add speed | commit|
-- | --| --
1.5.0b20190411 | 0.003 |
1.5.0b20190412 | 0.003 | 30d479feae11b04d423f2ec055d124fb6d2d5730
1.5.0b20190425 | 0.0034 | 587d480372f41b25fcbffd45ce77373ad93d41b5
1.5.0b20190426 | 0.0034 |
1.5.0b20190619 | 0.0033 | ccbbf6b4b76ea536a6583c99497c83b65a20817b
1.5.0b20190620 | 0.0035 | 2de0db0911f2e71728fa85ab342bd99a10974fc9
1.5.0b20190621 | 0.0037 | 8b5f376d17b644385706016caf8b1e58e95d96df
@sandeep-krishnamurthy @Zha0q1
https://github.com/apache/incubator-mxnet/pull/15240 This PR introduced regression from 0619-0621.
broadcast_add speed before this PR: 0.004, after this PR: 0.0047
Note speed is different than pip installed version due to build flags but the regression is there.
build: sudo make -j$(nproc) USE_MKLDNN=1
script:
import mxnet as mx
from mxnet import nd
mx.random.seed(10)
from benchmark.opperf.utils.benchmark_utils import run_performance_test
add_res = run_performance_test(nd.broadcast_add, run_backward=False, dtype='float32', ctx=mx.cpu(),
inputs=[{'lhs': [(1024, 1024), (10000, 10), (10000, 1)], 'rhs': [(1024, 1024), (10000, 10), (10000, 1)]} ],
warmup=10, runs=500)
print("speed:", add_res[0]['broadcast_add'][0]['avg_time_forward_broadcast_add'])
another regression from PR https://github.com/apache/incubator-mxnet/pull/14661
broadcast_add speed:
before 0.0033
after: 0.0039
There is no much regression (accounting variance also) between 1.4.1 and 1.5.0 on broadcast ops if using python time()
. it may due to profiler change between versions.
Result on C5.18xlarge.
聽 | 1.4.1 | 1.4.1 | 1.4.1 | average | 1.5.0 | 1.5.0 | 1.5.0 | 1.5.0 | 1.5.0 average
-- | -- | -- | -- | -- | -- | -- | -- | -- | --
broadcast_add | 0.055 | 0.059 | 0.056624 | 0.0567635 | 0.058 | 0.06 | 0.057 | 0.057 | 0.058183
elemwise_add | 0.104 | 0.104460 | 0.104491 | 0.10444 | 0.121223 | 0.127 | 0.124 | 0.124 | 0.125145
Thanks to @reminisce for helping out and providing the scripts.
import time
import mxnet as mx
mx.random.seed(0)
a = mx.nd.random.uniform(shape=(1024, 1024))
b = mx.nd.random.uniform(shape=(1024, 1024))
repeat = 10000
mx.nd.waitall()
start = time.time()
for _ in range(repeat):
c = mx.nd.broadcast_add(a, b)
# c = mx.nd.elemwise_add(a, b)
c.wait_to_read()
elapse = time.time() - start
print("elapse time: %fms" % (elapse * 1000 / repeat))
https://github.com/apache/incubator-mxnet/pull/15240 may introduced extra runtime in profiler in mxnet 1.5. Using timeit module, broadcast_add
between mxnet1.4 and mxnet1.5 does not show significant difference:
import timeit
SETUP_CODE = '''
import mxnet as mx
from mxnet import nd
mx.random.seed(10)
def binary_op(lhs, rhs):
return nd.broadcast_add(lhs=lhs, rhs=rhs, dtype='float32', ctx=mx.cpu())
lhs = nd.random.uniform(shape=(1024, 1024), dtype='float32', ctx=mx.cpu())
rhs = nd.random.uniform(shape=(1, 1024), dtype='float32', ctx=mx.cpu())
res = binary_op(lhs, rhs)
'''
TEST_CODE = '''
res.wait_to_read()
'''
# timeit.repeat statement
times = timeit.repeat(setup = SETUP_CODE,
stmt = TEST_CODE,
repeat = 3,
number = 10000)
# priniting minimum exec. time
print('time: {}'.format(min(times)))
Results using script above
聽 | 1.4.1 | 1.5.0 |
-- | -- | -- |
broadcast_add | 0.0073 | 0.0071 |
elemwise_add | 0.0079 | 0.0072 |
sigmoid | 0.0079 | 0.0075 |
hard_sigmoid | 0.0072 | 0.0069 |
There is also no significant regression on BatchNorm op between 1.4.1 and 1.5.0
Also the speed actually improves if turning int64 flag on.
聽 | 1.4.1 (int64) runs | 1.5.0 (int 32) runs | 1.5.0 (int 64) runs
-- | -- | -- | --
BatchNorm | 2.609942 | 2.63 | 2.031
2nd run | 2.621809 | 2.63 | 2.054
3rd run | 2.621809 | 2.611 | 2.041
average | 2.60764 | 2.61148 | 2.042131
script:
import mxnet as mx
import time
mx.random.seed(0)
data = mx.nd.random.uniform(0, 256, (32, 3, 256, 256))
beta = mx.nd.random.uniform(shape=(3,))
gamma = mx.nd.random.uniform(shape=(3,))
mean = mx.nd.random.uniform(shape=(3,))
var = mx.nd.random.uniform(shape=(3,))
repeat = 1000
mx.nd.waitall()
start = time.time()
for _ in range(repeat):
#c = mx.nd.broadcast_add(a, b)
c = mx.nd.BatchNorm(data=data, gamma=gamma, beta=beta, moving_mean=mean, moving_var=var)
# c = mx.nd.elemwise_add(a, b)
c.wait_to_read()
elapse = time.time() - start
print("elapse time: %fms" % (elapse * 1000 / repeat))
timeit
or time
module for benchmark. Or use the same profiler module to make sure it's apple to apple comparison@roywei does it mean the operator profiling results with current profiler module is not accurate?
@ciyongch The profiler was implemented in 1.5 and there are changes in the profiler code which were not in 1.4.1 release: https://github.com/apache/incubator-mxnet/pull/15240. This could have made difference in runtime as the runtime in each operator is very sensitive.
Going forward, we should enhance the profiler module and make it available in CI so we can capture the performance degradation in time.
@apeforest Thanks for the info, as we saw the current perf degradation only happened for some ops or some certain shapes with v1.5, any suggestion to do profiling for a real model? Could we still rely on the profiling result with latest MXNet code base (with #15240) so far?
Most helpful comment
Conclusion
timeit
ortime
module for benchmark. Or use the same profiler module to make sure it's apple to apple comparison