Incubator-mxnet: Performance regression from 1.4.1 to 1.5.1

Created on 23 Dec 2019  路  42Comments  路  Source: apache/incubator-mxnet

Description

In a recent test, we noticed a 33% performance drop between 1.4.1 and 1.5.1 in resnext-50 inference. We have triaged and found the culprit to be https://github.com/apache/incubator-mxnet/pull/15802.

@PatricZhao Please help to investigate.

@jonatan1626 @ChaiBapchya will provide more data points.

Error Message

(Paste the complete error message. Please also include stack trace by setting environment variable DMLC_LOG_STACK_TRACE_DEPTH=10 before running your script.)

To Reproduce

(If you developed your own code, please provide a short script that reproduces the error. For existing examples, please provide link.)

Steps to reproduce

(Paste the commands you ran that produced the error.)

1.
2.

What have you tried to solve it?

1.
2.

Environment

We recommend using our script for collecting the diagnositc information. Run the following command and paste the outputs below:

curl --retry 10 -s https://raw.githubusercontent.com/dmlc/gluon-nlp/master/tools/diagnose.py | python

# paste outputs here
Bug MKLDNN Performance

All 42 comments

There is a commit between 1.4.1 -> 1.5.1 that is causing the regression. Initially I found that #15802 was the issue, but after rerunning/reproducing it to verify, I am getting different results. I am currently going through the commits to figure out where the regression is will post my results once I have found it!

These were the initial numbers, which is why this investigation is going on:

Now we are trying to find the culprit for the regression from 1.4.1 -> 1.5.1

There was a gain from 1.5.1 -> 1.6 from the mkldnn upgrade #15422

=========================== MXNet 1.4.1 Pip ==============================
Model: resnext-50
----Latency----
Model Partition Time: 0
Model Load Time: 84.53512191772461
First Inference: 116.00756645202637
p99: 19.801616668701172
p90: 19.596172332763672
p50: 15.244483947753906
Avg: 15.836643512970845
StdDev: 3.903761278317364
Iterations: 100000
Context: cpu

=========================== MXNet 1.5.1 Pip ==============================
Model: resnext-50
----Latency----
Model Partition Time: 0
Model Load Time: 84.19561386108398
First Inference: 121.11115455627441
p99: 34.098148345947266
p90: 30.607223510742188
p50: 27.56810188293457
Avg: 27.974184123915748
StdDev: 7.246931590666572
Iterations: 100000
Context: cpu

=========================== MXNet 1.6 Pip ==============================
Model: resnext-50
----Latency----
Model Partition Time: 0
Model Load Time: 71.62761688232422
First Inference: 104.33459281921387
p99: 21.777629852294922
p90: 20.840167999267578
p50: 17.279386520385742
Avg: 18.166795754279853
StdDev: 4.648682601625848
Iterations: 100000
Context: cpu

Quick update on this, so commit id c5265fbf635965b5124df2d7dcea1b57c6508b73 (I'm labeling this commit number 182, meaning this is 182 commits behind 1.5.1) gives us:

"p50": 18.60,
"p90": 22.57,
"p99": 27.02

The following commits give this error (181, 180):

File "gluon_resnext50.py", line 24, in <module> run_resnext50_gluon(*parse_inputs()) File "gluon_resnext50.py", line 20, in run_resnext50_gluon log_name='resnext50' File "/home/ubuntu/recreate/recreate/perf_script/utils.py", line 67, in run_model infer_time1,infer_results = infer(model, inputs, bsize) File "/home/ubuntu/recreate/recreate/perf_script/infer.py", line 115, in infer_imagenet_gluon return gluon_forward(model,mx_img,bsize) File "/home/ubuntu/recreate/recreate/perf_script/infer.py", line 151, in gluon_forward results.wait_to_read() File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/mxnet/ndarray/ndarray.py", line 1819, in wait_to_read check_call(_LIB.MXNDArrayWaitToRead(self.handle)) File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/mxnet/base.py", line 253, in check_call raise MXNetError(py_str(_LIB.MXGetLastError())) mxnet.base.MXNetError: [02:10:27] src/operator/nn/mkldnn/mkldnn_base.cc:398: Unknown MKLDNN format for 5 dimensions: 108 Stack trace: [bt] (0) /home/ubuntu/anaconda3/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x2445fb) [0x7f33d0e745fb] [bt] (1) /home/ubuntu/anaconda3/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x250fe5) [0x7f33d0e80fe5] [bt] (2) /home/ubuntu/anaconda3/lib/python3.6/site-packages/mxnet/libmxnet.so(mxnet::NDArray::GetMKLDNNData(mkldnn::memory::primitive_desc const&) const+0x1f4) [0x7f33d320d784] [bt] (3) /home/ubuntu/anaconda3/lib/python3.6/site-packages/mxnet/libmxnet.so(mxnet::GetWeights(mxnet::NDArray const&, mkldnn::memory::primitive_desc const&, int)+0x21) [0x7f33d0e87d11] [bt] (4) /home/ubuntu/anaconda3/lib/python3.6/site-packages/mxnet/libmxnet.so(mxnet::op::MKLDNNConvolutionForwardFullFeature(mxnet::op::MKLDNNConvFullParam const&, mxnet::OpContext const&, mxnet::op::MKLDNNConvForward*, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&)+0x4da) [0x7f33d0ea35ba] [bt] (5) /home/ubuntu/anaconda3/lib/python3.6/site-packages/mxnet/libmxnet.so(mxnet::op::MKLDNNConvolutionForward(nnvm::NodeAttrs const&, mxnet::OpContext const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&)+0x434) [0x7f33d0ea4d04] [bt] (6) /home/ubuntu/anaconda3/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x72497a) [0x7f33d135497a] [bt] (7) /home/ubuntu/anaconda3/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x241b547) [0x7f33d304b547] [bt] (8) /home/ubuntu/anaconda3/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x2468cef) [0x7f33d3098cef]

Commit id e5902ec626400b8b12b2ac1da082ed86bafa67ed (22) is also not runnable.

This is the next commit that works commit id d6fd15131f6ab915fc3085b9fdd505e521cc66a1 (21) which gives:

"p50": 27.37
"p90": 30.61,
"p99": 34.53

There are 160ish commits in between these two commit ids, I haven't tried running them all. I guess the next step would be to try running to see if any of them work at all?

@wuxun-zhang please take a look for this issue

@apeforest how about the performance on 1.6 and latest master?

@apeforest @jonatan1626 I will look into this issue. BTW, could you also share the benchmarking scripts you used? Thanks.

Thanks @wuxun-zhang you can find the scripts here: https://github.com/jonatan1626/mxnet_performance

the folder build_mxnet contains the docker build for mxnet. I put the commit number in the Dockerfile of the mxnet version that I want to build.
Build usingdocker build -t mxnet . within that directory
Then docker run -it mxnet bash. The wheel can be found in the dist directory.

I copy that out and then pip install it to the instance.

I then go to the perf_script folder and run python gluon_resnext50.py --iteration 10000. That will take a little while and print out the numbers I'm reporting!

Let me know if you run into any issues! I am having trouble building 1.6 and the latest master using this way, am currently trying to figure out what's wrong.

@jonatan1626 Thanks for sharing. But, there is a issue when I tried to download image from S3 in my local machine.

(mxnet_v1.5) 味 python gluon_resnext50.py --iteration 100                                                                                                                                                                                                               [ae78845] 
EIA context not available, trying GPU...
GPU not available, trying CPU...
Traceback (most recent call last):
  File "gluon_resnext50.py", line 24, in <module>
    run_resnext50_gluon(*parse_inputs())
  File "gluon_resnext50.py", line 7, in run_resnext50_gluon
    img = load_image(dtype,bsize)
  File "/home/wuxunzha/mxnet_performance/perf_script/utils.py", line 19, in load_image
    s3.Bucket('perf-benchmark').download_file(s3_path, filename)
......
botocore.exceptions.NoCredentialsError: Unable to locate credentials

Could you temporarily upload the testing image and resnext50 model files to github repo if file size is not too large? Thanks

I tried to use resnext50 model from Gluon-CV modelzoo since I cannot get access to the model in s3. I used the local machine CLX-8280 with 28 physical cores and the inference command is like below:

export OMP_NUM_THREADS=28
export KMP_AFFINITY=granularity=fine,noduplicates,compact,1,0
numactl --physcpubind=0-27 --membind=0 python gluon_resnext50.py --iteration 10000

Output based on mxnet-mkl==1.5.1

EIA context not available, trying GPU...
GPU not available, trying CPU...
1.5.1 <module 'mxnet' from '~/anaconda3/envs/mxnet_v1.5/lib/python3.6/site-packages/mxnet/__init__.py'>
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
starting test for Model: resnext-50 as float32 with batch size 1
[18:00:41] src/nnvm/legacy_json_util.cc:204: Warning: loading symbol saved by MXNet version 10600 with lower version of MXNet v10501. May cause undefined behavior. Please update MXNet if you encounter any issue
~/anaconda3/envs/mxnet_v1.5/lib/python3.6/site-packages/mxnet/gluon/block.py:1159: UserWarning: Cannot decide type for the following arguments. Consider providing them as input:
        data: None
  input_sym_arg_type = in_param.infer_type()[0]
~/anaconda3/envs/mxnet_v1.5/lib/python3.6/site-packages/mxnet/gluon/block.py:548: UserWarning: The 1-th input to HybridBlock is not used by any computation. Is this intended?
  out = self.forward(*args)
[18:03:18] src/nnvm/legacy_json_util.cc:204: Warning: loading symbol saved by MXNet version 10600 with lower version of MXNet v10501. May cause undefined behavior. Please update MXNet if you encounter any issue
Model: resnext-50
        ----Latency----
        Model Partition Time: 0
        Model Load Time: 79.47278022766113
        First Inference: 108.0780029296875
        p99: 15.761613845825195
        p90: 15.525102615356445
        p50: 15.347003936767578
        Avg: 15.376607843213481
        StdDev: 0.1906057337722647

Output based on mxnet master

EIA context not available, trying GPU...
GPU not available, trying CPU...
1.6.0 <module 'mxnet' from '~/github/incubator-mxnet/python/mxnet/__init__.py'>
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
starting test for Model: resnext-50 as float32 with batch size 1
~/github/incubator-mxnet/python/mxnet/gluon/block.py:1398: UserWarning: Cannot decide type for the following arguments. Consider providing them as input:
        data: None
  input_sym_arg_type = in_param.infer_type()[0]
~/github/incubator-mxnet/python/mxnet/gluon/block.py:693: UserWarning: The 1-th input to HybridBlock is not used by any computation. Is this intended?
  out = self.forward(*args)
Model: resnext-50
        ----Latency----
        Model Partition Time: 0
        Model Load Time: 85.40558815002441
        First Inference: 113.36183547973633
        p99: 16.376495361328125
        p90: 16.173124313354492
        p50: 15.985250473022461
        Avg: 16.017432952330072
        StdDev: 0.18273338901509342

From my side, I didn't see there is any significant performance change from 1.5.1 to 1.6.0. I didn't test with mxnet-mkl 1.4.1 since resnext50 model from gluoncv modelzoo is not compatible with mxnet 1.4.1. So, it would be a great help if you guys can share the resnext50 model you used, then I can give it a try with mxnet 1.4.1. thanks

Hey @wuxun-zhang sorry about that! I have put the model files and input data that I am using in the repo in a zip file params.zip.

Should I also be running

export OMP_NUM_THREADS=28
export KMP_AFFINITY=granularity=fine,noduplicates,compact,1,0
numactl --physcpubind=0-27 --membind=0 python gluon_resnext50.py --iteration 10000

When executing the test? Could that be the cause for the difference?

@pengzhao-intel performance in 1.6 came back a little but still not as good as 1.4. Also we are not sure if the performance recovery were due to other performance enhancements.

@jonatan1626 Could you strip any EIA related setting from your script. Let鈥檚 use the Apache MXNet script to test

@apeforest so I think it's using Apache MXNet already? When running the script this is printed:

EIA context not available, trying GPU...
GPU not available, trying CPU...
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
starting test for Model: resnext-50 as float32 with batch size 1

I reran the benchmark with the above script on c5.18xlarge:

export OMP_NUM_THREADS=28
export KMP_AFFINITY=granularity=fine,noduplicates,compact,1,0
numactl --physcpubind=0-27 --membind=0 python gluon_resnext50.py --iteration 100000

1.4.1
Install: pip install mxnet-mkl==1.4.1
Output:

Model: resnext-50
    ----Latency----
    Model Partition Time: 0
    Model Load Time: 91.20917320251465
    First Inference: 116.30487442016602
    p99: 16.54815673828125
    p90: 16.46280288696289
    p50: 16.413450241088867
    Avg: 16.413932993352866
    StdDev: 0.09641098586944327

1.5
Install: pip install mxnet-mkl==1.5.1
Output:

Model: resnext-50
    ----Latency----
    Model Partition Time: 0
    Model Load Time: 96.67277336120605
    First Inference: 135.10727882385254
    p99: 32.649993896484375
    p90: 31.419754028320312
    p50: 29.414892196655273
    Avg: 29.275005764308506
    StdDev: 1.75405292166976

1.6
Install: pip install mxnet-mkl==1.6.0b20191025
Output:

Model: resnext-50
    ----Latency----
    Model Partition Time: 0
    Model Load Time: 87.45336532592773
    First Inference: 116.45269393920898
    p99: 18.494844436645508
    p90: 18.326282501220703
    p50: 18.211841583251953
    Avg: 18.22383847160674
    StdDev: 0.09122314713059498

@jonatan1626 Maybe not relate to the regression, but c5.18xlarge should have 2 cpu sockets and 18 cores on each. The env variables should look like:

export OMP_NUM_THREADS=36
export KMP_AFFINITY=granularity=fine,noduplicates,compact,1,0
python gluon_resnext50.py --iteration 100000

If you just want to leverage the first cpu socket (like what @wuxun-zhang did on his machine):

export OMP_NUM_THREADS=18
export KMP_AFFINITY=granularity=fine,noduplicates,compact,1,0
numactl --physcpubind=0-17 --membind=0 python gluon_resnext50.py --iteration 100000

Thanks @TaoLv for reminding the above different settings.

@jonatan1626 Just updated the collected results on c5.18xlarge. I got the very different performance numbers of mxnet-mkl 1.5.1 with yours. Could you double check in your side?

With below commands (using all physical cores on 1 socket):

export OMP_NUM_THREADS=18
export KMP_AFFINITY=granularity=fine,noduplicates,compact,1,0
numactl --physcpubind=0-17 --membind=0 python gluon_resnext50.py --iteration 10000
1.4.1
Install: pip install mxnet-mkl==1.4.1
Output:

Model: resnext-50
        ----Latency----
        Model Partition Time: 0
        Model Load Time: 68.3131217956543
        First Inference: 90.35015106201172
        p99: 14.545679092407227
        p90: 14.46223258972168
        p50: 14.419078826904297
        Avg: 14.41427937695006
        StdDev: 0.09572181406460434



md5-a5f4aa46876401cc8ef52baf98cbf3c0



1.5.1
Install: pip install mxnet-mkl==1.5.1
Output:

Model: resnext-50
        ----Latency----
        Model Partition Time: 0
        Model Load Time: 68.64523887634277
        First Inference: 93.96791458129883
        p99: 16.599690914154053
        p90: 16.532421112060547
        p50: 16.48259162902832
        Avg: 16.486327974449974
        StdDev: 0.09714292044513262



md5-a5f4aa46876401cc8ef52baf98cbf3c0



1.6.0
Install: pip install mxnet-mkl==1.6.0b20191025
Output:

Model: resnext-50
        ----Latency----
        Model Partition Time: 0
        Model Load Time: 74.61214065551758
        First Inference: 100.08096694946289
        p99: 16.619694232940674
        p90: 16.5402889251709
        p50: 16.489028930664062
        Avg: 16.493797707719867
        StdDev: 0.09944463033333369

Thanks @TaoLv for that information!

@wuxun-zhang I may be missing something, but I these are the numbers from the rerun.

These are my steps:

- I start a new c5.18xlarge instance
- Choose: Deep Learning AMI (Ubuntu 16.04) Version 26.0 - ami-07728e9e2742b0662
- Launch the instance and copy the git repo to it.
- sudo apt-get install numactl
- pip install opencv-python
- pip install <mxnet_version>
- export OMP_NUM_THREADS=18
- export KMP_AFFINITY=granularity=fine,noduplicates,compact,1,0
- numactl --physcpubind=0-17 --membind=0 python gluon_resnext50.py --iteration 10000

Does that seem right or am I missing something?

I have also installed mxnet-mkl==1.5.1 onto the other machines and rerun the benchmark and they give around the same numbers I've posted below.

1.4.1
Install: pip install mxnet-mkl==1.4.1
Output:
Model: resnext-50
    ----Latency----
    Model Partition Time: 0
    Model Load Time: 48.482418060302734
    First Inference: 71.05255126953125
    p99: 14.76740837097168
    p90: 14.683246612548828
    p50: 14.641046524047852
    Avg: 14.643481583276303
    StdDev: 0.41109802796022626
Install: pip install mxnet-mkl==1.5.1
Output:
Model: resnext-50
    ----Latency----
    Model Partition Time: 0
    Model Load Time: 59.892892837524414
    First Inference: 90.64745903015137
    p99: 23.79322052001953
    p90: 23.267030715942383
    p50: 22.50361442565918
    Avg: 22.492807133890796
    StdDev: 0.5872117870215227
Install: pip install mxnet-mkl==1.6.0b20191025
Output:
Model: resnext-50
    ----Latency----
    Model Partition Time: 0
    Model Load Time: 240.96202850341797
    First Inference: 266.71481132507324
    p99: 16.834497451782227
    p90: 16.765832901000977
    p50: 16.708850860595703
    Avg: 16.70833106470542
    StdDev: 0.10597406036163494

@jonatan1626 could you do some performance debug in your local?

1) Following the below link, you can check if the MKLDNN is used and then you can compare the runtime of each primitive in your network.
https://mxnet.apache.org/api/python/docs/tutorials/performance/backend/mkldnn/mkldnn_readme#4
2) Use MXNet profiling tools to dunp the runtime of each operator and compare the runtime between two versions.

If you see the obvious performance drop from the primitive/operator level, feel free to paste it in here.

@wuxun-zhang I think we need a document to illustrate how to debug performance for MXNet and MKLDNN like what we do for MKLDNN readme. Please try to write a draft and we can publish in website soon.

@jonatan1626 any thought, could you compare the performance of AMI, pip install and build from source for 1.4 and 1.5.1?

@wuxun-zhang I think we need a document to illustrate how to debug performance for MXNet and MKLDNN like what we do for MKLDNN readme. Please try to write a draft and we can publish in website soon.

Yes, that will be very helpful. I'll start this work.

@pengzhao-intel sure I will look at the docs and can get the profiling numbers!

Also wanted to double check, we should look at: DL AMI 25/26, then do a pip install (1.4.1, 1.5.1, 1.6), and then build from source for 1.4.1 and 1.5.1 and run the benchmark? So a total of 12 different runs.

@jonatan1626 I just created an auto-run script to run benchmark with MXNet profiler. You can find these files in my gist repo. Firstly, you need to apply the patch symbolic_run.diff to your original repo, and then just run ./auto_run_with_profiler.sh to collect profiling information (will print the results into stdout). Below is what I got in my local machine.

mxnet-mkl 1.5.1

operator
=================
Name                          Total Count        Time (ms)    Min Time (ms)    Max Time (ms)    Avg Time (ms)
----                          -----------        ---------    -------------    -------------    -------------
_sg_mkldnn_conv                    530000       93196.5547           0.0440           2.7290           0.1758
CopyCPU2CPU                         10002        1750.9480           0.1330           0.2320           0.1751
_sg_mkldnn_fully_connected           10000        1408.5420           0.1180           0.4730           0.1409
BatchNorm                           10000         889.8060           0.0480           0.4280           0.0890
SoftmaxOutput                       10000         753.9630           0.0500           1.0160           0.0754
Pooling                             20000         834.3010           0.0230           0.3370           0.0417
Flatten                             10000         360.9760           0.0240           1.0100           0.0361
WaitForVar                          10000          67.7780           0.0040           0.0230           0.0068
Reorder                             10000          28.7770           0.0010           0.0130           0.0029

mxnet-mkl==1.4.1

operator
=================
Name                          Total Count        Time (ms)    Min Time (ms)    Max Time (ms)    Avg Time (ms)
----                          -----------        ---------    -------------    -------------    -------------
SoftmaxOutput                       10000         310.8290           0.0240           0.0710           0.0311
DeleteVariable                      20000          61.4390           0.0000           0.0120           0.0031
Reorder                             30000         118.7160           0.0020           0.0120           0.0040
WaitForVar                          10000          67.3600           0.0040           0.0160           0.0067
Pooling                             20000         782.9750           0.0210           0.0790           0.0391
_sg_mkldnn_conv                    530000      106754.1406           0.0370           4.6030           0.2014
BatchNorm                           10000         517.9470           0.0430           0.0670           0.0518
FullyConnected                      10000        2523.4150           0.2110           0.6170           0.2523
Flatten                             10000        1566.0520           0.1320           0.4620           0.1566
CopyCPU2CPU                         10002        1775.1851           0.1430           0.2000           0.1775

Thanks @wuxun-zhang!

Here are the results of my run (I switched the iterations to 10000):

mxnet-mkl==1.4.1

operator
=================
Name                          Total Count        Time (ms)    Min Time (ms)    Max Time (ms)    Avg Time (ms)
----                          -----------        ---------    -------------    -------------    -------------
SoftmaxOutput                       40000        1150.4480           0.0200          15.1680           0.0288
DeleteVariable                     100000         398.5110           0.0000           0.0290           0.0040
Reorder                            120000         284.6760           0.0010           0.2120           0.0024
WaitForVar                          40000         394.4030           0.0050           0.0280           0.0099
Pooling                             80000        3309.8770           0.0180          19.7900           0.0414
_sg_mkldnn_conv                   2120000      425492.1875           0.0320          32.0680           0.2007
BatchNorm                           40000        4496.3232           0.0350          31.7840           0.1124
FullyConnected                      40000       11512.5957           0.2460          27.1810           0.2878
Flatten                             40000        6089.9600           0.1300          12.0980           0.1522
CopyCPU2CPU                         60002        7034.2251           0.0890           0.1550           0.1172

mxnet-mkl==1.5.1

operator
=================
Name                          Total Count        Time (ms)    Min Time (ms)    Max Time (ms)    Avg Time (ms)
----                          -----------        ---------    -------------    -------------    -------------
_sg_mkldnn_conv                   2120000      382184.0938           0.0340          32.2100           0.1803
_sg_mkldnn_fully_connected           40000        5580.2979           0.1250          13.9900           0.1395
CopyCPU2CPU                         60002        6630.0391           0.0780           0.2140           0.1105
BatchNorm                           40000        3444.3301           0.0360          20.4180           0.0861
SoftmaxOutput                       40000        2328.8540           0.0420          23.9040           0.0582
Pooling                             80000        3521.3201           0.0200          21.3320           0.0440
Flatten                             40000         889.5670           0.0170          17.4920           0.0222
WaitForVar                          40000         518.3480           0.0060           0.0290           0.0130
DeleteVariable                      20000         169.5460           0.0040           0.0310           0.0085
Reorder                             40000          75.0850           0.0010           0.0190           0.0019

mxnet-mkl==1.6

operator
=================
Name                          Total Count        Time (ms)    Min Time (ms)    Max Time (ms)    Avg Time (ms)
----                          -----------        ---------    -------------    -------------    -------------
_sg_mkldnn_conv                   1060000      196001.8281           0.0360          31.9680           0.1849
CopyCPU2CPU                         30001        3590.7620           0.0860           0.1810           0.1197
_sg_mkldnn_fully_connected           20000        3062.9231           0.1370          25.2740           0.1531
BatchNorm                           20000        1971.8280           0.0570          26.9690           0.0986
Pooling                             40000        1834.4380           0.0210          21.8530           0.0459
SoftmaxOutput                       20000         987.8520           0.0370          19.0080           0.0494
Flatten                             20000         462.1840           0.0170          11.9900           0.0231
WaitForVar                          19999         264.5960           0.0080           0.0300           0.0132
DeleteVariable                      10000          90.5490           0.0050           0.0310           0.0091
Reorder                             20000          48.6460           0.0020           0.0180           0.0024

EIA context not available, trying GPU...

Why is this line ever printed at all? Apache mxnet should have no knowledge of EIA context.

@apeforest Look at the code here, if I understood correctly, this script just use try...except... to determine the running context. ctx = mx.eia() fails with throwing an exception.

@jonatan1626 Glad it works. Looks that the time consuming of conv and fc accounts for most of time(>90%). From the above profiling results, we didn't see any obvious performance drop for each operator, right?

@apeforest This benchmark script is from the EIA benchmark that we are using for inference tests. So it makes a call to check whether and EIA context is available, otherwise it will use mx.cpu() context.

@wuxun-zhang yeah I didn't see any regression! I'm looking closer at how we are timing it in the script. I'm a little confused because when you ran it, you're not getting any regression in the numbers, but when I run it, it looks like I have some regression.

Also, is there any difference between using load_imagenet_gluon vs load_imagenet_module?

@wuxun-zhang Based on your earlier performance result, it seems we still see inference runtime increased from Avg: 14.64 in mxnet-1.4 to Avg: 16.486327974449974 in mxnet-1.5. Maybe they are not related to MKLDNN operators.

@jonatan1626 Could you please continue to rootcause the commits that caused the performance regression between 1.4 and 1.5?

@apeforest This benchmark script is from the EIA benchmark that we are using for inference tests. So it makes a call to check whether and EIA context is available, otherwise it will use mx.cpu() context.

@wuxun-zhang yeah I didn't see any regression! I'm looking closer at how we are timing it in the script. I'm a little confused because when you ran it, you're not getting any regression in the numbers, but when I run it, it looks like I have some regression.

Also, is there any difference between using load_imagenet_gluon vs load_imagenet_module?

Yes. load_imagenet_gluon will use gluon hybridized model while load_imagenet_module will use symbolic model. If you want, you can also use load_imagenet_module to re-collect the latency of resnext50 with different mxnet-mkl versions.

@apeforest
Is this what you wanted with 1.4.1 pip run, 1.5.1 pip run, 1.4.1 source run, 1.5.1 source run.

pip mxnet-mkl==1.4.1
Model: resnext-50
----Latency----
Model Partition Time: 0
Model Load Time: 87.3405933380127
First Inference: 113.36421966552734
p99: 20.80082893371582
p90: 19.6378231048584
p50: 15.682220458984375
Avg: 16.990032211302186
StdDev: 6.2820178635081145

source mxnet-mkl==1.4.1
Wasn't able to build, will edit once I figure out was is wrong.

pip mxnet-mkl==1.5.1
Model: resnext-50
----Latency----
Model Partition Time: 0
Model Load Time: 80.60860633850098
First Inference: 121.35004997253418
p99: 37.58740425109863
p90: 30.339956283569336
p50: 27.198076248168945
Avg: 27.73231802529281
StdDev: 6.622306336227176

source mxnet-mkl==1.5.1
Model: resnext-50
----Latency----
Model Partition Time: 0
Model Load Time: 73.66228103637695
First Inference: 114.89081382751465
p99: 46.54550552368164
p90: 30.79986572265625
p50: 27.560710906982422
Avg: 28.353479811816488
StdDev: 10.021275501124565

@apeforest assign [@apeforest ]

@jonatan1626 Yes. That's an apple-to-apple comparison as we discussed.

Can you also verify if there is degradation between pip install mxnet==1.5.0b20181209 which is the earliest pre-release of 1.5 on pypi and pip 1.4? If there is no degradation, we can just do binary search on the daily build of 1.5 on pypi after that date. If there is still degradation, we may need to build from source and compare.

@apeforest
Is this what you wanted with 1.4.1 pip run, 1.5.1 pip run, 1.4.1 source run, 1.5.1 source run.

pip mxnet-mkl==1.4.1
Model: resnext-50
----Latency----
Model Partition Time: 0
Model Load Time: 87.3405933380127
First Inference: 113.36421966552734
p99: 20.80082893371582
p90: 19.6378231048584
p50: 15.682220458984375
Avg: 16.990032211302186
StdDev: 6.2820178635081145

source mxnet-mkl==1.4.1
Wasn't able to build, will edit once I figure out was is wrong.

pip mxnet-mkl==1.5.1
Model: resnext-50
----Latency----
Model Partition Time: 0
Model Load Time: 80.60860633850098
First Inference: 121.35004997253418
p99: 37.58740425109863
p90: 30.339956283569336
p50: 27.198076248168945
Avg: 27.73231802529281
StdDev: 6.622306336227176

source mxnet-mkl==1.5.1
Model: resnext-50
----Latency----
Model Partition Time: 0
Model Load Time: 73.66228103637695
First Inference: 114.89081382751465
p99: 46.54550552368164
p90: 30.79986572265625
p50: 27.560710906982422
Avg: 28.353479811816488
StdDev: 10.021275501124565

Hi @jonatan1626 , may I know your exact command to build MXNet with MKLDNN from source (mine is make USE_OPENCV=1 USE_MKLDNN=1 USE_BLAS=mkl USE_INTEL_PATH=/opt/intel -j ) ? which version of gcc/g++ you use to build mxnet? I found that there maybe much performance difference when using different gcc verisons on c5.18.

The below results are tested on c5.18xlarge - Deep Learning AMI (Ubuntu 16.04) Version 26.0 (ami-025ed45832b817a35) .

gcc version: gcc (Ubuntu 5.4.0-6ubuntu1~16.04.12) 5.4.0 20160609

I compared the performance of several primitives for MXNet: pip-installed, source built with gcc 4.8 and source built with gcc 5.4. Looks that the performance of batch_norm primitive (ref:any pass) will be significantly affected by gcc/g++ versions. Previously, I also mentioned that the performance regression only happens when running gluon hybridized models (no op fusion by default). Regarding symbolic models, batch_norm will be fused with conv automatically ( MXNET_SUBGRAPH_BACKEND=MKLDNN) and no need to call batch_norm primitive.

mkldnn primitives (ms) | gcc 4.8.5 (pip) | gcc 4.8.5 (source, apt-get install gcc-4.8) | gcc 5.4.0 (source, default)
-- | -- | -- | --
Conv | 5705.7 | 5672.18 | 5636.06
inner product | 100.172 | 106.25 | 101.991
batch_norm | 11484.1 | 11757.1 | 5065.54
softmax | 9.24 | 11.82 | 11.94
reorder | 305.375| 306.86 | 308.302

Could you help double check if my results are reasonable from your side? You can use this script to do profiling on MKLDNN primitives (no need other patches), and the you can get the results like above. Please ping me freely if you have any question. Thanks.

Hey @wuxun-zhang so I am using a docker build that seems to be running gcc version 4.8.4 (Ubuntu 4.8.4-2ubuntu1~14.04.4) you can find it here: mxnet-build

I am doing a './build.sh mkl' to build the mxnet version, but now this makes sense as to why we are getting different numbers!

The cause of the regression doesn't seem to be a problem with MKL as we saw the same/better operator performance in the numbers posted above. We think that another package may be the problem so we are now trying to find the cause of it!

@apeforest Should I now just run all the benchmarks with the regular mxnet version or should I keep going with the mkl versions?

@jonatan1626 When building from source please use exactly the same cmake command as @wuxun-zhang is using. The ./build.sh is used to generate pip wheel.

@apeforest Because the benchmarking scripts uses the performance of the pip wheels shouldn't we only be building pip wheels and then running on that?
Also, should I be doing:
pip install mxnet==1.5.0b20181209 or pip install mxnet-mkl==1.5.0b20181216?

Basically, should we not look at mkl for the performance regression anymore or continue to use it?

@wuxun-zhang Sure I can double check on my side. Is that gcc run with mxnet master branch? Also what commands are you using to change the gcc version on the ubuntu system?

@jonatan1626
When we say test pip wheel, we mean using the pip wheel published at PyPi. When build from source, we always use cmake with specified flags. I think using ./build.sh to test performance is different from "build from source".

@apeforest Because the benchmarking scripts uses the performance of the pip wheels shouldn't we only be building pip wheels and then running on that?
Also, should I be doing:
pip install mxnet==1.5.0b20181209 or pip install mxnet-mkl==1.5.0b20181216?

Basically, should we not look at mkl for the performance regression anymore or continue to use it?

@wuxun-zhang Sure I can double check on my side. Is that gcc run with mxnet master branch? Also what commands are you using to change the gcc version on the ubuntu system?

I just tested with mxnet v1.5.1 branch. FYI, this link provides the step-by-step guidance for switching to different gcc versions on Ubuntu.

@jonatan1626 Could you please provide an update? What's the final conclusion or action items for the performance from 1.4 to 1.5

@apeforest So I wasn't able to build from source and test mxnet 1.6.0.rc0, but these are the results from the build from source on my side using make USE_OPENCV=1 USE_MKLDNN=1 USE_BLAS=mkl USE_INTEL_PATH=/opt/intel -j

gcc version 4.8.5 (Ubuntu 4.8.5-4ubuntu8~16.04.1)
mxnet 1.4.1

Model: resnext-50
    ----Latency----
    Model Partition Time: 0
    Model Load Time: 84.43331718444824
    First Inference: 114.0604019165039
    p99: 19.7904109954834
    p90: 18.867969512939453
    p50: 15.221357345581055
    Avg: 16.203637552348134
    StdDev: 4.929330571600373

mxnet 1.5.1

Model: resnext-50
    ----Latency----
    Model Partition Time: 0
    Model Load Time: 123.49724769592285
    First Inference: 154.80685234069824
    p99: 20.685195922851562
    p90: 19.562959671020508
    p50: 16.03412628173828
    Avg: 16.875816558337718
    StdDev: 4.678177418423743

@apeforest You can close this issue:

I have built from source and ran the benchmark, doesn't seem to be any regression from 1.5.1 and 1.6.0.rc1. I was able to build 1.6.0.rc1 without using the DL-AMI.

1.5.1

Model: resnext-50
    ----Latency----
    Model Partition Time: 0
    Model Load Time: 111.120700836
    First Inference: 142.073869705
    p99: 20.83611488342285
    p90: 19.79804039001465
    p50: 16.29781723022461
    Avg: 17.51398781550943
    StdDev: 5.346191697373225

1.6.0

Model: resnext-50
    ----Latency----
    Model Partition Time: 0
    Model Load Time: 170.495986938
    First Inference: 197.278022766
    p99: 21.157026290893555
    p90: 19.52815055847168
    p50: 16.55292510986328
    Avg: 17.2545698187063
    StdDev: 5.3719987490252725

The original issue was regression between 1.4 and 1.5. Could you also post results built from source using 1.4?

Yup these were the results when building from source. It doesn't look like there's any regression related to the source code, it may have been caused somewhere within the pip builds.
Model: resnext-50

1.4.1

    ----Latency----
    Model Partition Time: 0
    Model Load Time: 91.81904792785645
    First Inference: 125.44941902160645
    p99: 20.407915115356445
    p90: 19.57225799560547
    p50: 18.459558486938477
    Avg: 18.543348969393833
    StdDev: 4.550632101690607

1.5.1

    ----Latency----
    Model Partition Time: 0
    Model Load Time: 91.66574478149414
    First Inference: 123.6727237701416
    p99: 22.166967391967773
    p90: 20.053863525390625
    p50: 18.6159610748291
    Avg: 18.976717993636427
    StdDev: 3.9545673816784284

No regression found when building mxnet from source with 1.4, 1.5 and 1.6. The difference in performance could be due to the inconsistent script when generating the pip wheels for testing.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

WangcsShuai picture WangcsShuai  路  3Comments

sbodenstein picture sbodenstein  路  3Comments

xzqjack picture xzqjack  路  3Comments

realbns2008 picture realbns2008  路  3Comments

yuconglin picture yuconglin  路  3Comments