In a recent test, we noticed a 33% performance drop between 1.4.1 and 1.5.1 in resnext-50 inference. We have triaged and found the culprit to be https://github.com/apache/incubator-mxnet/pull/15802.
@PatricZhao Please help to investigate.
@jonatan1626 @ChaiBapchya will provide more data points.
(Paste the complete error message. Please also include stack trace by setting environment variable DMLC_LOG_STACK_TRACE_DEPTH=10 before running your script.)
(If you developed your own code, please provide a short script that reproduces the error. For existing examples, please provide link.)
(Paste the commands you ran that produced the error.)
1.
2.
1.
2.
We recommend using our script for collecting the diagnositc information. Run the following command and paste the outputs below:
curl --retry 10 -s https://raw.githubusercontent.com/dmlc/gluon-nlp/master/tools/diagnose.py | python
# paste outputs here
There is a commit between 1.4.1 -> 1.5.1 that is causing the regression. Initially I found that #15802 was the issue, but after rerunning/reproducing it to verify, I am getting different results. I am currently going through the commits to figure out where the regression is will post my results once I have found it!
These were the initial numbers, which is why this investigation is going on:
Now we are trying to find the culprit for the regression from 1.4.1 -> 1.5.1
There was a gain from 1.5.1 -> 1.6 from the mkldnn upgrade #15422
=========================== MXNet 1.4.1 Pip ==============================
Model: resnext-50
----Latency----
Model Partition Time: 0
Model Load Time: 84.53512191772461
First Inference: 116.00756645202637
p99: 19.801616668701172
p90: 19.596172332763672
p50: 15.244483947753906
Avg: 15.836643512970845
StdDev: 3.903761278317364
Iterations: 100000
Context: cpu
=========================== MXNet 1.5.1 Pip ==============================
Model: resnext-50
----Latency----
Model Partition Time: 0
Model Load Time: 84.19561386108398
First Inference: 121.11115455627441
p99: 34.098148345947266
p90: 30.607223510742188
p50: 27.56810188293457
Avg: 27.974184123915748
StdDev: 7.246931590666572
Iterations: 100000
Context: cpu
=========================== MXNet 1.6 Pip ==============================
Model: resnext-50
----Latency----
Model Partition Time: 0
Model Load Time: 71.62761688232422
First Inference: 104.33459281921387
p99: 21.777629852294922
p90: 20.840167999267578
p50: 17.279386520385742
Avg: 18.166795754279853
StdDev: 4.648682601625848
Iterations: 100000
Context: cpu
Quick update on this, so commit id c5265fbf635965b5124df2d7dcea1b57c6508b73 (I'm labeling this commit number 182, meaning this is 182 commits behind 1.5.1) gives us:
"p50": 18.60,
"p90": 22.57,
"p99": 27.02
File "gluon_resnext50.py", line 24, in <module>
run_resnext50_gluon(*parse_inputs())
File "gluon_resnext50.py", line 20, in run_resnext50_gluon
log_name='resnext50'
File "/home/ubuntu/recreate/recreate/perf_script/utils.py", line 67, in run_model
infer_time1,infer_results = infer(model, inputs, bsize)
File "/home/ubuntu/recreate/recreate/perf_script/infer.py", line 115, in infer_imagenet_gluon
return gluon_forward(model,mx_img,bsize)
File "/home/ubuntu/recreate/recreate/perf_script/infer.py", line 151, in gluon_forward
results.wait_to_read()
File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/mxnet/ndarray/ndarray.py", line 1819, in wait_to_read
check_call(_LIB.MXNDArrayWaitToRead(self.handle))
File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/mxnet/base.py", line 253, in check_call
raise MXNetError(py_str(_LIB.MXGetLastError()))
mxnet.base.MXNetError: [02:10:27] src/operator/nn/mkldnn/mkldnn_base.cc:398: Unknown MKLDNN format for 5 dimensions: 108
Stack trace:
[bt] (0) /home/ubuntu/anaconda3/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x2445fb) [0x7f33d0e745fb]
[bt] (1) /home/ubuntu/anaconda3/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x250fe5) [0x7f33d0e80fe5]
[bt] (2) /home/ubuntu/anaconda3/lib/python3.6/site-packages/mxnet/libmxnet.so(mxnet::NDArray::GetMKLDNNData(mkldnn::memory::primitive_desc const&) const+0x1f4) [0x7f33d320d784]
[bt] (3) /home/ubuntu/anaconda3/lib/python3.6/site-packages/mxnet/libmxnet.so(mxnet::GetWeights(mxnet::NDArray const&, mkldnn::memory::primitive_desc const&, int)+0x21) [0x7f33d0e87d11]
[bt] (4) /home/ubuntu/anaconda3/lib/python3.6/site-packages/mxnet/libmxnet.so(mxnet::op::MKLDNNConvolutionForwardFullFeature(mxnet::op::MKLDNNConvFullParam const&, mxnet::OpContext const&, mxnet::op::MKLDNNConvForward*, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&)+0x4da) [0x7f33d0ea35ba]
[bt] (5) /home/ubuntu/anaconda3/lib/python3.6/site-packages/mxnet/libmxnet.so(mxnet::op::MKLDNNConvolutionForward(nnvm::NodeAttrs const&, mxnet::OpContext const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&)+0x434) [0x7f33d0ea4d04]
[bt] (6) /home/ubuntu/anaconda3/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x72497a) [0x7f33d135497a]
[bt] (7) /home/ubuntu/anaconda3/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x241b547) [0x7f33d304b547]
[bt] (8) /home/ubuntu/anaconda3/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x2468cef) [0x7f33d3098cef]
Commit id e5902ec626400b8b12b2ac1da082ed86bafa67ed (22) is also not runnable.
This is the next commit that works commit id d6fd15131f6ab915fc3085b9fdd505e521cc66a1 (21) which gives:
"p50": 27.37
"p90": 30.61,
"p99": 34.53
There are 160ish commits in between these two commit ids, I haven't tried running them all. I guess the next step would be to try running to see if any of them work at all?
@wuxun-zhang please take a look for this issue
@apeforest how about the performance on 1.6 and latest master?
@apeforest @jonatan1626 I will look into this issue. BTW, could you also share the benchmarking scripts you used? Thanks.
Thanks @wuxun-zhang you can find the scripts here: https://github.com/jonatan1626/mxnet_performance
the folder build_mxnet contains the docker build for mxnet. I put the commit number in the Dockerfile of the mxnet version that I want to build.
Build usingdocker build -t mxnet . within that directory
Then docker run -it mxnet bash. The wheel can be found in the dist directory.
I copy that out and then pip install it to the instance.
I then go to the perf_script folder and run python gluon_resnext50.py --iteration 10000. That will take a little while and print out the numbers I'm reporting!
Let me know if you run into any issues! I am having trouble building 1.6 and the latest master using this way, am currently trying to figure out what's wrong.
@jonatan1626 Thanks for sharing. But, there is a issue when I tried to download image from S3 in my local machine.
(mxnet_v1.5) 味 python gluon_resnext50.py --iteration 100 [ae78845]
EIA context not available, trying GPU...
GPU not available, trying CPU...
Traceback (most recent call last):
File "gluon_resnext50.py", line 24, in <module>
run_resnext50_gluon(*parse_inputs())
File "gluon_resnext50.py", line 7, in run_resnext50_gluon
img = load_image(dtype,bsize)
File "/home/wuxunzha/mxnet_performance/perf_script/utils.py", line 19, in load_image
s3.Bucket('perf-benchmark').download_file(s3_path, filename)
......
botocore.exceptions.NoCredentialsError: Unable to locate credentials
Could you temporarily upload the testing image and resnext50 model files to github repo if file size is not too large? Thanks
I tried to use resnext50 model from Gluon-CV modelzoo since I cannot get access to the model in s3. I used the local machine CLX-8280 with 28 physical cores and the inference command is like below:
export OMP_NUM_THREADS=28
export KMP_AFFINITY=granularity=fine,noduplicates,compact,1,0
numactl --physcpubind=0-27 --membind=0 python gluon_resnext50.py --iteration 10000
Output based on mxnet-mkl==1.5.1
EIA context not available, trying GPU...
GPU not available, trying CPU...
1.5.1 <module 'mxnet' from '~/anaconda3/envs/mxnet_v1.5/lib/python3.6/site-packages/mxnet/__init__.py'>
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
starting test for Model: resnext-50 as float32 with batch size 1
[18:00:41] src/nnvm/legacy_json_util.cc:204: Warning: loading symbol saved by MXNet version 10600 with lower version of MXNet v10501. May cause undefined behavior. Please update MXNet if you encounter any issue
~/anaconda3/envs/mxnet_v1.5/lib/python3.6/site-packages/mxnet/gluon/block.py:1159: UserWarning: Cannot decide type for the following arguments. Consider providing them as input:
data: None
input_sym_arg_type = in_param.infer_type()[0]
~/anaconda3/envs/mxnet_v1.5/lib/python3.6/site-packages/mxnet/gluon/block.py:548: UserWarning: The 1-th input to HybridBlock is not used by any computation. Is this intended?
out = self.forward(*args)
[18:03:18] src/nnvm/legacy_json_util.cc:204: Warning: loading symbol saved by MXNet version 10600 with lower version of MXNet v10501. May cause undefined behavior. Please update MXNet if you encounter any issue
Model: resnext-50
----Latency----
Model Partition Time: 0
Model Load Time: 79.47278022766113
First Inference: 108.0780029296875
p99: 15.761613845825195
p90: 15.525102615356445
p50: 15.347003936767578
Avg: 15.376607843213481
StdDev: 0.1906057337722647
Output based on mxnet master
EIA context not available, trying GPU...
GPU not available, trying CPU...
1.6.0 <module 'mxnet' from '~/github/incubator-mxnet/python/mxnet/__init__.py'>
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
starting test for Model: resnext-50 as float32 with batch size 1
~/github/incubator-mxnet/python/mxnet/gluon/block.py:1398: UserWarning: Cannot decide type for the following arguments. Consider providing them as input:
data: None
input_sym_arg_type = in_param.infer_type()[0]
~/github/incubator-mxnet/python/mxnet/gluon/block.py:693: UserWarning: The 1-th input to HybridBlock is not used by any computation. Is this intended?
out = self.forward(*args)
Model: resnext-50
----Latency----
Model Partition Time: 0
Model Load Time: 85.40558815002441
First Inference: 113.36183547973633
p99: 16.376495361328125
p90: 16.173124313354492
p50: 15.985250473022461
Avg: 16.017432952330072
StdDev: 0.18273338901509342
From my side, I didn't see there is any significant performance change from 1.5.1 to 1.6.0. I didn't test with mxnet-mkl 1.4.1 since resnext50 model from gluoncv modelzoo is not compatible with mxnet 1.4.1. So, it would be a great help if you guys can share the resnext50 model you used, then I can give it a try with mxnet 1.4.1. thanks
Hey @wuxun-zhang sorry about that! I have put the model files and input data that I am using in the repo in a zip file params.zip.
Should I also be running
export OMP_NUM_THREADS=28
export KMP_AFFINITY=granularity=fine,noduplicates,compact,1,0
numactl --physcpubind=0-27 --membind=0 python gluon_resnext50.py --iteration 10000
When executing the test? Could that be the cause for the difference?
@pengzhao-intel performance in 1.6 came back a little but still not as good as 1.4. Also we are not sure if the performance recovery were due to other performance enhancements.
@jonatan1626 Could you strip any EIA related setting from your script. Let鈥檚 use the Apache MXNet script to test
@apeforest so I think it's using Apache MXNet already? When running the script this is printed:
EIA context not available, trying GPU...
GPU not available, trying CPU...
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
starting test for Model: resnext-50 as float32 with batch size 1
I reran the benchmark with the above script on c5.18xlarge:
export OMP_NUM_THREADS=28
export KMP_AFFINITY=granularity=fine,noduplicates,compact,1,0
numactl --physcpubind=0-27 --membind=0 python gluon_resnext50.py --iteration 100000
1.4.1
Install: pip install mxnet-mkl==1.4.1
Output:
Model: resnext-50
----Latency----
Model Partition Time: 0
Model Load Time: 91.20917320251465
First Inference: 116.30487442016602
p99: 16.54815673828125
p90: 16.46280288696289
p50: 16.413450241088867
Avg: 16.413932993352866
StdDev: 0.09641098586944327
1.5
Install: pip install mxnet-mkl==1.5.1
Output:
Model: resnext-50
----Latency----
Model Partition Time: 0
Model Load Time: 96.67277336120605
First Inference: 135.10727882385254
p99: 32.649993896484375
p90: 31.419754028320312
p50: 29.414892196655273
Avg: 29.275005764308506
StdDev: 1.75405292166976
1.6
Install: pip install mxnet-mkl==1.6.0b20191025
Output:
Model: resnext-50
----Latency----
Model Partition Time: 0
Model Load Time: 87.45336532592773
First Inference: 116.45269393920898
p99: 18.494844436645508
p90: 18.326282501220703
p50: 18.211841583251953
Avg: 18.22383847160674
StdDev: 0.09122314713059498
@jonatan1626 Maybe not relate to the regression, but c5.18xlarge should have 2 cpu sockets and 18 cores on each. The env variables should look like:
export OMP_NUM_THREADS=36
export KMP_AFFINITY=granularity=fine,noduplicates,compact,1,0
python gluon_resnext50.py --iteration 100000
If you just want to leverage the first cpu socket (like what @wuxun-zhang did on his machine):
export OMP_NUM_THREADS=18
export KMP_AFFINITY=granularity=fine,noduplicates,compact,1,0
numactl --physcpubind=0-17 --membind=0 python gluon_resnext50.py --iteration 100000
Thanks @TaoLv for reminding the above different settings.
@jonatan1626 Just updated the collected results on c5.18xlarge. I got the very different performance numbers of mxnet-mkl 1.5.1 with yours. Could you double check in your side?
With below commands (using all physical cores on 1 socket):
export OMP_NUM_THREADS=18
export KMP_AFFINITY=granularity=fine,noduplicates,compact,1,0
numactl --physcpubind=0-17 --membind=0 python gluon_resnext50.py --iteration 10000
1.4.1
Install: pip install mxnet-mkl==1.4.1
Output:
Model: resnext-50
----Latency----
Model Partition Time: 0
Model Load Time: 68.3131217956543
First Inference: 90.35015106201172
p99: 14.545679092407227
p90: 14.46223258972168
p50: 14.419078826904297
Avg: 14.41427937695006
StdDev: 0.09572181406460434
md5-a5f4aa46876401cc8ef52baf98cbf3c0
1.5.1
Install: pip install mxnet-mkl==1.5.1
Output:
Model: resnext-50
----Latency----
Model Partition Time: 0
Model Load Time: 68.64523887634277
First Inference: 93.96791458129883
p99: 16.599690914154053
p90: 16.532421112060547
p50: 16.48259162902832
Avg: 16.486327974449974
StdDev: 0.09714292044513262
md5-a5f4aa46876401cc8ef52baf98cbf3c0
1.6.0
Install: pip install mxnet-mkl==1.6.0b20191025
Output:
Model: resnext-50
----Latency----
Model Partition Time: 0
Model Load Time: 74.61214065551758
First Inference: 100.08096694946289
p99: 16.619694232940674
p90: 16.5402889251709
p50: 16.489028930664062
Avg: 16.493797707719867
StdDev: 0.09944463033333369
Thanks @TaoLv for that information!
@wuxun-zhang I may be missing something, but I these are the numbers from the rerun.
These are my steps:
- I start a new c5.18xlarge instance
- Choose: Deep Learning AMI (Ubuntu 16.04) Version 26.0 - ami-07728e9e2742b0662
- Launch the instance and copy the git repo to it.
- sudo apt-get install numactl
- pip install opencv-python
- pip install <mxnet_version>
- export OMP_NUM_THREADS=18
- export KMP_AFFINITY=granularity=fine,noduplicates,compact,1,0
- numactl --physcpubind=0-17 --membind=0 python gluon_resnext50.py --iteration 10000
Does that seem right or am I missing something?
I have also installed mxnet-mkl==1.5.1 onto the other machines and rerun the benchmark and they give around the same numbers I've posted below.
1.4.1
Install: pip install mxnet-mkl==1.4.1
Output:
Model: resnext-50
----Latency----
Model Partition Time: 0
Model Load Time: 48.482418060302734
First Inference: 71.05255126953125
p99: 14.76740837097168
p90: 14.683246612548828
p50: 14.641046524047852
Avg: 14.643481583276303
StdDev: 0.41109802796022626
Install: pip install mxnet-mkl==1.5.1
Output:
Model: resnext-50
----Latency----
Model Partition Time: 0
Model Load Time: 59.892892837524414
First Inference: 90.64745903015137
p99: 23.79322052001953
p90: 23.267030715942383
p50: 22.50361442565918
Avg: 22.492807133890796
StdDev: 0.5872117870215227
Install: pip install mxnet-mkl==1.6.0b20191025
Output:
Model: resnext-50
----Latency----
Model Partition Time: 0
Model Load Time: 240.96202850341797
First Inference: 266.71481132507324
p99: 16.834497451782227
p90: 16.765832901000977
p50: 16.708850860595703
Avg: 16.70833106470542
StdDev: 0.10597406036163494
@jonatan1626 could you do some performance debug in your local?
1) Following the below link, you can check if the MKLDNN is used and then you can compare the runtime of each primitive in your network.
https://mxnet.apache.org/api/python/docs/tutorials/performance/backend/mkldnn/mkldnn_readme#4
2) Use MXNet profiling tools to dunp the runtime of each operator and compare the runtime between two versions.
If you see the obvious performance drop from the primitive/operator level, feel free to paste it in here.
@wuxun-zhang I think we need a document to illustrate how to debug performance for MXNet and MKLDNN like what we do for MKLDNN readme. Please try to write a draft and we can publish in website soon.
@jonatan1626 any thought, could you compare the performance of AMI, pip install and build from source for 1.4 and 1.5.1?
@wuxun-zhang I think we need a document to illustrate how to debug performance for MXNet and MKLDNN like what we do for MKLDNN readme. Please try to write a draft and we can publish in website soon.
Yes, that will be very helpful. I'll start this work.
@pengzhao-intel sure I will look at the docs and can get the profiling numbers!
Also wanted to double check, we should look at: DL AMI 25/26, then do a pip install (1.4.1, 1.5.1, 1.6), and then build from source for 1.4.1 and 1.5.1 and run the benchmark? So a total of 12 different runs.
@jonatan1626 I just created an auto-run script to run benchmark with MXNet profiler. You can find these files in my gist repo. Firstly, you need to apply the patch symbolic_run.diff to your original repo, and then just run ./auto_run_with_profiler.sh to collect profiling information (will print the results into stdout). Below is what I got in my local machine.
mxnet-mkl 1.5.1
operator
=================
Name Total Count Time (ms) Min Time (ms) Max Time (ms) Avg Time (ms)
---- ----------- --------- ------------- ------------- -------------
_sg_mkldnn_conv 530000 93196.5547 0.0440 2.7290 0.1758
CopyCPU2CPU 10002 1750.9480 0.1330 0.2320 0.1751
_sg_mkldnn_fully_connected 10000 1408.5420 0.1180 0.4730 0.1409
BatchNorm 10000 889.8060 0.0480 0.4280 0.0890
SoftmaxOutput 10000 753.9630 0.0500 1.0160 0.0754
Pooling 20000 834.3010 0.0230 0.3370 0.0417
Flatten 10000 360.9760 0.0240 1.0100 0.0361
WaitForVar 10000 67.7780 0.0040 0.0230 0.0068
Reorder 10000 28.7770 0.0010 0.0130 0.0029
mxnet-mkl==1.4.1
operator
=================
Name Total Count Time (ms) Min Time (ms) Max Time (ms) Avg Time (ms)
---- ----------- --------- ------------- ------------- -------------
SoftmaxOutput 10000 310.8290 0.0240 0.0710 0.0311
DeleteVariable 20000 61.4390 0.0000 0.0120 0.0031
Reorder 30000 118.7160 0.0020 0.0120 0.0040
WaitForVar 10000 67.3600 0.0040 0.0160 0.0067
Pooling 20000 782.9750 0.0210 0.0790 0.0391
_sg_mkldnn_conv 530000 106754.1406 0.0370 4.6030 0.2014
BatchNorm 10000 517.9470 0.0430 0.0670 0.0518
FullyConnected 10000 2523.4150 0.2110 0.6170 0.2523
Flatten 10000 1566.0520 0.1320 0.4620 0.1566
CopyCPU2CPU 10002 1775.1851 0.1430 0.2000 0.1775
Thanks @wuxun-zhang!
Here are the results of my run (I switched the iterations to 10000):
mxnet-mkl==1.4.1
operator
=================
Name Total Count Time (ms) Min Time (ms) Max Time (ms) Avg Time (ms)
---- ----------- --------- ------------- ------------- -------------
SoftmaxOutput 40000 1150.4480 0.0200 15.1680 0.0288
DeleteVariable 100000 398.5110 0.0000 0.0290 0.0040
Reorder 120000 284.6760 0.0010 0.2120 0.0024
WaitForVar 40000 394.4030 0.0050 0.0280 0.0099
Pooling 80000 3309.8770 0.0180 19.7900 0.0414
_sg_mkldnn_conv 2120000 425492.1875 0.0320 32.0680 0.2007
BatchNorm 40000 4496.3232 0.0350 31.7840 0.1124
FullyConnected 40000 11512.5957 0.2460 27.1810 0.2878
Flatten 40000 6089.9600 0.1300 12.0980 0.1522
CopyCPU2CPU 60002 7034.2251 0.0890 0.1550 0.1172
mxnet-mkl==1.5.1
operator
=================
Name Total Count Time (ms) Min Time (ms) Max Time (ms) Avg Time (ms)
---- ----------- --------- ------------- ------------- -------------
_sg_mkldnn_conv 2120000 382184.0938 0.0340 32.2100 0.1803
_sg_mkldnn_fully_connected 40000 5580.2979 0.1250 13.9900 0.1395
CopyCPU2CPU 60002 6630.0391 0.0780 0.2140 0.1105
BatchNorm 40000 3444.3301 0.0360 20.4180 0.0861
SoftmaxOutput 40000 2328.8540 0.0420 23.9040 0.0582
Pooling 80000 3521.3201 0.0200 21.3320 0.0440
Flatten 40000 889.5670 0.0170 17.4920 0.0222
WaitForVar 40000 518.3480 0.0060 0.0290 0.0130
DeleteVariable 20000 169.5460 0.0040 0.0310 0.0085
Reorder 40000 75.0850 0.0010 0.0190 0.0019
mxnet-mkl==1.6
operator
=================
Name Total Count Time (ms) Min Time (ms) Max Time (ms) Avg Time (ms)
---- ----------- --------- ------------- ------------- -------------
_sg_mkldnn_conv 1060000 196001.8281 0.0360 31.9680 0.1849
CopyCPU2CPU 30001 3590.7620 0.0860 0.1810 0.1197
_sg_mkldnn_fully_connected 20000 3062.9231 0.1370 25.2740 0.1531
BatchNorm 20000 1971.8280 0.0570 26.9690 0.0986
Pooling 40000 1834.4380 0.0210 21.8530 0.0459
SoftmaxOutput 20000 987.8520 0.0370 19.0080 0.0494
Flatten 20000 462.1840 0.0170 11.9900 0.0231
WaitForVar 19999 264.5960 0.0080 0.0300 0.0132
DeleteVariable 10000 90.5490 0.0050 0.0310 0.0091
Reorder 20000 48.6460 0.0020 0.0180 0.0024
EIA context not available, trying GPU...
Why is this line ever printed at all? Apache mxnet should have no knowledge of EIA context.
@apeforest Look at the code here, if I understood correctly, this script just use try...except... to determine the running context. ctx = mx.eia() fails with throwing an exception.
@jonatan1626 Glad it works. Looks that the time consuming of conv and fc accounts for most of time(>90%). From the above profiling results, we didn't see any obvious performance drop for each operator, right?
@apeforest This benchmark script is from the EIA benchmark that we are using for inference tests. So it makes a call to check whether and EIA context is available, otherwise it will use mx.cpu() context.
@wuxun-zhang yeah I didn't see any regression! I'm looking closer at how we are timing it in the script. I'm a little confused because when you ran it, you're not getting any regression in the numbers, but when I run it, it looks like I have some regression.
Also, is there any difference between using load_imagenet_gluon vs load_imagenet_module?
@wuxun-zhang Based on your earlier performance result, it seems we still see inference runtime increased from Avg: 14.64 in mxnet-1.4 to Avg: 16.486327974449974 in mxnet-1.5. Maybe they are not related to MKLDNN operators.
@jonatan1626 Could you please continue to rootcause the commits that caused the performance regression between 1.4 and 1.5?
@apeforest This benchmark script is from the EIA benchmark that we are using for inference tests. So it makes a call to check whether and EIA context is available, otherwise it will use mx.cpu() context.
@wuxun-zhang yeah I didn't see any regression! I'm looking closer at how we are timing it in the script. I'm a little confused because when you ran it, you're not getting any regression in the numbers, but when I run it, it looks like I have some regression.
Also, is there any difference between using
load_imagenet_gluonvsload_imagenet_module?
Yes. load_imagenet_gluon will use gluon hybridized model while load_imagenet_module will use symbolic model. If you want, you can also use load_imagenet_module to re-collect the latency of resnext50 with different mxnet-mkl versions.
@apeforest
Is this what you wanted with 1.4.1 pip run, 1.5.1 pip run, 1.4.1 source run, 1.5.1 source run.
pip mxnet-mkl==1.4.1
Model: resnext-50
----Latency----
Model Partition Time: 0
Model Load Time: 87.3405933380127
First Inference: 113.36421966552734
p99: 20.80082893371582
p90: 19.6378231048584
p50: 15.682220458984375
Avg: 16.990032211302186
StdDev: 6.2820178635081145
source mxnet-mkl==1.4.1
Wasn't able to build, will edit once I figure out was is wrong.
pip mxnet-mkl==1.5.1
Model: resnext-50
----Latency----
Model Partition Time: 0
Model Load Time: 80.60860633850098
First Inference: 121.35004997253418
p99: 37.58740425109863
p90: 30.339956283569336
p50: 27.198076248168945
Avg: 27.73231802529281
StdDev: 6.622306336227176
source mxnet-mkl==1.5.1
Model: resnext-50
----Latency----
Model Partition Time: 0
Model Load Time: 73.66228103637695
First Inference: 114.89081382751465
p99: 46.54550552368164
p90: 30.79986572265625
p50: 27.560710906982422
Avg: 28.353479811816488
StdDev: 10.021275501124565
@apeforest assign [@apeforest ]
@jonatan1626 Yes. That's an apple-to-apple comparison as we discussed.
Can you also verify if there is degradation between pip install mxnet==1.5.0b20181209 which is the earliest pre-release of 1.5 on pypi and pip 1.4? If there is no degradation, we can just do binary search on the daily build of 1.5 on pypi after that date. If there is still degradation, we may need to build from source and compare.
@apeforest
Is this what you wanted with 1.4.1 pip run, 1.5.1 pip run, 1.4.1 source run, 1.5.1 source run.pip mxnet-mkl==1.4.1
Model: resnext-50
----Latency----
Model Partition Time: 0
Model Load Time: 87.3405933380127
First Inference: 113.36421966552734
p99: 20.80082893371582
p90: 19.6378231048584
p50: 15.682220458984375
Avg: 16.990032211302186
StdDev: 6.2820178635081145source mxnet-mkl==1.4.1
Wasn't able to build, will edit once I figure out was is wrong.pip mxnet-mkl==1.5.1
Model: resnext-50
----Latency----
Model Partition Time: 0
Model Load Time: 80.60860633850098
First Inference: 121.35004997253418
p99: 37.58740425109863
p90: 30.339956283569336
p50: 27.198076248168945
Avg: 27.73231802529281
StdDev: 6.622306336227176source mxnet-mkl==1.5.1
Model: resnext-50
----Latency----
Model Partition Time: 0
Model Load Time: 73.66228103637695
First Inference: 114.89081382751465
p99: 46.54550552368164
p90: 30.79986572265625
p50: 27.560710906982422
Avg: 28.353479811816488
StdDev: 10.021275501124565
Hi @jonatan1626 , may I know your exact command to build MXNet with MKLDNN from source (mine is make USE_OPENCV=1 USE_MKLDNN=1 USE_BLAS=mkl USE_INTEL_PATH=/opt/intel -j ) ? which version of gcc/g++ you use to build mxnet? I found that there maybe much performance difference when using different gcc verisons on c5.18.
The below results are tested on c5.18xlarge - Deep Learning AMI (Ubuntu 16.04) Version 26.0 (ami-025ed45832b817a35) .
gcc version: gcc (Ubuntu 5.4.0-6ubuntu1~16.04.12) 5.4.0 20160609
I compared the performance of several primitives for MXNet: pip-installed, source built with gcc 4.8 and source built with gcc 5.4. Looks that the performance of batch_norm primitive (ref:any pass) will be significantly affected by gcc/g++ versions. Previously, I also mentioned that the performance regression only happens when running gluon hybridized models (no op fusion by default). Regarding symbolic models, batch_norm will be fused with conv automatically ( MXNET_SUBGRAPH_BACKEND=MKLDNN) and no need to call batch_norm primitive.
mkldnn primitives (ms) | gcc 4.8.5 (pip) | gcc 4.8.5 (source, apt-get install gcc-4.8) | gcc 5.4.0 (source, default)
-- | -- | -- | --
Conv | 5705.7 | 5672.18 | 5636.06
inner product | 100.172 | 106.25 | 101.991
batch_norm | 11484.1 | 11757.1 | 5065.54
softmax | 9.24 | 11.82 | 11.94
reorder | 305.375| 306.86 | 308.302
Could you help double check if my results are reasonable from your side? You can use this script to do profiling on MKLDNN primitives (no need other patches), and the you can get the results like above. Please ping me freely if you have any question. Thanks.
Hey @wuxun-zhang so I am using a docker build that seems to be running gcc version 4.8.4 (Ubuntu 4.8.4-2ubuntu1~14.04.4) you can find it here: mxnet-build
I am doing a './build.sh mkl' to build the mxnet version, but now this makes sense as to why we are getting different numbers!
The cause of the regression doesn't seem to be a problem with MKL as we saw the same/better operator performance in the numbers posted above. We think that another package may be the problem so we are now trying to find the cause of it!
@apeforest Should I now just run all the benchmarks with the regular mxnet version or should I keep going with the mkl versions?
@jonatan1626 When building from source please use exactly the same cmake command as @wuxun-zhang is using. The ./build.sh is used to generate pip wheel.
@apeforest Because the benchmarking scripts uses the performance of the pip wheels shouldn't we only be building pip wheels and then running on that?
Also, should I be doing:
pip install mxnet==1.5.0b20181209 or pip install mxnet-mkl==1.5.0b20181216?
Basically, should we not look at mkl for the performance regression anymore or continue to use it?
@wuxun-zhang Sure I can double check on my side. Is that gcc run with mxnet master branch? Also what commands are you using to change the gcc version on the ubuntu system?
@jonatan1626
When we say test pip wheel, we mean using the pip wheel published at PyPi. When build from source, we always use cmake with specified flags. I think using ./build.sh to test performance is different from "build from source".
@apeforest Because the benchmarking scripts uses the performance of the pip wheels shouldn't we only be building pip wheels and then running on that?
Also, should I be doing:
pip install mxnet==1.5.0b20181209 or pip install mxnet-mkl==1.5.0b20181216?Basically, should we not look at mkl for the performance regression anymore or continue to use it?
@wuxun-zhang Sure I can double check on my side. Is that gcc run with mxnet master branch? Also what commands are you using to change the gcc version on the ubuntu system?
I just tested with mxnet v1.5.1 branch. FYI, this link provides the step-by-step guidance for switching to different gcc versions on Ubuntu.
@jonatan1626 Could you please provide an update? What's the final conclusion or action items for the performance from 1.4 to 1.5
@apeforest So I wasn't able to build from source and test mxnet 1.6.0.rc0, but these are the results from the build from source on my side using make USE_OPENCV=1 USE_MKLDNN=1 USE_BLAS=mkl USE_INTEL_PATH=/opt/intel -j
gcc version 4.8.5 (Ubuntu 4.8.5-4ubuntu8~16.04.1)
mxnet 1.4.1
Model: resnext-50
----Latency----
Model Partition Time: 0
Model Load Time: 84.43331718444824
First Inference: 114.0604019165039
p99: 19.7904109954834
p90: 18.867969512939453
p50: 15.221357345581055
Avg: 16.203637552348134
StdDev: 4.929330571600373
mxnet 1.5.1
Model: resnext-50
----Latency----
Model Partition Time: 0
Model Load Time: 123.49724769592285
First Inference: 154.80685234069824
p99: 20.685195922851562
p90: 19.562959671020508
p50: 16.03412628173828
Avg: 16.875816558337718
StdDev: 4.678177418423743
@apeforest You can close this issue:
I have built from source and ran the benchmark, doesn't seem to be any regression from 1.5.1 and 1.6.0.rc1. I was able to build 1.6.0.rc1 without using the DL-AMI.
1.5.1
Model: resnext-50
----Latency----
Model Partition Time: 0
Model Load Time: 111.120700836
First Inference: 142.073869705
p99: 20.83611488342285
p90: 19.79804039001465
p50: 16.29781723022461
Avg: 17.51398781550943
StdDev: 5.346191697373225
1.6.0
Model: resnext-50
----Latency----
Model Partition Time: 0
Model Load Time: 170.495986938
First Inference: 197.278022766
p99: 21.157026290893555
p90: 19.52815055847168
p50: 16.55292510986328
Avg: 17.2545698187063
StdDev: 5.3719987490252725
The original issue was regression between 1.4 and 1.5. Could you also post results built from source using 1.4?
Yup these were the results when building from source. It doesn't look like there's any regression related to the source code, it may have been caused somewhere within the pip builds.
Model: resnext-50
1.4.1
----Latency----
Model Partition Time: 0
Model Load Time: 91.81904792785645
First Inference: 125.44941902160645
p99: 20.407915115356445
p90: 19.57225799560547
p50: 18.459558486938477
Avg: 18.543348969393833
StdDev: 4.550632101690607
1.5.1
----Latency----
Model Partition Time: 0
Model Load Time: 91.66574478149414
First Inference: 123.6727237701416
p99: 22.166967391967773
p90: 20.053863525390625
p50: 18.6159610748291
Avg: 18.976717993636427
StdDev: 3.9545673816784284
No regression found when building mxnet from source with 1.4, 1.5 and 1.6. The difference in performance could be due to the inconsistent script when generating the pip wheels for testing.