The change that upgraded MKLDNN to 1.0 caused performance (images/sec) to drop by 200 points.
The through-put performance (images/sec) during training dropped to 1300 images/sec.
Prior to this change the throughput was in the range of 1500-1530 images/sec.
The attached gzip file contains the training script that trains resnet18_v2 network on Cifar10 dataset.
image_classification.tar.gz
The above numbers were measured on C5.18xlarge ubuntu instance.
(Paste the commands you ran that produced the error.)
pip install psutil gluoncv
export KMP_AFFINITY='granularity=fine,compact,1,0' && export OMP_NUM_THREADS=36
python deeplearning-benchmark/image_classification/image_classification.py --model resnet18_v2 --dataset cifar10 --mode symbolic --gpus 0 --epochs 25 --log-interval 50 --kvstore local --dtype='float32' --batch-size=64
The sample output looks like below.
/usr/local/lib/python2.7/dist-packages/mxnet/numpy_op_signature.py:61: UserWarning: Some mxnet.numpy operator signatures may not be displayed consistently with their counterparts in the official NumPy package due to too-low Python version 2.7.12 (default, Oct 8 2019, 14:14:10)
[GCC 5.4.0 20160609]. Python >= 3.5 is required to make the signatures display correctly.
.format(str(sys.version)))
Namespace(batch_norm=False, batch_size=64, benchmark=False, dataset='cifar10', dtype='float32', epochs=25, gpus=0, kvstore='local', log_interval=50, lr=0.01, mode='symbolic', model='resnet18_v2', seed=123, use_pretrained=False, use_thumbnail=False, wd=0.0001)
[01:23:04] src/io/iter_image_recordio_2.cc:178: ImageRecordIOParser2: data/cifar/train.rec, use 4 threads for decoding..
[01:23:04] src/io/iter_image_recordio_2.cc:178: ImageRecordIOParser2: data/cifar/test.rec, use 4 threads for decoding..
[01:23:04] src/executor/graph_executor.cc:1936: Subgraph backend MKLDNN is activated.
INFO:root:Epoch[0] Batch [0-50] Speed: 892.55 samples/sec accuracy=0.288909
INFO:root:Epoch[0] Batch [50-100] Speed: 1390.86 samples/sec accuracy=0.390625
INFO:root:Epoch[0] Batch [100-150] Speed: 987.58 samples/sec accuracy=0.421250
INFO:root:Epoch[0] Batch [150-200] Speed: 1407.58 samples/sec accuracy=0.440312
INFO:root:Epoch[0] Batch [200-250] Speed: 1310.79 samples/sec accuracy=0.468438
INFO:root:Epoch[0] Batch [250-300] Speed: 1331.61 samples/sec accuracy=0.500313
INFO:root:Epoch[0] Batch [300-350] Speed: 1420.91 samples/sec accuracy=0.522500
INFO:root:Epoch[0] Batch [350-400] Speed: 1469.40 samples/sec accuracy=0.527813
INFO:root:Epoch[0] Batch [400-450] Speed: 1195.95 samples/sec accuracy=0.550312
INFO:root:Epoch[0] Batch [450-500] Speed: 1146.35 samples/sec accuracy=0.573125
INFO:root:Epoch[0] Batch [500-550] Speed: 1543.27 samples/sec accuracy=0.568125
INFO:root:Epoch[0] Batch [550-600] Speed: 1251.45 samples/sec accuracy=0.574688
INFO:root:Epoch[0] Batch [600-650] Speed: 1303.13 samples/sec accuracy=0.602187
INFO:root:Epoch[0] Batch [650-700] Speed: 1283.89 samples/sec accuracy=0.618750
INFO:root:Epoch[0] Batch [700-750] Speed: 955.70 samples/sec accuracy=0.607187
INFO:root:Epoch[0] Train-accuracy=0.514007
@TaoLv @pengzhao-intel @zixuanweeei @samskalicky
@mxnet-label-bot add [R1.6.0]
@leleamol How did you install the mxnet package, from source code or the nightly build? If build from source code, could you please share the make line also? #16555 removed the libiomp5 library from mxnet default build to comply with Apache License requirements. That could be the reason of this issue but I still need reproduce to confirm. If possible, could you please try to build mxnet with USE_BLAS=mkl? It will pull in the libiomp5 library. To install MKL BLAS, please refer to https://github.com/apache/incubator-mxnet/blob/master/ci/docker/install/ubuntu_mkl.sh. Thanks!
Our test results, https://github.com/apache/incubator-mxnet/issues/16845#issuecomment-557757080
@TaoLv I have build the mxnet package from source.
I followed the instructions that are mentioned in the README.md
I just put them in the script form for quicker execution like below.
For building the mkl variant, invoke the following script with "mkl" as command line parameter.
#!/usr/bin/env bash
CURRNET_DIR=`pwd`
echo $CURRNET_DIR
PIP_BUILD=$HOME/pip_build
MXNET_BUILD=$PIP_BUILD/mxnet-build
cd $HOME
mkdir $PIP_BUILD
mv $HOME/incubator-mxnet $MXNET_BUILD
cd $MXNET_BUILD
echo "Building mxnet."
source tools/staticbuild/build.sh $1 pip
cd $PIP_BUILD
cp -r $MXNET_BUILD/tools/pip/. .
export mxnet_variant=$1
python setup.py bdist_wheel
@zachgk assign [@apeforest ]
cpu test on both v1.5.x and v1.6.x mkldnn + openblas, but no regression issue was found.
So can you try to use USE_BLAS=mkl as Taolv said above and test again?
I have tried to use build.sh but failed for: CMake Error at simd/CMakeLists.txt:41 (enable_language):
No CMAKE_ASM_NASM_COMPILER could be found.
So for v1.5 and v1.6 I build use cmd:
make -j USE_MKLDNN=1 USE_BLAS=openblas USE_GPERFTOOLS=0
and setting openblas include and lib directory.
platform: skx-8180
1.5:
[rongzha1@mlt-ace ds2_training_inference]$ cd mxnet_1.5/
[rongzha1@mlt-ace mxnet_1.5]$ ldd lib/libmxnet.so | grep open
libopenblas.so.0 => /lib64/libopenblas.so.0 (0x00007f8db5ff9000)
libopencv_highgui.so.2.4 => /lib64/libopencv_highgui.so.2.4 (0x00007f8dacdaf000)
libopencv_imgproc.so.2.4 => /lib64/libopencv_imgproc.so.2.4 (0x00007f8dac931000)
libopencv_core.so.2.4 => /lib64/libopencv_core.so.2.4 (0x00007f8dac4f7000)
[rongzha1@mlt-ace mxnet_1.5]$ ldd lib/libmxnet.so | grep mkl
libmklml_intel.so => /home/rongzha1/project/mxnet/ds2_training_inference/mxnet_1.5/lib/libmklml_intel.so (0x00007f9707c8d000)
libmkldnn.so.0 => /home/rongzha1/project/mxnet/ds2_training_inference/mxnet_1.5/lib/libmkldnn.so.0 (0x00007f970671d000)
(mxnet) [rongzha1@mlt-ace mxnet_1.5]$ ldd lib/libmxnet.so | grep omp
libiomp5.so => /home/rongzha1/project/mxnet/ds2_training_inference/mxnet_1.5/lib/libiomp5.so (0x00007f75cbc42000)
libXcomposite.so.1 => /lib64/libXcomposite.so.1 (0x00007f75c2647000)
1.6.x:
[rongzha1@mlt-skx141 perf_regression]$ ldd lib/libmxnet.so | grep open
libopenblas.so.0 => /usr/lib64/libopenblas.so.0 (0x00007fc101c03000)
libopencv_highgui.so.2.4 => /usr/lib64/libopencv_highgui.so.2.4 (0x00007fc1004cf000)
libopencv_imgproc.so.2.4 => /usr/lib64/libopencv_imgproc.so.2.4 (0x00007fc100051000)
libopencv_core.so.2.4 => /usr/lib64/libopencv_core.so.2.4 (0x00007fc0ffc18000)
[rongzha1@mlt-skx141 perf_regression]$ ldd lib/libmxnet.so | grep mkl
libmkldnn.so.1 => /home/rongzha1/project/mxnet/ds2_training_inference/perf_regression/lib/libmkldnn.so.1 (0x00007f8378240000)
[rongzha1@mlt-skx141 perf_regression]$ ldd lib/libmxnet.so | grep omp
libgomp.so.1 => /usr/lib64/libgomp.so.1 (0x00007f1357b17000)
libXcomposite.so.1 => /usr/lib64/libXcomposite.so.1 (0x00007f13509a1000)
v1.5.x:
OMP=56
1 [21:43:26] src/io/iter_image_recordio_2.cc:172: ImageRecordIOParser2: data/cifar/train.rec, use 4 threads for decoding..
2 [21:43:26] src/io/iter_image_recordio_2.cc:172: ImageRecordIOParser2: data/cifar/test.rec, use 4 threads for decoding..
3 INFO:root:Epoch[0] Batch [0-50] Speed: 1668.60 samples/sec accuracy=0.273897
4 INFO:root:Epoch[0] Batch [50-100] Speed: 1699.64 samples/sec accuracy=0.380312
5 INFO:root:Epoch[0] Batch [100-150] Speed: 1692.57 samples/sec accuracy=0.425000
6 INFO:root:Epoch[0] Batch [150-200] Speed: 1696.67 samples/sec accuracy=0.444063
7 INFO:root:Epoch[0] Batch [200-250] Speed: 1698.27 samples/sec accuracy=0.465000
8 INFO:root:Epoch[0] Batch [250-300] Speed: 1693.87 samples/sec accuracy=0.497812
9 INFO:root:Epoch[0] Batch [300-350] Speed: 1698.26 samples/sec accuracy=0.505625
10 INFO:root:Epoch[0] Batch [350-400] Speed: 1691.21 samples/sec accuracy=0.520000
11 INFO:root:Epoch[0] Batch [400-450] Speed: 1694.42 samples/sec accuracy=0.538750
12 INFO:root:Epoch[0] Batch [450-500] Speed: 1693.73 samples/sec accuracy=0.576875
13 INFO:root:Epoch[0] Batch [500-550] Speed: 1688.67 samples/sec accuracy=0.579063
14 INFO:root:Epoch[0] Batch [550-600] Speed: 1686.91 samples/sec accuracy=0.585313
15 INFO:root:Epoch[0] Batch [600-650] Speed: 1691.39 samples/sec accuracy=0.605313
16 INFO:root:Epoch[0] Batch [650-700] Speed: 1693.22 samples/sec accuracy=0.612812
17 INFO:root:Epoch[0] Batch [700-750] Speed: 1692.32 samples/sec accuracy=0.603750
18 INFO:root:Epoch[0] Train-accuracy=0.511549
19 INFO:root:Epoch[0] Time cost=29.955
20 INFO:root:Epoch[0] Validation-accuracy=0.642317
OMP=36
1 [22:10:31] src/io/iter_image_recordio_2.cc:172: ImageRecordIOParser2: data/cifar/train.rec, use 4 threads for decoding..
2 [22:10:31] src/io/iter_image_recordio_2.cc:172: ImageRecordIOParser2: data/cifar/test.rec, use 4 threads for decoding..
3 INFO:root:Epoch[0] Batch [0-50] Speed: 1969.98 samples/sec accuracy=0.279412
4 INFO:root:Epoch[0] Batch [50-100] Speed: 2014.50 samples/sec accuracy=0.380937
5 INFO:root:Epoch[0] Batch [100-150] Speed: 2009.43 samples/sec accuracy=0.428125
6 INFO:root:Epoch[0] Batch [150-200] Speed: 2013.70 samples/sec accuracy=0.450313
7 INFO:root:Epoch[0] Batch [200-250] Speed: 2012.61 samples/sec accuracy=0.460625
8 INFO:root:Epoch[0] Batch [250-300] Speed: 2014.29 samples/sec accuracy=0.497812
9 INFO:root:Epoch[0] Batch [300-350] Speed: 2013.60 samples/sec accuracy=0.505000
10 INFO:root:Epoch[0] Batch [350-400] Speed: 2009.98 samples/sec accuracy=0.532500
11 INFO:root:Epoch[0] Batch [400-450] Speed: 2014.39 samples/sec accuracy=0.557500
12 INFO:root:Epoch[0] Batch [450-500] Speed: 2015.02 samples/sec accuracy=0.576250
13 INFO:root:Epoch[0] Batch [500-550] Speed: 2015.25 samples/sec accuracy=0.577187
14 INFO:root:Epoch[0] Batch [550-600] Speed: 2012.03 samples/sec accuracy=0.581250
15 INFO:root:Epoch[0] Batch [600-650] Speed: 2014.64 samples/sec accuracy=0.608437
16 INFO:root:Epoch[0] Batch [650-700] Speed: 2017.28 samples/sec accuracy=0.616563
17 INFO:root:Epoch[0] Batch [700-750] Speed: 2017.49 samples/sec accuracy=0.604688
18 INFO:root:Epoch[0] Train-accuracy=0.514086
19 INFO:root:Epoch[0] Time cost=24.895
20 INFO:root:Epoch[0] Validation-accuracy=0.635052
v1.6.x:
OMP = 36
1 [22:02:24] src/io/iter_image_recordio_2.cc:178: ImageRecordIOParser2: data/cifar/train.rec, use 4 threads for decoding..
2 [22:02:25] src/io/iter_image_recordio_2.cc:178: ImageRecordIOParser2: data/cifar/test.rec, use 4 threads for decoding..
3 [22:02:25] src/executor/graph_executor.cc:1979: Subgraph backend MKLDNN is activated.
4 /home/rongzha1/anaconda3/envs/mxnet/lib/python3.6/site-packages/scipy/__init__.py:115: UserWarning: Numpy 1.13.3 or above is required for this version of scipy (detected version 1.13.1)
5 UserWarning)
6 INFO:root:Epoch[0] Batch [0-50] Speed: 2119.74 samples/sec accuracy=0.280025
7 INFO:root:Epoch[0] Batch [50-100] Speed: 2161.65 samples/sec accuracy=0.392500
8 INFO:root:Epoch[0] Batch [100-150] Speed: 2145.79 samples/sec accuracy=0.425938
9 INFO:root:Epoch[0] Batch [150-200] Speed: 2145.72 samples/sec accuracy=0.448125
10 INFO:root:Epoch[0] Batch [200-250] Speed: 2158.03 samples/sec accuracy=0.461250
11 INFO:root:Epoch[0] Batch [250-300] Speed: 2151.47 samples/sec accuracy=0.498125
12 INFO:root:Epoch[0] Batch [300-350] Speed: 2157.60 samples/sec accuracy=0.515312
13 INFO:root:Epoch[0] Batch [350-400] Speed: 2133.91 samples/sec accuracy=0.530625
14 INFO:root:Epoch[0] Batch [400-450] Speed: 2143.35 samples/sec accuracy=0.545625
15 INFO:root:Epoch[0] Batch [450-500] Speed: 2153.24 samples/sec accuracy=0.577187
16 INFO:root:Epoch[0] Batch [500-550] Speed: 2154.20 samples/sec accuracy=0.577500
17 INFO:root:Epoch[0] Batch [550-600] Speed: 2151.89 samples/sec accuracy=0.580625
18 INFO:root:Epoch[0] Batch [600-650] Speed: 2162.29 samples/sec accuracy=0.596250
19 INFO:root:Epoch[0] Batch [650-700] Speed: 2161.74 samples/sec accuracy=0.609062
20 INFO:root:Epoch[0] Batch [700-750] Speed: 2156.80 samples/sec accuracy=0.597812
21 INFO:root:Epoch[0] Train-accuracy=0.512828
22 INFO:root:Epoch[0] Time cost=23.642
23 INFO:root:Epoch[0] Validation-accuracy=0.613455
Considering @rongzha1 comment I don't consider this issue to be a blocker for 1.6 release. Please comment if you disagree @leleamol @samskalicky .
@ptrendx @rongzha1 @PatricZhao thanks for looking into this, but the issue is not resolved until we verify by running the script @leleamol shared. The build.sh is the script used to generate the pip wheels. using make doesnt follow the same steps and reproduce the problem.
If you cant reproduce the build using the same scripts, I can share a pre-built pip wheel with you separately.
Regarding the following error:
No CMAKE_ASM_NASM_COMPILER could be found.
you can install with sudo apt-get install nasm
Hi @samskalicky I applied AWS Deep learning AMI, c5.18xlarge and ubuntu 14.04 as yours
Using @leleamol shared script to build mxnet:
mxnet1.5:
git checkout v1.5.x(commit c9818480680f84daa6e281a974ab263691302ba8)
when training, some error happens:
mxnet.base.MXNetError: [08:18:23] src/operator/nn/mkldnn/mkldnn_base.cc:372: Unknown MKLDNN format for 4 dimensions: 53
So which version did you use? what's the commit id ?
mxnet1.6:
git checkout v1.6.x(commit 200f0ec8ff55c7264554786822d8467dd9b15174)
both script build and make cmd build, training speed is about 1700 samples/sec
Cannot reproduce performance regression issue.
Details:
Using @leleamol shared script to build mxnet; 2 minor issue:
Result is as following:
[08:45:29] src/io/iter_image_recordio_2.cc:178: ImageRecordIOParser2: data/cifar/train.rec, use 4 threads for decoding..
[08:45:29] src/io/iter_image_recordio_2.cc:178: ImageRecordIOParser2: data/cifar/test.rec, use 4 threads for decoding..
[08:45:29] src/executor/graph_executor.cc:1984: Subgraph backend MKLDNN is activated.
INFO:root:Epoch[0] Batch [0-50] Speed: 1444.97 samples/sec accuracy=0.267770
INFO:root:Epoch[0] Batch [50-100] Speed: 1657.16 samples/sec accuracy=0.381563
INFO:root:Epoch[0] Batch [100-150] Speed: 1629.53 samples/sec accuracy=0.423438
INFO:root:Epoch[0] Batch [150-200] Speed: 1686.67 samples/sec accuracy=0.441875
INFO:root:Epoch[0] Batch [200-250] Speed: 1671.42 samples/sec accuracy=0.462187
INFO:root:Epoch[0] Batch [250-300] Speed: 1723.94 samples/sec accuracy=0.510000
INFO:root:Epoch[0] Batch [300-350] Speed: 1699.66 samples/sec accuracy=0.507500
INFO:root:Epoch[0] Batch [350-400] Speed: 1665.39 samples/sec accuracy=0.523125
INFO:root:Epoch[0] Batch [400-450] Speed: 1724.03 samples/sec accuracy=0.531250
INFO:root:Epoch[0] Batch [450-500] Speed: 1723.66 samples/sec accuracy=0.577187
INFO:root:Epoch[0] Batch [500-550] Speed: 1724.53 samples/sec accuracy=0.574375
INFO:root:Epoch[0] Batch [550-600] Speed: 1721.45 samples/sec accuracy=0.581250
INFO:root:Epoch[0] Batch [600-650] Speed: 1658.77 samples/sec accuracy=0.607500
INFO:root:Epoch[0] Batch [650-700] Speed: 1725.24 samples/sec accuracy=0.606250
INFO:root:Epoch[0] Batch [700-750] Speed: 1726.21 samples/sec accuracy=0.606563
I also use build cmd:
make -j USE_MKLDNN=1 USE_BLAS=openblas USE_GPERFTOOLS=0
cd python/ && python setup.py install
results as following:
Archive: cifar10.zip
creating: cifar/
inflating: cifar/test.rec
inflating: cifar/test.lst
inflating: cifar/train.lst
inflating: cifar/train.rec
[07:38:12] src/io/iter_image_recordio_2.cc:178: ImageRecordIOParser2: data/cifar/train.rec, use 4 threads for decoding..
[07:38:12] src/io/iter_image_recordio_2.cc:178: ImageRecordIOParser2: data/cifar/test.rec, use 4 threads for decoding..
[07:38:12] src/executor/graph_executor.cc:1984: Subgraph backend MKLDNN is activated.
INFO:root:Epoch[0] Batch [0-50] Speed: 1416.12 samples/sec accuracy=0.278799
INFO:root:Epoch[0] Batch [50-100] Speed: 1673.98 samples/sec accuracy=0.385313
INFO:root:Epoch[0] Batch [100-150] Speed: 1624.87 samples/sec accuracy=0.424687
INFO:root:Epoch[0] Batch [150-200] Speed: 1668.53 samples/sec accuracy=0.438750
INFO:root:Epoch[0] Batch [200-250] Speed: 1664.30 samples/sec accuracy=0.478438
INFO:root:Epoch[0] Batch [250-300] Speed: 1696.48 samples/sec accuracy=0.511250
INFO:root:Epoch[0] Batch [300-350] Speed: 1701.83 samples/sec accuracy=0.517188
INFO:root:Epoch[0] Batch [350-400] Speed: 1616.46 samples/sec accuracy=0.545000
INFO:root:Epoch[0] Batch [400-450] Speed: 1697.75 samples/sec accuracy=0.556875
INFO:root:Epoch[0] Batch [450-500] Speed: 1703.83 samples/sec accuracy=0.575625
INFO:root:Epoch[0] Batch [500-550] Speed: 1703.13 samples/sec accuracy=0.572812
INFO:root:Epoch[0] Batch [550-600] Speed: 1699.32 samples/sec accuracy=0.587187
INFO:root:Epoch[0] Batch [600-650] Speed: 1682.87 samples/sec accuracy=0.604688
INFO:root:Epoch[0] Batch [650-700] Speed: 1671.12 samples/sec accuracy=0.612187
INFO:root:Epoch[0] Batch [700-750] Speed: 1705.85 samples/sec accuracy=0.611875
INFO:root:Epoch[0] Train-accuracy=0.516964
INFO:root:Epoch[0] Time cost=30.561
INFO:root:Epoch[0] Validation-accuracy=0.628085
attach screenshot:



Hi @TaoLv, is there an ETA to have this issue fixed? It's causing quite some concern around here.
Thanks,
Omar
Added a script for easy repro:
To run:
piotr@34-215-197-42:130:~$ for i in 1 2 4 8 16 32 64 128 256 512 1024 2048; do ./imagenet.sh $i 2>&1 | tee run_$i.log; done
piotr@34-215-197-42:1:~$ ./table.py
@oorqueda @samskalicky @leleamol As mentioned in https://github.com/apache/incubator-mxnet/issues/16891#issuecomment-557760466, I suspect that the regression is caused by the removal of libiomp5.so. To verify, please try to apply the below patch to make/pip_linux_mkl.mk:
diff --git a/make/pip/pip_linux_mkl.mk b/make/pip/pip_linux_mkl.mk
index 1cf389ae4..dd23434fa 100644
--- a/make/pip/pip_linux_mkl.mk
+++ b/make/pip/pip_linux_mkl.mk
@@ -49,7 +49,7 @@ ADD_CFLAGS += -I$(DEPS_PATH)/include -ffunction-sections -fdata-sections
# choose the version of blas you want to use
# can be: mkl, blas, atlas, openblas
# in default use atlas for linux while apple for osx
-USE_BLAS=openblas
+USE_BLAS=mkl
# whether use opencv during compilation
# you can disable it, however, you will not able to use
@@ -98,7 +98,7 @@ USE_LAPACK_PATH = $(DEPS_PATH)/lib
# add path to intel library, you may need it for MKL, if you did not add the path
# to environment variable
-USE_INTEL_PATH = NONE
+USE_INTEL_PATH = /opt/intel/
And then build MXNet with:
tools/staticbuild/build.sh mkl pip
If it's true, I don't think we have any choice to avoid the regression in pip packages as removing libiomp5.so is a requirement from Apache. Please refer to https://github.com/apache/incubator-mxnet/issues/15544. Thanks!
@leleamol could you help to confirm the current test status based on our feedback?
I don't want it to block 1.6 release.
cc @samskalicky @apeforest
@oorqueda @samskalicky @leleamol As mentioned in #16891 (comment), I suspect that the regression is caused by the removal of libiomp5.so. To verify, please try to apply the below patch to
make/pip_linux_mkl.mk:diff --git a/make/pip/pip_linux_mkl.mk b/make/pip/pip_linux_mkl.mk index 1cf389ae4..dd23434fa 100644 --- a/make/pip/pip_linux_mkl.mk +++ b/make/pip/pip_linux_mkl.mk @@ -49,7 +49,7 @@ ADD_CFLAGS += -I$(DEPS_PATH)/include -ffunction-sections -fdata-sections # choose the version of blas you want to use # can be: mkl, blas, atlas, openblas # in default use atlas for linux while apple for osx -USE_BLAS=openblas +USE_BLAS=mkl # whether use opencv during compilation # you can disable it, however, you will not able to use @@ -98,7 +98,7 @@ USE_LAPACK_PATH = $(DEPS_PATH)/lib # add path to intel library, you may need it for MKL, if you did not add the path # to environment variable -USE_INTEL_PATH = NONE +USE_INTEL_PATH = /opt/intel/And then build MXNet with:
tools/staticbuild/build.sh mkl pipIf it's true, I don't think we have any choice to avoid the regression in pip packages as removing libiomp5.so is a requirement from Apache. Please refer to #15544. Thanks!
Retried with this patch after installing MKL BLAS with https://github.com/apache/incubator-mxnet/blob/master/ci/docker/install/ubuntu_mkl.sh and got these results:
Average Throughput: 1663.49 samples/sec
INFO:root:Epoch[0] Batch [0-50] Speed: 1414.31 samples/sec accuracy=0.281863
INFO:root:Epoch[0] Batch [50-100] Speed: 1610.74 samples/sec accuracy=0.382500
INFO:root:Epoch[0] Batch [100-150] Speed: 1625.33 samples/sec accuracy=0.430000
INFO:root:Epoch[0] Batch [150-200] Speed: 1649.23 samples/sec accuracy=0.432500
INFO:root:Epoch[0] Batch [200-250] Speed: 1663.87 samples/sec accuracy=0.465000
INFO:root:Epoch[0] Batch [250-300] Speed: 1640.63 samples/sec accuracy=0.495625
INFO:root:Epoch[0] Batch [300-350] Speed: 1671.83 samples/sec accuracy=0.502500
INFO:root:Epoch[0] Batch [350-400] Speed: 1669.90 samples/sec accuracy=0.516563
INFO:root:Epoch[0] Batch [400-450] Speed: 1600.49 samples/sec accuracy=0.548125
INFO:root:Epoch[0] Batch [450-500] Speed: 1669.11 samples/sec accuracy=0.562500
INFO:root:Epoch[0] Batch [500-550] Speed: 1671.51 samples/sec accuracy=0.558750
INFO:root:Epoch[0] Batch [550-600] Speed: 1667.67 samples/sec accuracy=0.586875
INFO:root:Epoch[0] Batch [600-650] Speed: 1670.19 samples/sec accuracy=0.591562
INFO:root:Epoch[0] Batch [650-700] Speed: 1652.81 samples/sec accuracy=0.611250
INFO:root:Epoch[0] Batch [700-750] Speed: 1630.58 samples/sec accuracy=0.600000
INFO:root:Epoch[0] Train-accuracy=0.508252
INFO:root:Epoch[0] Time cost=30.680
INFO:root:Epoch[0] Validation-accuracy=0.632166
INFO:root:Epoch[1] Batch [0-50] Speed: 1648.76 samples/sec accuracy=0.625613
INFO:root:Epoch[1] Batch [50-100] Speed: 1660.23 samples/sec accuracy=0.629375
INFO:root:Epoch[1] Batch [100-150] Speed: 1616.19 samples/sec accuracy=0.640312
INFO:root:Epoch[1] Batch [150-200] Speed: 1670.47 samples/sec accuracy=0.643125
INFO:root:Epoch[1] Batch [200-250] Speed: 1670.92 samples/sec accuracy=0.657500
INFO:root:Epoch[1] Batch [250-300] Speed: 1671.10 samples/sec accuracy=0.655625
INFO:root:Epoch[1] Batch [300-350] Speed: 1669.03 samples/sec accuracy=0.651250
INFO:root:Epoch[1] Batch [350-400] Speed: 1669.22 samples/sec accuracy=0.655312
INFO:root:Epoch[1] Batch [400-450] Speed: 1671.08 samples/sec accuracy=0.672813
INFO:root:Epoch[1] Batch [450-500] Speed: 1671.26 samples/sec accuracy=0.673750
INFO:root:Epoch[1] Batch [500-550] Speed: 1650.34 samples/sec accuracy=0.682500
INFO:root:Epoch[1] Batch [550-600] Speed: 1663.81 samples/sec accuracy=0.681250
INFO:root:Epoch[1] Batch [600-650] Speed: 1671.43 samples/sec accuracy=0.695625
INFO:root:Epoch[1] Batch [650-700] Speed: 1622.47 samples/sec accuracy=0.698438
INFO:root:Epoch[1] Batch [700-750] Speed: 1671.23 samples/sec accuracy=0.687187
INFO:root:Epoch[1] Train-accuracy=0.664633
INFO:root:Epoch[1] Time cost=30.096
INFO:root:Epoch[1] Validation-accuracy=0.673878
INFO:root:Epoch[2] Batch [0-50] Speed: 1668.44 samples/sec accuracy=0.701900
INFO:root:Epoch[2] Batch [50-100] Speed: 1673.86 samples/sec accuracy=0.698750
INFO:root:Epoch[2] Batch [100-150] Speed: 1669.55 samples/sec accuracy=0.712500
INFO:root:Epoch[2] Batch [150-200] Speed: 1673.31 samples/sec accuracy=0.713750
INFO:root:Epoch[2] Batch [200-250] Speed: 1673.31 samples/sec accuracy=0.726562
INFO:root:Epoch[2] Batch [250-300] Speed: 1672.89 samples/sec accuracy=0.717187
INFO:root:Epoch[2] Batch [300-350] Speed: 1651.81 samples/sec accuracy=0.725938
INFO:root:Epoch[2] Batch [350-400] Speed: 1623.66 samples/sec accuracy=0.718750
INFO:root:Epoch[2] Batch [400-450] Speed: 1672.81 samples/sec accuracy=0.729688
INFO:root:Epoch[2] Batch [450-500] Speed: 1672.86 samples/sec accuracy=0.736563
INFO:root:Epoch[2] Batch [500-550] Speed: 1669.99 samples/sec accuracy=0.730625
INFO:root:Epoch[2] Batch [550-600] Speed: 1670.90 samples/sec accuracy=0.728750
INFO:root:Epoch[2] Batch [600-650] Speed: 1673.84 samples/sec accuracy=0.739375
INFO:root:Epoch[2] Batch [650-700] Speed: 1675.46 samples/sec accuracy=0.750313
INFO:root:Epoch[2] Batch [700-750] Speed: 1675.23 samples/sec accuracy=0.739062
INFO:root:Epoch[2] Train-accuracy=0.725112
INFO:root:Epoch[2] Time cost=29.959
INFO:root:Epoch[2] Validation-accuracy=0.699419
INFO:root:Epoch[3] Batch [0-50] Speed: 1620.48 samples/sec accuracy=0.747243
INFO:root:Epoch[3] Batch [50-100] Speed: 1665.64 samples/sec accuracy=0.747188
INFO:root:Epoch[3] Batch [100-150] Speed: 1669.65 samples/sec accuracy=0.744375
INFO:root:Epoch[3] Batch [150-200] Speed: 1672.57 samples/sec accuracy=0.756563
INFO:root:Epoch[3] Batch [200-250] Speed: 1673.09 samples/sec accuracy=0.755625
INFO:root:Epoch[3] Batch [250-300] Speed: 1672.16 samples/sec accuracy=0.757500
INFO:root:Epoch[3] Batch [300-350] Speed: 1671.06 samples/sec accuracy=0.757812
INFO:root:Epoch[3] Batch [350-400] Speed: 1670.54 samples/sec accuracy=0.754687
INFO:root:Epoch[3] Batch [400-450] Speed: 1673.20 samples/sec accuracy=0.774375
INFO:root:Epoch[3] Batch [450-500] Speed: 1656.83 samples/sec accuracy=0.768750
INFO:root:Epoch[3] Batch [500-550] Speed: 1672.77 samples/sec accuracy=0.772813
INFO:root:Epoch[3] Batch [550-600] Speed: 1662.18 samples/sec accuracy=0.770312
INFO:root:Epoch[3] Batch [600-650] Speed: 1672.07 samples/sec accuracy=0.770000
INFO:root:Epoch[3] Batch [650-700] Speed: 1642.67 samples/sec accuracy=0.780000
INFO:root:Epoch[3] Batch [700-750] Speed: 1670.11 samples/sec accuracy=0.776875
INFO:root:Epoch[3] Train-accuracy=0.762764
INFO:root:Epoch[3] Time cost=30.022
INFO:root:Epoch[3] Validation-accuracy=0.731771
INFO:root:Epoch[4] Batch [0-50] Speed: 1667.95 samples/sec accuracy=0.778493
INFO:root:Epoch[4] Batch [50-100] Speed: 1672.75 samples/sec accuracy=0.790312
INFO:root:Epoch[4] Batch [100-150] Speed: 1669.29 samples/sec accuracy=0.776875
INFO:root:Epoch[4] Batch [150-200] Speed: 1673.50 samples/sec accuracy=0.792500
INFO:root:Epoch[4] Batch [200-250] Speed: 1672.97 samples/sec accuracy=0.783438
INFO:root:Epoch[4] Batch [250-300] Speed: 1672.72 samples/sec accuracy=0.796250
INFO:root:Epoch[4] Batch [300-350] Speed: 1658.90 samples/sec accuracy=0.784687
INFO:root:Epoch[4] Batch [350-400] Speed: 1669.21 samples/sec accuracy=0.790937
INFO:root:Epoch[4] Batch [400-450] Speed: 1664.05 samples/sec accuracy=0.800312
INFO:root:Epoch[4] Batch [450-500] Speed: 1637.17 samples/sec accuracy=0.789375
INFO:root:Epoch[4] Batch [500-550] Speed: 1665.37 samples/sec accuracy=0.799687
INFO:root:Epoch[4] Batch [550-600] Speed: 1668.98 samples/sec accuracy=0.806562
INFO:root:Epoch[4] Batch [600-650] Speed: 1672.85 samples/sec accuracy=0.809375
INFO:root:Epoch[4] Batch [650-700] Speed: 1674.14 samples/sec accuracy=0.816562
INFO:root:Epoch[4] Batch [700-750] Speed: 1674.87 samples/sec accuracy=0.800000
INFO:root:Epoch[4] Train-accuracy=0.794457
INFO:root:Epoch[4] Time cost=29.996
INFO:root:Epoch[4] Validation-accuracy=0.741740
INFO:root:Epoch[5] Batch [0-50] Speed: 1668.07 samples/sec accuracy=0.809436
INFO:root:Epoch[5] Batch [50-100] Speed: 1673.35 samples/sec accuracy=0.810312
INFO:root:Epoch[5] Batch [100-150] Speed: 1651.66 samples/sec accuracy=0.807500
INFO:root:Epoch[5] Batch [150-200] Speed: 1667.67 samples/sec accuracy=0.809063
INFO:root:Epoch[5] Batch [200-250] Speed: 1668.76 samples/sec accuracy=0.808750
INFO:root:Epoch[5] Batch [250-300] Speed: 1672.72 samples/sec accuracy=0.810937
INFO:root:Epoch[5] Batch [300-350] Speed: 1671.69 samples/sec accuracy=0.816562
INFO:root:Epoch[5] Batch [350-400] Speed: 1672.54 samples/sec accuracy=0.818750
INFO:root:Epoch[5] Batch [400-450] Speed: 1631.24 samples/sec accuracy=0.822187
INFO:root:Epoch[5] Batch [450-500] Speed: 1665.93 samples/sec accuracy=0.815937
INFO:root:Epoch[5] Batch [500-550] Speed: 1674.52 samples/sec accuracy=0.819063
INFO:root:Epoch[5] Batch [550-600] Speed: 1670.75 samples/sec accuracy=0.812500
INFO:root:Epoch[5] Batch [600-650] Speed: 1673.81 samples/sec accuracy=0.825937
INFO:root:Epoch[5] Batch [650-700] Speed: 1676.04 samples/sec accuracy=0.827187
INFO:root:Epoch[5] Batch [700-750] Speed: 1675.77 samples/sec accuracy=0.817813
INFO:root:Epoch[5] Train-accuracy=0.815501
INFO:root:Epoch[5] Time cost=29.948
INFO:root:Epoch[5] Validation-accuracy=0.749399
INFO:root:Epoch[6] Batch [0-50] Speed: 1669.17 samples/sec accuracy=0.837623
INFO:root:Epoch[6] Batch [50-100] Speed: 1661.24 samples/sec accuracy=0.813750
INFO:root:Epoch[6] Batch [100-150] Speed: 1667.14 samples/sec accuracy=0.830313
INFO:root:Epoch[6] Batch [150-200] Speed: 1667.80 samples/sec accuracy=0.826250
INFO:root:Epoch[6] Batch [200-250] Speed: 1673.15 samples/sec accuracy=0.826562
INFO:root:Epoch[6] Batch [250-300] Speed: 1646.27 samples/sec accuracy=0.836875
INFO:root:Epoch[6] Batch [300-350] Speed: 1666.01 samples/sec accuracy=0.829375
INFO:root:Epoch[6] Batch [350-400] Speed: 1672.95 samples/sec accuracy=0.834688
INFO:root:Epoch[6] Batch [400-450] Speed: 1673.64 samples/sec accuracy=0.835625
INFO:root:Epoch[6] Batch [450-500] Speed: 1675.71 samples/sec accuracy=0.843437
INFO:root:Epoch[6] Batch [500-550] Speed: 1674.81 samples/sec accuracy=0.849688
INFO:root:Epoch[6] Batch [550-600] Speed: 1670.66 samples/sec accuracy=0.848750
INFO:root:Epoch[6] Batch [600-650] Speed: 1674.67 samples/sec accuracy=0.850000
INFO:root:Epoch[6] Batch [650-700] Speed: 1676.15 samples/sec accuracy=0.852187
INFO:root:Epoch[6] Batch [700-750] Speed: 1662.28 samples/sec accuracy=0.840625
INFO:root:Epoch[6] Train-accuracy=0.837408
INFO:root:Epoch[6] Time cost=29.926
INFO:root:Epoch[6] Validation-accuracy=0.755609
INFO:root:Epoch[7] Batch [0-50] Speed: 1669.53 samples/sec accuracy=0.851409
INFO:root:Epoch[7] Batch [50-100] Speed: 1673.99 samples/sec accuracy=0.851875
INFO:root:Epoch[7] Batch [100-150] Speed: 1664.78 samples/sec accuracy=0.845000
INFO:root:Epoch[7] Batch [150-200] Speed: 1643.95 samples/sec accuracy=0.848125
INFO:root:Epoch[7] Batch [200-250] Speed: 1673.32 samples/sec accuracy=0.846250
INFO:root:Epoch[7] Batch [250-300] Speed: 1674.50 samples/sec accuracy=0.854062
INFO:root:Epoch[7] Batch [300-350] Speed: 1667.81 samples/sec accuracy=0.868750
INFO:root:Epoch[7] Batch [350-400] Speed: 1672.58 samples/sec accuracy=0.856875
INFO:root:Epoch[7] Batch [400-450] Speed: 1674.09 samples/sec accuracy=0.856563
INFO:root:Epoch[7] Batch [450-500] Speed: 1674.60 samples/sec accuracy=0.855000
INFO:root:Epoch[7] Batch [500-550] Speed: 1674.48 samples/sec accuracy=0.868125
INFO:root:Epoch[7] Batch [550-600] Speed: 1670.71 samples/sec accuracy=0.854688
INFO:root:Epoch[7] Batch [600-650] Speed: 1674.68 samples/sec accuracy=0.859375
INFO:root:Epoch[7] Batch [650-700] Speed: 1675.54 samples/sec accuracy=0.867812
INFO:root:Epoch[7] Batch [700-750] Speed: 1636.57 samples/sec accuracy=0.861250
INFO:root:Epoch[7] Train-accuracy=0.856634
INFO:root:Epoch[7] Time cost=29.935
INFO:root:Epoch[7] Validation-accuracy=0.751202
INFO:root:Epoch[8] Batch [0-50] Speed: 1666.25 samples/sec accuracy=0.862745
INFO:root:Epoch[8] Batch [50-100] Speed: 1667.20 samples/sec accuracy=0.871563
INFO:root:Epoch[8] Batch [100-150] Speed: 1638.39 samples/sec accuracy=0.859688
INFO:root:Epoch[8] Batch [150-200] Speed: 1668.52 samples/sec accuracy=0.874687
INFO:root:Epoch[8] Batch [200-250] Speed: 1664.86 samples/sec accuracy=0.866875
INFO:root:Epoch[8] Batch [250-300] Speed: 1670.59 samples/sec accuracy=0.866250
INFO:root:Epoch[8] Batch [300-350] Speed: 1672.36 samples/sec accuracy=0.872500
INFO:root:Epoch[8] Batch [350-400] Speed: 1667.79 samples/sec accuracy=0.876250
INFO:root:Epoch[8] Batch [400-450] Speed: 1672.58 samples/sec accuracy=0.875938
INFO:root:Epoch[8] Batch [450-500] Speed: 1672.51 samples/sec accuracy=0.871250
INFO:root:Epoch[8] Batch [500-550] Speed: 1671.49 samples/sec accuracy=0.878750
INFO:root:Epoch[8] Batch [550-600] Speed: 1668.27 samples/sec accuracy=0.884062
INFO:root:Epoch[8] Batch [600-650] Speed: 1656.65 samples/sec accuracy=0.882812
INFO:root:Epoch[8] Batch [650-700] Speed: 1671.64 samples/sec accuracy=0.884062
INFO:root:Epoch[8] Batch [700-750] Speed: 1673.34 samples/sec accuracy=0.874687
INFO:root:Epoch[8] Train-accuracy=0.873581
INFO:root:Epoch[8] Time cost=30.010
INFO:root:Epoch[8] Validation-accuracy=0.766421
INFO:root:Epoch[9] Batch [0-50] Speed: 1669.04 samples/sec accuracy=0.879289
INFO:root:Epoch[9] Batch [50-100] Speed: 1671.88 samples/sec accuracy=0.887188
INFO:root:Epoch[9] Batch [100-150] Speed: 1662.53 samples/sec accuracy=0.867500
INFO:root:Epoch[9] Batch [150-200] Speed: 1672.37 samples/sec accuracy=0.881875
INFO:root:Epoch[9] Batch [200-250] Speed: 1672.11 samples/sec accuracy=0.886563
INFO:root:Epoch[9] Batch [250-300] Speed: 1635.77 samples/sec accuracy=0.870938
INFO:root:Epoch[9] Batch [300-350] Speed: 1670.30 samples/sec accuracy=0.884062
INFO:root:Epoch[9] Batch [350-400] Speed: 1671.09 samples/sec accuracy=0.879375
INFO:root:Epoch[9] Batch [400-450] Speed: 1667.68 samples/sec accuracy=0.883125
INFO:root:Epoch[9] Batch [450-500] Speed: 1673.33 samples/sec accuracy=0.885000
INFO:root:Epoch[9] Batch [500-550] Speed: 1672.83 samples/sec accuracy=0.883750
INFO:root:Epoch[9] Batch [550-600] Speed: 1668.54 samples/sec accuracy=0.887500
INFO:root:Epoch[9] Batch [600-650] Speed: 1672.97 samples/sec accuracy=0.890312
INFO:root:Epoch[9] Batch [650-700] Speed: 1653.01 samples/sec accuracy=0.889062
INFO:root:Epoch[9] Batch [700-750] Speed: 1673.44 samples/sec accuracy=0.889062
INFO:root:Epoch[9] Train-accuracy=0.883263
INFO:root:Epoch[9] Time cost=29.960
INFO:root:Epoch[9] Validation-accuracy=0.762520
INFO:root:Epoch[10] Batch [0-50] Speed: 1666.71 samples/sec accuracy=0.887868
INFO:root:Epoch[10] Batch [50-100] Speed: 1672.06 samples/sec accuracy=0.882500
INFO:root:Epoch[10] Batch [100-150] Speed: 1668.15 samples/sec accuracy=0.881250
INFO:root:Epoch[10] Batch [150-200] Speed: 1667.18 samples/sec accuracy=0.899062
INFO:root:Epoch[10] Batch [200-250] Speed: 1670.72 samples/sec accuracy=0.881563
INFO:root:Epoch[10] Batch [250-300] Speed: 1671.63 samples/sec accuracy=0.890000
INFO:root:Epoch[10] Batch [300-350] Speed: 1669.62 samples/sec accuracy=0.905625
INFO:root:Epoch[10] Batch [350-400] Speed: 1664.69 samples/sec accuracy=0.904375
INFO:root:Epoch[10] Batch [400-450] Speed: 1671.13 samples/sec accuracy=0.901250
INFO:root:Epoch[10] Batch [450-500] Speed: 1666.08 samples/sec accuracy=0.896250
INFO:root:Epoch[10] Batch [500-550] Speed: 1670.59 samples/sec accuracy=0.905312
INFO:root:Epoch[10] Batch [550-600] Speed: 1667.69 samples/sec accuracy=0.894687
INFO:root:Epoch[10] Batch [600-650] Speed: 1671.95 samples/sec accuracy=0.895938
INFO:root:Epoch[10] Batch [650-700] Speed: 1672.98 samples/sec accuracy=0.909375
INFO:root:Epoch[10] Batch [700-750] Speed: 1624.72 samples/sec accuracy=0.909375
INFO:root:Epoch[10] Train-accuracy=0.896667
INFO:root:Epoch[10] Time cost=29.974
INFO:root:Epoch[10] Validation-accuracy=0.764123
@NihalHarish thanks for verifying
@TaoLv
The patch doesn't seem to be merged on the master branch. Any reason why it's not being done along with the PR that bumped MKLDNN to v1.0 https://github.com/apache/incubator-mxnet/pull/16555
diff --git a/make/pip/pip_linux_mkl.mk b/make/pip/pip_linux_mkl.mk
index 1cf389ae4..dd23434fa 100644
--- a/make/pip/pip_linux_mkl.mk
+++ b/make/pip/pip_linux_mkl.mk
@@ -49,7 +49,7 @@ ADD_CFLAGS += -I$(DEPS_PATH)/include -ffunction-sections -fdata-sections
# choose the version of blas you want to use
# can be: mkl, blas, atlas, openblas
# in default use atlas for linux while apple for osx
-USE_BLAS=openblas
+USE_BLAS=mkl
# whether use opencv during compilation
# you can disable it, however, you will not able to use
@@ -98,7 +98,7 @@ USE_LAPACK_PATH = $(DEPS_PATH)/lib
# add path to intel library, you may need it for MKL, if you did not add the path
# to environment variable
-USE_INTEL_PATH = NONE
+USE_INTEL_PATH = /opt/intel/
If it was omitted by mistake and since it is required, I could push a PR for the same.
Thanks.
@ChaiBapchya The file is used to build mxnet-mkl pip package. If you want to change the configurations, I think you need have a proposal on dev@.
What is the status of this issue? From the conversation it seems to me that Intel people think it is not an issue (or at least it is unavoidable) and Amazon people are concerned about this. Is that accurate? If so, how does it affect the 1.6 release - should I go ahead and make the RC despite this issue or is there active work going on to fix it?
@TaoLv are you saying that we should keep the current config where we build the mkl flavor with openblas:
master:
https://github.com/apache/incubator-mxnet/blob/7895f93e67dc3e9da360f7a9c667e3c0f1e76c0f/make/staticbuild/linux_mkl.mk#L52
1.6.x branch:
https://github.com/apache/incubator-mxnet/blob/a576531836c5a5c4fb6dfbc944de94b619d6ccfa/make/pip/pip_linux_mkl.mk#L52
Or are you proposing that it needs to be changed to build the mkl flavor with mkl blas instead of openblas?
mkl flavor packages are always built with USE_BLAS=openblas. We can change that to MKL BLAS if we are allowed to include dependency with category x license [1] into MXNet convenient releases.
Thanks @TaoLv
I was able to rebuild and reproduce Nihal's results:
$ python deeplearning-benchmark/image_classification/image_classification.py --model resnet18_v2 --dataset cifar10 --mode symbolic --gpus 0 --epochs 25 --log-interval 50 --kvstore local --dtype='float32' --batch-size=64
Namespace(batch_norm=False, batch_size=64, benchmark=False, dataset='cifar10', dtype='float32', epochs=25, gpus=0, kvstore='local', log_interval=50, lr=0.01, mode='symbolic', model='resnet18_v2', seed=123, use_pretrained=False, use_thumbnail=False, wd=0.0001)
Archive: cifar10.zip
creating: cifar/
inflating: cifar/test.rec
inflating: cifar/test.lst
inflating: cifar/train.lst
inflating: cifar/train.rec
[05:12:00] src/io/iter_image_recordio_2.cc:178: ImageRecordIOParser2: data/cifar/train.rec, use 4 threads for decoding..
[05:12:00] src/io/iter_image_recordio_2.cc:178: ImageRecordIOParser2: data/cifar/test.rec, use 4 threads for decoding..
[05:12:00] src/executor/graph_executor.cc:1979: Subgraph backend MKLDNN is activated.
INFO:root:Epoch[0] Batch [0-50] Speed: 1583.17 samples/sec accuracy=0.285846
INFO:root:Epoch[0] Batch [50-100] Speed: 1508.38 samples/sec accuracy=0.388750
INFO:root:Epoch[0] Batch [100-150] Speed: 1623.32 samples/sec accuracy=0.433125
INFO:root:Epoch[0] Batch [150-200] Speed: 1613.61 samples/sec accuracy=0.443437
INFO:root:Epoch[0] Batch [200-250] Speed: 1642.54 samples/sec accuracy=0.455000
INFO:root:Epoch[0] Batch [250-300] Speed: 1625.45 samples/sec accuracy=0.506250
INFO:root:Epoch[0] Batch [300-350] Speed: 1620.83 samples/sec accuracy=0.515312
INFO:root:Epoch[0] Batch [350-400] Speed: 1637.02 samples/sec accuracy=0.537500
INFO:root:Epoch[0] Batch [400-450] Speed: 1635.96 samples/sec accuracy=0.550937
INFO:root:Epoch[0] Batch [450-500] Speed: 1641.26 samples/sec accuracy=0.574688
INFO:root:Epoch[0] Batch [500-550] Speed: 1643.39 samples/sec accuracy=0.569063
INFO:root:Epoch[0] Batch [550-600] Speed: 1639.69 samples/sec accuracy=0.573125
INFO:root:Epoch[0] Batch [600-650] Speed: 1644.01 samples/sec accuracy=0.598437
INFO:root:Epoch[0] Batch [650-700] Speed: 1644.10 samples/sec accuracy=0.614375
INFO:root:Epoch[0] Batch [700-750] Speed: 1644.86 samples/sec accuracy=0.601250
The root cause of this performance regression is from the difference of BLAS libraries (switching from MKL BLAS to OpenBLAS) and removing the libiomp5.so library.
Now the next step is to determine how we want to proceed. Do we continue with OpenBLAS and take the hit on performance, or as @TaoLv mentioned can we use the category x licensed dependency?
Hi @TaoLv, @samskalicky,
Intel MKL-DNN includes GEMM implementation that is comparable in terms of performance to Intel MKL. Is using mkldnn_gemm an option here?
@TaoLv @pengzhao-intel Are there features in MXNet that require MKL as the BLAS library? I was able to find this line:
https://github.com/apache/incubator-mxnet/blob/c82af38211dbf8356a4f3b35f023632c5bf880ae/src/operator/quantization/quantized_fully_connected.cc#L291
Im rereading the previous comment and now im confused:
@oorqueda @samskalicky @leleamol As mentioned in #16891 (comment), I suspect that the regression is caused by the removal of libiomp5.so.
...
If it's true, I don't think we have any choice to avoid the regression in pip packages as removing libiomp5.so is a requirement from Apache. Please refer to #15544. Thanks!
Is the performance difference coming from using Intel's OpenMP library (libiomp5) or from using the MKL BLAS library itself and some routines like GEMM (as @vpirogov mentions)?
@vpirogov @samskalicky Although MKL BLAS may also have positive impact to the case demonstrated above, I think the main gap is from different OMP runtimes. Setting USE_BLAS=mkl will help to pull in iomp5. Sure I'm going to replace cblas_sgemm and cblas_sgemm_batch with the MatMul primitive from DNNL once it's release, but I don't think that will help to fill the gap between gomp and iomp5.
@samskalicky The code you referred will not be called in the ResNet18 case. Most of the computation in ResNet18 should go to DNNL.
@TaoLv, is anything preventing us from using LLVM OpenMP runtime (libomp)? It is pretty much an open source version of libiomp5.
@vpirogov We can do that. My only concern is the interoperability of it. Also from MXNet perspective, we need move the release process from make to cmake which I don't think can be done within the schedule of the 1.6.0 release.
What do you mean by interoperability exactly?
@TaoLv To get a closure on this topic, would it be possible to move the discussion forward
Thanks
@vpirogov @ChaiBapchya The interoperability means:
@TaoLv,
You are right that when different OpenMP runtimes are used in the same application there's a potential for interoperability issues. For this particular discussion it's important to note that the interoperability considerations are the same for libiomp5 and libomp. From that perspective using libomp does not introduce any additional issues in comparison to what MXNet used before (i.e. libiomp5).
@vpirogov, yes, that's true. libomp and libiomp5 should have the same interoperability issue. From this perspective, the current release build solution (makefile + gomp) sounds a safer choice though it has relatively worse performance. I assume that gomp has better interoperability than the other two runtimes, maybe not true.
@samskalicky and all,
The problem is very clean now. I think we need to make a decision and going forward.
Two possible paths as below
Keep the build as-is with gomp
cons, stable and mature now
pros, a slight performance drop
Re-build with llvm by CMake
cons, same performance as before
pros, efforts on improving CMake path and potential interoperability issues
From my side, I prefer the first option. What's your opinion?
Hi @pengzhao-intel, in MXNet 2.0 Cmake is planned to be the only build system: https://github.com/apache/incubator-mxnet/projects/18#card-30594044
Would that address the cons in Option 2?
Hi @pengzhao-intel, in MXNet 2.0 Cmake is planned to be the only build system: https://github.com/apache/incubator-mxnet/projects/18#card-30594044
Would that address the cons in Option 2?
It's a good chance to make the system clean :)
closing since the fix has alrady updated with latest MKLDNN version.
Most helpful comment
Thanks @TaoLv
I was able to rebuild and reproduce Nihal's results:
The root cause of this performance regression is from the difference of BLAS libraries (switching from MKL BLAS to OpenBLAS) and removing the libiomp5.so library.
Now the next step is to determine how we want to proceed. Do we continue with OpenBLAS and take the hit on performance, or as @TaoLv mentioned can we use the category x licensed dependency?