Incubator-mxnet: Upgrading MKLDNN to 1.0 causes performance regression.

Created on 22 Nov 2019 · 36Comments · Source: apache/incubator-mxnet

Description

The change that upgraded MKLDNN to 1.0 caused performance (images/sec) to drop by 200 points.

Error Message

The through-put performance (images/sec) during training dropped to 1300 images/sec.
Prior to this change the throughput was in the range of 1500-1530 images/sec.

To Reproduce

The attached gzip file contains the training script that trains resnet18_v2 network on Cifar10 dataset.
image_classification.tar.gz
The above numbers were measured on C5.18xlarge ubuntu instance.

Steps to reproduce

(Paste the commands you ran that produced the error.)

Build and install the mxnet-mkl pip wheel that contains the above changes on the test machine.
Unzip the attached gzip file on the test machine.
Install the psutil and gluoncv and Export the KMP_AFFINITY anf OMP_NUM_THREADS variable as below.

pip install psutil gluoncv
export KMP_AFFINITY='granularity=fine,compact,1,0' && export OMP_NUM_THREADS=36

Run the following command to start the training.

python deeplearning-benchmark/image_classification/image_classification.py --model resnet18_v2 --dataset cifar10 --mode symbolic --gpus 0 --epochs 25 --log-interval 50 --kvstore local --dtype='float32' --batch-size=64

The sample output looks like below.

/usr/local/lib/python2.7/dist-packages/mxnet/numpy_op_signature.py:61: UserWarning: Some mxnet.numpy operator signatures may not be displayed consistently with their counterparts in the official NumPy package due to too-low Python version 2.7.12 (default, Oct  8 2019, 14:14:10)
[GCC 5.4.0 20160609]. Python >= 3.5 is required to make the signatures display correctly.
  .format(str(sys.version)))
Namespace(batch_norm=False, batch_size=64, benchmark=False, dataset='cifar10', dtype='float32', epochs=25, gpus=0, kvstore='local', log_interval=50, lr=0.01, mode='symbolic', model='resnet18_v2', seed=123, use_pretrained=False, use_thumbnail=False, wd=0.0001)
[01:23:04] src/io/iter_image_recordio_2.cc:178: ImageRecordIOParser2: data/cifar/train.rec, use 4 threads for decoding..
[01:23:04] src/io/iter_image_recordio_2.cc:178: ImageRecordIOParser2: data/cifar/test.rec, use 4 threads for decoding..
[01:23:04] src/executor/graph_executor.cc:1936: Subgraph backend MKLDNN is activated.
INFO:root:Epoch[0] Batch [0-50] Speed: 892.55 samples/sec       accuracy=0.288909
INFO:root:Epoch[0] Batch [50-100]       Speed: 1390.86 samples/sec      accuracy=0.390625
INFO:root:Epoch[0] Batch [100-150]      Speed: 987.58 samples/sec       accuracy=0.421250
INFO:root:Epoch[0] Batch [150-200]      Speed: 1407.58 samples/sec      accuracy=0.440312
INFO:root:Epoch[0] Batch [200-250]      Speed: 1310.79 samples/sec      accuracy=0.468438
INFO:root:Epoch[0] Batch [250-300]      Speed: 1331.61 samples/sec      accuracy=0.500313
INFO:root:Epoch[0] Batch [300-350]      Speed: 1420.91 samples/sec      accuracy=0.522500
INFO:root:Epoch[0] Batch [350-400]      Speed: 1469.40 samples/sec      accuracy=0.527813
INFO:root:Epoch[0] Batch [400-450]      Speed: 1195.95 samples/sec      accuracy=0.550312
INFO:root:Epoch[0] Batch [450-500]      Speed: 1146.35 samples/sec      accuracy=0.573125
INFO:root:Epoch[0] Batch [500-550]      Speed: 1543.27 samples/sec      accuracy=0.568125
INFO:root:Epoch[0] Batch [550-600]      Speed: 1251.45 samples/sec      accuracy=0.574688
INFO:root:Epoch[0] Batch [600-650]      Speed: 1303.13 samples/sec      accuracy=0.602187
INFO:root:Epoch[0] Batch [650-700]      Speed: 1283.89 samples/sec      accuracy=0.618750
INFO:root:Epoch[0] Batch [700-750]      Speed: 955.70 samples/sec       accuracy=0.607187
INFO:root:Epoch[0] Train-accuracy=0.514007

Environment

c5.18xlarge
ubuntu 14.04 LTS

Bug MKLDNN R1.6.0

Source

leleamol

😕1

Most helpful comment

Thanks @TaoLv

I was able to rebuild and reproduce Nihal's results:

$ python deeplearning-benchmark/image_classification/image_classification.py --model resnet18_v2 --dataset cifar10 --mode symbolic --gpus 0 --epochs 25 --log-interval 50 --kvstore local --dtype='float32' --batch-size=64
Namespace(batch_norm=False, batch_size=64, benchmark=False, dataset='cifar10', dtype='float32', epochs=25, gpus=0, kvstore='local', log_interval=50, lr=0.01, mode='symbolic', model='resnet18_v2', seed=123, use_pretrained=False, use_thumbnail=False, wd=0.0001)
Archive:  cifar10.zip
   creating: cifar/
  inflating: cifar/test.rec          
  inflating: cifar/test.lst          
  inflating: cifar/train.lst         
  inflating: cifar/train.rec         
[05:12:00] src/io/iter_image_recordio_2.cc:178: ImageRecordIOParser2: data/cifar/train.rec, use 4 threads for decoding..
[05:12:00] src/io/iter_image_recordio_2.cc:178: ImageRecordIOParser2: data/cifar/test.rec, use 4 threads for decoding..
[05:12:00] src/executor/graph_executor.cc:1979: Subgraph backend MKLDNN is activated.
INFO:root:Epoch[0] Batch [0-50] Speed: 1583.17 samples/sec  accuracy=0.285846
INFO:root:Epoch[0] Batch [50-100]   Speed: 1508.38 samples/sec  accuracy=0.388750
INFO:root:Epoch[0] Batch [100-150]  Speed: 1623.32 samples/sec  accuracy=0.433125
INFO:root:Epoch[0] Batch [150-200]  Speed: 1613.61 samples/sec  accuracy=0.443437
INFO:root:Epoch[0] Batch [200-250]  Speed: 1642.54 samples/sec  accuracy=0.455000
INFO:root:Epoch[0] Batch [250-300]  Speed: 1625.45 samples/sec  accuracy=0.506250
INFO:root:Epoch[0] Batch [300-350]  Speed: 1620.83 samples/sec  accuracy=0.515312
INFO:root:Epoch[0] Batch [350-400]  Speed: 1637.02 samples/sec  accuracy=0.537500
INFO:root:Epoch[0] Batch [400-450]  Speed: 1635.96 samples/sec  accuracy=0.550937
INFO:root:Epoch[0] Batch [450-500]  Speed: 1641.26 samples/sec  accuracy=0.574688
INFO:root:Epoch[0] Batch [500-550]  Speed: 1643.39 samples/sec  accuracy=0.569063
INFO:root:Epoch[0] Batch [550-600]  Speed: 1639.69 samples/sec  accuracy=0.573125
INFO:root:Epoch[0] Batch [600-650]  Speed: 1644.01 samples/sec  accuracy=0.598437
INFO:root:Epoch[0] Batch [650-700]  Speed: 1644.10 samples/sec  accuracy=0.614375
INFO:root:Epoch[0] Batch [700-750]  Speed: 1644.86 samples/sec  accuracy=0.601250

The root cause of this performance regression is from the difference of BLAS libraries (switching from MKL BLAS to OpenBLAS) and removing the libiomp5.so library.

Now the next step is to determine how we want to proceed. Do we continue with OpenBLAS and take the hit on performance, or as @TaoLv mentioned can we use the category x licensed dependency?

samskalicky on 11 Dec 2019

👍2

All 36 comments

16555

@TaoLv @pengzhao-intel @zixuanweeei @samskalicky

leleamol on 22 Nov 2019

@mxnet-label-bot add [R1.6.0]

samskalicky on 23 Nov 2019

@leleamol How did you install the mxnet package, from source code or the nightly build? If build from source code, could you please share the make line also? #16555 removed the libiomp5 library from mxnet default build to comply with Apache License requirements. That could be the reason of this issue but I still need reproduce to confirm. If possible, could you please try to build mxnet with USE_BLAS=mkl? It will pull in the libiomp5 library. To install MKL BLAS, please refer to https://github.com/apache/incubator-mxnet/blob/master/ci/docker/install/ubuntu_mkl.sh. Thanks!

TaoLv on 23 Nov 2019

Our test results, https://github.com/apache/incubator-mxnet/issues/16845#issuecomment-557757080

pengzhao-intel on 23 Nov 2019

@TaoLv I have build the mxnet package from source.

I followed the instructions that are mentioned in the README.md

I just put them in the script form for quicker execution like below.

For building the mkl variant, invoke the following script with "mkl" as command line parameter.

#!/usr/bin/env bash


CURRNET_DIR=`pwd`
echo $CURRNET_DIR
PIP_BUILD=$HOME/pip_build
MXNET_BUILD=$PIP_BUILD/mxnet-build
cd $HOME

mkdir $PIP_BUILD
mv $HOME/incubator-mxnet $MXNET_BUILD
cd $MXNET_BUILD
echo "Building mxnet."
source tools/staticbuild/build.sh $1 pip

cd $PIP_BUILD
cp -r $MXNET_BUILD/tools/pip/. .
export mxnet_variant=$1
python setup.py bdist_wheel

leleamol on 25 Nov 2019

@zachgk assign [@apeforest ]

samskalicky on 25 Nov 2019

cpu test on both v1.5.x and v1.6.x mkldnn + openblas, but no regression issue was found.
So can you try to use USE_BLAS=mkl as Taolv said above and test again?

I have tried to use build.sh but failed for: CMake Error at simd/CMakeLists.txt:41 (enable_language):
No CMAKE_ASM_NASM_COMPILER could be found.
So for v1.5 and v1.6 I build use cmd:
make -j USE_MKLDNN=1 USE_BLAS=openblas USE_GPERFTOOLS=0
and setting openblas include and lib directory.
platform: skx-8180
1.5:
[rongzha1@mlt-ace ds2_training_inference]$ cd mxnet_1.5/
[rongzha1@mlt-ace mxnet_1.5]$ ldd lib/libmxnet.so | grep open
libopenblas.so.0 => /lib64/libopenblas.so.0 (0x00007f8db5ff9000)
libopencv_highgui.so.2.4 => /lib64/libopencv_highgui.so.2.4 (0x00007f8dacdaf000)
libopencv_imgproc.so.2.4 => /lib64/libopencv_imgproc.so.2.4 (0x00007f8dac931000)
libopencv_core.so.2.4 => /lib64/libopencv_core.so.2.4 (0x00007f8dac4f7000)
[rongzha1@mlt-ace mxnet_1.5]$ ldd lib/libmxnet.so | grep mkl
libmklml_intel.so => /home/rongzha1/project/mxnet/ds2_training_inference/mxnet_1.5/lib/libmklml_intel.so (0x00007f9707c8d000)
libmkldnn.so.0 => /home/rongzha1/project/mxnet/ds2_training_inference/mxnet_1.5/lib/libmkldnn.so.0 (0x00007f970671d000)
(mxnet) [rongzha1@mlt-ace mxnet_1.5]$ ldd lib/libmxnet.so | grep omp
libiomp5.so => /home/rongzha1/project/mxnet/ds2_training_inference/mxnet_1.5/lib/libiomp5.so (0x00007f75cbc42000)
libXcomposite.so.1 => /lib64/libXcomposite.so.1 (0x00007f75c2647000)

1.6.x:
[rongzha1@mlt-skx141 perf_regression]$ ldd lib/libmxnet.so | grep open
libopenblas.so.0 => /usr/lib64/libopenblas.so.0 (0x00007fc101c03000)
libopencv_highgui.so.2.4 => /usr/lib64/libopencv_highgui.so.2.4 (0x00007fc1004cf000)
libopencv_imgproc.so.2.4 => /usr/lib64/libopencv_imgproc.so.2.4 (0x00007fc100051000)
libopencv_core.so.2.4 => /usr/lib64/libopencv_core.so.2.4 (0x00007fc0ffc18000)
[rongzha1@mlt-skx141 perf_regression]$ ldd lib/libmxnet.so | grep mkl
libmkldnn.so.1 => /home/rongzha1/project/mxnet/ds2_training_inference/perf_regression/lib/libmkldnn.so.1 (0x00007f8378240000)
[rongzha1@mlt-skx141 perf_regression]$ ldd lib/libmxnet.so | grep omp
libgomp.so.1 => /usr/lib64/libgomp.so.1 (0x00007f1357b17000)
libXcomposite.so.1 => /usr/lib64/libXcomposite.so.1 (0x00007f13509a1000)

v1.5.x:
OMP=56
1 [21:43:26] src/io/iter_image_recordio_2.cc:172: ImageRecordIOParser2: data/cifar/train.rec, use 4 threads for decoding..
2 [21:43:26] src/io/iter_image_recordio_2.cc:172: ImageRecordIOParser2: data/cifar/test.rec, use 4 threads for decoding..
3 INFO:root:Epoch[0] Batch [0-50] Speed: 1668.60 samples/sec accuracy=0.273897
4 INFO:root:Epoch[0] Batch [50-100] Speed: 1699.64 samples/sec accuracy=0.380312
5 INFO:root:Epoch[0] Batch [100-150] Speed: 1692.57 samples/sec accuracy=0.425000
6 INFO:root:Epoch[0] Batch [150-200] Speed: 1696.67 samples/sec accuracy=0.444063
7 INFO:root:Epoch[0] Batch [200-250] Speed: 1698.27 samples/sec accuracy=0.465000
8 INFO:root:Epoch[0] Batch [250-300] Speed: 1693.87 samples/sec accuracy=0.497812
9 INFO:root:Epoch[0] Batch [300-350] Speed: 1698.26 samples/sec accuracy=0.505625
10 INFO:root:Epoch[0] Batch [350-400] Speed: 1691.21 samples/sec accuracy=0.520000
11 INFO:root:Epoch[0] Batch [400-450] Speed: 1694.42 samples/sec accuracy=0.538750
12 INFO:root:Epoch[0] Batch [450-500] Speed: 1693.73 samples/sec accuracy=0.576875
13 INFO:root:Epoch[0] Batch [500-550] Speed: 1688.67 samples/sec accuracy=0.579063
14 INFO:root:Epoch[0] Batch [550-600] Speed: 1686.91 samples/sec accuracy=0.585313
15 INFO:root:Epoch[0] Batch [600-650] Speed: 1691.39 samples/sec accuracy=0.605313
16 INFO:root:Epoch[0] Batch [650-700] Speed: 1693.22 samples/sec accuracy=0.612812
17 INFO:root:Epoch[0] Batch [700-750] Speed: 1692.32 samples/sec accuracy=0.603750
18 INFO:root:Epoch[0] Train-accuracy=0.511549
19 INFO:root:Epoch[0] Time cost=29.955
20 INFO:root:Epoch[0] Validation-accuracy=0.642317

OMP=36
1 [22:10:31] src/io/iter_image_recordio_2.cc:172: ImageRecordIOParser2: data/cifar/train.rec, use 4 threads for decoding..
2 [22:10:31] src/io/iter_image_recordio_2.cc:172: ImageRecordIOParser2: data/cifar/test.rec, use 4 threads for decoding..
3 INFO:root:Epoch[0] Batch [0-50] Speed: 1969.98 samples/sec accuracy=0.279412
4 INFO:root:Epoch[0] Batch [50-100] Speed: 2014.50 samples/sec accuracy=0.380937
5 INFO:root:Epoch[0] Batch [100-150] Speed: 2009.43 samples/sec accuracy=0.428125
6 INFO:root:Epoch[0] Batch [150-200] Speed: 2013.70 samples/sec accuracy=0.450313
7 INFO:root:Epoch[0] Batch [200-250] Speed: 2012.61 samples/sec accuracy=0.460625
8 INFO:root:Epoch[0] Batch [250-300] Speed: 2014.29 samples/sec accuracy=0.497812
9 INFO:root:Epoch[0] Batch [300-350] Speed: 2013.60 samples/sec accuracy=0.505000
10 INFO:root:Epoch[0] Batch [350-400] Speed: 2009.98 samples/sec accuracy=0.532500
11 INFO:root:Epoch[0] Batch [400-450] Speed: 2014.39 samples/sec accuracy=0.557500
12 INFO:root:Epoch[0] Batch [450-500] Speed: 2015.02 samples/sec accuracy=0.576250
13 INFO:root:Epoch[0] Batch [500-550] Speed: 2015.25 samples/sec accuracy=0.577187
14 INFO:root:Epoch[0] Batch [550-600] Speed: 2012.03 samples/sec accuracy=0.581250
15 INFO:root:Epoch[0] Batch [600-650] Speed: 2014.64 samples/sec accuracy=0.608437
16 INFO:root:Epoch[0] Batch [650-700] Speed: 2017.28 samples/sec accuracy=0.616563
17 INFO:root:Epoch[0] Batch [700-750] Speed: 2017.49 samples/sec accuracy=0.604688
18 INFO:root:Epoch[0] Train-accuracy=0.514086
19 INFO:root:Epoch[0] Time cost=24.895
20 INFO:root:Epoch[0] Validation-accuracy=0.635052

v1.6.x:
OMP = 36
1 [22:02:24] src/io/iter_image_recordio_2.cc:178: ImageRecordIOParser2: data/cifar/train.rec, use 4 threads for decoding..
2 [22:02:25] src/io/iter_image_recordio_2.cc:178: ImageRecordIOParser2: data/cifar/test.rec, use 4 threads for decoding..
3 [22:02:25] src/executor/graph_executor.cc:1979: Subgraph backend MKLDNN is activated.
4 /home/rongzha1/anaconda3/envs/mxnet/lib/python3.6/site-packages/scipy/__init__.py:115: UserWarning: Numpy 1.13.3 or above is required for this version of scipy (detected version 1.13.1)
5 UserWarning)
6 INFO:root:Epoch[0] Batch [0-50] Speed: 2119.74 samples/sec accuracy=0.280025
7 INFO:root:Epoch[0] Batch [50-100] Speed: 2161.65 samples/sec accuracy=0.392500
8 INFO:root:Epoch[0] Batch [100-150] Speed: 2145.79 samples/sec accuracy=0.425938
9 INFO:root:Epoch[0] Batch [150-200] Speed: 2145.72 samples/sec accuracy=0.448125
10 INFO:root:Epoch[0] Batch [200-250] Speed: 2158.03 samples/sec accuracy=0.461250
11 INFO:root:Epoch[0] Batch [250-300] Speed: 2151.47 samples/sec accuracy=0.498125
12 INFO:root:Epoch[0] Batch [300-350] Speed: 2157.60 samples/sec accuracy=0.515312
13 INFO:root:Epoch[0] Batch [350-400] Speed: 2133.91 samples/sec accuracy=0.530625
14 INFO:root:Epoch[0] Batch [400-450] Speed: 2143.35 samples/sec accuracy=0.545625
15 INFO:root:Epoch[0] Batch [450-500] Speed: 2153.24 samples/sec accuracy=0.577187
16 INFO:root:Epoch[0] Batch [500-550] Speed: 2154.20 samples/sec accuracy=0.577500
17 INFO:root:Epoch[0] Batch [550-600] Speed: 2151.89 samples/sec accuracy=0.580625
18 INFO:root:Epoch[0] Batch [600-650] Speed: 2162.29 samples/sec accuracy=0.596250
19 INFO:root:Epoch[0] Batch [650-700] Speed: 2161.74 samples/sec accuracy=0.609062
20 INFO:root:Epoch[0] Batch [700-750] Speed: 2156.80 samples/sec accuracy=0.597812
21 INFO:root:Epoch[0] Train-accuracy=0.512828
22 INFO:root:Epoch[0] Time cost=23.642
23 INFO:root:Epoch[0] Validation-accuracy=0.613455

rongzha1 on 26 Nov 2019

👍1

Considering @rongzha1 comment I don't consider this issue to be a blocker for 1.6 release. Please comment if you disagree @leleamol @samskalicky .

ptrendx on 27 Nov 2019

@ptrendx @rongzha1 @PatricZhao thanks for looking into this, but the issue is not resolved until we verify by running the script @leleamol shared. The build.sh is the script used to generate the pip wheels. using make doesnt follow the same steps and reproduce the problem.

If you cant reproduce the build using the same scripts, I can share a pre-built pip wheel with you separately.

samskalicky on 27 Nov 2019

Regarding the following error:

No CMAKE_ASM_NASM_COMPILER could be found.

you can install with sudo apt-get install nasm

samskalicky on 27 Nov 2019

Hi @samskalicky I applied AWS Deep learning AMI, c5.18xlarge and ubuntu 14.04 as yours
Using @leleamol shared script to build mxnet:

mxnet1.5:
git checkout v1.5.x(commit c9818480680f84daa6e281a974ab263691302ba8)
when training, some error happens:
mxnet.base.MXNetError: [08:18:23] src/operator/nn/mkldnn/mkldnn_base.cc:372: Unknown MKLDNN format for 4 dimensions: 53
So which version did you use? what's the commit id ?
mxnet1.6:
git checkout v1.6.x(commit 200f0ec8ff55c7264554786822d8467dd9b15174)
both script build and make cmd build, training speed is about 1700 samples/sec

Cannot reproduce performance regression issue.

Details:
Using @leleamol shared script to build mxnet; 2 minor issue:

script error : source tools/staticbuild/build.sh $1 pip sh can not recognize ' source' cmd;
remove 'source ' can work
link error: can't find /usr/lib/gcc/x86_64-linux-gnu/5/libgfortran.so
try to link gcc5 lib, works well:
ln -s /usr/lib/gcc/x86_64-linux-gnu/5/libgfortran.so /usr/lib/gcc/x86_64-linux-gnu/4.8/libgfortran.so
after build: cd mxnet-build/python && python setup.py install
run cifar training

Result is as following:
[08:45:29] src/io/iter_image_recordio_2.cc:178: ImageRecordIOParser2: data/cifar/train.rec, use 4 threads for decoding..
[08:45:29] src/io/iter_image_recordio_2.cc:178: ImageRecordIOParser2: data/cifar/test.rec, use 4 threads for decoding..
[08:45:29] src/executor/graph_executor.cc:1984: Subgraph backend MKLDNN is activated.
INFO:root:Epoch[0] Batch [0-50] Speed: 1444.97 samples/sec accuracy=0.267770
INFO:root:Epoch[0] Batch [50-100] Speed: 1657.16 samples/sec accuracy=0.381563
INFO:root:Epoch[0] Batch [100-150] Speed: 1629.53 samples/sec accuracy=0.423438
INFO:root:Epoch[0] Batch [150-200] Speed: 1686.67 samples/sec accuracy=0.441875
INFO:root:Epoch[0] Batch [200-250] Speed: 1671.42 samples/sec accuracy=0.462187
INFO:root:Epoch[0] Batch [250-300] Speed: 1723.94 samples/sec accuracy=0.510000
INFO:root:Epoch[0] Batch [300-350] Speed: 1699.66 samples/sec accuracy=0.507500
INFO:root:Epoch[0] Batch [350-400] Speed: 1665.39 samples/sec accuracy=0.523125
INFO:root:Epoch[0] Batch [400-450] Speed: 1724.03 samples/sec accuracy=0.531250
INFO:root:Epoch[0] Batch [450-500] Speed: 1723.66 samples/sec accuracy=0.577187
INFO:root:Epoch[0] Batch [500-550] Speed: 1724.53 samples/sec accuracy=0.574375
INFO:root:Epoch[0] Batch [550-600] Speed: 1721.45 samples/sec accuracy=0.581250
INFO:root:Epoch[0] Batch [600-650] Speed: 1658.77 samples/sec accuracy=0.607500
INFO:root:Epoch[0] Batch [650-700] Speed: 1725.24 samples/sec accuracy=0.606250
INFO:root:Epoch[0] Batch [700-750] Speed: 1726.21 samples/sec accuracy=0.606563

I also use build cmd:
make -j USE_MKLDNN=1 USE_BLAS=openblas USE_GPERFTOOLS=0
cd python/ && python setup.py install
results as following:
Archive: cifar10.zip
creating: cifar/
inflating: cifar/test.rec
inflating: cifar/test.lst
inflating: cifar/train.lst
inflating: cifar/train.rec
[07:38:12] src/io/iter_image_recordio_2.cc:178: ImageRecordIOParser2: data/cifar/train.rec, use 4 threads for decoding..
[07:38:12] src/io/iter_image_recordio_2.cc:178: ImageRecordIOParser2: data/cifar/test.rec, use 4 threads for decoding..
[07:38:12] src/executor/graph_executor.cc:1984: Subgraph backend MKLDNN is activated.
INFO:root:Epoch[0] Batch [0-50] Speed: 1416.12 samples/sec accuracy=0.278799
INFO:root:Epoch[0] Batch [50-100] Speed: 1673.98 samples/sec accuracy=0.385313
INFO:root:Epoch[0] Batch [100-150] Speed: 1624.87 samples/sec accuracy=0.424687
INFO:root:Epoch[0] Batch [150-200] Speed: 1668.53 samples/sec accuracy=0.438750
INFO:root:Epoch[0] Batch [200-250] Speed: 1664.30 samples/sec accuracy=0.478438
INFO:root:Epoch[0] Batch [250-300] Speed: 1696.48 samples/sec accuracy=0.511250
INFO:root:Epoch[0] Batch [300-350] Speed: 1701.83 samples/sec accuracy=0.517188
INFO:root:Epoch[0] Batch [350-400] Speed: 1616.46 samples/sec accuracy=0.545000
INFO:root:Epoch[0] Batch [400-450] Speed: 1697.75 samples/sec accuracy=0.556875
INFO:root:Epoch[0] Batch [450-500] Speed: 1703.83 samples/sec accuracy=0.575625
INFO:root:Epoch[0] Batch [500-550] Speed: 1703.13 samples/sec accuracy=0.572812
INFO:root:Epoch[0] Batch [550-600] Speed: 1699.32 samples/sec accuracy=0.587187
INFO:root:Epoch[0] Batch [600-650] Speed: 1682.87 samples/sec accuracy=0.604688
INFO:root:Epoch[0] Batch [650-700] Speed: 1671.12 samples/sec accuracy=0.612187
INFO:root:Epoch[0] Batch [700-750] Speed: 1705.85 samples/sec accuracy=0.611875
INFO:root:Epoch[0] Train-accuracy=0.516964
INFO:root:Epoch[0] Time cost=30.561
INFO:root:Epoch[0] Validation-accuracy=0.628085

attach screenshot:
1 6_make1
1 6_make2
1 6_script_build

rongzha1 on 28 Nov 2019

👍1

Hi @TaoLv, is there an ETA to have this issue fixed? It's causing quite some concern around here.

Thanks,

Omar

oorqueda on 30 Nov 2019

Added a script for easy repro:

http://ix.io/23fU

http://ix.io/23fV

To run:

piotr@34-215-197-42:130:~$ for i in 1 2 4 8 16 32 64 128 256 512 1024 2048; do ./imagenet.sh $i 2>&1 | tee run_$i.log; done
piotr@34-215-197-42:1:~$ ./table.py

larroy on 1 Dec 2019

@oorqueda @samskalicky @leleamol As mentioned in https://github.com/apache/incubator-mxnet/issues/16891#issuecomment-557760466, I suspect that the regression is caused by the removal of libiomp5.so. To verify, please try to apply the below patch to make/pip_linux_mkl.mk:

diff --git a/make/pip/pip_linux_mkl.mk b/make/pip/pip_linux_mkl.mk
index 1cf389ae4..dd23434fa 100644
--- a/make/pip/pip_linux_mkl.mk
+++ b/make/pip/pip_linux_mkl.mk
@@ -49,7 +49,7 @@ ADD_CFLAGS += -I$(DEPS_PATH)/include -ffunction-sections -fdata-sections
 # choose the version of blas you want to use
 # can be: mkl, blas, atlas, openblas
 # in default use atlas for linux while apple for osx
-USE_BLAS=openblas
+USE_BLAS=mkl

 # whether use opencv during compilation
 # you can disable it, however, you will not able to use
@@ -98,7 +98,7 @@ USE_LAPACK_PATH = $(DEPS_PATH)/lib

 # add path to intel library, you may need it for MKL, if you did not add the path
 # to environment variable
-USE_INTEL_PATH = NONE
+USE_INTEL_PATH = /opt/intel/

And then build MXNet with:

tools/staticbuild/build.sh mkl pip

If it's true, I don't think we have any choice to avoid the regression in pip packages as removing libiomp5.so is a requirement from Apache. Please refer to https://github.com/apache/incubator-mxnet/issues/15544. Thanks!

TaoLv on 1 Dec 2019

@leleamol could you help to confirm the current test status based on our feedback?
I don't want it to block 1.6 release.

cc @samskalicky @apeforest

pengzhao-intel on 5 Dec 2019

@oorqueda @samskalicky @leleamol As mentioned in #16891 (comment), I suspect that the regression is caused by the removal of libiomp5.so. To verify, please try to apply the below patch to make/pip_linux_mkl.mk:
diff --git a/make/pip/pip_linux_mkl.mk b/make/pip/pip_linux_mkl.mk
index 1cf389ae4..dd23434fa 100644
--- a/make/pip/pip_linux_mkl.mk
+++ b/make/pip/pip_linux_mkl.mk
@@ -49,7 +49,7 @@ ADD_CFLAGS += -I$(DEPS_PATH)/include -ffunction-sections -fdata-sections
 # choose the version of blas you want to use
 # can be: mkl, blas, atlas, openblas
 # in default use atlas for linux while apple for osx
-USE_BLAS=openblas
+USE_BLAS=mkl

 # whether use opencv during compilation
 # you can disable it, however, you will not able to use
@@ -98,7 +98,7 @@ USE_LAPACK_PATH = $(DEPS_PATH)/lib

 # add path to intel library, you may need it for MKL, if you did not add the path
 # to environment variable
-USE_INTEL_PATH = NONE
+USE_INTEL_PATH = /opt/intel/
And then build MXNet with:
tools/staticbuild/build.sh mkl pip
If it's true, I don't think we have any choice to avoid the regression in pip packages as removing libiomp5.so is a requirement from Apache. Please refer to #15544. Thanks!

Retried with this patch after installing MKL BLAS with https://github.com/apache/incubator-mxnet/blob/master/ci/docker/install/ubuntu_mkl.sh and got these results:

Average Throughput: 1663.49 samples/sec

INFO:root:Epoch[0] Batch [0-50] Speed: 1414.31 samples/sec  accuracy=0.281863
INFO:root:Epoch[0] Batch [50-100]   Speed: 1610.74 samples/sec  accuracy=0.382500
INFO:root:Epoch[0] Batch [100-150]  Speed: 1625.33 samples/sec  accuracy=0.430000
INFO:root:Epoch[0] Batch [150-200]  Speed: 1649.23 samples/sec  accuracy=0.432500
INFO:root:Epoch[0] Batch [200-250]  Speed: 1663.87 samples/sec  accuracy=0.465000
INFO:root:Epoch[0] Batch [250-300]  Speed: 1640.63 samples/sec  accuracy=0.495625
INFO:root:Epoch[0] Batch [300-350]  Speed: 1671.83 samples/sec  accuracy=0.502500
INFO:root:Epoch[0] Batch [350-400]  Speed: 1669.90 samples/sec  accuracy=0.516563
INFO:root:Epoch[0] Batch [400-450]  Speed: 1600.49 samples/sec  accuracy=0.548125
INFO:root:Epoch[0] Batch [450-500]  Speed: 1669.11 samples/sec  accuracy=0.562500
INFO:root:Epoch[0] Batch [500-550]  Speed: 1671.51 samples/sec  accuracy=0.558750
INFO:root:Epoch[0] Batch [550-600]  Speed: 1667.67 samples/sec  accuracy=0.586875
INFO:root:Epoch[0] Batch [600-650]  Speed: 1670.19 samples/sec  accuracy=0.591562
INFO:root:Epoch[0] Batch [650-700]  Speed: 1652.81 samples/sec  accuracy=0.611250
INFO:root:Epoch[0] Batch [700-750]  Speed: 1630.58 samples/sec  accuracy=0.600000
INFO:root:Epoch[0] Train-accuracy=0.508252
INFO:root:Epoch[0] Time cost=30.680
INFO:root:Epoch[0] Validation-accuracy=0.632166
INFO:root:Epoch[1] Batch [0-50] Speed: 1648.76 samples/sec  accuracy=0.625613
INFO:root:Epoch[1] Batch [50-100]   Speed: 1660.23 samples/sec  accuracy=0.629375
INFO:root:Epoch[1] Batch [100-150]  Speed: 1616.19 samples/sec  accuracy=0.640312
INFO:root:Epoch[1] Batch [150-200]  Speed: 1670.47 samples/sec  accuracy=0.643125
INFO:root:Epoch[1] Batch [200-250]  Speed: 1670.92 samples/sec  accuracy=0.657500
INFO:root:Epoch[1] Batch [250-300]  Speed: 1671.10 samples/sec  accuracy=0.655625
INFO:root:Epoch[1] Batch [300-350]  Speed: 1669.03 samples/sec  accuracy=0.651250
INFO:root:Epoch[1] Batch [350-400]  Speed: 1669.22 samples/sec  accuracy=0.655312
INFO:root:Epoch[1] Batch [400-450]  Speed: 1671.08 samples/sec  accuracy=0.672813
INFO:root:Epoch[1] Batch [450-500]  Speed: 1671.26 samples/sec  accuracy=0.673750
INFO:root:Epoch[1] Batch [500-550]  Speed: 1650.34 samples/sec  accuracy=0.682500
INFO:root:Epoch[1] Batch [550-600]  Speed: 1663.81 samples/sec  accuracy=0.681250
INFO:root:Epoch[1] Batch [600-650]  Speed: 1671.43 samples/sec  accuracy=0.695625
INFO:root:Epoch[1] Batch [650-700]  Speed: 1622.47 samples/sec  accuracy=0.698438
INFO:root:Epoch[1] Batch [700-750]  Speed: 1671.23 samples/sec  accuracy=0.687187
INFO:root:Epoch[1] Train-accuracy=0.664633
INFO:root:Epoch[1] Time cost=30.096
INFO:root:Epoch[1] Validation-accuracy=0.673878
INFO:root:Epoch[2] Batch [0-50] Speed: 1668.44 samples/sec  accuracy=0.701900
INFO:root:Epoch[2] Batch [50-100]   Speed: 1673.86 samples/sec  accuracy=0.698750
INFO:root:Epoch[2] Batch [100-150]  Speed: 1669.55 samples/sec  accuracy=0.712500
INFO:root:Epoch[2] Batch [150-200]  Speed: 1673.31 samples/sec  accuracy=0.713750
INFO:root:Epoch[2] Batch [200-250]  Speed: 1673.31 samples/sec  accuracy=0.726562
INFO:root:Epoch[2] Batch [250-300]  Speed: 1672.89 samples/sec  accuracy=0.717187
INFO:root:Epoch[2] Batch [300-350]  Speed: 1651.81 samples/sec  accuracy=0.725938
INFO:root:Epoch[2] Batch [350-400]  Speed: 1623.66 samples/sec  accuracy=0.718750
INFO:root:Epoch[2] Batch [400-450]  Speed: 1672.81 samples/sec  accuracy=0.729688
INFO:root:Epoch[2] Batch [450-500]  Speed: 1672.86 samples/sec  accuracy=0.736563
INFO:root:Epoch[2] Batch [500-550]  Speed: 1669.99 samples/sec  accuracy=0.730625
INFO:root:Epoch[2] Batch [550-600]  Speed: 1670.90 samples/sec  accuracy=0.728750
INFO:root:Epoch[2] Batch [600-650]  Speed: 1673.84 samples/sec  accuracy=0.739375
INFO:root:Epoch[2] Batch [650-700]  Speed: 1675.46 samples/sec  accuracy=0.750313
INFO:root:Epoch[2] Batch [700-750]  Speed: 1675.23 samples/sec  accuracy=0.739062
INFO:root:Epoch[2] Train-accuracy=0.725112
INFO:root:Epoch[2] Time cost=29.959
INFO:root:Epoch[2] Validation-accuracy=0.699419
INFO:root:Epoch[3] Batch [0-50] Speed: 1620.48 samples/sec  accuracy=0.747243
INFO:root:Epoch[3] Batch [50-100]   Speed: 1665.64 samples/sec  accuracy=0.747188
INFO:root:Epoch[3] Batch [100-150]  Speed: 1669.65 samples/sec  accuracy=0.744375
INFO:root:Epoch[3] Batch [150-200]  Speed: 1672.57 samples/sec  accuracy=0.756563
INFO:root:Epoch[3] Batch [200-250]  Speed: 1673.09 samples/sec  accuracy=0.755625
INFO:root:Epoch[3] Batch [250-300]  Speed: 1672.16 samples/sec  accuracy=0.757500
INFO:root:Epoch[3] Batch [300-350]  Speed: 1671.06 samples/sec  accuracy=0.757812
INFO:root:Epoch[3] Batch [350-400]  Speed: 1670.54 samples/sec  accuracy=0.754687
INFO:root:Epoch[3] Batch [400-450]  Speed: 1673.20 samples/sec  accuracy=0.774375
INFO:root:Epoch[3] Batch [450-500]  Speed: 1656.83 samples/sec  accuracy=0.768750
INFO:root:Epoch[3] Batch [500-550]  Speed: 1672.77 samples/sec  accuracy=0.772813
INFO:root:Epoch[3] Batch [550-600]  Speed: 1662.18 samples/sec  accuracy=0.770312
INFO:root:Epoch[3] Batch [600-650]  Speed: 1672.07 samples/sec  accuracy=0.770000
INFO:root:Epoch[3] Batch [650-700]  Speed: 1642.67 samples/sec  accuracy=0.780000
INFO:root:Epoch[3] Batch [700-750]  Speed: 1670.11 samples/sec  accuracy=0.776875
INFO:root:Epoch[3] Train-accuracy=0.762764
INFO:root:Epoch[3] Time cost=30.022
INFO:root:Epoch[3] Validation-accuracy=0.731771
INFO:root:Epoch[4] Batch [0-50] Speed: 1667.95 samples/sec  accuracy=0.778493
INFO:root:Epoch[4] Batch [50-100]   Speed: 1672.75 samples/sec  accuracy=0.790312
INFO:root:Epoch[4] Batch [100-150]  Speed: 1669.29 samples/sec  accuracy=0.776875
INFO:root:Epoch[4] Batch [150-200]  Speed: 1673.50 samples/sec  accuracy=0.792500
INFO:root:Epoch[4] Batch [200-250]  Speed: 1672.97 samples/sec  accuracy=0.783438
INFO:root:Epoch[4] Batch [250-300]  Speed: 1672.72 samples/sec  accuracy=0.796250
INFO:root:Epoch[4] Batch [300-350]  Speed: 1658.90 samples/sec  accuracy=0.784687
INFO:root:Epoch[4] Batch [350-400]  Speed: 1669.21 samples/sec  accuracy=0.790937
INFO:root:Epoch[4] Batch [400-450]  Speed: 1664.05 samples/sec  accuracy=0.800312
INFO:root:Epoch[4] Batch [450-500]  Speed: 1637.17 samples/sec  accuracy=0.789375
INFO:root:Epoch[4] Batch [500-550]  Speed: 1665.37 samples/sec  accuracy=0.799687
INFO:root:Epoch[4] Batch [550-600]  Speed: 1668.98 samples/sec  accuracy=0.806562
INFO:root:Epoch[4] Batch [600-650]  Speed: 1672.85 samples/sec  accuracy=0.809375
INFO:root:Epoch[4] Batch [650-700]  Speed: 1674.14 samples/sec  accuracy=0.816562
INFO:root:Epoch[4] Batch [700-750]  Speed: 1674.87 samples/sec  accuracy=0.800000
INFO:root:Epoch[4] Train-accuracy=0.794457
INFO:root:Epoch[4] Time cost=29.996
INFO:root:Epoch[4] Validation-accuracy=0.741740
INFO:root:Epoch[5] Batch [0-50] Speed: 1668.07 samples/sec  accuracy=0.809436
INFO:root:Epoch[5] Batch [50-100]   Speed: 1673.35 samples/sec  accuracy=0.810312
INFO:root:Epoch[5] Batch [100-150]  Speed: 1651.66 samples/sec  accuracy=0.807500
INFO:root:Epoch[5] Batch [150-200]  Speed: 1667.67 samples/sec  accuracy=0.809063
INFO:root:Epoch[5] Batch [200-250]  Speed: 1668.76 samples/sec  accuracy=0.808750
INFO:root:Epoch[5] Batch [250-300]  Speed: 1672.72 samples/sec  accuracy=0.810937
INFO:root:Epoch[5] Batch [300-350]  Speed: 1671.69 samples/sec  accuracy=0.816562
INFO:root:Epoch[5] Batch [350-400]  Speed: 1672.54 samples/sec  accuracy=0.818750
INFO:root:Epoch[5] Batch [400-450]  Speed: 1631.24 samples/sec  accuracy=0.822187
INFO:root:Epoch[5] Batch [450-500]  Speed: 1665.93 samples/sec  accuracy=0.815937
INFO:root:Epoch[5] Batch [500-550]  Speed: 1674.52 samples/sec  accuracy=0.819063
INFO:root:Epoch[5] Batch [550-600]  Speed: 1670.75 samples/sec  accuracy=0.812500
INFO:root:Epoch[5] Batch [600-650]  Speed: 1673.81 samples/sec  accuracy=0.825937
INFO:root:Epoch[5] Batch [650-700]  Speed: 1676.04 samples/sec  accuracy=0.827187
INFO:root:Epoch[5] Batch [700-750]  Speed: 1675.77 samples/sec  accuracy=0.817813
INFO:root:Epoch[5] Train-accuracy=0.815501
INFO:root:Epoch[5] Time cost=29.948
INFO:root:Epoch[5] Validation-accuracy=0.749399
INFO:root:Epoch[6] Batch [0-50] Speed: 1669.17 samples/sec  accuracy=0.837623
INFO:root:Epoch[6] Batch [50-100]   Speed: 1661.24 samples/sec  accuracy=0.813750
INFO:root:Epoch[6] Batch [100-150]  Speed: 1667.14 samples/sec  accuracy=0.830313
INFO:root:Epoch[6] Batch [150-200]  Speed: 1667.80 samples/sec  accuracy=0.826250
INFO:root:Epoch[6] Batch [200-250]  Speed: 1673.15 samples/sec  accuracy=0.826562
INFO:root:Epoch[6] Batch [250-300]  Speed: 1646.27 samples/sec  accuracy=0.836875
INFO:root:Epoch[6] Batch [300-350]  Speed: 1666.01 samples/sec  accuracy=0.829375
INFO:root:Epoch[6] Batch [350-400]  Speed: 1672.95 samples/sec  accuracy=0.834688
INFO:root:Epoch[6] Batch [400-450]  Speed: 1673.64 samples/sec  accuracy=0.835625
INFO:root:Epoch[6] Batch [450-500]  Speed: 1675.71 samples/sec  accuracy=0.843437
INFO:root:Epoch[6] Batch [500-550]  Speed: 1674.81 samples/sec  accuracy=0.849688
INFO:root:Epoch[6] Batch [550-600]  Speed: 1670.66 samples/sec  accuracy=0.848750
INFO:root:Epoch[6] Batch [600-650]  Speed: 1674.67 samples/sec  accuracy=0.850000
INFO:root:Epoch[6] Batch [650-700]  Speed: 1676.15 samples/sec  accuracy=0.852187
INFO:root:Epoch[6] Batch [700-750]  Speed: 1662.28 samples/sec  accuracy=0.840625
INFO:root:Epoch[6] Train-accuracy=0.837408
INFO:root:Epoch[6] Time cost=29.926
INFO:root:Epoch[6] Validation-accuracy=0.755609
INFO:root:Epoch[7] Batch [0-50] Speed: 1669.53 samples/sec  accuracy=0.851409
INFO:root:Epoch[7] Batch [50-100]   Speed: 1673.99 samples/sec  accuracy=0.851875
INFO:root:Epoch[7] Batch [100-150]  Speed: 1664.78 samples/sec  accuracy=0.845000
INFO:root:Epoch[7] Batch [150-200]  Speed: 1643.95 samples/sec  accuracy=0.848125
INFO:root:Epoch[7] Batch [200-250]  Speed: 1673.32 samples/sec  accuracy=0.846250
INFO:root:Epoch[7] Batch [250-300]  Speed: 1674.50 samples/sec  accuracy=0.854062
INFO:root:Epoch[7] Batch [300-350]  Speed: 1667.81 samples/sec  accuracy=0.868750
INFO:root:Epoch[7] Batch [350-400]  Speed: 1672.58 samples/sec  accuracy=0.856875
INFO:root:Epoch[7] Batch [400-450]  Speed: 1674.09 samples/sec  accuracy=0.856563
INFO:root:Epoch[7] Batch [450-500]  Speed: 1674.60 samples/sec  accuracy=0.855000
INFO:root:Epoch[7] Batch [500-550]  Speed: 1674.48 samples/sec  accuracy=0.868125
INFO:root:Epoch[7] Batch [550-600]  Speed: 1670.71 samples/sec  accuracy=0.854688
INFO:root:Epoch[7] Batch [600-650]  Speed: 1674.68 samples/sec  accuracy=0.859375
INFO:root:Epoch[7] Batch [650-700]  Speed: 1675.54 samples/sec  accuracy=0.867812
INFO:root:Epoch[7] Batch [700-750]  Speed: 1636.57 samples/sec  accuracy=0.861250
INFO:root:Epoch[7] Train-accuracy=0.856634
INFO:root:Epoch[7] Time cost=29.935
INFO:root:Epoch[7] Validation-accuracy=0.751202
INFO:root:Epoch[8] Batch [0-50] Speed: 1666.25 samples/sec  accuracy=0.862745
INFO:root:Epoch[8] Batch [50-100]   Speed: 1667.20 samples/sec  accuracy=0.871563
INFO:root:Epoch[8] Batch [100-150]  Speed: 1638.39 samples/sec  accuracy=0.859688
INFO:root:Epoch[8] Batch [150-200]  Speed: 1668.52 samples/sec  accuracy=0.874687
INFO:root:Epoch[8] Batch [200-250]  Speed: 1664.86 samples/sec  accuracy=0.866875
INFO:root:Epoch[8] Batch [250-300]  Speed: 1670.59 samples/sec  accuracy=0.866250
INFO:root:Epoch[8] Batch [300-350]  Speed: 1672.36 samples/sec  accuracy=0.872500
INFO:root:Epoch[8] Batch [350-400]  Speed: 1667.79 samples/sec  accuracy=0.876250
INFO:root:Epoch[8] Batch [400-450]  Speed: 1672.58 samples/sec  accuracy=0.875938
INFO:root:Epoch[8] Batch [450-500]  Speed: 1672.51 samples/sec  accuracy=0.871250
INFO:root:Epoch[8] Batch [500-550]  Speed: 1671.49 samples/sec  accuracy=0.878750
INFO:root:Epoch[8] Batch [550-600]  Speed: 1668.27 samples/sec  accuracy=0.884062
INFO:root:Epoch[8] Batch [600-650]  Speed: 1656.65 samples/sec  accuracy=0.882812
INFO:root:Epoch[8] Batch [650-700]  Speed: 1671.64 samples/sec  accuracy=0.884062
INFO:root:Epoch[8] Batch [700-750]  Speed: 1673.34 samples/sec  accuracy=0.874687
INFO:root:Epoch[8] Train-accuracy=0.873581
INFO:root:Epoch[8] Time cost=30.010
INFO:root:Epoch[8] Validation-accuracy=0.766421
INFO:root:Epoch[9] Batch [0-50] Speed: 1669.04 samples/sec  accuracy=0.879289
INFO:root:Epoch[9] Batch [50-100]   Speed: 1671.88 samples/sec  accuracy=0.887188
INFO:root:Epoch[9] Batch [100-150]  Speed: 1662.53 samples/sec  accuracy=0.867500
INFO:root:Epoch[9] Batch [150-200]  Speed: 1672.37 samples/sec  accuracy=0.881875
INFO:root:Epoch[9] Batch [200-250]  Speed: 1672.11 samples/sec  accuracy=0.886563
INFO:root:Epoch[9] Batch [250-300]  Speed: 1635.77 samples/sec  accuracy=0.870938
INFO:root:Epoch[9] Batch [300-350]  Speed: 1670.30 samples/sec  accuracy=0.884062
INFO:root:Epoch[9] Batch [350-400]  Speed: 1671.09 samples/sec  accuracy=0.879375
INFO:root:Epoch[9] Batch [400-450]  Speed: 1667.68 samples/sec  accuracy=0.883125
INFO:root:Epoch[9] Batch [450-500]  Speed: 1673.33 samples/sec  accuracy=0.885000
INFO:root:Epoch[9] Batch [500-550]  Speed: 1672.83 samples/sec  accuracy=0.883750
INFO:root:Epoch[9] Batch [550-600]  Speed: 1668.54 samples/sec  accuracy=0.887500
INFO:root:Epoch[9] Batch [600-650]  Speed: 1672.97 samples/sec  accuracy=0.890312
INFO:root:Epoch[9] Batch [650-700]  Speed: 1653.01 samples/sec  accuracy=0.889062
INFO:root:Epoch[9] Batch [700-750]  Speed: 1673.44 samples/sec  accuracy=0.889062
INFO:root:Epoch[9] Train-accuracy=0.883263
INFO:root:Epoch[9] Time cost=29.960
INFO:root:Epoch[9] Validation-accuracy=0.762520
INFO:root:Epoch[10] Batch [0-50]    Speed: 1666.71 samples/sec  accuracy=0.887868
INFO:root:Epoch[10] Batch [50-100]  Speed: 1672.06 samples/sec  accuracy=0.882500
INFO:root:Epoch[10] Batch [100-150] Speed: 1668.15 samples/sec  accuracy=0.881250
INFO:root:Epoch[10] Batch [150-200] Speed: 1667.18 samples/sec  accuracy=0.899062
INFO:root:Epoch[10] Batch [200-250] Speed: 1670.72 samples/sec  accuracy=0.881563
INFO:root:Epoch[10] Batch [250-300] Speed: 1671.63 samples/sec  accuracy=0.890000
INFO:root:Epoch[10] Batch [300-350] Speed: 1669.62 samples/sec  accuracy=0.905625
INFO:root:Epoch[10] Batch [350-400] Speed: 1664.69 samples/sec  accuracy=0.904375
INFO:root:Epoch[10] Batch [400-450] Speed: 1671.13 samples/sec  accuracy=0.901250
INFO:root:Epoch[10] Batch [450-500] Speed: 1666.08 samples/sec  accuracy=0.896250
INFO:root:Epoch[10] Batch [500-550] Speed: 1670.59 samples/sec  accuracy=0.905312
INFO:root:Epoch[10] Batch [550-600] Speed: 1667.69 samples/sec  accuracy=0.894687
INFO:root:Epoch[10] Batch [600-650] Speed: 1671.95 samples/sec  accuracy=0.895938
INFO:root:Epoch[10] Batch [650-700] Speed: 1672.98 samples/sec  accuracy=0.909375
INFO:root:Epoch[10] Batch [700-750] Speed: 1624.72 samples/sec  accuracy=0.909375
INFO:root:Epoch[10] Train-accuracy=0.896667
INFO:root:Epoch[10] Time cost=29.974
INFO:root:Epoch[10] Validation-accuracy=0.764123

NihalHarish on 6 Dec 2019

@NihalHarish thanks for verifying

@TaoLv
The patch doesn't seem to be merged on the master branch. Any reason why it's not being done along with the PR that bumped MKLDNN to v1.0 https://github.com/apache/incubator-mxnet/pull/16555

diff --git a/make/pip/pip_linux_mkl.mk b/make/pip/pip_linux_mkl.mk
index 1cf389ae4..dd23434fa 100644
--- a/make/pip/pip_linux_mkl.mk
+++ b/make/pip/pip_linux_mkl.mk
@@ -49,7 +49,7 @@ ADD_CFLAGS += -I$(DEPS_PATH)/include -ffunction-sections -fdata-sections
 # choose the version of blas you want to use
 # can be: mkl, blas, atlas, openblas
 # in default use atlas for linux while apple for osx
-USE_BLAS=openblas
+USE_BLAS=mkl

 # whether use opencv during compilation
 # you can disable it, however, you will not able to use
@@ -98,7 +98,7 @@ USE_LAPACK_PATH = $(DEPS_PATH)/lib

 # add path to intel library, you may need it for MKL, if you did not add the path
 # to environment variable
-USE_INTEL_PATH = NONE
+USE_INTEL_PATH = /opt/intel/

If it was omitted by mistake and since it is required, I could push a PR for the same.

Also, do you folks have any data about performance tests run AFTER this patch is applied?

Thanks.

ChaiBapchya on 7 Dec 2019

@ChaiBapchya The file is used to build mxnet-mkl pip package. If you want to change the configurations, I think you need have a proposal on dev@.

TaoLv on 8 Dec 2019

What is the status of this issue? From the conversation it seems to me that Intel people think it is not an issue (or at least it is unavoidable) and Amazon people are concerned about this. Is that accurate? If so, how does it affect the 1.6 release - should I go ahead and make the RC despite this issue or is there active work going on to fix it?

ptrendx on 10 Dec 2019

@TaoLv are you saying that we should keep the current config where we build the mkl flavor with openblas:
master:
https://github.com/apache/incubator-mxnet/blob/7895f93e67dc3e9da360f7a9c667e3c0f1e76c0f/make/staticbuild/linux_mkl.mk#L52
1.6.x branch:
https://github.com/apache/incubator-mxnet/blob/a576531836c5a5c4fb6dfbc944de94b619d6ccfa/make/pip/pip_linux_mkl.mk#L52
Or are you proposing that it needs to be changed to build the mkl flavor with mkl blas instead of openblas?

samskalicky on 10 Dec 2019

mkl flavor packages are always built with USE_BLAS=openblas. We can change that to MKL BLAS if we are allowed to include dependency with category x license [1] into MXNet convenient releases.

[1] https://www.apache.org/legal/resolved.html#category-x

TaoLv on 11 Dec 2019

Thanks @TaoLv

I was able to rebuild and reproduce Nihal's results:

$ python deeplearning-benchmark/image_classification/image_classification.py --model resnet18_v2 --dataset cifar10 --mode symbolic --gpus 0 --epochs 25 --log-interval 50 --kvstore local --dtype='float32' --batch-size=64
Namespace(batch_norm=False, batch_size=64, benchmark=False, dataset='cifar10', dtype='float32', epochs=25, gpus=0, kvstore='local', log_interval=50, lr=0.01, mode='symbolic', model='resnet18_v2', seed=123, use_pretrained=False, use_thumbnail=False, wd=0.0001)
Archive:  cifar10.zip
   creating: cifar/
  inflating: cifar/test.rec          
  inflating: cifar/test.lst          
  inflating: cifar/train.lst         
  inflating: cifar/train.rec         
[05:12:00] src/io/iter_image_recordio_2.cc:178: ImageRecordIOParser2: data/cifar/train.rec, use 4 threads for decoding..
[05:12:00] src/io/iter_image_recordio_2.cc:178: ImageRecordIOParser2: data/cifar/test.rec, use 4 threads for decoding..
[05:12:00] src/executor/graph_executor.cc:1979: Subgraph backend MKLDNN is activated.
INFO:root:Epoch[0] Batch [0-50] Speed: 1583.17 samples/sec  accuracy=0.285846
INFO:root:Epoch[0] Batch [50-100]   Speed: 1508.38 samples/sec  accuracy=0.388750
INFO:root:Epoch[0] Batch [100-150]  Speed: 1623.32 samples/sec  accuracy=0.433125
INFO:root:Epoch[0] Batch [150-200]  Speed: 1613.61 samples/sec  accuracy=0.443437
INFO:root:Epoch[0] Batch [200-250]  Speed: 1642.54 samples/sec  accuracy=0.455000
INFO:root:Epoch[0] Batch [250-300]  Speed: 1625.45 samples/sec  accuracy=0.506250
INFO:root:Epoch[0] Batch [300-350]  Speed: 1620.83 samples/sec  accuracy=0.515312
INFO:root:Epoch[0] Batch [350-400]  Speed: 1637.02 samples/sec  accuracy=0.537500
INFO:root:Epoch[0] Batch [400-450]  Speed: 1635.96 samples/sec  accuracy=0.550937
INFO:root:Epoch[0] Batch [450-500]  Speed: 1641.26 samples/sec  accuracy=0.574688
INFO:root:Epoch[0] Batch [500-550]  Speed: 1643.39 samples/sec  accuracy=0.569063
INFO:root:Epoch[0] Batch [550-600]  Speed: 1639.69 samples/sec  accuracy=0.573125
INFO:root:Epoch[0] Batch [600-650]  Speed: 1644.01 samples/sec  accuracy=0.598437
INFO:root:Epoch[0] Batch [650-700]  Speed: 1644.10 samples/sec  accuracy=0.614375
INFO:root:Epoch[0] Batch [700-750]  Speed: 1644.86 samples/sec  accuracy=0.601250

The root cause of this performance regression is from the difference of BLAS libraries (switching from MKL BLAS to OpenBLAS) and removing the libiomp5.so library.

Now the next step is to determine how we want to proceed. Do we continue with OpenBLAS and take the hit on performance, or as @TaoLv mentioned can we use the category x licensed dependency?

samskalicky on 11 Dec 2019

👍2

Hi @TaoLv, @samskalicky,

Intel MKL-DNN includes GEMM implementation that is comparable in terms of performance to Intel MKL. Is using mkldnn_gemm an option here?

vpirogov on 11 Dec 2019

@TaoLv @pengzhao-intel Are there features in MXNet that require MKL as the BLAS library? I was able to find this line:
https://github.com/apache/incubator-mxnet/blob/c82af38211dbf8356a4f3b35f023632c5bf880ae/src/operator/quantization/quantized_fully_connected.cc#L291

Im rereading the previous comment and now im confused:

@oorqueda @samskalicky @leleamol As mentioned in #16891 (comment), I suspect that the regression is caused by the removal of libiomp5.so.
...
If it's true, I don't think we have any choice to avoid the regression in pip packages as removing libiomp5.so is a requirement from Apache. Please refer to #15544. Thanks!

Is the performance difference coming from using Intel's OpenMP library (libiomp5) or from using the MKL BLAS library itself and some routines like GEMM (as @vpirogov mentions)?

samskalicky on 11 Dec 2019

@vpirogov @samskalicky Although MKL BLAS may also have positive impact to the case demonstrated above, I think the main gap is from different OMP runtimes. Setting USE_BLAS=mkl will help to pull in iomp5. Sure I'm going to replace cblas_sgemm and cblas_sgemm_batch with the MatMul primitive from DNNL once it's release, but I don't think that will help to fill the gap between gomp and iomp5.

@samskalicky The code you referred will not be called in the ResNet18 case. Most of the computation in ResNet18 should go to DNNL.

TaoLv on 12 Dec 2019

@TaoLv, is anything preventing us from using LLVM OpenMP runtime (libomp)? It is pretty much an open source version of libiomp5.

vpirogov on 12 Dec 2019

@vpirogov We can do that. My only concern is the interoperability of it. Also from MXNet perspective, we need move the release process from make to cmake which I don't think can be done within the schedule of the 1.6.0 release.

TaoLv on 12 Dec 2019

What do you mean by interoperability exactly?

vpirogov on 12 Dec 2019

👍1

@TaoLv To get a closure on this topic, would it be possible to move the discussion forward
Thanks

ChaiBapchya on 18 Dec 2019

@vpirogov @ChaiBapchya The interoperability means:

how to pass the threading model to the dependencies of MXNet, eg. openblas, lapack, opencv, dnnl, mkl.
how to cooperate with other tools, eg. gomp based numpy or pytorch.

TaoLv on 18 Dec 2019

@TaoLv,

You are right that when different OpenMP runtimes are used in the same application there's a potential for interoperability issues. For this particular discussion it's important to note that the interoperability considerations are the same for libiomp5 and libomp. From that perspective using libomp does not introduce any additional issues in comparison to what MXNet used before (i.e. libiomp5).

vpirogov on 18 Dec 2019

@vpirogov, yes, that's true. libomp and libiomp5 should have the same interoperability issue. From this perspective, the current release build solution (makefile + gomp) sounds a safer choice though it has relatively worse performance. I assume that gomp has better interoperability than the other two runtimes, maybe not true.

TaoLv on 19 Dec 2019

@samskalicky and all,
The problem is very clean now. I think we need to make a decision and going forward.
Two possible paths as below

Keep the build as-is with gomp
cons, stable and mature now
pros, a slight performance drop
Re-build with llvm by CMake
cons, same performance as before
pros, efforts on improving CMake path and potential interoperability issues

From my side, I prefer the first option. What's your opinion?

pengzhao-intel on 26 Dec 2019

Hi @pengzhao-intel, in MXNet 2.0 Cmake is planned to be the only build system: https://github.com/apache/incubator-mxnet/projects/18#card-30594044

Would that address the cons in Option 2?

apeforest on 21 Jan 2020

👍1

Hi @pengzhao-intel, in MXNet 2.0 Cmake is planned to be the only build system: https://github.com/apache/incubator-mxnet/projects/18#card-30594044

Would that address the cons in Option 2?

It's a good chance to make the system clean :)

pengzhao-intel on 25 Feb 2020

closing since the fix has alrady updated with latest MKLDNN version.

pengzhao-intel on 31 Aug 2020

Was this page helpful?

0 / 5 - 0 ratings