Incubator-mxnet: unix-cpu MKL/MKL-DNN Test Time

Created on 6 May 2020 · 14Comments · Source: apache/incubator-mxnet

Description

Since #18146 we introduced parallel testing in CI with the hope of reducing test time. However, during that effort we noticed that the MKL and MKLDNN tests run slower than the setting without MKL or MKLDNN. This issue summarizes the set up and current time difference.

Setup

The results in this issue come from this CI run, and the time difference is similar in master branch validation.

To show the results, we compare the following test nodes.

Python 3: CPU

Build

build_ubuntu_cpu_openblas() {
    set -ex
    cd /work/build
    CXXFLAGS="-Wno-error=strict-overflow" CC=gcc-7 CXX=g++-7 cmake \
        -DCMAKE_BUILD_TYPE="RelWithDebInfo" \
        -DENABLE_TESTCOVERAGE=ON \
        -DUSE_TVM_OP=ON \
        -DUSE_CPP_PACKAGE=ON \
        -DUSE_MKL_IF_AVAILABLE=OFF \
        -DUSE_MKLDNN=OFF \
        -DUSE_CUDA=OFF \
        -DUSE_DIST_KVSTORE=ON \
        -DBUILD_CYTHON_MODULES=ON \
        -G Ninja /work/mxnet
    ninja
}

Python 3: MKL-CPU

Build

build_ubuntu_cpu_mkl() {
    set -ex
    cd /work/build
    CC=gcc-7 CXX=g++-7 cmake \
        -DCMAKE_BUILD_TYPE="RelWithDebInfo" \
        -DENABLE_TESTCOVERAGE=ON \
        -DUSE_MKLDNN=OFF \
        -DUSE_CUDA=OFF \
        -DUSE_TVM_OP=ON \
        -DUSE_MKL_IF_AVAILABLE=ON \
        -DUSE_BLAS=MKL \
        -GNinja /work/mxnet
    ninja
}

Python 3: MKLDNN-CPU

Build

build_ubuntu_cpu_mkldnn() {
    set -ex
    cd /work/build
    CC=gcc-7 CXX=g++-7 cmake \
        -DCMAKE_BUILD_TYPE="RelWithDebInfo" \
        -DENABLE_TESTCOVERAGE=ON \
        -DUSE_MKL_IF_AVAILABLE=OFF \
        -DUSE_TVM_OP=ON \
        -DUSE_MKLDNN=ON \
        -DUSE_CUDA=OFF \
        -DUSE_CPP_PACKAGE=ON \
        -G Ninja /work/mxnet
    ninja
}

Python 3: MKLDNN-MKL-CPU

Build

build_ubuntu_cpu_mkldnn_mkl() {
    set -ex
    cd /work/build
    CC=gcc-7 CXX=g++-7 cmake \
        -DCMAKE_BUILD_TYPE="RelWithDebInfo" \
        -DENABLE_TESTCOVERAGE=ON \
        -DUSE_MKLDNN=ON \
        -DUSE_CUDA=OFF \
        -DUSE_TVM_OP=ON \
        -DUSE_MKL_IF_AVAILABLE=ON \
        -DUSE_BLAS=MKL \
        -GNinja /work/mxnet
    ninja
}

Tests

Each of the test node runs one of the two following test functions

python3_ut

unittest_ubuntu_python3_cpu() {
    set -ex
    export PYTHONPATH=./python/
    export MXNET_MKLDNN_DEBUG=0  # Ignored if not present
    export MXNET_STORAGE_FALLBACK_LOG_VERBOSE=0
    export MXNET_SUBGRAPH_VERBOSE=0
    export MXNET_ENABLE_CYTHON=0
    export DMLC_LOG_STACK_TRACE_DEPTH=10
    pytest -m 'not serial' -n 4 --durations=50 --cov-report xml:tests_unittest.xml --verbose tests/python/unittest
    pytest -m 'serial' --durations=50 --cov-report xml:tests_unittest.xml --cov-append --verbose tests/python/unittest
    pytest -n 4 --durations=50 --cov-report xml:tests_quantization.xml --verbose tests/python/quantization
}

python_ut_mkldnn

unittest_ubuntu_python3_cpu_mkldnn() {
    set -ex
    export PYTHONPATH=./python/
    export MXNET_MKLDNN_DEBUG=0  # Ignored if not present
    export MXNET_STORAGE_FALLBACK_LOG_VERBOSE=0
    export MXNET_SUBGRAPH_VERBOSE=0
    export MXNET_ENABLE_CYTHON=0
    export DMLC_LOG_STACK_TRACE_DEPTH=10
    pytest -m 'not serial' -n 4 --durations=50 --cov-report xml:tests_unittest.xml --verbose tests/python/unittest
    pytest -m 'serial' --durations=50 --cov-report xml:tests_unittest.xml --cov-append --verbose tests/python/unittest
    pytest -n 4 --durations=50 --cov-report xml:tests_mkl.xml --verbose tests/python/mkl
}

Test steps

In order to show fine-grain time result, I break down each of the steps in the test nodes as the following:

docker launch: the time it takes to prepare and launch the docker
parallel unittest: pytest -m 'not serial' -n 4 --durations=50 --cov-report xml:tests_unittest.xml --verbose tests/python/unittest
serial unittest: pytest -m 'serial' --durations=50 --cov-report xml:tests_unittest.xml --cov-append --verbose tests/python/unittest
quantization test: pytest -n 4 --durations=50 --cov-report xml:tests_quantization.xml --verbose tests/python/quantization
mkl test: pytest -n 4 --durations=50 --cov-report xml:tests_mkl.xml --verbose tests/python/mkl

Results

The unit for the following results are seconds. Python 3: CPU results are considered baseline.

| Test | docker launch | parallel unittest | serial unittest | quantization | mkl test |
|--------------------------|---------------|--------------------------|--------------------|--------------|----------|
| Python 3: CPU | 892 | 981 | 951 | 39 | N/A |
| Python 3: MKL-CPU | 874 | 4996 (+409%) | 4380 (+361%)| 67 | N/A |
| Python 3: MKLDNN-CPU | 892 | 5096 (+419%) | 3314 (+248%)| N/A | 1080 |
| Python 3: MKLDNN-MKL-CPU | 901 | 3899 (+397%) | 3507 (+269%)| N/A | 1210 |

CI MKL MKLDNN

Source

szha

👍1

Most helpful comment

I discovered that slowdown is caused by ENABLE_TESTCOVERAGE=ON flag not by OpenMP.
I think overhead comes from locking some resources by injected code, because
OMP_NUM_THREADS=1 improves speed of tests.

With ENABLE_TESTCOVERAGE=1 & unset OMP_NUM_THREADS:

test_operator.py::test_lstm_bidirectional – test time: 305s
tests/python/unittest/test_gluon.py::test_slice_pooling2d_slice_pooling2d - test time: 479.93s

With ENABLE_TESTCOVERAGE=1 & OMP_NUM_THREADS=1:

test_operator.py::test_lstm_bidirectional – test time: 236.45s
tests/python/unittest/test_gluon.py::test_slice_pooling2d_slice_pooling2d - test time: 336.66s

Without ENABLE_TESTCOVERAGE & unset OMP_NUM_THREADS:

test_operator.py::test_lstm_bidirectional – test time: 40.93s
tests/python/unittest/test_gluon.py::test_slice_pooling2d_slice_pooling2d - test time: 58.14s

Summary:

setting OMP_NUM_THREADS=1 helps
ENABLE_TESTCOVERAGE flag is root cause of slowdown
setting OMP_NUM_THREADS/4 - for parallel tests helps (or first bullet)

CC: @TaoLv , @leezu , @szha

bgawrych on 29 May 2020

👍4

All 14 comments

cc @TaoLv @PatricZhao

szha on 6 May 2020

Thanks to raising this concern. we will take a look soon.

pengzhao-intel on 6 May 2020

I notice MXNET_MKLDNN_DEBUG is already disabled. I suspect some of the overhead come from primitive creation as the backend is designed for real world usages where the primitives will be created and cached in the first iteration, and reused in the follows. But in unit tests, most of primitives will be created and only used once.

@szha Do you have any detailed report where we can find the test time for each test for test file?

TaoLv on 6 May 2020

@TaoLv for the example link I included, you can find the log where top 50 most time consuming tests are listed for each run.

szha on 6 May 2020

Any update on this?

szha on 16 May 2020

For now I disabled the MKL/MKLDNN parallel testing and naive engine change due to the excessive testing time these builds take. for testing, revert this commit: https://github.com/apache/incubator-mxnet/pull/18252/commits/9b7b7e26e37c087f7d009006b8dea31fdaab88ca

szha on 17 May 2020

@szha I've done some testing in docker for parallel tests and have some overview. For testing I've used about 35 long tests.
pytest -n 4 spawns much more threads for MKL/MKLDNN built and all the time during test CPU usage is 100% (24 cores used out of 24). So it's probably starvation problem.
Threads spawned:
MKL/MKLDNN build: most of the time it was about 250, but pick was before end and it was 265
CPU build : 140 (pick 150)

I've tested time for MKLDNN built with set OMP_NUM_THREADS=n/4 (n - number of cores)
and it was significat faster (about 1.7x)

I'm gonna do some testing with MKL and OMP flags and point the best configuration and also figure out 'serial' case

cc: @TaoLv

bgawrych on 21 May 2020

👀1 👍1

Second problem is that on CI docker image, mxnet built with cmake is probably not using Intel OpenMP, but provided with GCC OpenMP :

But strange thing is that ldd command shows different linking:

for make: ldd lib/libmxnet.so | grep omp shows:
libgomp.so.1 => /usr/lib/x86_64-linux-gnu/libgomp.so.1
for cmake: ldd build/libmxnet. so | grep omp shows:
libomp.so => /work/mxnet/build/3rdparty/openmp/runtime/src/libomp.so
Removing 3rdparty/openmp didn’t help

• mxnet built with make build system is faster about 8x times
o test_operator.py::test_lstm_bidirectional – cmake built execution time: 305 sec
o test_operator.py::test_lstm_bidirectional – make built execution time: 34.05 sec

• I've tried lots of build combinations with different env. variables
eg. setting LD_LIBRARY_PATH, etc.
nothing helps, trying and digging more

bgawrych on 28 May 2020

cmake build uses MKL by default if available. You can look at the cmake configuration output. To force using MKL, set cmake -DUSE_BLAS=MKL. Reference https://github.com/apache/incubator-mxnet/blob/382279e0d5b88b2bd05e6c00105fe1e9eb2bc600/cmake/ChooseBlas.cmake#L21-L32

If MKL Blas is used, OpenMP will not be built

https://github.com/apache/incubator-mxnet/blob/382279e0d5b88b2bd05e6c00105fe1e9eb2bc600/CMakeLists.txt#L402-L427

There may be a bug in these two reference, or your build environment isn't correct?

leezu on 28 May 2020

With ENABLE_TESTCOVERAGE=1 & unset OMP_NUM_THREADS:

test_operator.py::test_lstm_bidirectional – test time: 305s
tests/python/unittest/test_gluon.py::test_slice_pooling2d_slice_pooling2d - test time: 479.93s

With ENABLE_TESTCOVERAGE=1 & OMP_NUM_THREADS=1:

test_operator.py::test_lstm_bidirectional – test time: 236.45s
tests/python/unittest/test_gluon.py::test_slice_pooling2d_slice_pooling2d - test time: 336.66s

Without ENABLE_TESTCOVERAGE & unset OMP_NUM_THREADS:

test_operator.py::test_lstm_bidirectional – test time: 40.93s
tests/python/unittest/test_gluon.py::test_slice_pooling2d_slice_pooling2d - test time: 58.14s

Summary:

setting OMP_NUM_THREADS=1 helps
ENABLE_TESTCOVERAGE flag is root cause of slowdown
setting OMP_NUM_THREADS/4 - for parallel tests helps (or first bullet)

CC: @TaoLv , @leezu , @szha

bgawrych on 29 May 2020

👍4

Thank you for looking into this! Do you have any insights into why this only affects MKL builds?

leezu on 29 May 2020

@leezu

Thank you for looking into this! Do you have any insights into why this only affects MKL builds?

This affects CPU (openBLAS) build as well.
CPU build measurement:

With ENABLE_TESTCOVERAGE=1 & unset OMP_NUM_THREADS:

test_operator.py::test_lstm_bidirectional – test time: 105.87s and 112.04s

With ENABLE_TESTCOVERAGE=1 & OMP_NUM_THREADS=1:

test_operator.py::test_lstm_bidirectional – test time: 89.02s and 91.87s

Without ENABLE_TESTCOVERAGE & unset OMP_NUM_THREADS:

test_operator.py::test_lstm_bidirectional – test time: 86.25s and 86.75s

Without ENABLE_TESTCOVERAGE & OMP_NUM_THREADS=12:

test_operator.py::test_lstm_bidirectional – test time: 64.68s and 65.76s

Last test case shows that openBLAS build have some problems with utilizing all threads.

bgawrych on 1 Jun 2020

@szha Can we close this issue as root cause was found?

bgawrych on 27 Jul 2020

@bgawrych yes. thank you for tracking this down.

szha on 27 Jul 2020

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Per-layer learning rate for fine tuning a pretrained network

phunterlau · 3Comments

No module named bbox when running rcnn demo.py

realbns2008 · 3Comments

Training Accuracy and Validation Accuracy problem

WangcsShuai · 3Comments

what's the usage of ' is_train' in forward?

xzqjack · 3Comments

Correct Speedometer Callback Usage

JonBoyleCoding · 3Comments