Incubator-mxnet: unix-cpu MKL/MKL-DNN Test Time

Created on 6 May 2020  Â·  14Comments  Â·  Source: apache/incubator-mxnet

Description

Since #18146 we introduced parallel testing in CI with the hope of reducing test time. However, during that effort we noticed that the MKL and MKLDNN tests run slower than the setting without MKL or MKLDNN. This issue summarizes the set up and current time difference.

Setup

The results in this issue come from this CI run, and the time difference is similar in master branch validation.

To show the results, we compare the following test nodes.

Python 3: CPU

Build

build_ubuntu_cpu_openblas() {
    set -ex
    cd /work/build
    CXXFLAGS="-Wno-error=strict-overflow" CC=gcc-7 CXX=g++-7 cmake \
        -DCMAKE_BUILD_TYPE="RelWithDebInfo" \
        -DENABLE_TESTCOVERAGE=ON \
        -DUSE_TVM_OP=ON \
        -DUSE_CPP_PACKAGE=ON \
        -DUSE_MKL_IF_AVAILABLE=OFF \
        -DUSE_MKLDNN=OFF \
        -DUSE_CUDA=OFF \
        -DUSE_DIST_KVSTORE=ON \
        -DBUILD_CYTHON_MODULES=ON \
        -G Ninja /work/mxnet
    ninja
}

Python 3: MKL-CPU

Build

build_ubuntu_cpu_mkl() {
    set -ex
    cd /work/build
    CC=gcc-7 CXX=g++-7 cmake \
        -DCMAKE_BUILD_TYPE="RelWithDebInfo" \
        -DENABLE_TESTCOVERAGE=ON \
        -DUSE_MKLDNN=OFF \
        -DUSE_CUDA=OFF \
        -DUSE_TVM_OP=ON \
        -DUSE_MKL_IF_AVAILABLE=ON \
        -DUSE_BLAS=MKL \
        -GNinja /work/mxnet
    ninja
}

Python 3: MKLDNN-CPU

Build

build_ubuntu_cpu_mkldnn() {
    set -ex
    cd /work/build
    CC=gcc-7 CXX=g++-7 cmake \
        -DCMAKE_BUILD_TYPE="RelWithDebInfo" \
        -DENABLE_TESTCOVERAGE=ON \
        -DUSE_MKL_IF_AVAILABLE=OFF \
        -DUSE_TVM_OP=ON \
        -DUSE_MKLDNN=ON \
        -DUSE_CUDA=OFF \
        -DUSE_CPP_PACKAGE=ON \
        -G Ninja /work/mxnet
    ninja
}

Python 3: MKLDNN-MKL-CPU

Build

build_ubuntu_cpu_mkldnn_mkl() {
    set -ex
    cd /work/build
    CC=gcc-7 CXX=g++-7 cmake \
        -DCMAKE_BUILD_TYPE="RelWithDebInfo" \
        -DENABLE_TESTCOVERAGE=ON \
        -DUSE_MKLDNN=ON \
        -DUSE_CUDA=OFF \
        -DUSE_TVM_OP=ON \
        -DUSE_MKL_IF_AVAILABLE=ON \
        -DUSE_BLAS=MKL \
        -GNinja /work/mxnet
    ninja
}

Tests

Each of the test node runs one of the two following test functions

python3_ut

unittest_ubuntu_python3_cpu() {
    set -ex
    export PYTHONPATH=./python/
    export MXNET_MKLDNN_DEBUG=0  # Ignored if not present
    export MXNET_STORAGE_FALLBACK_LOG_VERBOSE=0
    export MXNET_SUBGRAPH_VERBOSE=0
    export MXNET_ENABLE_CYTHON=0
    export DMLC_LOG_STACK_TRACE_DEPTH=10
    pytest -m 'not serial' -n 4 --durations=50 --cov-report xml:tests_unittest.xml --verbose tests/python/unittest
    pytest -m 'serial' --durations=50 --cov-report xml:tests_unittest.xml --cov-append --verbose tests/python/unittest
    pytest -n 4 --durations=50 --cov-report xml:tests_quantization.xml --verbose tests/python/quantization
}

python_ut_mkldnn

unittest_ubuntu_python3_cpu_mkldnn() {
    set -ex
    export PYTHONPATH=./python/
    export MXNET_MKLDNN_DEBUG=0  # Ignored if not present
    export MXNET_STORAGE_FALLBACK_LOG_VERBOSE=0
    export MXNET_SUBGRAPH_VERBOSE=0
    export MXNET_ENABLE_CYTHON=0
    export DMLC_LOG_STACK_TRACE_DEPTH=10
    pytest -m 'not serial' -n 4 --durations=50 --cov-report xml:tests_unittest.xml --verbose tests/python/unittest
    pytest -m 'serial' --durations=50 --cov-report xml:tests_unittest.xml --cov-append --verbose tests/python/unittest
    pytest -n 4 --durations=50 --cov-report xml:tests_mkl.xml --verbose tests/python/mkl
}

Test steps

In order to show fine-grain time result, I break down each of the steps in the test nodes as the following:

  • docker launch: the time it takes to prepare and launch the docker
  • parallel unittest: pytest -m 'not serial' -n 4 --durations=50 --cov-report xml:tests_unittest.xml --verbose tests/python/unittest
  • serial unittest: pytest -m 'serial' --durations=50 --cov-report xml:tests_unittest.xml --cov-append --verbose tests/python/unittest
  • quantization test: pytest -n 4 --durations=50 --cov-report xml:tests_quantization.xml --verbose tests/python/quantization
  • mkl test: pytest -n 4 --durations=50 --cov-report xml:tests_mkl.xml --verbose tests/python/mkl

Results

The unit for the following results are seconds. Python 3: CPU results are considered baseline.

| Test | docker launch | parallel unittest | serial unittest | quantization | mkl test |
|--------------------------|---------------|--------------------------|--------------------|--------------|----------|
| Python 3: CPU | 892 | 981 | 951 | 39 | N/A |
| Python 3: MKL-CPU | 874 | 4996 (+409%) | 4380 (+361%)| 67 | N/A |
| Python 3: MKLDNN-CPU | 892 | 5096 (+419%) | 3314 (+248%)| N/A | 1080 |
| Python 3: MKLDNN-MKL-CPU | 901 | 3899 (+397%) | 3507 (+269%)| N/A | 1210 |

CI MKL MKLDNN

Most helpful comment

I discovered that slowdown is caused by ENABLE_TESTCOVERAGE=ON flag not by OpenMP.
I think overhead comes from locking some resources by injected code, because
OMP_NUM_THREADS=1 improves speed of tests.

With ENABLE_TESTCOVERAGE=1 & unset OMP_NUM_THREADS:

  • test_operator.py::test_lstm_bidirectional – test time: 305s
  • tests/python/unittest/test_gluon.py::test_slice_pooling2d_slice_pooling2d - test time: 479.93s

With ENABLE_TESTCOVERAGE=1 & OMP_NUM_THREADS=1:

  • test_operator.py::test_lstm_bidirectional – test time: 236.45s
  • tests/python/unittest/test_gluon.py::test_slice_pooling2d_slice_pooling2d - test time: 336.66s

Without ENABLE_TESTCOVERAGE & unset OMP_NUM_THREADS:

  • test_operator.py::test_lstm_bidirectional – test time: 40.93s
  • tests/python/unittest/test_gluon.py::test_slice_pooling2d_slice_pooling2d - test time: 58.14s

Summary:

  • setting OMP_NUM_THREADS=1 helps
  • ENABLE_TESTCOVERAGE flag is root cause of slowdown
  • setting OMP_NUM_THREADS/4 - for parallel tests helps (or first bullet)

CC: @TaoLv , @leezu , @szha

All 14 comments

cc @TaoLv @PatricZhao

Thanks to raising this concern. we will take a look soon.

I notice MXNET_MKLDNN_DEBUG is already disabled. I suspect some of the overhead come from primitive creation as the backend is designed for real world usages where the primitives will be created and cached in the first iteration, and reused in the follows. But in unit tests, most of primitives will be created and only used once.

@szha Do you have any detailed report where we can find the test time for each test for test file?

@TaoLv for the example link I included, you can find the log where top 50 most time consuming tests are listed for each run.

Any update on this?

For now I disabled the MKL/MKLDNN parallel testing and naive engine change due to the excessive testing time these builds take. for testing, revert this commit: https://github.com/apache/incubator-mxnet/pull/18252/commits/9b7b7e26e37c087f7d009006b8dea31fdaab88ca

@szha I've done some testing in docker for parallel tests and have some overview. For testing I've used about 35 long tests.
pytest -n 4 spawns much more threads for MKL/MKLDNN built and all the time during test CPU usage is 100% (24 cores used out of 24). So it's probably starvation problem.
Threads spawned:
MKL/MKLDNN build: most of the time it was about 250, but pick was before end and it was 265
CPU build : 140 (pick 150)

I've tested time for MKLDNN built with set OMP_NUM_THREADS=n/4 (n - number of cores)
and it was significat faster (about 1.7x)

I'm gonna do some testing with MKL and OMP flags and point the best configuration and also figure out 'serial' case

cc: @TaoLv

Second problem is that on CI docker image, mxnet built with cmake is probably not using Intel OpenMP, but provided with GCC OpenMP :

But strange thing is that ldd command shows different linking:

  • for make: ldd lib/libmxnet.so | grep omp shows:
    libgomp.so.1 => /usr/lib/x86_64-linux-gnu/libgomp.so.1
  • for cmake: ldd build/libmxnet. so | grep omp shows:
    libomp.so => /work/mxnet/build/3rdparty/openmp/runtime/src/libomp.so
    Removing 3rdparty/openmp didn’t help

• mxnet built with make build system is faster about 8x times
o test_operator.py::test_lstm_bidirectional – cmake built execution time: 305 sec
o test_operator.py::test_lstm_bidirectional – make built execution time: 34.05 sec

• I've tried lots of build combinations with different env. variables
eg. setting LD_LIBRARY_PATH, etc.
nothing helps, trying and digging more

cmake build uses MKL by default if available. You can look at the cmake configuration output. To force using MKL, set cmake -DUSE_BLAS=MKL. Reference https://github.com/apache/incubator-mxnet/blob/382279e0d5b88b2bd05e6c00105fe1e9eb2bc600/cmake/ChooseBlas.cmake#L21-L32

If MKL Blas is used, OpenMP will not be built

https://github.com/apache/incubator-mxnet/blob/382279e0d5b88b2bd05e6c00105fe1e9eb2bc600/CMakeLists.txt#L402-L427

There may be a bug in these two reference, or your build environment isn't correct?

I discovered that slowdown is caused by ENABLE_TESTCOVERAGE=ON flag not by OpenMP.
I think overhead comes from locking some resources by injected code, because
OMP_NUM_THREADS=1 improves speed of tests.

With ENABLE_TESTCOVERAGE=1 & unset OMP_NUM_THREADS:

  • test_operator.py::test_lstm_bidirectional – test time: 305s
  • tests/python/unittest/test_gluon.py::test_slice_pooling2d_slice_pooling2d - test time: 479.93s

With ENABLE_TESTCOVERAGE=1 & OMP_NUM_THREADS=1:

  • test_operator.py::test_lstm_bidirectional – test time: 236.45s
  • tests/python/unittest/test_gluon.py::test_slice_pooling2d_slice_pooling2d - test time: 336.66s

Without ENABLE_TESTCOVERAGE & unset OMP_NUM_THREADS:

  • test_operator.py::test_lstm_bidirectional – test time: 40.93s
  • tests/python/unittest/test_gluon.py::test_slice_pooling2d_slice_pooling2d - test time: 58.14s

Summary:

  • setting OMP_NUM_THREADS=1 helps
  • ENABLE_TESTCOVERAGE flag is root cause of slowdown
  • setting OMP_NUM_THREADS/4 - for parallel tests helps (or first bullet)

CC: @TaoLv , @leezu , @szha

Thank you for looking into this! Do you have any insights into why this only affects MKL builds?

@leezu

Thank you for looking into this! Do you have any insights into why this only affects MKL builds?

This affects CPU (openBLAS) build as well.
CPU build measurement:

With ENABLE_TESTCOVERAGE=1 & unset OMP_NUM_THREADS:

  • test_operator.py::test_lstm_bidirectional – test time: 105.87s and 112.04s

With ENABLE_TESTCOVERAGE=1 & OMP_NUM_THREADS=1:

  • test_operator.py::test_lstm_bidirectional – test time: 89.02s and 91.87s

Without ENABLE_TESTCOVERAGE & unset OMP_NUM_THREADS:

  • test_operator.py::test_lstm_bidirectional – test time: 86.25s and 86.75s

Without ENABLE_TESTCOVERAGE & OMP_NUM_THREADS=12:

  • test_operator.py::test_lstm_bidirectional – test time: 64.68s and 65.76s

Last test case shows that openBLAS build have some problems with utilizing all threads.

@szha Can we close this issue as root cause was found?

@bgawrych yes. thank you for tracking this down.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

yuconglin picture yuconglin  Â·  3Comments

xzqjack picture xzqjack  Â·  3Comments

Shiro-LK picture Shiro-LK  Â·  3Comments

Zhaoyang-XU picture Zhaoyang-XU  Â·  3Comments

ranti-iitg picture ranti-iitg  Â·  3Comments