Since #18146 we introduced parallel testing in CI with the hope of reducing test time. However, during that effort we noticed that the MKL and MKLDNN tests run slower than the setting without MKL or MKLDNN. This issue summarizes the set up and current time difference.
The results in this issue come from this CI run, and the time difference is similar in master branch validation.
To show the results, we compare the following test nodes.
build_ubuntu_cpu_openblas() {
set -ex
cd /work/build
CXXFLAGS="-Wno-error=strict-overflow" CC=gcc-7 CXX=g++-7 cmake \
-DCMAKE_BUILD_TYPE="RelWithDebInfo" \
-DENABLE_TESTCOVERAGE=ON \
-DUSE_TVM_OP=ON \
-DUSE_CPP_PACKAGE=ON \
-DUSE_MKL_IF_AVAILABLE=OFF \
-DUSE_MKLDNN=OFF \
-DUSE_CUDA=OFF \
-DUSE_DIST_KVSTORE=ON \
-DBUILD_CYTHON_MODULES=ON \
-G Ninja /work/mxnet
ninja
}
build_ubuntu_cpu_mkl() {
set -ex
cd /work/build
CC=gcc-7 CXX=g++-7 cmake \
-DCMAKE_BUILD_TYPE="RelWithDebInfo" \
-DENABLE_TESTCOVERAGE=ON \
-DUSE_MKLDNN=OFF \
-DUSE_CUDA=OFF \
-DUSE_TVM_OP=ON \
-DUSE_MKL_IF_AVAILABLE=ON \
-DUSE_BLAS=MKL \
-GNinja /work/mxnet
ninja
}
build_ubuntu_cpu_mkldnn() {
set -ex
cd /work/build
CC=gcc-7 CXX=g++-7 cmake \
-DCMAKE_BUILD_TYPE="RelWithDebInfo" \
-DENABLE_TESTCOVERAGE=ON \
-DUSE_MKL_IF_AVAILABLE=OFF \
-DUSE_TVM_OP=ON \
-DUSE_MKLDNN=ON \
-DUSE_CUDA=OFF \
-DUSE_CPP_PACKAGE=ON \
-G Ninja /work/mxnet
ninja
}
build_ubuntu_cpu_mkldnn_mkl() {
set -ex
cd /work/build
CC=gcc-7 CXX=g++-7 cmake \
-DCMAKE_BUILD_TYPE="RelWithDebInfo" \
-DENABLE_TESTCOVERAGE=ON \
-DUSE_MKLDNN=ON \
-DUSE_CUDA=OFF \
-DUSE_TVM_OP=ON \
-DUSE_MKL_IF_AVAILABLE=ON \
-DUSE_BLAS=MKL \
-GNinja /work/mxnet
ninja
}
Each of the test node runs one of the two following test functions
unittest_ubuntu_python3_cpu() {
set -ex
export PYTHONPATH=./python/
export MXNET_MKLDNN_DEBUG=0 # Ignored if not present
export MXNET_STORAGE_FALLBACK_LOG_VERBOSE=0
export MXNET_SUBGRAPH_VERBOSE=0
export MXNET_ENABLE_CYTHON=0
export DMLC_LOG_STACK_TRACE_DEPTH=10
pytest -m 'not serial' -n 4 --durations=50 --cov-report xml:tests_unittest.xml --verbose tests/python/unittest
pytest -m 'serial' --durations=50 --cov-report xml:tests_unittest.xml --cov-append --verbose tests/python/unittest
pytest -n 4 --durations=50 --cov-report xml:tests_quantization.xml --verbose tests/python/quantization
}
unittest_ubuntu_python3_cpu_mkldnn() {
set -ex
export PYTHONPATH=./python/
export MXNET_MKLDNN_DEBUG=0 # Ignored if not present
export MXNET_STORAGE_FALLBACK_LOG_VERBOSE=0
export MXNET_SUBGRAPH_VERBOSE=0
export MXNET_ENABLE_CYTHON=0
export DMLC_LOG_STACK_TRACE_DEPTH=10
pytest -m 'not serial' -n 4 --durations=50 --cov-report xml:tests_unittest.xml --verbose tests/python/unittest
pytest -m 'serial' --durations=50 --cov-report xml:tests_unittest.xml --cov-append --verbose tests/python/unittest
pytest -n 4 --durations=50 --cov-report xml:tests_mkl.xml --verbose tests/python/mkl
}
In order to show fine-grain time result, I break down each of the steps in the test nodes as the following:
pytest -m 'not serial' -n 4 --durations=50 --cov-report xml:tests_unittest.xml --verbose tests/python/unittestpytest -m 'serial' --durations=50 --cov-report xml:tests_unittest.xml --cov-append --verbose tests/python/unittestpytest -n 4 --durations=50 --cov-report xml:tests_quantization.xml --verbose tests/python/quantizationpytest -n 4 --durations=50 --cov-report xml:tests_mkl.xml --verbose tests/python/mklThe unit for the following results are seconds. Python 3: CPU results are considered baseline.
| Test | docker launch | parallel unittest | serial unittest | quantization | mkl test |
|--------------------------|---------------|--------------------------|--------------------|--------------|----------|
| Python 3: CPU | 892 | 981 | 951 | 39 | N/A |
| Python 3: MKL-CPU | 874 | 4996 (+409%) | 4380 (+361%)| 67 | N/A |
| Python 3: MKLDNN-CPU | 892 | 5096 (+419%) | 3314 (+248%)| N/A | 1080 |
| Python 3: MKLDNN-MKL-CPU | 901 | 3899 (+397%) | 3507 (+269%)| N/A | 1210 |
cc @TaoLv @PatricZhao
Thanks to raising this concern. we will take a look soon.
I notice MXNET_MKLDNN_DEBUG is already disabled. I suspect some of the overhead come from primitive creation as the backend is designed for real world usages where the primitives will be created and cached in the first iteration, and reused in the follows. But in unit tests, most of primitives will be created and only used once.
@szha Do you have any detailed report where we can find the test time for each test for test file?
@TaoLv for the example link I included, you can find the log where top 50 most time consuming tests are listed for each run.
Any update on this?
For now I disabled the MKL/MKLDNN parallel testing and naive engine change due to the excessive testing time these builds take. for testing, revert this commit: https://github.com/apache/incubator-mxnet/pull/18252/commits/9b7b7e26e37c087f7d009006b8dea31fdaab88ca
@szha I've done some testing in docker for parallel tests and have some overview. For testing I've used about 35 long tests.
pytest -n 4 spawns much more threads for MKL/MKLDNN built and all the time during test CPU usage is 100% (24 cores used out of 24). So it's probably starvation problem.
Threads spawned:
MKL/MKLDNN build: most of the time it was about 250, but pick was before end and it was 265
CPU build : 140 (pick 150)
I've tested time for MKLDNN built with set OMP_NUM_THREADS=n/4 (n - number of cores)
and it was significat faster (about 1.7x)
I'm gonna do some testing with MKL and OMP flags and point the best configuration and also figure out 'serial' case
cc: @TaoLv
Second problem is that on CI docker image, mxnet built with cmake is probably not using Intel OpenMP, but provided with GCC OpenMP :

But strange thing is that ldd command shows different linking:
• mxnet built with make build system is faster about 8x times
o test_operator.py::test_lstm_bidirectional – cmake built execution time: 305 sec
o test_operator.py::test_lstm_bidirectional – make built execution time: 34.05 sec
• I've tried lots of build combinations with different env. variables
eg. setting LD_LIBRARY_PATH, etc.
nothing helps, trying and digging more
cmake build uses MKL by default if available. You can look at the cmake configuration output. To force using MKL, set cmake -DUSE_BLAS=MKL. Reference https://github.com/apache/incubator-mxnet/blob/382279e0d5b88b2bd05e6c00105fe1e9eb2bc600/cmake/ChooseBlas.cmake#L21-L32
If MKL Blas is used, OpenMP will not be built
There may be a bug in these two reference, or your build environment isn't correct?
I discovered that slowdown is caused by ENABLE_TESTCOVERAGE=ON flag not by OpenMP.
I think overhead comes from locking some resources by injected code, because
OMP_NUM_THREADS=1 improves speed of tests.
With ENABLE_TESTCOVERAGE=1 & unset OMP_NUM_THREADS:
With ENABLE_TESTCOVERAGE=1 & OMP_NUM_THREADS=1:
Without ENABLE_TESTCOVERAGE & unset OMP_NUM_THREADS:
Summary:
CC: @TaoLv , @leezu , @szha
Thank you for looking into this! Do you have any insights into why this only affects MKL builds?
@leezu
Thank you for looking into this! Do you have any insights into why this only affects MKL builds?
This affects CPU (openBLAS) build as well.
CPU build measurement:
With ENABLE_TESTCOVERAGE=1 & unset OMP_NUM_THREADS:
With ENABLE_TESTCOVERAGE=1 & OMP_NUM_THREADS=1:
Without ENABLE_TESTCOVERAGE & unset OMP_NUM_THREADS:
Without ENABLE_TESTCOVERAGE & OMP_NUM_THREADS=12:
Last test case shows that openBLAS build have some problems with utilizing all threads.
@szha Can we close this issue as root cause was found?
@bgawrych yes. thank you for tracking this down.
Most helpful comment
I discovered that slowdown is caused by ENABLE_TESTCOVERAGE=ON flag not by OpenMP.
I think overhead comes from locking some resources by injected code, because
OMP_NUM_THREADS=1 improves speed of tests.
With ENABLE_TESTCOVERAGE=1 & unset OMP_NUM_THREADS:
With ENABLE_TESTCOVERAGE=1 & OMP_NUM_THREADS=1:
Without ENABLE_TESTCOVERAGE & unset OMP_NUM_THREADS:
Summary:
CC: @TaoLv , @leezu , @szha