Mmdetection: multi-gpu training problem

Created on 20 Sep 2019 · 2Comments · Source: open-mmlab/mmdetection

Thanks for your error report and we appreciate it a lot.

Checklist

I have searched related issues but cannot get the expected help.
The bug has not been fixed in the latest version.

Describe the bug
I use the command CUDA_VISIBLE_DEVICES=0,1,2,3 ./tools/dist_train.sh configs/ssd512_coco.py 4 --validate to do multi-gpu training. After the message about loading coco annotations, it comes out with error messages, provided on Error traceback part:

It seems to be OpenMP problem, but I have no idea how to solve it.

Reproduction

What command or script did you run?

CUDA_VISIBLE_DEVICES=0,1,2,3 ./tools/dist_train.sh configs/ssd512_coco.py 4 --validate

Did you make any modifications on the code or config? Did you understand what you have modified?
I use the default code for training
What dataset did you use?
coco2017

Environment

OS: Ubuntu 18.04.1
GCC 7.3.0
PyTorch version 1.2.0
- How you installed PyTorch conda
- GPU model V100
- CUDA and CUDNN version CUDA:10.1 / CUDNN 7.6.2
- [optional] Other information that may be related (such as $PATH, $LD_LIBRARY_PATH, $PYTHONPATH, etc.)

Error traceback

OMP: Error #13: Assertion failure at z_Linux_util.cpp(2361).
OMP: Hint Please submit a bug report with this message, compile and run commands used, and machine configuration info including native compiler and operating system versions. Faster response will be obtained by including all program sources. For information on submitting this issue, please see http://www.intel.com/software/products/support/.
OMP: Error #13: Assertion failure at z_Linux_util.cpp(2361).
OMP: Hint Please submit a bug report with this message, compile and run commands used, and machine configuration info including native compiler and operating system versions. Faster response will be obtained by including all program sources. For information on submitting this issue, please see http://www.intel.com/software/products/support/.
OMP: Error #13: Assertion failure at z_Linux_util.cpp(2361).
OMP: Hint Please submit a bug report with this message, compile and run commands used, and machine configuration info including native compiler and operating system versions. Faster response will be obtained by including all program sources. For information on submitting this issue, please see http://www.intel.com/software/products/support/.
OMP: Error #13: Assertion failure at z_Linux_util.cpp(2361).
OMP: Hint Please submit a bug report with this message, compile and run commands used, and machine configuration info including native compiler and operating system versions. Faster response will be obtained by including all program sources. For information on submitting this issue, please see http://www.intel.com/software/products/support/.

Bug fix
The issue addresses that there are some errors about intel-openmp=2019.5.
So the suggested solution would be downgrading the intel-openmp version by
conda install -y intel-openmp-2019.4

Source