Mmdetection: multi-gpu training problem

Created on 20 Sep 2019  路  2Comments  路  Source: open-mmlab/mmdetection

Thanks for your error report and we appreciate it a lot.

Checklist

  1. I have searched related issues but cannot get the expected help.
  2. The bug has not been fixed in the latest version.

Describe the bug
I use the command CUDA_VISIBLE_DEVICES=0,1,2,3 ./tools/dist_train.sh configs/ssd512_coco.py 4 --validate to do multi-gpu training. After the message about loading coco annotations, it comes out with error messages, provided on Error traceback part:

It seems to be OpenMP problem, but I have no idea how to solve it.

Reproduction

  1. What command or script did you run?
CUDA_VISIBLE_DEVICES=0,1,2,3 ./tools/dist_train.sh configs/ssd512_coco.py 4 --validate
  1. Did you make any modifications on the code or config? Did you understand what you have modified?
    I use the default code for training
  2. What dataset did you use?
    coco2017

Environment

  • OS: Ubuntu 18.04.1
  • GCC 7.3.0
  • PyTorch version 1.2.0

    • How you installed PyTorch conda

    • GPU model V100

    • CUDA and CUDNN version CUDA:10.1 / CUDNN 7.6.2

    • [optional] Other information that may be related (such as $PATH, $LD_LIBRARY_PATH, $PYTHONPATH, etc.)

Error traceback

OMP: Error #13: Assertion failure at z_Linux_util.cpp(2361).
OMP: Hint Please submit a bug report with this message, compile and run commands used, and machine configuration info including native compiler and operating system versions. Faster response will be obtained by including all program sources. For information on submitting this issue, please see http://www.intel.com/software/products/support/.
OMP: Error #13: Assertion failure at z_Linux_util.cpp(2361).
OMP: Hint Please submit a bug report with this message, compile and run commands used, and machine configuration info including native compiler and operating system versions. Faster response will be obtained by including all program sources. For information on submitting this issue, please see http://www.intel.com/software/products/support/.
OMP: Error #13: Assertion failure at z_Linux_util.cpp(2361).
OMP: Hint Please submit a bug report with this message, compile and run commands used, and machine configuration info including native compiler and operating system versions. Faster response will be obtained by including all program sources. For information on submitting this issue, please see http://www.intel.com/software/products/support/.
OMP: Error #13: Assertion failure at z_Linux_util.cpp(2361).
OMP: Hint Please submit a bug report with this message, compile and run commands used, and machine configuration info including native compiler and operating system versions. Faster response will be obtained by including all program sources. For information on submitting this issue, please see http://www.intel.com/software/products/support/.

Bug fix
The issue addresses that there are some errors about intel-openmp=2019.5.
So the suggested solution would be downgrading the intel-openmp version by
conda install -y intel-openmp-2019.4

Most helpful comment

Hi @wakananai ,

I got the same error and then solve it by conda install -y intel-openmp=2019.4. I followed this issue.

All 2 comments

Hi @wakananai ,

I got the same error and then solve it by conda install -y intel-openmp=2019.4. I followed this issue.

Hi @LcDog ,

After downgrading the intel-openmp version by conda install -y intel-openmp=2019.4, the multi-gpu training code can be run without error.

Thank you for your kind assistance.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

songyuc picture songyuc  路  3Comments

FrankXinqi picture FrankXinqi  路  3Comments

yangcong955 picture yangcong955  路  3Comments

liugaolian picture liugaolian  路  3Comments

michaelisc picture michaelisc  路  3Comments