OpenMP error on running fairseq-train (even with --num-workers 0 )
fairseq-generate works crash-free
RUN:
DATA_FOLDER=errorsim-pw1; ARCH=fconv; FOLDER=fconv_pw1_test2 fairseq-train ./data-bin/$DATA_FOLDER --lr 0.25 --clip-norm 0.1 --dropout 0.2 --max-tokens 4000
--arch $ARCH --save-dir ./checkpoints/$FOLDER --no-progress-bar --log-interval 50
--num-workers 0 --cpu
(--cpu can be removed if you have a GPU, and the error is the same with any value of --num-workers)
ERROR LOG:
OMP: Error #13: Assertion failure at z_Linux_util.cpp(2361).
OMP: Hint Please submit a bug report with this message, compile and run commands used, and machine configuration info including native compiler and operating system versions. Faster response will be obtained by including all program sources. For information on submitting this issue, please see http://www.intel.com/software/products/support/.
OMP: Error #13: Assertion failure at z_Linux_util.cpp(2361).
OMP: Hint Please submit a bug report with this message, compile and run commands used, and machine configuration info including native compiler and operating system versions. Faster response will be obtained by including all program sources. For information on submitting this issue, please see http://www.intel.com/software/products/support/.
Traceback (most recent call last):
File "/homes/3/serai/pytorch_vib/bin/fairseq-train", line 11, in
load_entry_point('fairseq', 'console_scripts', 'fairseq-train')()
File "/homes/3/serai/fairseq_vib/fairseq_cli/train.py", line 329, in cli_main
nprocs=args.distributed_world_size,
File "/homes/3/serai/pytorch_vib/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 171, in spawn
while not spawn_context.join():
File "/homes/3/serai/pytorch_vib/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 107, in join
(error_index, name)
Exception: process 1 terminated with signal SIGABRT
pip, source): Using pip, as an editableecho |cpp -fopenmp -dM |grep -i open prints "#define _OPENMP 201107"This is a new-ish RHEL7 environment I'm trying to work with. So far whatever I've done with pytorch, fairseq, other libraries I'm using works fine.
Maybe this is relevant? It looks like they've since pulled the packages from the registry, so maybe you can just try to upgrade the package? Otherwise it looks like setting KMP_INIT_AT_FORK=FALSE helped others.
Looks like a bullseye! Setting KMP_INIT_AT_FORK=FALSE helped me too. Gonna look into upgrading my Anaconda to see if that helps.