Starting with IWSLT 2014 dataset, train.py works fine only if CUDA_VISIBLE_DEVICES is set with only one GPU.
Error: mkl-service + Intel(R) MKL: MKL_THREADING_LAYER=INTEL is incompatible with libgomp.so.1 library.
Try to import numpy first or set the threading layer accordingly. Set MKL_SERVICE_FORCE_INTEL to force it.
Error: mkl-service + Intel(R) MKL: MKL_THREADING_LAYER=INTEL is incompatible with libgomp.so.1 library.
Try to import numpy first or set the threading layer accordingly. Set MKL_SERVICE_FORCE_INTEL to force it.
Error: mkl-service + Intel(R) MKL: MKL_THREADING_LAYER=INTEL is incompatible with libgomp.so.1 library.
Try to import numpy first or set the threading layer accordingly. Set MKL_SERVICE_FORCE_INTEL to force it.
Traceback (most recent call last):
File "/home/anaconda3/bin/fairseq-train", line 11, in <module>
load_entry_point('fairseq', 'console_scripts', 'fairseq-train')()
File "/workhub/codes/fairseq/fairseq_cli/train.py", line 339, in cli_main
nprocs=args.distributed_world_size,
File "/home/anaconda3/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 200, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/home/anaconda3/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 158, in start_processes
while not context.join():
File "/home/anaconda3/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 113, in join
(error_index, exitcode)
Exception: process 1 terminated with exit code 1
Traceback (most recent call last):
File "train.py", line 11, in <module>
cli_main()
File "/workhub/codes/fairseq/fairseq_cli/train.py", line 326, in cli_main
nprocs=torch.cuda.device_count(),
File "/home/anaconda3/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 200, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/home/anaconda3/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 158, in start_processes
while not context.join():
File "/home/anaconda3/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 119, in join
raise Exception(msg)
Exception:
-- Process 2 terminated with the following error:
Traceback (most recent call last):
File "/home/anaconda3/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 20, in _wrap
fn(i, *args)
File "/workhub/codes/fairseq/fairseq_cli/train.py", line 308, in distributed_main
main(args, init_distributed=True)
File "/workhub/codes/fairseq/fairseq_cli/train.py", line 48, in main
args.distributed_rank = distributed_utils.distributed_init(args)
File "/workhub/codes/fairseq/fairseq/distributed_utils.py", line 81, in distributed_init
raise ValueError('Cannot initialize distributed with distributed_world_size=1')
ValueError: Cannot initialize distributed with distributed_world_size=1
Traceback (most recent call last):
File "/home/anaconda3/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/home/anaconda3/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/anaconda3/lib/python3.7/site-packages/torch/distributed/launch.py", line 263, in <module>
main()
File "/home/anaconda3/lib/python3.7/site-packages/torch/distributed/launch.py", line 259, in main
cmd=cmd)
subprocess.CalledProcessError: Command '['/home/anaconda3/bin/python', '-u', 'train.py', '--local_rank=0', 'data-bin/iwslt14.tokenized.de-en', '--lr', '0.25', '--clip-norm', '0.1', '--dropout', '0.2', '--max-tokens', '400']' returned non-zero exit status 1.
I installed NCCL as it's shown in 3.1 https://docs.nvidia.com/deeplearning/sdk/nccl-install-guide/index.html. And I'm not sure if the error is relevant with nccl.
If it works on one GPU, then seems like an issue with your environment rather than fairseq. Can you try running some of the nccl-tests: https://github.com/NVIDIA/nccl-tests and also the pytorch distributed tests: https://github.com/pytorch/pytorch/tree/master/test/distributed
For what it's worth, I got same error in completely unrelated (single-GPU) code due to upgrading to PyTorch 1.5.0. Setting MKL_SERVICE_FORCE_INTEL=1 made the error message go away and the code worked but I don't know what side effects it may have had in performance.
Related bug on pytorch repo: https://github.com/pytorch/pytorch/issues/37377
@alexholdenmiller Can you confirm if there were any side effects in terms of performance?
I didn't benchmark it carefully in this case (and note that this appeared to be necessary in pytorch 1.6 as well), though it would be hard to tell if it was due to changing this flag or due to the new version though since both happened at once. However, I didn't notice any drastic change in performance (positive or negative) with the version changes in my code base.
Most helpful comment
For what it's worth, I got same error in completely unrelated (single-GPU) code due to upgrading to PyTorch 1.5.0. Setting
MKL_SERVICE_FORCE_INTEL=1made the error message go away and the code worked but I don't know what side effects it may have had in performance.Related bug on pytorch repo: https://github.com/pytorch/pytorch/issues/37377