Fairseq: Can not train with multi-gpus

Created on 26 Apr 2020  ยท  4Comments  ยท  Source: pytorch/fairseq

โ“ Questions and Help

Starting with IWSLT 2014 dataset, train.py works fine only if CUDA_VISIBLE_DEVICES is set with only one GPU.

  • If I use Multi-GPUs by setting CUDA_VISIBLE_DEVICES=0,1,2,3 or just removing CUDA_VISIBLE_DEVICES option from command line
Error: mkl-service + Intel(R) MKL: MKL_THREADING_LAYER=INTEL is incompatible with libgomp.so.1 library.
    Try to import numpy first or set the threading layer accordingly. Set MKL_SERVICE_FORCE_INTEL to force it.
Error: mkl-service + Intel(R) MKL: MKL_THREADING_LAYER=INTEL is incompatible with libgomp.so.1 library.
    Try to import numpy first or set the threading layer accordingly. Set MKL_SERVICE_FORCE_INTEL to force it.
Error: mkl-service + Intel(R) MKL: MKL_THREADING_LAYER=INTEL is incompatible with libgomp.so.1 library.
    Try to import numpy first or set the threading layer accordingly. Set MKL_SERVICE_FORCE_INTEL to force it.
Traceback (most recent call last):
  File "/home/anaconda3/bin/fairseq-train", line 11, in <module>
    load_entry_point('fairseq', 'console_scripts', 'fairseq-train')()
  File "/workhub/codes/fairseq/fairseq_cli/train.py", line 339, in cli_main
    nprocs=args.distributed_world_size,
  File "/home/anaconda3/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 200, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/home/anaconda3/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 158, in start_processes
    while not context.join():
  File "/home/anaconda3/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 113, in join
    (error_index, exitcode)
Exception: process 1 terminated with exit code 1
  • If I use torch.distributed.launch, python -m torch.distributed.launch --nproc_per_node 1 train.py data-bin/iwslt14.tokenized.de-en --lr 0.25 --clip-norm 0.1 --dropout 0.2 --max-tokens 400
Traceback (most recent call last):
  File "train.py", line 11, in <module>
    cli_main()
  File "/workhub/codes/fairseq/fairseq_cli/train.py", line 326, in cli_main
    nprocs=torch.cuda.device_count(),
  File "/home/anaconda3/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 200, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/home/anaconda3/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 158, in start_processes
    while not context.join():
  File "/home/anaconda3/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 119, in join
    raise Exception(msg)
Exception:

-- Process 2 terminated with the following error:
Traceback (most recent call last):
  File "/home/anaconda3/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 20, in _wrap
    fn(i, *args)
  File "/workhub/codes/fairseq/fairseq_cli/train.py", line 308, in distributed_main
    main(args, init_distributed=True)
  File "/workhub/codes/fairseq/fairseq_cli/train.py", line 48, in main
    args.distributed_rank = distributed_utils.distributed_init(args)
  File "/workhub/codes/fairseq/fairseq/distributed_utils.py", line 81, in distributed_init
    raise ValueError('Cannot initialize distributed with distributed_world_size=1')
ValueError: Cannot initialize distributed with distributed_world_size=1

Traceback (most recent call last):
  File "/home/anaconda3/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/anaconda3/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/anaconda3/lib/python3.7/site-packages/torch/distributed/launch.py", line 263, in <module>
    main()
  File "/home/anaconda3/lib/python3.7/site-packages/torch/distributed/launch.py", line 259, in main
    cmd=cmd)
subprocess.CalledProcessError: Command '['/home/anaconda3/bin/python', '-u', 'train.py', '--local_rank=0', 'data-bin/iwslt14.tokenized.de-en', '--lr', '0.25', '--clip-norm', '0.1', '--dropout', '0.2', '--max-tokens', '400']' returned non-zero exit status 1.

Before asking:

  1. search the issues.
  2. search the docs.

What is your question?

I installed NCCL as it's shown in 3.1 https://docs.nvidia.com/deeplearning/sdk/nccl-install-guide/index.html. And I'm not sure if the error is relevant with nccl.

Code

What have you tried?

What's your environment?

  • fairseq Version (0.9.0)
  • PyTorch Version (1.5.0)
  • OS (Linux Ubuntu18.04)
  • How you installed fairseq (pip install --editable .):
  • Python version: 3.7
  • CUDA/cuDNN version: 10.2/7.6.5
  • GPU models and configuration: Tesla K80
  • Any other relevant information: NCCL 2.6.4
needs triage question

Most helpful comment

For what it's worth, I got same error in completely unrelated (single-GPU) code due to upgrading to PyTorch 1.5.0. Setting MKL_SERVICE_FORCE_INTEL=1 made the error message go away and the code worked but I don't know what side effects it may have had in performance.

Related bug on pytorch repo: https://github.com/pytorch/pytorch/issues/37377

All 4 comments

If it works on one GPU, then seems like an issue with your environment rather than fairseq. Can you try running some of the nccl-tests: https://github.com/NVIDIA/nccl-tests and also the pytorch distributed tests: https://github.com/pytorch/pytorch/tree/master/test/distributed

For what it's worth, I got same error in completely unrelated (single-GPU) code due to upgrading to PyTorch 1.5.0. Setting MKL_SERVICE_FORCE_INTEL=1 made the error message go away and the code worked but I don't know what side effects it may have had in performance.

Related bug on pytorch repo: https://github.com/pytorch/pytorch/issues/37377

@alexholdenmiller Can you confirm if there were any side effects in terms of performance?

I didn't benchmark it carefully in this case (and note that this appeared to be necessary in pytorch 1.6 as well), though it would be hard to tell if it was due to changing this flag or due to the new version though since both happened at once. However, I didn't notice any drastic change in performance (positive or negative) with the version changes in my code base.

Was this page helpful?
2 / 5 - 1 ratings