Fairseq: How to run on a single machine with multiple GPUs?

Created on 28 Jul 2019 · 4Comments · Source: pytorch/fairseq

I'm a novice and follow the fairseq documentation Training a model here.
I have 8 GTX1080Ti on a single machine so I want to use multiple GPUs.

When I run the command:

CUDA_VISIBLE_DEVICES=0 fairseq-train data-bin/iwslt14.tokenized.de-en \
    --lr 0.25 --clip-norm 0.1 --dropout 0.2 --max-tokens 4000 \
    --arch fconv_iwslt_de_en --save-dir checkpoints/fconv

Although CUDA_VISIBLE_DEVICES=0, it only uses 1 GPU actually and all other things are fine.
I checked torch.cuda.device_count() is 8.
Is it supposed to be Data parallel when we set CUDA_VISIBLE_DEVICES>0? I don't know why it only use 1 GPU on my machine.

Then I try to use distributed training on this single machine with 4 gpus.

python -m torch.distributed.launch --nproc_per_node=4 \
    --nnodes=1 --node_rank=0 --master_addr=localhost\
    --master_port=1234 \
    $(which fairseq-train) data-bin/iwslt14.tokenized.de-en \
    --lr 0.25 --clip-norm 0.1 --dropout 0.2 --max-tokens 4000 \
    --arch fconv_iwslt_de_en --save-dir checkpoints/fconv --distributed-world-size 4

output:

| distributed init (rank 3): env://
| distributed init (rank 1): env://
| distributed init (rank 0): env://
| distributed init (rank 7): env://
| initialized host pci-SYS-4028GR-TR as rank 7
| distributed init (rank 6): env://
| initialized host pci-SYS-4028GR-TR as rank 6
| distributed init (rank 5): env://
| initialized host pci-SYS-4028GR-TR as rank 5
| initialized host pci-SYS-4028GR-TR as rank 0
...
Traceback (most recent call last):
  File "/home/users/bone/anaconda3/envs/ompi/bin/fairseq-train", line 10, in <module>
    sys.exit(cli_main())
  File "/home/users/bone/anaconda3/envs/ompi/lib/python3.6/site-packages/fairseq_cli/train.py", line 283, in cli_main
    nprocs=torch.cuda.device_count(),
  File "/home/users/bone/anaconda3/envs/ompi/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 167, in spawn
    while not spawn_context.join():
  File "/home/users/bone/anaconda3/envs/ompi/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 114, in join
    raise Exception(msg)
Exception: 

-- Process 3 terminated with the following error:
Traceback (most recent call last):
  File "/home/users/bone/anaconda3/envs/ompi/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
    fn(i, *args)
  File "/home/users/bone/anaconda3/envs/ompi/lib/python3.6/site-packages/fairseq_cli/train.py", line 265, in distributed_main
    main(args, init_distributed=True)
  File "/home/users/bone/anaconda3/envs/ompi/lib/python3.6/site-packages/fairseq_cli/train.py", line 36, in main
    args.distributed_rank = distributed_utils.distributed_init(args)
  File "/home/users/bone/anaconda3/envs/ompi/lib/python3.6/site-packages/fairseq/distributed_utils.py", line 87, in distributed_init
    dist.all_reduce(torch.rand(1).cuda())
  File "/home/users/bone/anaconda3/envs/ompi/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 900, in all_reduce
    work = _default_pg.allreduce([tensor], opts)
RuntimeError: NCCL error in: /home/users/junyu/pytorch/torch/lib/c10d/../c10d/NCCLUtils.hpp:39, invalid argument

Then I add
--distributed-init-method tcp://localhost:16344 or --ddp-backend no_c10d or both, all ends with the same error.

Thanks a lot if you read to here, and if someone could help me it would be highly appreciated.
Python 3.6.8
torch 1.2.0
CUDA 9.0
fairseq 0.7.2
apex 0.1
Nvidia Driver: 384.130

Source

Anderbone

Most helpful comment

CUDA_VISIBLE_DEVICES should contain a comma-separated list of device IDs to use. So CUDA_VISIBLE_DEVICES=4 would use the fifth GPU on your system.

If you don't set CUDA_VISIBLE_DEVICES, fairseq will use all visible GPUs automatically, no need to set distributed-init-method or use torch.distributed.launch.

myleott on 28 Jul 2019

👍3

All 4 comments

CUDA_VISIBLE_DEVICES should contain a comma-separated list of device IDs to use. So CUDA_VISIBLE_DEVICES=4 would use the fifth GPU on your system.

If you don't set CUDA_VISIBLE_DEVICES, fairseq will use all visible GPUs automatically, no need to set distributed-init-method or use torch.distributed.launch.

myleott on 28 Jul 2019

👍3

Hi,

I tried to use distributed training on two 8-GPU machines. Using the following command, I got the same error as the Anderbone shown, which says:
RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/../c10d/NCCLUtils.hpp:39, invalid argument

On the first machine,

python -m torch.distributed.launch --nproc_per_node=1 \
--master_addr=my.server.ip --master_port=12358 --nnodes=2 --node_rank=0 \
train.py data/binary/wmt16_en_de_bpe32k ...

On the second machine,

python -m torch.distributed.launch --nproc_per_node=1 \
--master_addr=my.server.ip --master_port=12358 --nnodes=2 --node_rank=1 \
train.py data/binary/wmt16_en_de_bpe32k ...

By the way, if I set nproc_per_node to 8, it will start 8 programs on each GPU, which is quite weird.

Michaelvll on 8 Sep 2019

By the way, if I set nproc_per_node to 8, it will start 8 programs on each GPU, which is quite weird.

I have the same exact issue. Did you manage to find a solution?

reumar on 18 Sep 2019

Hi, I found the solution for distributed training. We need to add --distributed-no-spawn to prevent the program to start 8 jobs on each GPU. Following is the command line I used for distributed training.

For the first machine,

python -m torch.distributed.launch --nproc_per_node=8 \
--master_addr=my.server.ip --master_port=8080 --nnodes=2 --node_rank=0 \
train.py data/binary/wmt16_en_de_bpe32k --distributed-no-spawn ...

For the second one,

python -m torch.distributed.launch --nproc_per_node=8 \
--master_addr=my.server.ip --master_port=8080 --nnodes=2 --node_rank=1 \
train.py data/binary/wmt16_en_de_bpe32k --distributed-no-spawn ...

Michaelvll on 26 Sep 2019

Was this page helpful?

0 / 5 - 0 ratings