I'm a novice and follow the fairseq documentation Training a model here.
I have 8 GTX1080Ti on a single machine so I want to use multiple GPUs.
When I run the command:
CUDA_VISIBLE_DEVICES=0 fairseq-train data-bin/iwslt14.tokenized.de-en \
--lr 0.25 --clip-norm 0.1 --dropout 0.2 --max-tokens 4000 \
--arch fconv_iwslt_de_en --save-dir checkpoints/fconv
Although CUDA_VISIBLE_DEVICES=0, it only uses 1 GPU actually and all other things are fine.
I checked torch.cuda.device_count() is 8.
Is it supposed to be Data parallel when we set CUDA_VISIBLE_DEVICES>0? I don't know why it only use 1 GPU on my machine.
Then I try to use distributed training on this single machine with 4 gpus.
python -m torch.distributed.launch --nproc_per_node=4 \
--nnodes=1 --node_rank=0 --master_addr=localhost\
--master_port=1234 \
$(which fairseq-train) data-bin/iwslt14.tokenized.de-en \
--lr 0.25 --clip-norm 0.1 --dropout 0.2 --max-tokens 4000 \
--arch fconv_iwslt_de_en --save-dir checkpoints/fconv --distributed-world-size 4
output:
| distributed init (rank 3): env://
| distributed init (rank 1): env://
| distributed init (rank 0): env://
| distributed init (rank 7): env://
| initialized host pci-SYS-4028GR-TR as rank 7
| distributed init (rank 6): env://
| initialized host pci-SYS-4028GR-TR as rank 6
| distributed init (rank 5): env://
| initialized host pci-SYS-4028GR-TR as rank 5
| initialized host pci-SYS-4028GR-TR as rank 0
...
Traceback (most recent call last):
File "/home/users/bone/anaconda3/envs/ompi/bin/fairseq-train", line 10, in <module>
sys.exit(cli_main())
File "/home/users/bone/anaconda3/envs/ompi/lib/python3.6/site-packages/fairseq_cli/train.py", line 283, in cli_main
nprocs=torch.cuda.device_count(),
File "/home/users/bone/anaconda3/envs/ompi/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 167, in spawn
while not spawn_context.join():
File "/home/users/bone/anaconda3/envs/ompi/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 114, in join
raise Exception(msg)
Exception:
-- Process 3 terminated with the following error:
Traceback (most recent call last):
File "/home/users/bone/anaconda3/envs/ompi/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
fn(i, *args)
File "/home/users/bone/anaconda3/envs/ompi/lib/python3.6/site-packages/fairseq_cli/train.py", line 265, in distributed_main
main(args, init_distributed=True)
File "/home/users/bone/anaconda3/envs/ompi/lib/python3.6/site-packages/fairseq_cli/train.py", line 36, in main
args.distributed_rank = distributed_utils.distributed_init(args)
File "/home/users/bone/anaconda3/envs/ompi/lib/python3.6/site-packages/fairseq/distributed_utils.py", line 87, in distributed_init
dist.all_reduce(torch.rand(1).cuda())
File "/home/users/bone/anaconda3/envs/ompi/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 900, in all_reduce
work = _default_pg.allreduce([tensor], opts)
RuntimeError: NCCL error in: /home/users/junyu/pytorch/torch/lib/c10d/../c10d/NCCLUtils.hpp:39, invalid argument
Then I add
--distributed-init-method tcp://localhost:16344 or --ddp-backend no_c10d or both, all ends with the same error.
Thanks a lot if you read to here, and if someone could help me it would be highly appreciated.
Python 3.6.8
torch 1.2.0
CUDA 9.0
fairseq 0.7.2
apex 0.1
Nvidia Driver: 384.130
CUDA_VISIBLE_DEVICES should contain a comma-separated list of device IDs to use. So CUDA_VISIBLE_DEVICES=4 would use the fifth GPU on your system.
If you don't set CUDA_VISIBLE_DEVICES, fairseq will use all visible GPUs automatically, no need to set distributed-init-method or use torch.distributed.launch.
Hi,
I tried to use distributed training on two 8-GPU machines. Using the following command, I got the same error as the Anderbone shown, which says:
RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/../c10d/NCCLUtils.hpp:39, invalid argument
On the first machine,
python -m torch.distributed.launch --nproc_per_node=1 \
--master_addr=my.server.ip --master_port=12358 --nnodes=2 --node_rank=0 \
train.py data/binary/wmt16_en_de_bpe32k ...
On the second machine,
python -m torch.distributed.launch --nproc_per_node=1 \
--master_addr=my.server.ip --master_port=12358 --nnodes=2 --node_rank=1 \
train.py data/binary/wmt16_en_de_bpe32k ...
By the way, if I set nproc_per_node to 8, it will start 8 programs on each GPU, which is quite weird.
By the way, if I set nproc_per_node to 8, it will start 8 programs on each GPU, which is quite weird.
I have the same exact issue. Did you manage to find a solution?
Hi, I found the solution for distributed training. We need to add --distributed-no-spawn to prevent the program to start 8 jobs on each GPU. Following is the command line I used for distributed training.
For the first machine,
python -m torch.distributed.launch --nproc_per_node=8 \
--master_addr=my.server.ip --master_port=8080 --nnodes=2 --node_rank=0 \
train.py data/binary/wmt16_en_de_bpe32k --distributed-no-spawn ...
For the second one,
python -m torch.distributed.launch --nproc_per_node=8 \
--master_addr=my.server.ip --master_port=8080 --nnodes=2 --node_rank=1 \
train.py data/binary/wmt16_en_de_bpe32k --distributed-no-spawn ...
Most helpful comment
CUDA_VISIBLE_DEVICESshould contain a comma-separated list of device IDs to use. So CUDA_VISIBLE_DEVICES=4 would use the fifth GPU on your system.If you don't set CUDA_VISIBLE_DEVICES, fairseq will use all visible GPUs automatically, no need to set distributed-init-method or use torch.distributed.launch.