Hi Fairseq Team,
I have just went through the fairseq documentation about distributed training and I have following questions.
Does Fairseq use only NCCL library to perform distributed training? Please correct me if I'm wrong.
Is there any example available where horovod has been used with fairseq specially in Neural Machine Translation task? If yes can you please share the link?
Is there any detail documentation available on how to set up servers and worker(node) in order to run distributed training using fairseq with NCCL? If yes can you please share the link?
Thanks,
Jalaj
fairseq uses PyTorch's DistributedDataParallel, which uses NCCL under the hood: https://pytorch.org/docs/stable/nn.html?highlight=distributeddataparallel#torch.nn.parallel.DistributedDataParallel
Once you have setup PyTorch and NCCL properly, instructions for doing distributed training with fairseq can be found here: https://fairseq.readthedocs.io/en/latest/getting_started.html#distributed-training
Hi Fairseq team and @myleott Thanks for helping me.
I'm getting the following error when I'm running the distributed training. I have referred the following issues to resolve the issue but seems it didn't help me much.
I have simple multinode GPU architecture 2 nodes in total and 1 GPU on each node so total GPUs are 2.
LOG on Worker node:
Traceback (most recent call last):
File "software//fairseq-py/train.py", line 347, in <module>
distributed_main(args)
File "software/fairseq-py/distributed_train.py", line 39, in main
single_process_main(args)
File "software/fairseq-py/train.py", line 87, in main
train(args, trainer, task, epoch_itr)
File "software/fairseq-py/train.py", line 125, in train
log_output = trainer.train_step(sample, update_params=True)
File "software/fairseq-py/fairseq/trainer.py", line 137, in train_step
(sample_sizes, logging_outputs, ooms_fwd, ooms_bwd)
File "software/fairseq-py/fairseq/distributed_utils.py", line 77, in all_gather_list
torch.distributed.all_gather(out_buffers, in_buffer.cuda())
File "venv/lib/python3.6/site-packages/torch/distributed/__init__.py", line 439, in all_gather
return all_gather_multigpu([tensor_list], [tensor], group)
File "venv/lib/python3.6/site-packages/torch/distributed/__init__.py", line 413, in all_gather_multigpu
group)
RuntimeError: NCCL error in: /pytorch/torch/lib/THD/base/data_channels/DataChannelNccl.cpp:322, unhandled system error
CUDA 10.1
CUDANN 7.6.4
NCCL 2.4.6
Pytorch 1.1.0
NCCL environment variables
export NCCL_SOCKET_IFNAME=ens3
export NCCL_DEBUG=INFO
export NCCL_IB_CUDA_SUPPORT=0
export NCCL_P2P_DISABLE=0
export NCCL_IB_DISABLE=1
export NCCL_NET_GDR_LEVEL=3
export NCCL_NET_GDR_READ=0
export NCCL_SHM_DISABLE=0
I have run nccl-test using this command it run perfectly. ./build/all_reduce_perf -b 8 -e 256M -f 2 -g 1
According to me CUDA, CudaNN and NCCL version are compatible with each other. Is there anything I'm missing? Any help or suggestion is appreciable.
Thanks,
Hi is there any updates in this issue? I also having trouble running fairseq with torch.distributed.launch
@xiongchenyan I will be trying out few new experiments soon, Will update here if I can resolve this issue
Most helpful comment
Hi is there any updates in this issue? I also having trouble running fairseq with torch.distributed.launch