Fairseq: Need more information about distributed training

Created on 19 Sep 2019  路  4Comments  路  Source: pytorch/fairseq

Hi Fairseq Team,

I have just went through the fairseq documentation about distributed training and I have following questions.

Does Fairseq use only NCCL library to perform distributed training? Please correct me if I'm wrong.

Is there any example available where horovod has been used with fairseq specially in Neural Machine Translation task? If yes can you please share the link?

Is there any detail documentation available on how to set up servers and worker(node) in order to run distributed training using fairseq with NCCL? If yes can you please share the link?

Thanks,
Jalaj

Most helpful comment

Hi is there any updates in this issue? I also having trouble running fairseq with torch.distributed.launch

All 4 comments

fairseq uses PyTorch's DistributedDataParallel, which uses NCCL under the hood: https://pytorch.org/docs/stable/nn.html?highlight=distributeddataparallel#torch.nn.parallel.DistributedDataParallel

Once you have setup PyTorch and NCCL properly, instructions for doing distributed training with fairseq can be found here: https://fairseq.readthedocs.io/en/latest/getting_started.html#distributed-training

Hi Fairseq team and @myleott Thanks for helping me.

I'm getting the following error when I'm running the distributed training. I have referred the following issues to resolve the issue but seems it didn't help me much.

I have simple multinode GPU architecture 2 nodes in total and 1 GPU on each node so total GPUs are 2.

LOG on Worker node:

Traceback (most recent call last):
  File "software//fairseq-py/train.py", line 347, in <module>
    distributed_main(args)
  File "software/fairseq-py/distributed_train.py", line 39, in main
    single_process_main(args)
  File "software/fairseq-py/train.py", line 87, in main
    train(args, trainer, task, epoch_itr)
  File "software/fairseq-py/train.py", line 125, in train
    log_output = trainer.train_step(sample, update_params=True)
  File "software/fairseq-py/fairseq/trainer.py", line 137, in train_step
    (sample_sizes, logging_outputs, ooms_fwd, ooms_bwd)
  File "software/fairseq-py/fairseq/distributed_utils.py", line 77, in all_gather_list
    torch.distributed.all_gather(out_buffers, in_buffer.cuda())
  File "venv/lib/python3.6/site-packages/torch/distributed/__init__.py", line 439, in all_gather
    return all_gather_multigpu([tensor_list], [tensor], group)
  File "venv/lib/python3.6/site-packages/torch/distributed/__init__.py", line 413, in all_gather_multigpu
    group)
RuntimeError: NCCL error in: /pytorch/torch/lib/THD/base/data_channels/DataChannelNccl.cpp:322, unhandled system error

CUDA 10.1
CUDANN 7.6.4
NCCL 2.4.6
Pytorch 1.1.0

NCCL environment variables

export NCCL_SOCKET_IFNAME=ens3
export NCCL_DEBUG=INFO
export NCCL_IB_CUDA_SUPPORT=0
export NCCL_P2P_DISABLE=0
export NCCL_IB_DISABLE=1
export NCCL_NET_GDR_LEVEL=3
export NCCL_NET_GDR_READ=0
export NCCL_SHM_DISABLE=0

I have run nccl-test using this command it run perfectly. ./build/all_reduce_perf -b 8 -e 256M -f 2 -g 1

According to me CUDA, CudaNN and NCCL version are compatible with each other. Is there anything I'm missing? Any help or suggestion is appreciable.

Thanks,

Hi is there any updates in this issue? I also having trouble running fairseq with torch.distributed.launch

@xiongchenyan I will be trying out few new experiments soon, Will update here if I can resolve this issue

Was this page helpful?
0 / 5 - 0 ratings