Fairseq: programs hangs in distributed_utils.all_gather_list (reopen)

Created on 11 Mar 2020 · 10Comments · Source: pytorch/fairseq

🐛 Bug

This is basically reopening this issue since I'm experiencing the same bug.

When I run the following code on two separate machines, the distributed_utils.all_gather_list call just hangs, and never returns. GPUs are at 100% but nothing happens.

NODE_RANK=0
MASTER_IP=14.92.25.162
MASTER_PORT=1234
DATA_DIR=/data/wikitext-10

TOTAL_UPDATES=125000    # Total number of training steps
WARMUP_UPDATES=10000    # Warmup the learning rate over this many updates
PEAK_LR=0.0005          # Peak learning rate, adjust as needed
TOKENS_PER_SAMPLE=16   # Max sequence length
MAX_POSITIONS=16       # Num. positional embeddings (usually same as above)
MAX_SENTENCES=4        # Number of sequences per batch (batch size)
UPDATE_FREQ=2          # Increase the batch size 16x

fairseq-train --fp16 $DATA_DIR \
    --distributed-world-size 4 \
    --distributed-rank $NODE_RANK \
    --distributed-init-method 'tcp://'$MASTER_IP':'$MASTER_PORT \
    --task masked_lm  \
    --criterion masked_lm \
    --arch roberta_large \
    --sample-break-mode complete \
    --tokens-per-sample $TOKENS_PER_SAMPLE \
    --optimizer adam \
    --adam-betas '(0.9,0.98)' \
    --adam-eps 1e-6 \
    --clip-norm 0.0 \
    --lr-scheduler polynomial_decay \
    --lr $PEAK_LR \
    --warmup-updates $WARMUP_UPDATES \
    --total-num-update $TOTAL_UPDATES \
    --dropout 0.1 --attention-dropout 0.1 --weight-decay 0.01 \
    --max-sentences $MAX_SENTENCES --update-freq $UPDATE_FREQ \
    --max-update $TOTAL_UPDATES --log-format simple --log-interval 1 --no-progress-bar

I've traced it to a call to item() in fairseq.distributed.utils.

When I reproduce it _on the same machine_, i.e. running the script twice on the the same machine (setting the GPUs accordingly), the training works.

Expected behavior

The gather list call returns.

Environment

Environment
fairseq Version: 0.9.0
PyTorch Version:1.4.0
OS: CentOS Linux release 7.6.1810
How you installed fairseq: pip
Python version: 3.6.8
CUDA/cuDNN version: Cuda compilation tools, release 10.2, V10.2.89
GPU models and configuration: V100's across 2 machines

question

Source

aronla

Most helpful comment

What are the implications of using bmuf? And not fp16?

FYI, you probably don't want to use BMUF for general training.

By default fairseq implements synchronous distributed SGD training (a.k.a. distributed data parallel). BMUF is something different. It stands for Blockwise Model-Update Filtering and is described here.

You should generally prefer the default options, or use --ddp-backend=no_c10d if you run into inconsistent gradient issues.

Edit : It worked with local instances (2 RTX with 4 GPU's each) but not with private cloud instances (2 RTX with 8GPU's each)

Sorry, this is probably unrelated to fairseq. I recommend running Nvidia's nccl-tests whenever working with a new environment, to ensure that the nodes are able to communicate properly.

myleott on 30 Apr 2020

👍2

All 10 comments

Are the two machines able to communicate with each other properly? You say it's 4 GPUs (--distributed-world-size=4) and "V100's across 2 machines". Does that mean you have 2 nodes each with 2 GPUs?

Can you try a simpler training command to start? You'll need to install fairseq from source and then try something like:

python train.py --fp16 --ddp-backend=no_c10d \
  --no-save --disable-validation \
  --task dummy_masked_lm --dataset-size 100000 --dict-size 49995 \
  --arch dummy_model \
  --criterion masked_lm \
  --optimizer adam --lr 1e-4 \
  --tokens-per-sample 128 \
  --max-sentences 4 \
  --log-format json --log-interval 10 \
  --max-epoch 1

myleott on 11 Mar 2020

Hi, and thanks for your answer @myleott !

Does that mean you have 2 nodes each with 2 GPUs?

Yes, exactly.

Can you try a simpler training command to start? You'll need to install fairseq from source and then try something like

Unfortunately, I don't have the permissions to be able to install from source.
I could however use the script from the pytorch tutorial, and it trained across both machines, which to me says that the machines are communicating with each other?

I also tried to set export NCCL_LL_THRESHOLD=0 but it still hangs.

I get a similar behaviour when trying to use the gloo backend instead.

aronla on 12 Mar 2020

Unfortunately, I don't have the permissions to be able to install from source.

It should work if you do: pip install --editable --user .

myleott on 13 Mar 2020

I still can't install with the --user flag, but I think it is installed with editable. Using the code above , I'm getting the following error:

-- Process 8 terminated with the following error:
...
...
site-packages/torch/nn/functional.py", line 2094, in nll_loss
    .format(input.size(0), target.size(0)))
ValueError: Expected input batch_size (512) to match target batch_size (72).

However, using criterion=cross_entropy it starts training, but when doing it distributedly across two machines, I get RuntimeError: Fatal error: gradients are inconsistent between workers. Try --ddp-backend=no_c10d.

I already had the no_c10d flag set, but removing that I seem to end up in the dead hang situation I described in my original post.
I can't use the solution here, then I get AttributeError: 'FairseqBMUF' object has no attribute 'scaler' (only a problem when using --fp16)
Also, all nccl-tests ran across the two machines so all collective operations work.
Using gloo beckend also gives the gradient are inconsistent between workers error.

Thank you! :D

aronla on 14 Mar 2020

UPDATE:
Using the combination

    --distributed-backend nccl \
    --use-bmuf \
    --ddp-backend=no_c10d \

and _removing_ the flag --fp16 got it training!
What are the implications of using bmuf? And not fp16?

I'm thinking something might still be weird as the other combinations should work, right?

aronla on 14 Mar 2020

I am facing the same issue and NCCL_DEBUG=INFO's trace is same as in this pytorch's forum discussion and an other
Reinstalling torch did not help.

Edit : It worked with local instances (2 RTX with 4 GPU's each) but not with private cloud instances (2 RTX with 8GPU's each) used the following flags, but of no help.
export NCCL_P2P_LEVEL=2
export NCCL_P2P_DISABLE=1
export NCCL_DEBUG=INFO
export NCCL_DEBUG_SUBSYS=ALL

Please suggest any way to help to debug, if possible.

Seems like a connection issue, so not a right place to ask, thanks.

gvskalyan on 26 Mar 2020

What are the implications of using bmuf? And not fp16?

FYI, you probably don't want to use BMUF for general training.

You should generally prefer the default options, or use --ddp-backend=no_c10d if you run into inconsistent gradient issues.

Edit : It worked with local instances (2 RTX with 4 GPU's each) but not with private cloud instances (2 RTX with 8GPU's each)

Sorry, this is probably unrelated to fairseq. I recommend running Nvidia's nccl-tests whenever working with a new environment, to ensure that the nodes are able to communicate properly.

myleott on 30 Apr 2020

👍2

@myleott I got RuntimeError: Fatal error: gradients are inconsistent between workers. Try --ddp-backend=no_c10d. even if I added this command into my training script. Would you please give me some advice?

SystemErrorWang on 28 May 2020

@myleott same issue as @SystemErrorWang

ShirleyHan6 on 28 May 2020

I got the same problem, in my case for some reasons gradients were NaN, but I am not sure why this is happening. Unfortunately I couldn't reproduce it to investigate better. Since this is not caught by previous NaN checks, the problem may be the division the the samples, but I am not sure a 0 is possible there...