Fairseq: Loss increase with RoBERTa pretraining despite large batch size

Created on 25 Mar 2020 · 6Comments · Source: pytorch/fairseq

❓ Questions and Help

What is your question?

I am pretraining RoBERTa on a decently sized setup (40 Tesla V100 32GB GPU cards spread across 4 nodes) on about 100GB of text data. I can train for about 1 epoch on this data before losses start significantly increasing. My question is: is this really a reasonable scenario for a setup of my size/configuration? Am I perhaps running things incorrectly?

Code

I am running the following command on all four nodes:

fairseq-train $DATA_DIR \
    --distributed-world-size 40 \
    --distributed-rank $NODE_RANK \ # This is set to 0, 10, 20 or 30 for each of the four nodes
    --distributed-backend nccl \
    --use-bmuf \
    --ddp-backend=no_c10d \
    --distributed-init-method 'tcp://'$MASTER_IP':'$MASTER_PORT \
    --task masked_lm  \
    --criterion masked_lm \
    --arch roberta_large \
    --sample-break-mode none \
    --max-tokens 7000 \
    --optimizer adam \
    --adam-betas '(0.9,0.98)' \
    --adam-eps 1e-6 \
    --clip-norm 0.0 \
    --lr-scheduler polynomial_decay \
    --lr 0.0004 \
    --warmup-updates 30000 \
    --total-num-update 500000 \
    --dropout 0.1 --attention-dropout 0.1 --weight-decay 0.01 \
    --update-freq 28 \
    --save-interval-updates 1000 \
    --skip-invalid-size-inputs-valid-test \
    --max-update 500000 --log-format simple --log-interval 1 --no-progress-bar

What have you tried?

I have checked out similar issues in this repo, where the general advice seems to be "this kind of thing can happen with small batch sizes". However, my batch size should be pretty big, since I'm almost fully allocating all 40 GPU cards (as well as using gradient accumulation). I do get the curious output of:

| epoch 002:   2040 / 3958 loss=8.783, nll_loss=8.783, ppl=440.58, wps=8344, ups=0, wpb=202347.618, bsz=395.210, num_updates=5999, lr=7.99867e-05, gnorm=33.638, clip=0.000, oom=0.000, wall=145875, train_wall=141009

I find it strange that the bsz output is that low. I'm assuming that the output specifies a per-card batch size?

What's your environment?

fairseq Version: 0.9
PyTorch Version: 1.4.0
OS: CentOS 7
How you installed fairseq: From source
Build command you used (if compiling from source): The pip install procedure from the readme
Python version: 3.6.8
CUDA/cuDNN version: CUDA 10.2
GPU models and configuration: 40 Tesla V100 (32GB) spread across 4 nodes

question

Source

iyor

Most helpful comment

Using a different seed didn't work, so I caved and went with a lower peak learning rate on friday. Still running and things seem OK.

iyor on 30 Mar 2020

👍2

All 6 comments

You should remove --use-bmuf. That flag enables Blockwise Model-Update Filtering (BMUF), which you don't want. That may also be why your total batch size is so small (bsz=395.210).

Also, you can remove --ddp-backend=no_c10d, it should be faster with the default value of --ddp-backend=c10d.

I find it strange that the bsz output is that low. I'm assuming that the output specifies a per-card batch size?

No, the bsz counter is the cumulative batch size across all workers. It may be wrong because you're using BMUF.

myleott on 25 Mar 2020

Thanks for the quick response. I am currently re-running training with --use-bmuf removed and --ddp-backend=c10d. I'll know if this solves my loss increase issue sometime tomorrow.

iyor on 26 Mar 2020

Left pretraining running overnight (also added the --fp16 flag), and sure enough halfway through epoch 2, losses start increasing. Eventually, I get the error message,

-- Process 6 terminated with the following error:
Traceback (most recent call last):
  File "/usr/local/lib64/python3.6/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
    fn(i, *args)
  File "/usr/local/lib64/python3.6/site-packages/fairseq_cli/train.py", line 296, in distributed_main
    main(args, init_distributed=True)
  File "/usr/local/lib64/python3.6/site-packages/fairseq_cli/train.py", line 86, in main
    train(args, trainer, task, epoch_itr)
  File "/usr/local/lib64/python3.6/site-packages/fairseq_cli/train.py", line 127, in train
    log_output = trainer.train_step(samples)
  File "/usr/local/lib64/python3.6/site-packages/fairseq/trainer.py", line 433, in train_step
    grad_norm = self.optimizer.clip_grad_norm(self.args.clip_norm)
  File "/usr/local/lib64/python3.6/site-packages/fairseq/optim/fp16_optimizer.py", line 146, in clip_grad_norm
    ).format(self.min_loss_scale))
FloatingPointError: Minimum loss scale reached (0.0001). Your loss is probably exploding. Try lowering the learning rate, using gradient clipping or increasing the batch size.

Here is some training output that precedes the stack trace:

...
| epoch 002:   3703 / 3958 loss=8.820, nll_loss=8.820, ppl=452.03, wps=574318, ups=0, wpb=4587519.613,
 bsz=8959.999, num_updates=7646, lr=0.000101947, gnorm=1.368, clip=0.000, oom=0.000, loss_scale=0.002,
 wall=61221, train_wall=57358
| WARNING: overflow detected, setting loss scale to: 0.0009765625
| epoch 002:   3704 / 3958 loss=8.820, nll_loss=8.820, ppl=452.03, wps=574318, ups=0, wpb=4587519.613,
 bsz=8959.999, num_updates=7646, lr=0.000101947, gnorm=1.368, clip=0.000, oom=0.000, loss_scale=0.002,
 wall=61221, train_wall=57358
...

Note that my batch size is 8960 and that the learning rate is well below the peak learning rate in the original RoBERTa paper. I shouldn't really need gradient clipping here, right?

iyor on 27 Mar 2020

As per other issues in this repo I'll try reloading the model from a checkpoint and using a different seed. However, judging from similar issues in the repo, this is a problem that seems to be associated with more modest setups, where batch sizes are limited. Is it reasonable to see these types of loss explosions on bigger setups as well, or is perhaps something else going on?

iyor on 27 Mar 2020

Using a different seed didn't work, so I caved and went with a lower peak learning rate on friday. Still running and things seem OK.

iyor on 30 Mar 2020

👍2

The learning rate often needs to be adjusted depending on the dataset. Your current batch size is even a bit larger now than the original RoBERTa one (bsz=8960), so that’s good.