I am pretraining RoBERTa on a decently sized setup (40 Tesla V100 32GB GPU cards spread across 4 nodes) on about 100GB of text data. I can train for about 1 epoch on this data before losses start significantly increasing. My question is: is this really a reasonable scenario for a setup of my size/configuration? Am I perhaps running things incorrectly?
I am running the following command on all four nodes:
fairseq-train $DATA_DIR \
--distributed-world-size 40 \
--distributed-rank $NODE_RANK \ # This is set to 0, 10, 20 or 30 for each of the four nodes
--distributed-backend nccl \
--use-bmuf \
--ddp-backend=no_c10d \
--distributed-init-method 'tcp://'$MASTER_IP':'$MASTER_PORT \
--task masked_lm \
--criterion masked_lm \
--arch roberta_large \
--sample-break-mode none \
--max-tokens 7000 \
--optimizer adam \
--adam-betas '(0.9,0.98)' \
--adam-eps 1e-6 \
--clip-norm 0.0 \
--lr-scheduler polynomial_decay \
--lr 0.0004 \
--warmup-updates 30000 \
--total-num-update 500000 \
--dropout 0.1 --attention-dropout 0.1 --weight-decay 0.01 \
--update-freq 28 \
--save-interval-updates 1000 \
--skip-invalid-size-inputs-valid-test \
--max-update 500000 --log-format simple --log-interval 1 --no-progress-bar
I have checked out similar issues in this repo, where the general advice seems to be "this kind of thing can happen with small batch sizes". However, my batch size should be pretty big, since I'm almost fully allocating all 40 GPU cards (as well as using gradient accumulation). I do get the curious output of:
| epoch 002: 2040 / 3958 loss=8.783, nll_loss=8.783, ppl=440.58, wps=8344, ups=0, wpb=202347.618, bsz=395.210, num_updates=5999, lr=7.99867e-05, gnorm=33.638, clip=0.000, oom=0.000, wall=145875, train_wall=141009
I find it strange that the bsz output is that low. I'm assuming that the output specifies a per-card batch size?
You should remove --use-bmuf. That flag enables Blockwise Model-Update Filtering (BMUF), which you don't want. That may also be why your total batch size is so small (bsz=395.210).
Also, you can remove --ddp-backend=no_c10d, it should be faster with the default value of --ddp-backend=c10d.
I find it strange that the bsz output is that low. I'm assuming that the output specifies a per-card batch size?
No, the bsz counter is the cumulative batch size across all workers. It may be wrong because you're using BMUF.
Thanks for the quick response. I am currently re-running training with --use-bmuf removed and --ddp-backend=c10d. I'll know if this solves my loss increase issue sometime tomorrow.
Left pretraining running overnight (also added the --fp16 flag), and sure enough halfway through epoch 2, losses start increasing. Eventually, I get the error message,
-- Process 6 terminated with the following error:
Traceback (most recent call last):
File "/usr/local/lib64/python3.6/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
fn(i, *args)
File "/usr/local/lib64/python3.6/site-packages/fairseq_cli/train.py", line 296, in distributed_main
main(args, init_distributed=True)
File "/usr/local/lib64/python3.6/site-packages/fairseq_cli/train.py", line 86, in main
train(args, trainer, task, epoch_itr)
File "/usr/local/lib64/python3.6/site-packages/fairseq_cli/train.py", line 127, in train
log_output = trainer.train_step(samples)
File "/usr/local/lib64/python3.6/site-packages/fairseq/trainer.py", line 433, in train_step
grad_norm = self.optimizer.clip_grad_norm(self.args.clip_norm)
File "/usr/local/lib64/python3.6/site-packages/fairseq/optim/fp16_optimizer.py", line 146, in clip_grad_norm
).format(self.min_loss_scale))
FloatingPointError: Minimum loss scale reached (0.0001). Your loss is probably exploding. Try lowering the learning rate, using gradient clipping or increasing the batch size.
Here is some training output that precedes the stack trace:
...
| epoch 002: 3703 / 3958 loss=8.820, nll_loss=8.820, ppl=452.03, wps=574318, ups=0, wpb=4587519.613,
bsz=8959.999, num_updates=7646, lr=0.000101947, gnorm=1.368, clip=0.000, oom=0.000, loss_scale=0.002,
wall=61221, train_wall=57358
| WARNING: overflow detected, setting loss scale to: 0.0009765625
| epoch 002: 3704 / 3958 loss=8.820, nll_loss=8.820, ppl=452.03, wps=574318, ups=0, wpb=4587519.613,
bsz=8959.999, num_updates=7646, lr=0.000101947, gnorm=1.368, clip=0.000, oom=0.000, loss_scale=0.002,
wall=61221, train_wall=57358
...
Note that my batch size is 8960 and that the learning rate is well below the peak learning rate in the original RoBERTa paper. I shouldn't really need gradient clipping here, right?
As per other issues in this repo I'll try reloading the model from a checkpoint and using a different seed. However, judging from similar issues in the repo, this is a problem that seems to be associated with more modest setups, where batch sizes are limited. Is it reasonable to see these types of loss explosions on bigger setups as well, or is perhaps something else going on?
Using a different seed didn't work, so I caved and went with a lower peak learning rate on friday. Still running and things seem OK.
The learning rate often needs to be adjusted depending on the dataset. Your current batch size is even a bit larger now than the original RoBERTa one (bsz=8960), so that’s good.
Using a different seed didn't work, so I caved and went with a lower peak learning rate on friday. Still running and things seem OK.
Great! A lower learning rate is fine, you just want it to be fairly large.
Most helpful comment
Using a different seed didn't work, so I caved and went with a lower peak learning rate on friday. Still running and things seem OK.