Fairseq: Is my training finetuing RoBERTa normal?

Created on 8 Aug 2019 · 15Comments · Source: pytorch/fairseq

Hi, I found it's weird for my custom sentence-pair classification task when I try to finetune RoBERTa. I followed the official instruction finetune_custom_classification.md. The ACC of mini-batchs is only 72 after 4.5 epochs and there is not any change for training loss.
Below is the part of training log.

| epoch 004:  60%|6| 11710/19494 [9:13:13<6:22:36,  2.95s/it, loss=0.777, nll_loss=0.009, ppl=1.01, wps=1987, ups=0, wpb=5633.335, bsz=63.999, num_updates=70192, lr=4.39121e-06, gnorm=2.331, clip=0.000, oom=0.00
| epoch 004:  60%|6| 11711/19494 [9:13:16<6:07:54,  2.84s/it, loss=0.777, nll_loss=0.009, ppl=1.01, wps=1987, ups=0, wpb=5633.324, bsz=63.999, num_updates=70193, lr=4.39117e-06, gnorm=2.331, clip=0.000, oom=0.00
| epoch 004:  60%|6| 11712/19494 [9:13:18<5:57:49,  2.76s/it, loss=0.777, nll_loss=0.009, ppl=1.01, wps=1987, ups=0, wpb=5633.301, bsz=63.999, num_updates=70194, lr=4.39113e-06, gnorm=2.331, clip=0.000, oom=0.00
| epoch 004:  60%|6| 11713/19494 [9:13:22<6:07:35,  2.83s/it, loss=0.777, nll_loss=0.009, ppl=1.01, wps=1987, ups=0, wpb=5633.305, bsz=63.999, num_updates=70195, lr=4.3911e-06, gnorm=2.331, clip=0.000, oom=0.000
| epoch 004:  60%|6| 11714/19494 [9:13:24<6:12:05,  2.87s/it, loss=0.777, nll_loss=0.009, ppl=1.01, wps=1987, ups=0, wpb=5633.321, bsz=63.999, num_updates=70196, lr=4.39106e-06, gnorm=2.331, clip=0.000, oom=0.00
| epoch 004:  60%|6| 11715/19494 [9:13:27<6:15:06,  2.89s/it, loss=0.777, nll_loss=0.009, ppl=1.01, wps=1987, ups=0, wpb=5633.321, bsz=63.999, num_updates=70197, lr=4.39102e-06, gnorm=2.331, clip=0.000, oom=0.00
| epoch 004:  60%|6| 11716/19494 [9:13:30<6:01:41,  2.79s/it, loss=0.777, nll_loss=0.009, ppl=1.01, wps=1987, ups=0, wpb=5633.299, bsz=63.999, num_updates=70198, lr=4.39098e-06, gnorm=2.331, clip=0.000, oom=0.000
, wall=199126, train_wall=195055, accuracy=0.727044]

And the AUC of the test data is around 56%

| Model | AUC of Test Set |
| ---- | ---- |
| checkpoint1.pt |0.5563589297270759|
| checkpoint_1_6000.pt | 0.5355381491151726 |
| checkpoint_1_12000.pt | 0.55602419048894259|
| checkpoint_1_18000.pt | 0.5745017964339114|
| checkpoint2.pt | 0.5630760304389548 |
| checkpoint_2_24000.pt | 0.5613800182990784 |
| checkpoint_2_30000.pt | 0.5706188212715628 |
| checkpoint_2_36000.pt | 0.5615139139943317 |
| checkpoint3.pt | 0.5755729619959384 |
| checkpoint_3_42000.pt | 0.555890294793689 |
| checkpoint_3_48000.pt | 0.5390417531409699 |
| checkpoint_3_54000.pt | 0.559014527682935 |

I tried the learning rate from 5e-5 to 6e-5 and above is the best result.

I found 9 types in the dictionary of label and is it expected because this is just binary classification task.

loading archive file /home/fecheng/project/fairseq/checkpoints/lr7e-6_mp150
loading archive file data/list_qp_train_en_filter.tsv/
| [input] dictionary: 50265 types
| [label] dictionary: 9 types

Below is my environment and training command

python : 3.6.7
pytorch: 1.0
GPU: P40 22G

input_data_dir=data/list_qp_train_en_filter.tsv/
TOTAL_NUM_UPDATES=187500  # after TOTAL_NUM_UPDATES, lr will be 0
WARMUP_UPDATES=500      # 6 percent of the number of updates
LR=1e-5
NUM_CLASSES=2
BATCH_SIZE=16
max_positions=150
save_dir=checkpoints/lr${LR}_mp${max_positions}
train_log=$save_dir/train.log
mkdir -p $save_dir

CUDA_VISIBLE_DEVICES=2 python -u train.py $input_data_dir \
--restore-file models/pretrained/roberta.large/ \
--max-positions $max_positions \
--max-sentences $BATCH_SIZE \
--max-tokens 4400 \
--task sentence_prediction \
--reset-optimizer --reset-dataloader --reset-meters \
--required-batch-size-multiple 1 \
--init-token 0 --separator-token 2 \
--arch roberta_large \
--criterion sentence_prediction \
--num-classes $NUM_CLASSES \
--dropout 0.1 --attention-dropout 0.1 \
--weight-decay 0.1 --optimizer adam --adam-betas "(0.9, 0.98)" --adam-eps 1e-06 \
--clip-norm 0.0 \
--lr-scheduler polynomial_decay --lr $LR --total-num-update $TOTAL_NUM_UPDATES --warmup-updates $WARMUP_UPDATES \
--max-epoch 10 \
--best-checkpoint-metric accuracy --maximize-best-checkpoint-metric \
--truncate-sequence \
--update-freq 4 \
--save-dir $save_dir \
--save-interval-updates 6000 \
--keep-interval-updates -1 \
--log-format tqdm \
--find-unused-parameters

Source

chengfx

Most helpful comment

And also I find an information in my training log
(...)
| no existing checkpoint found checkpoints/imdb/models/pretrained/roberta.large/

That means you're not using RoBERTa or pretraining at all -- you're just using a randomly initialized model with the BERT architecture.

So the actual dir of pretrained model is ${save_dir}/${--restore-file} not ${--restore-file}

It's dynamic based on whether you specify an absolute path or not: https://github.com/pytorch/fairseq/blob/832491962b30fb2164bed696e1489685a885402f/fairseq/checkpoint_utils.py#L100-L103

I'll probably modify this code to be a bit more robust to non-absolute paths.

However, there is an issue about loading checkpoint when I use this command --restore-file ../../models/pretrained/roberta.large/model.pt

Yes, because you have --max-positions 150 in your command. The pretrained model expects --max-positions 512, so when you try to load the checkpoint it sees extra positional embeddings and can't load them. I can try to add a fallback that trims the unused positional embeddings, but the easiest thing is to change --max-positions=512.

myleott on 12 Aug 2019

👍3 🎉1

All 15 comments

It's hard to say what's going on without knowing more details of task / dataset.

But few things:

1) 9 types is okay, since we treat the labels also as normal fairseq-dictionary and special symbols are added to that also. You can see the label dictionary in data/list_qp_train_en_filter.tsv/label/dict.txt

2) You seem to be using bsz=64 because you have --update-freq=4. Typically we found that bsz=32 and lr=1e-5 was most stable across various tasks and datasets. Maybe give that a try?

ngoyal2707 on 8 Aug 2019

The dataset is okay because it was trained by BERT-Large and XLNet and both the auc of test set could be up to above 80% in 1 or 2 epochs.

Sure, I will try different batch size and more results will update here after experiments finish

chengfx on 9 Aug 2019

Besides, I found many times that the loss is same for different mini-batchs in a long time. It's very very weird. Is it expected?
| epoch 001: 84%|8| 16295/19494 [12:50:14<2:27:09, 2.76s/it, loss=0.869, nll_loss=0.010, ppl=1.01, wps=1986, ups=0, wpb=5632.866, bsz=63.999, num_updates=16295, lr=9.15535e-06, gnorm=3.938, clip=0.000, oom=0.000, wall=46226, train_wall=45365, accuracy=0.6 | epoch 001: 84%|8| 16296/19494 [12:50:16<2:25:33, 2.73s/it, loss=0.869, nll_loss=0.010, ppl=1.01, wps=1986, ups=0, wpb=5632.868, bsz=63.999, num_updates=16296, lr=9.15529e-06, gnorm=3.938, clip=0.000, oom=0.000, wall=46229, train_wall=45368, accuracy=0.6 | epoch 001: 84%|8| 16297/19494 [12:50:19<2:25:46, 2.74s/it, loss=0.869, nll_loss=0.010, ppl=1.01, wps=1986, ups=0, wpb=5632.859, bsz=63.999, num_updates=16297, lr=9.15524e-06, gnorm=3.939, clip=0.000, oom=0.000, wall=46231, train_wall=45371, accuracy=0.6 | epoch 001: 84%|8| 16298/19494 [12:50:22<2:29:06, 2.80s/it, loss=0.869, nll_loss=0.010, ppl=1.01, wps=1986, ups=0, wpb=5632.866, bsz=63.999, num_updates=16298, lr=9.15519e-06, gnorm=3.939, clip=0.000, oom=0.000, wall=46234, train_wall=45374, accuracy=0.6 | epoch 001: 84%|8| 16299/19494 [12:50:25<2:25:39, 2.74s/it, loss=0.869, nll_loss=0.010, ppl=1.01, wps=1986, ups=0, wpb=5632.855, bsz=63.999, num_updates=16299, lr=9.15513e-06, gnorm=3.939, clip=0.000, oom=0.000, wall=46237, train_wall=45376, accuracy=0.6 | epoch 001: 84%|8| 16300/19494 [12:50:27<2:26:51, 2.76s/it, loss=0.869, nll_loss=0.010, ppl=1.01, wps=1986, ups=0, wpb=5632.861, bsz=63.999, num_updates=16300, lr=9.15508e-06, gnorm=3.939, clip=0.000, oom=0.000, wall=46240, train_wall=45379, accuracy=0.6 | epoch 001: 84%|8| 16301/19494 [12:50:30<2:29:29, 2.81s/it, loss=0.869, nll_loss=0.010, ppl=1.01, wps=1986, ups=0, wpb=5632.853, bsz=63.999, num_updates=16301, lr=9.15503e-06, gnorm=3.939, clip=0.000, oom=0.000, wall=46243, train_wall=45382, accuracy=0.6 | epoch 001: 84%|8| 16302/19494 [12:50:33<2:30:13, 2.82s/it, loss=0.869, nll_loss=0.010, ppl=1.01, wps=1986, ups=0, wpb=5632.853, bsz=63.999, num_updates=16302, lr=9.15497e-06, gnorm=3.939, clip=0.000, oom=0.000, wall=46246, train_wall=45385, accuracy=0.6 | epoch 001: 84%|8| 16303/19494 [12:50:36<2:32:24, 2.87s/it, loss=0.869, nll_loss=0.010, ppl=1.01, wps=1986, ups=0, wpb=5632.861, bsz=63.999, num_updates=16303, lr=9.15492e-06, gnorm=3.939, clip=0.000, oom=0.000, wall=46249, train_wall=45388, accuracy=0.6 | epoch 001: 84%|8| 16304/19494 [12:50:39<2:27:14, 2.77s/it, loss=0.869, nll_loss=0.010, ppl=1.01, wps=1986, ups=0, wpb=5632.859, bsz=63.999, num_updates=16304, lr=9.15487e-06, gnorm=3.940, clip=0.000, oom=0.000, wall=46251, train_wall=45390, accuracy=0.6 | epoch 001: 84%|8| 16305/19494 [12:50:42<2:28:38, 2.80s/it, loss=0.869, nll_loss=0.010, ppl=1.01, wps=1986, ups=0, wpb=5632.876, bsz=63.999, num_updates=16305, lr=9.15481e-06, gnorm=3.941, clip=0.000, oom=0.000, wall=46254, train_wall=45393, accuracy=0.6 | epoch 001: 84%|8| 16306/19494 [12:50:44<2:30:54, 2.84s/it, loss=0.869, nll_loss=0.010, ppl=1.01, wps=1986, ups=0, wpb=5632.884, bsz=63.999, num_updates=16306, lr=9.15476e-06, gnorm=3.941, clip=0.000, oom=0.000, wall=46257, train_wall=45396, accuracy=0.6 | epoch 001: 84%|8| 16307/19494 [12:50:48<2:36:16, 2.94s/it, loss=0.869, nll_loss=0.010, ppl=1.01, wps=1986, ups=0, wpb=5632.904, bsz=63.999, num_updates=16307, lr=9.15471e-06, gnorm=3.941, clip=0.000, oom=0.000, wall=46260, train_wall=45399, accuracy=0.6 | epoch 001: 84%|8| 16308/19494 [12:50:50<2:35:11, 2.92s/it, loss=0.869, nll_loss=0.010, ppl=1.01, wps=1986, ups=0, wpb=5632.896, bsz=63.999, num_updates=16308, lr=9.15465e-06, gnorm=3.941, clip=0.000, oom=0.000, wall=46263, train_wall=45402, accuracy=0.6

chengfx on 9 Aug 2019

Hmm, that's not expected. That definitely means, it's not training.
Did you look at your preprocessed data? Does everything look as expected in the preprocessed data?

ngoyal2707 on 9 Aug 2019

I've checked train.0.bpe train.1.bpe and train.label.bpe and all look normal. I didn't check data in input0, input1 and label directory because that is binary data.
In the warm-up phase, the initial loss is above 1.0, and now the loss keeps 0.869 for a long time

chengfx on 9 Aug 2019

Can you please try --lr-scheduler fixed --lr 1e-5 --update-freq 2 --max-sentences 16 ? Sorry I don't have any other suggestion. Is this some public dataset? If so, I can take a look.

ngoyal2707 on 9 Aug 2019

Sorry it's not public dataset T^T. I will follow your suggestion and will continue to update latest result

chengfx on 9 Aug 2019

👍1

I think your command has a typo:

CUDA_VISIBLE_DEVICES=2 python -u train.py $input_data_dir \
--restore-file models/pretrained/roberta.large/
(...)

--restore-file should point to a .pt file. So it's probably using a randomly initialized model instead of the RoBERTa model. Can you confirm whether you see the line loaded checkpoint (...)/model.pt (epoch 0 @ 0 updates) in your training log?

myleott on 9 Aug 2019

👍2

Hi, @ngoyal2707 , I tried different batch size and it is still not work. I am not sure whether the root issue is probably using a randomly initialized model as @myleott said. I did an experiment using the IMDB. The parameters are same as the official instruction and below is the result of my experiment, is it expected?

|Epoch|Train ACC|Valid ACC|
|--|--|--|
|10|~96.5|~87.3|

And also I find an information in my training log

| model roberta_large, criterion SentencePredictionCriterion
| num. model params: 356462683 (num. trained: 356462683)
| training on 1 GPUs
| max tokens per GPU = 4400 and max sentences per GPU = 8
| no existing checkpoint found checkpoints/imdb/models/pretrained/roberta.large/
| loading train data for epoch 0

Hi @myleott I thought it needs a directory. I find this information in my log when I use this command --restore-file models/pretrained/roberta.large/model.pt

| training on 1 GPUs
| max tokens per GPU = 4400 and max sentences per GPU = 16
| no existing checkpoint found checkpoints/lr1e-5_mp150/models/pretrained/roberta.large/model.pt
| loading train data for epoch 0

So the actual dir of pretrained model is ${save_dir}/${--restore-file} not ${--restore-file}? However, there is an issue about loading checkpoint when I use this command --restore-file ../../models/pretrained/roberta.large/model.pt

| model roberta_large, criterion SentencePredictionCriterion
| num. model params: 356091995 (num. trained: 356091995)
| training on 1 GPUs
| max tokens per GPU = 4400 and max sentences per GPU = 16
Overwriting classification_heads.sentence_classification_head.dense.weight
Overwriting classification_heads.sentence_classification_head.dense.bias
Overwriting classification_heads.sentence_classification_head.out_proj.weight
Overwriting classification_heads.sentence_classification_head.out_proj.bias
Traceback (most recent call last):
  File "/home/fecheng/project/fairseq/fairseq/trainer.py", line 150, in load_checkpoint
    self.get_model().load_state_dict(state['model'], strict=True)
  File "/home/fecheng/project/fairseq/fairseq/models/fairseq_model.py", line 70, in load_state_dict
    return super().load_state_dict(state_dict, strict)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 769, in load_state_dict
    self.__class__.__name__, "\n\t".join(error_msgs)))
RuntimeError: Error(s) in loading state_dict for RobertaModel:
        size mismatch for decoder.sentence_encoder.embed_positions.weight: copying a param with shape torch.Size([514, 1024]) from checkpoint, the shape in current model is torch.Size([152, 1024]).

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "train.py", line 325, in <module>
    cli_main()
  File "train.py", line 321, in cli_main
    main(args)
  File "train.py", line 68, in main
    extra_state, epoch_itr = checkpoint_utils.load_checkpoint(args, trainer)
  File "/home/fecheng/project/fairseq/fairseq/checkpoint_utils.py", line 110, in load_checkpoint
    reset_meters=args.reset_meters,
  File "/home/fecheng/project/fairseq/fairseq/trainer.py", line 153, in load_checkpoint
    'Cannot load model parameters from checkpoint, '
Exception: Cannot load model parameters from checkpoint, please ensure that the architectures match.

chengfx on 12 Aug 2019

And also I find an information in my training log
(...)
| no existing checkpoint found checkpoints/imdb/models/pretrained/roberta.large/

That means you're not using RoBERTa or pretraining at all -- you're just using a randomly initialized model with the BERT architecture.

So the actual dir of pretrained model is ${save_dir}/${--restore-file} not ${--restore-file}

It's dynamic based on whether you specify an absolute path or not: https://github.com/pytorch/fairseq/blob/832491962b30fb2164bed696e1489685a885402f/fairseq/checkpoint_utils.py#L100-L103

I'll probably modify this code to be a bit more robust to non-absolute paths.

However, there is an issue about loading checkpoint when I use this command --restore-file ../../models/pretrained/roberta.large/model.pt

myleott on 12 Aug 2019

👍3 🎉1

Thanks @myleott , I will try it and update result here. It's necessary for some tasks whose data has smaller sentence to use smaller --max-positions because it will save the cost of training. 😄

chengfx on 12 Aug 2019

Note that you can use --tokens-per-sample 150 and it will only create sequences of max length 150. --max-positions is related but slightly different -- it's the number of positional embeddings that are learned.

So --max-positions 512 --tokens-per-sample 150 should work and is probably what you want.

myleott on 12 Aug 2019

👍1

It's work now. Thanks @myleott and @ngoyal2707

Besides, When I use --tokens-per-sample 150, there is an unrecognized arguments issue

usage: train.py [-h] [--no-progress-bar] [--log-interval N]
                [--log-format {json,none,simple,tqdm}]
                [--tensorboard-logdir DIR] [--tbmf-wrapper] [--seed N] [--cpu]
                [--fp16] [--memory-efficient-fp16]
                [--fp16-init-scale FP16_INIT_SCALE]
                [--fp16-scale-window FP16_SCALE_WINDOW]
                [--fp16-scale-tolerance FP16_SCALE_TOLERANCE]
                [--min-loss-scale D]
                [--threshold-loss-scale THRESHOLD_LOSS_SCALE]
                [--user-dir USER_DIR]
                [--criterion {cross_entropy,adaptive_loss,label_smoothed_cross_entropy,sentence_prediction,sentence_ranking,binary_cross_entropy,masked_lm,legacy_masked_lm_loss,composite_loss}]
                [--tokenizer {nltk,space,moses}]
                [--bpe {subword_nmt,gpt2,sentencepiece,fastbpe}]
                [--optimizer {nag,adadelta,adagrad,adam,adafactor,adamax,sgd}]
                [--lr-scheduler {inverse_sqrt,cosine,triangular,polynomial_decay,fixed,reduce_lr_on_plateau}]
                [--task TASK] [--num-workers N]
                [--skip-invalid-size-inputs-valid-test] [--max-tokens N]
                [--max-sentences N] [--required-batch-size-multiple N]
                [--dataset-impl FORMAT] [--train-subset SPLIT]
                [--valid-subset SPLIT] [--validate-interval N]
                [--disable-validation] [--max-tokens-valid N]
                [--max-sentences-valid N] [--curriculum N]
                [--distributed-world-size N]
                [--distributed-rank DISTRIBUTED_RANK]
                [--distributed-backend DISTRIBUTED_BACKEND]
                [--distributed-init-method DISTRIBUTED_INIT_METHOD]
                [--distributed-port DISTRIBUTED_PORT] [--device-id DEVICE_ID]
                [--distributed-no-spawn] [--ddp-backend {c10d,no_c10d}]
                [--bucket-cap-mb MB] [--fix-batches-to-gpus]
                [--find-unused-parameters] --arch ARCH [--max-epoch N]
                [--max-update N] [--clip-norm NORM] [--sentence-avg]
                [--update-freq N1,N2,...,N_K] [--lr LR_1,LR_2,...,LR_N]
                [--min-lr LR] [--use-bmuf] [--save-dir DIR]
                [--restore-file RESTORE_FILE] [--reset-dataloader]
                [--reset-lr-scheduler] [--reset-meters] [--reset-optimizer]
                [--optimizer-overrides DICT] [--save-interval N]
                [--save-interval-updates N] [--keep-interval-updates N]
                [--keep-last-epochs N] [--no-save] [--no-epoch-checkpoints]
                [--no-last-checkpoints] [--no-save-optimizer-state]
                [--best-checkpoint-metric BEST_CHECKPOINT_METRIC]
                [--maximize-best-checkpoint-metric] [--encoder-layers L]
                [--encoder-embed-dim H] [--encoder-ffn-embed-dim F]
                [--encoder-attention-heads A]
                [--activation-fn {relu,gelu,gelu_fast,gelu_accurate,tanh,linear}]
                [--pooler-activation-fn {relu,gelu,gelu_fast,gelu_accurate,tanh,linear}]
                [--encoder-normalize-before] [--dropout D]
                [--attention-dropout D] [--activation-dropout D]
                [--pooler-dropout D] [--max-positions MAX_POSITIONS]
                [--load-checkpoint-heads] [--save-predictions FILE]
                [--adam-betas B] [--adam-eps D] [--weight-decay WD]
                [--force-anneal N] [--lr-shrink LS] [--warmup-updates N]
                [--num-classes NUM_CLASSES] [--init-token INIT_TOKEN]
                [--separator-token SEPARATOR_TOKEN] [--regression-target]
                [--no-shuffle] [--truncate-sequence]
                FILE
train.py: error: unrecognized arguments: --tokens-per-sample 150

chengfx on 13 Aug 2019

👍1

@chengfx Try --max-tokens 150？

luofuli on 17 Dec 2019

No you should use --task masked_lm --tokens-per-sample 150. The argument is unrecognized because you probably forgot to specify --task :) It's defined here: https://github.com/pytorch/fairseq/blob/master/fairseq/tasks/masked_lm.py#L43-L45

--max-tokens is different. It controls the total batch size per worker. So if you use --max-tokens 300 --tokens-per-sample 150 then it'll create a batch with 2 sequences each of length 150.

myleott on 18 Dec 2019

Was this page helpful?

0 / 5 - 0 ratings