Fairseq: Is my training finetuing RoBERTa normal?

Created on 8 Aug 2019  Â·  15Comments  Â·  Source: pytorch/fairseq

Hi, I found it's weird for my custom sentence-pair classification task when I try to finetune RoBERTa. I followed the official instruction finetune_custom_classification.md. The ACC of mini-batchs is only 72 after 4.5 epochs and there is not any change for training loss.
Below is the part of training log.

| epoch 004:  60%|6| 11710/19494 [9:13:13<6:22:36,  2.95s/it, loss=0.777, nll_loss=0.009, ppl=1.01, wps=1987, ups=0, wpb=5633.335, bsz=63.999, num_updates=70192, lr=4.39121e-06, gnorm=2.331, clip=0.000, oom=0.00
| epoch 004:  60%|6| 11711/19494 [9:13:16<6:07:54,  2.84s/it, loss=0.777, nll_loss=0.009, ppl=1.01, wps=1987, ups=0, wpb=5633.324, bsz=63.999, num_updates=70193, lr=4.39117e-06, gnorm=2.331, clip=0.000, oom=0.00
| epoch 004:  60%|6| 11712/19494 [9:13:18<5:57:49,  2.76s/it, loss=0.777, nll_loss=0.009, ppl=1.01, wps=1987, ups=0, wpb=5633.301, bsz=63.999, num_updates=70194, lr=4.39113e-06, gnorm=2.331, clip=0.000, oom=0.00
| epoch 004:  60%|6| 11713/19494 [9:13:22<6:07:35,  2.83s/it, loss=0.777, nll_loss=0.009, ppl=1.01, wps=1987, ups=0, wpb=5633.305, bsz=63.999, num_updates=70195, lr=4.3911e-06, gnorm=2.331, clip=0.000, oom=0.000
| epoch 004:  60%|6| 11714/19494 [9:13:24<6:12:05,  2.87s/it, loss=0.777, nll_loss=0.009, ppl=1.01, wps=1987, ups=0, wpb=5633.321, bsz=63.999, num_updates=70196, lr=4.39106e-06, gnorm=2.331, clip=0.000, oom=0.00
| epoch 004:  60%|6| 11715/19494 [9:13:27<6:15:06,  2.89s/it, loss=0.777, nll_loss=0.009, ppl=1.01, wps=1987, ups=0, wpb=5633.321, bsz=63.999, num_updates=70197, lr=4.39102e-06, gnorm=2.331, clip=0.000, oom=0.00
| epoch 004:  60%|6| 11716/19494 [9:13:30<6:01:41,  2.79s/it, loss=0.777, nll_loss=0.009, ppl=1.01, wps=1987, ups=0, wpb=5633.299, bsz=63.999, num_updates=70198, lr=4.39098e-06, gnorm=2.331, clip=0.000, oom=0.000
, wall=199126, train_wall=195055, accuracy=0.727044]

And the AUC of the test data is around 56%

| Model | AUC of Test Set |
| ---- | ---- |
| checkpoint1.pt |0.5563589297270759|
| checkpoint_1_6000.pt | 0.5355381491151726 |
| checkpoint_1_12000.pt | 0.55602419048894259|
| checkpoint_1_18000.pt | 0.5745017964339114|
| checkpoint2.pt | 0.5630760304389548 |
| checkpoint_2_24000.pt | 0.5613800182990784 |
| checkpoint_2_30000.pt | 0.5706188212715628 |
| checkpoint_2_36000.pt | 0.5615139139943317 |
| checkpoint3.pt | 0.5755729619959384 |
| checkpoint_3_42000.pt | 0.555890294793689 |
| checkpoint_3_48000.pt | 0.5390417531409699 |
| checkpoint_3_54000.pt | 0.559014527682935 |

I tried the learning rate from 5e-5 to 6e-5 and above is the best result.

I found 9 types in the dictionary of label and is it expected because this is just binary classification task.

loading archive file /home/fecheng/project/fairseq/checkpoints/lr7e-6_mp150
loading archive file data/list_qp_train_en_filter.tsv/
| [input] dictionary: 50265 types
| [label] dictionary: 9 types

Below is my environment and training command

python : 3.6.7
pytorch: 1.0
GPU: P40 22G
input_data_dir=data/list_qp_train_en_filter.tsv/
TOTAL_NUM_UPDATES=187500  # after TOTAL_NUM_UPDATES, lr will be 0
WARMUP_UPDATES=500      # 6 percent of the number of updates
LR=1e-5
NUM_CLASSES=2
BATCH_SIZE=16
max_positions=150
save_dir=checkpoints/lr${LR}_mp${max_positions}
train_log=$save_dir/train.log
mkdir -p $save_dir

CUDA_VISIBLE_DEVICES=2 python -u train.py $input_data_dir \
--restore-file models/pretrained/roberta.large/ \
--max-positions $max_positions \
--max-sentences $BATCH_SIZE \
--max-tokens 4400 \
--task sentence_prediction \
--reset-optimizer --reset-dataloader --reset-meters \
--required-batch-size-multiple 1 \
--init-token 0 --separator-token 2 \
--arch roberta_large \
--criterion sentence_prediction \
--num-classes $NUM_CLASSES \
--dropout 0.1 --attention-dropout 0.1 \
--weight-decay 0.1 --optimizer adam --adam-betas "(0.9, 0.98)" --adam-eps 1e-06 \
--clip-norm 0.0 \
--lr-scheduler polynomial_decay --lr $LR --total-num-update $TOTAL_NUM_UPDATES --warmup-updates $WARMUP_UPDATES \
--max-epoch 10 \
--best-checkpoint-metric accuracy --maximize-best-checkpoint-metric \
--truncate-sequence \
--update-freq 4 \
--save-dir $save_dir \
--save-interval-updates 6000 \
--keep-interval-updates -1 \
--log-format tqdm \
--find-unused-parameters 

Most helpful comment

And also I find an information in my training log
(...)
| no existing checkpoint found checkpoints/imdb/models/pretrained/roberta.large/

That means you're not using RoBERTa or pretraining at all -- you're just using a randomly initialized model with the BERT architecture.

So the actual dir of pretrained model is ${save_dir}/${--restore-file} not ${--restore-file}

It's dynamic based on whether you specify an absolute path or not: https://github.com/pytorch/fairseq/blob/832491962b30fb2164bed696e1489685a885402f/fairseq/checkpoint_utils.py#L100-L103

I'll probably modify this code to be a bit more robust to non-absolute paths.

However, there is an issue about loading checkpoint when I use this command --restore-file ../../models/pretrained/roberta.large/model.pt

Yes, because you have --max-positions 150 in your command. The pretrained model expects --max-positions 512, so when you try to load the checkpoint it sees extra positional embeddings and can't load them. I can try to add a fallback that trims the unused positional embeddings, but the easiest thing is to change --max-positions=512.

All 15 comments

It's hard to say what's going on without knowing more details of task / dataset.

But few things:

1) 9 types is okay, since we treat the labels also as normal fairseq-dictionary and special symbols are added to that also. You can see the label dictionary in data/list_qp_train_en_filter.tsv/label/dict.txt

2) You seem to be using bsz=64 because you have --update-freq=4. Typically we found that bsz=32 and lr=1e-5 was most stable across various tasks and datasets. Maybe give that a try?

The dataset is okay because it was trained by BERT-Large and XLNet and both the auc of test set could be up to above 80% in 1 or 2 epochs.

Sure, I will try different batch size and more results will update here after experiments finish

Besides, I found many times that the loss is same for different mini-batchs in a long time. It's very very weird. Is it expected?
| epoch 001: 84%|8| 16295/19494 [12:50:14<2:27:09, 2.76s/it, loss=0.869, nll_loss=0.010, ppl=1.01, wps=1986, ups=0, wpb=5632.866, bsz=63.999, num_updates=16295, lr=9.15535e-06, gnorm=3.938, clip=0.000, oom=0.000, wall=46226, train_wall=45365, accuracy=0.6 | epoch 001: 84%|8| 16296/19494 [12:50:16<2:25:33, 2.73s/it, loss=0.869, nll_loss=0.010, ppl=1.01, wps=1986, ups=0, wpb=5632.868, bsz=63.999, num_updates=16296, lr=9.15529e-06, gnorm=3.938, clip=0.000, oom=0.000, wall=46229, train_wall=45368, accuracy=0.6 | epoch 001: 84%|8| 16297/19494 [12:50:19<2:25:46, 2.74s/it, loss=0.869, nll_loss=0.010, ppl=1.01, wps=1986, ups=0, wpb=5632.859, bsz=63.999, num_updates=16297, lr=9.15524e-06, gnorm=3.939, clip=0.000, oom=0.000, wall=46231, train_wall=45371, accuracy=0.6 | epoch 001: 84%|8| 16298/19494 [12:50:22<2:29:06, 2.80s/it, loss=0.869, nll_loss=0.010, ppl=1.01, wps=1986, ups=0, wpb=5632.866, bsz=63.999, num_updates=16298, lr=9.15519e-06, gnorm=3.939, clip=0.000, oom=0.000, wall=46234, train_wall=45374, accuracy=0.6 | epoch 001: 84%|8| 16299/19494 [12:50:25<2:25:39, 2.74s/it, loss=0.869, nll_loss=0.010, ppl=1.01, wps=1986, ups=0, wpb=5632.855, bsz=63.999, num_updates=16299, lr=9.15513e-06, gnorm=3.939, clip=0.000, oom=0.000, wall=46237, train_wall=45376, accuracy=0.6 | epoch 001: 84%|8| 16300/19494 [12:50:27<2:26:51, 2.76s/it, loss=0.869, nll_loss=0.010, ppl=1.01, wps=1986, ups=0, wpb=5632.861, bsz=63.999, num_updates=16300, lr=9.15508e-06, gnorm=3.939, clip=0.000, oom=0.000, wall=46240, train_wall=45379, accuracy=0.6 | epoch 001: 84%|8| 16301/19494 [12:50:30<2:29:29, 2.81s/it, loss=0.869, nll_loss=0.010, ppl=1.01, wps=1986, ups=0, wpb=5632.853, bsz=63.999, num_updates=16301, lr=9.15503e-06, gnorm=3.939, clip=0.000, oom=0.000, wall=46243, train_wall=45382, accuracy=0.6 | epoch 001: 84%|8| 16302/19494 [12:50:33<2:30:13, 2.82s/it, loss=0.869, nll_loss=0.010, ppl=1.01, wps=1986, ups=0, wpb=5632.853, bsz=63.999, num_updates=16302, lr=9.15497e-06, gnorm=3.939, clip=0.000, oom=0.000, wall=46246, train_wall=45385, accuracy=0.6 | epoch 001: 84%|8| 16303/19494 [12:50:36<2:32:24, 2.87s/it, loss=0.869, nll_loss=0.010, ppl=1.01, wps=1986, ups=0, wpb=5632.861, bsz=63.999, num_updates=16303, lr=9.15492e-06, gnorm=3.939, clip=0.000, oom=0.000, wall=46249, train_wall=45388, accuracy=0.6 | epoch 001: 84%|8| 16304/19494 [12:50:39<2:27:14, 2.77s/it, loss=0.869, nll_loss=0.010, ppl=1.01, wps=1986, ups=0, wpb=5632.859, bsz=63.999, num_updates=16304, lr=9.15487e-06, gnorm=3.940, clip=0.000, oom=0.000, wall=46251, train_wall=45390, accuracy=0.6 | epoch 001: 84%|8| 16305/19494 [12:50:42<2:28:38, 2.80s/it, loss=0.869, nll_loss=0.010, ppl=1.01, wps=1986, ups=0, wpb=5632.876, bsz=63.999, num_updates=16305, lr=9.15481e-06, gnorm=3.941, clip=0.000, oom=0.000, wall=46254, train_wall=45393, accuracy=0.6 | epoch 001: 84%|8| 16306/19494 [12:50:44<2:30:54, 2.84s/it, loss=0.869, nll_loss=0.010, ppl=1.01, wps=1986, ups=0, wpb=5632.884, bsz=63.999, num_updates=16306, lr=9.15476e-06, gnorm=3.941, clip=0.000, oom=0.000, wall=46257, train_wall=45396, accuracy=0.6 | epoch 001: 84%|8| 16307/19494 [12:50:48<2:36:16, 2.94s/it, loss=0.869, nll_loss=0.010, ppl=1.01, wps=1986, ups=0, wpb=5632.904, bsz=63.999, num_updates=16307, lr=9.15471e-06, gnorm=3.941, clip=0.000, oom=0.000, wall=46260, train_wall=45399, accuracy=0.6 | epoch 001: 84%|8| 16308/19494 [12:50:50<2:35:11, 2.92s/it, loss=0.869, nll_loss=0.010, ppl=1.01, wps=1986, ups=0, wpb=5632.896, bsz=63.999, num_updates=16308, lr=9.15465e-06, gnorm=3.941, clip=0.000, oom=0.000, wall=46263, train_wall=45402, accuracy=0.6

Hmm, that's not expected. That definitely means, it's not training.
Did you look at your preprocessed data? Does everything look as expected in the preprocessed data?

I've checked train.0.bpe train.1.bpe and train.label.bpe and all look normal. I didn't check data in input0, input1 and label directory because that is binary data.
In the warm-up phase, the initial loss is above 1.0, and now the loss keeps 0.869 for a long time

Can you please try --lr-scheduler fixed --lr 1e-5 --update-freq 2 --max-sentences 16 ? Sorry I don't have any other suggestion. Is this some public dataset? If so, I can take a look.

Sorry it's not public dataset T^T. I will follow your suggestion and will continue to update latest result

I think your command has a typo:

CUDA_VISIBLE_DEVICES=2 python -u train.py $input_data_dir \
--restore-file models/pretrained/roberta.large/
(...)

--restore-file should point to a .pt file. So it's probably using a randomly initialized model instead of the RoBERTa model. Can you confirm whether you see the line loaded checkpoint (...)/model.pt (epoch 0 @ 0 updates) in your training log?

Hi, @ngoyal2707 , I tried different batch size and it is still not work. I am not sure whether the root issue is probably using a randomly initialized model as @myleott said. I did an experiment using the IMDB. The parameters are same as the official instruction and below is the result of my experiment, is it expected?

|Epoch|Train ACC|Valid ACC|
|--|--|--|
|10|~96.5|~87.3|

And also I find an information in my training log

| model roberta_large, criterion SentencePredictionCriterion
| num. model params: 356462683 (num. trained: 356462683)
| training on 1 GPUs
| max tokens per GPU = 4400 and max sentences per GPU = 8
| no existing checkpoint found checkpoints/imdb/models/pretrained/roberta.large/
| loading train data for epoch 0

Hi @myleott I thought it needs a directory. I find this information in my log when I use this command --restore-file models/pretrained/roberta.large/model.pt

| training on 1 GPUs
| max tokens per GPU = 4400 and max sentences per GPU = 16
| no existing checkpoint found checkpoints/lr1e-5_mp150/models/pretrained/roberta.large/model.pt
| loading train data for epoch 0

So the actual dir of pretrained model is ${save_dir}/${--restore-file} not ${--restore-file}? However, there is an issue about loading checkpoint when I use this command --restore-file ../../models/pretrained/roberta.large/model.pt

| model roberta_large, criterion SentencePredictionCriterion
| num. model params: 356091995 (num. trained: 356091995)
| training on 1 GPUs
| max tokens per GPU = 4400 and max sentences per GPU = 16
Overwriting classification_heads.sentence_classification_head.dense.weight
Overwriting classification_heads.sentence_classification_head.dense.bias
Overwriting classification_heads.sentence_classification_head.out_proj.weight
Overwriting classification_heads.sentence_classification_head.out_proj.bias
Traceback (most recent call last):
  File "/home/fecheng/project/fairseq/fairseq/trainer.py", line 150, in load_checkpoint
    self.get_model().load_state_dict(state['model'], strict=True)
  File "/home/fecheng/project/fairseq/fairseq/models/fairseq_model.py", line 70, in load_state_dict
    return super().load_state_dict(state_dict, strict)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 769, in load_state_dict
    self.__class__.__name__, "\n\t".join(error_msgs)))
RuntimeError: Error(s) in loading state_dict for RobertaModel:
        size mismatch for decoder.sentence_encoder.embed_positions.weight: copying a param with shape torch.Size([514, 1024]) from checkpoint, the shape in current model is torch.Size([152, 1024]).

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "train.py", line 325, in <module>
    cli_main()
  File "train.py", line 321, in cli_main
    main(args)
  File "train.py", line 68, in main
    extra_state, epoch_itr = checkpoint_utils.load_checkpoint(args, trainer)
  File "/home/fecheng/project/fairseq/fairseq/checkpoint_utils.py", line 110, in load_checkpoint
    reset_meters=args.reset_meters,
  File "/home/fecheng/project/fairseq/fairseq/trainer.py", line 153, in load_checkpoint
    'Cannot load model parameters from checkpoint, '
Exception: Cannot load model parameters from checkpoint, please ensure that the architectures match.

And also I find an information in my training log
(...)
| no existing checkpoint found checkpoints/imdb/models/pretrained/roberta.large/

That means you're not using RoBERTa or pretraining at all -- you're just using a randomly initialized model with the BERT architecture.

So the actual dir of pretrained model is ${save_dir}/${--restore-file} not ${--restore-file}

It's dynamic based on whether you specify an absolute path or not: https://github.com/pytorch/fairseq/blob/832491962b30fb2164bed696e1489685a885402f/fairseq/checkpoint_utils.py#L100-L103

I'll probably modify this code to be a bit more robust to non-absolute paths.

However, there is an issue about loading checkpoint when I use this command --restore-file ../../models/pretrained/roberta.large/model.pt

Yes, because you have --max-positions 150 in your command. The pretrained model expects --max-positions 512, so when you try to load the checkpoint it sees extra positional embeddings and can't load them. I can try to add a fallback that trims the unused positional embeddings, but the easiest thing is to change --max-positions=512.

Thanks @myleott , I will try it and update result here. It's necessary for some tasks whose data has smaller sentence to use smaller --max-positions because it will save the cost of training. 😄

Note that you can use --tokens-per-sample 150 and it will only create sequences of max length 150. --max-positions is related but slightly different -- it's the number of positional embeddings that are learned.

So --max-positions 512 --tokens-per-sample 150 should work and is probably what you want.

It's work now. Thanks @myleott and @ngoyal2707

Besides, When I use --tokens-per-sample 150, there is an unrecognized arguments issue

usage: train.py [-h] [--no-progress-bar] [--log-interval N]
                [--log-format {json,none,simple,tqdm}]
                [--tensorboard-logdir DIR] [--tbmf-wrapper] [--seed N] [--cpu]
                [--fp16] [--memory-efficient-fp16]
                [--fp16-init-scale FP16_INIT_SCALE]
                [--fp16-scale-window FP16_SCALE_WINDOW]
                [--fp16-scale-tolerance FP16_SCALE_TOLERANCE]
                [--min-loss-scale D]
                [--threshold-loss-scale THRESHOLD_LOSS_SCALE]
                [--user-dir USER_DIR]
                [--criterion {cross_entropy,adaptive_loss,label_smoothed_cross_entropy,sentence_prediction,sentence_ranking,binary_cross_entropy,masked_lm,legacy_masked_lm_loss,composite_loss}]
                [--tokenizer {nltk,space,moses}]
                [--bpe {subword_nmt,gpt2,sentencepiece,fastbpe}]
                [--optimizer {nag,adadelta,adagrad,adam,adafactor,adamax,sgd}]
                [--lr-scheduler {inverse_sqrt,cosine,triangular,polynomial_decay,fixed,reduce_lr_on_plateau}]
                [--task TASK] [--num-workers N]
                [--skip-invalid-size-inputs-valid-test] [--max-tokens N]
                [--max-sentences N] [--required-batch-size-multiple N]
                [--dataset-impl FORMAT] [--train-subset SPLIT]
                [--valid-subset SPLIT] [--validate-interval N]
                [--disable-validation] [--max-tokens-valid N]
                [--max-sentences-valid N] [--curriculum N]
                [--distributed-world-size N]
                [--distributed-rank DISTRIBUTED_RANK]
                [--distributed-backend DISTRIBUTED_BACKEND]
                [--distributed-init-method DISTRIBUTED_INIT_METHOD]
                [--distributed-port DISTRIBUTED_PORT] [--device-id DEVICE_ID]
                [--distributed-no-spawn] [--ddp-backend {c10d,no_c10d}]
                [--bucket-cap-mb MB] [--fix-batches-to-gpus]
                [--find-unused-parameters] --arch ARCH [--max-epoch N]
                [--max-update N] [--clip-norm NORM] [--sentence-avg]
                [--update-freq N1,N2,...,N_K] [--lr LR_1,LR_2,...,LR_N]
                [--min-lr LR] [--use-bmuf] [--save-dir DIR]
                [--restore-file RESTORE_FILE] [--reset-dataloader]
                [--reset-lr-scheduler] [--reset-meters] [--reset-optimizer]
                [--optimizer-overrides DICT] [--save-interval N]
                [--save-interval-updates N] [--keep-interval-updates N]
                [--keep-last-epochs N] [--no-save] [--no-epoch-checkpoints]
                [--no-last-checkpoints] [--no-save-optimizer-state]
                [--best-checkpoint-metric BEST_CHECKPOINT_METRIC]
                [--maximize-best-checkpoint-metric] [--encoder-layers L]
                [--encoder-embed-dim H] [--encoder-ffn-embed-dim F]
                [--encoder-attention-heads A]
                [--activation-fn {relu,gelu,gelu_fast,gelu_accurate,tanh,linear}]
                [--pooler-activation-fn {relu,gelu,gelu_fast,gelu_accurate,tanh,linear}]
                [--encoder-normalize-before] [--dropout D]
                [--attention-dropout D] [--activation-dropout D]
                [--pooler-dropout D] [--max-positions MAX_POSITIONS]
                [--load-checkpoint-heads] [--save-predictions FILE]
                [--adam-betas B] [--adam-eps D] [--weight-decay WD]
                [--force-anneal N] [--lr-shrink LS] [--warmup-updates N]
                [--num-classes NUM_CLASSES] [--init-token INIT_TOKEN]
                [--separator-token SEPARATOR_TOKEN] [--regression-target]
                [--no-shuffle] [--truncate-sequence]
                FILE
train.py: error: unrecognized arguments: --tokens-per-sample 150

@chengfx Try --max-tokens 150?

No you should use --task masked_lm --tokens-per-sample 150. The argument is unrecognized because you probably forgot to specify --task :) It's defined here: https://github.com/pytorch/fairseq/blob/master/fairseq/tasks/masked_lm.py#L43-L45

--max-tokens is different. It controls the total batch size per worker. So if you use --max-tokens 300 --tokens-per-sample 150 then it'll create a batch with 2 sequences each of length 150.

Was this page helpful?
0 / 5 - 0 ratings