Transformers: XLNet-large-cased: hyper-parameters for fine-tuning on SST-2

Created on 16 Jul 2019 · 21Comments · Source: huggingface/transformers

I tried to finetune XLNet on one of the classification tasks from GLUE (Ubuntu, GPU Titan RTX, CUDA 10.0, pytorch 1.1):

export GLUE_DIR=/path/to/glue

python ./examples/run_glue.py \
--model_type xlnet \
--model_name_or_path xlnet-large-cased \
--do_train \
--do_eval \
--task_name=sst-2 \
--data_dir=${GLUE_DIR}/SST-2 \
--output_dir=./proc_data/sst-2 \
--max_seq_length=128 \
--per_gpu_eval_batch_size=8 \
--per_gpu_train_batch_size=8 \
--gradient_accumulation_steps=1 \
--max_steps=1200 \
--model_name=xlnet-large-cased \
--overwrite_output_dir \
--overwrite_cache \
--warmup_steps=120

Training and evaluation work without errors but it looks like accuracy doesn't increase during training, I evaluated every 500 steps:

07/16/2019 22:29:30 - INFO - __main__ - Eval results
07/16/2019 22:29:30 - INFO - __main__ - acc = 0.5091743119266054

07/16/2019 22:32:16 - INFO - __main__ - Loading features from cached file glue_data/SST-2/cached_dev_xlnet-large-cased_128_sst-2 | 999/8419 [05:37<41:47, 2.96it/s]
07/16/2019 22:32:17 - INFO - __main__ - Running evaluation
07/16/2019 22:32:17 - INFO - __main__ - Num examples = 872
07/16/2019 22:32:17 - INFO - __main__ - Batch size = 8

07/16/2019 22:32:25 - INFO - __main__ - Eval results
07/16/2019 22:32:25 - INFO - __main__ - acc = 0.5091743119266054

Finally the same acc:

07/16/2019 22:33:59 - INFO - __main__ - Eval results
07/16/2019 22:33:59 - INFO - __main__ - acc = 0.5091743119266054

The same situation is with my own classification dataset. Accuracy wasn't changed during training. Something is wrong with finetuning of XLNet

wontfix

Source

avostryakov

All 21 comments

I also tried to finetune xlnet base on squad 2.0 but the numbers on dev are pretty bad
Results: {'exact': 3.0405120862461046, 'f1': 6.947601433150003, 'total': 11873, 'HasAns_exact': 6.056005398110662, 'HasAns_f1': 13.881388632893048, 'HasAns_total': 5928, 'NoAns_exact': 0.0336417157275021, 'NoAns_f1': 0.0336417157275021, 'NoAns_total': 5945, 'best_exact': 50.07159100480081, 'best_exact_thresh': 0.0, 'best_f1': 50.07159100480081, 'best_f1_thresh': 0.0}

tbright17 on 17 Jul 2019

I suspect something is wrong with the evaluation code. Looking into it now.

tbright17 on 17 Jul 2019

@tbright17 Nothing wrong with evaluation. Accuracy and evaluation loss aren't changed during training. I used my own evaluation script, I used old BertAdam or OpenAIAdam optimizers without success.
@thomwolf Can you help?

avostryakov on 17 Jul 2019

I'll give a look, I've only tested XLNet on STS-B for the moment. You should check the hyper-parameters as well, they probably won't be the same as the ones of STS-B (some are mentioned in the XLNet paper).

thomwolf on 17 Jul 2019

First thing that comes to mind is that SST-2 is ~10 times bigger than STS-B (see the GLUE paper) so you need to increase the number of training step a lot if you want to do at least one full epoch on SST-2 training dataset (here you use the value for STS-B). And you should probably do several epochs, e.g. we do 6-7 epochs on STS-B). Check some examples of recommended hyper-parameters table 8 of the xlnet paper.

You can also directly specify the number of epochs instead of the maximum number of steps in the script. You can see all the hyper-parameters of the script with python ./run_glue.py --help.

thomwolf on 17 Jul 2019

First thing that comes to mind is that SST-2 is ~10 times bigger than STS-B (see the GLUE paper) so you need to increase the number of training step a lot if you want to do at least one full epoch on SST-2 training dataset (here you use the value for STS-B). And you should probably do several epochs, e.g. we do 6-7 epochs on STS-B). Check some examples of recommended hyper-parameters table 8 of the xlnet paper.

You can also directly specify the number of epochs instead of the maximum number of steps in the script. You can see all the hyper-parameters of the script with python ./run_glue.py --help.

I trained STS-B task with the same problem. You can see the following output with evaluation of every 100 steps (I added train and evaluation loss in output):

07/17/2019 13:09:55 - INFO - __main__ -   ***** Running evaluation  *****
07/17/2019 13:09:55 - INFO - __main__ -     Num examples = 1500
07/17/2019 13:09:55 - INFO - __main__ -     Batch size = 8
07/17/2019 13:10:09 - INFO - __main__ -   ***** Eval results  *****
07/17/2019 13:10:09 - INFO - __main__ -     corr = -0.05367882385720809
07/17/2019 13:10:09 - INFO - __main__ -     eval_loss = 2.8412214481133096##################################################################################################################| 188/188 [00:14<00:00, 13.41it/s]
07/17/2019 13:10:09 - INFO - __main__ -     pearson = -0.041275192
07/17/2019 13:10:09 - INFO - __main__ -     spearmanr = -0.06608245566229025
07/17/2019 13:10:09 - INFO - __main__ -   Training loss: 307.258519500494
                                                                                                                                                                                                                              07/17/2019 13:10:41 - INFO - __main__ -   Loading features from cached file ...glue_data/STS-B/cached_dev_xlnet-large-cased_128_sts-b               | 199/719 [01:18<03:25,  2.53it/s]
07/17/2019 13:10:41 - INFO - __main__ -   ***** Running evaluation  *****
07/17/2019 13:10:41 - INFO - __main__ -     Num examples = 1500
07/17/2019 13:10:41 - INFO - __main__ -     Batch size = 8
07/17/2019 13:10:56 - INFO - __main__ -   ***** Eval results  *****
07/17/2019 13:10:56 - INFO - __main__ -     corr = 0.13943037650184956
07/17/2019 13:10:56 - INFO - __main__ -     eval_loss = 2.3762524007482733##################################################################################################################| 188/188 [00:14<00:00, 13.29it/s]
07/17/2019 13:10:56 - INFO - __main__ -     pearson = 0.13502572
07/17/2019 13:10:56 - INFO - __main__ -     spearmanr = 0.1438350282350605
07/17/2019 13:10:56 - INFO - __main__ -   Training loss: 533.9101385176182
                                                                                                                                                                                                                              07/17/2019 13:11:28 - INFO - __main__ -   Loading features from cached file .../glue_data/STS-B/cached_dev_xlnet-large-cased_128_sts-b               | 299/719 [02:05<02:56,  2.39it/s]
07/17/2019 13:11:28 - INFO - __main__ -   ***** Running evaluation  *****
07/17/2019 13:11:28 - INFO - __main__ -     Num examples = 1500
07/17/2019 13:11:28 - INFO - __main__ -     Batch size = 8
07/17/2019 13:11:42 - INFO - __main__ -   ***** Eval results  *****
07/17/2019 13:11:42 - INFO - __main__ -     corr = -0.0830871973267994
07/17/2019 13:11:42 - INFO - __main__ -     eval_loss = 2.5565993221516305##################################################################################################################| 188/188 [00:14<00:00, 13.20it/s]
07/17/2019 13:11:42 - INFO - __main__ -     pearson = -0.08915693
07/17/2019 13:11:42 - INFO - __main__ -     spearmanr = -0.077017461524765
07/17/2019 13:11:42 - INFO - __main__ -   Training loss: 761.6802722513676
                                                                                                                                                                                                                              07/17/2019 13:12:15 - INFO - __main__ -   Loading features from cached file .../glue_data/STS-B/cached_dev_xlnet-large-cased_128_sts-b               | 399/719 [02:52<02:18,  2.32it/s]
07/17/2019 13:12:15 - INFO - __main__ -   ***** Running evaluation  *****
07/17/2019 13:12:15 - INFO - __main__ -     Num examples = 1500
07/17/2019 13:12:15 - INFO - __main__ -     Batch size = 8
07/17/2019 13:12:29 - INFO - __main__ -   ***** Eval results  *****
07/17/2019 13:12:29 - INFO - __main__ -     corr = -0.08715267932681456
07/17/2019 13:12:29 - INFO - __main__ -     eval_loss = 2.398741365113157###################################################################################################################| 188/188 [00:14<00:00, 13.12it/s]
07/17/2019 13:12:29 - INFO - __main__ -     pearson = -0.08428703
07/17/2019 13:12:29 - INFO - __main__ -     spearmanr = -0.09001832616862088
07/17/2019 13:12:29 - INFO - __main__ -   Training loss: 974.8287971913815

How you can see training loss is increasing, eval loss is almost the same, other metrics fluctuate around 0.

avostryakov on 17 Jul 2019

@thomwolf So, it looks like training is happening but in opposite direction for some reason

avostryakov on 17 Jul 2019

Maybe you haven't fully read the explanation accompanying the STS-B example in the readme?

It says "On this machine we thus have a batch size of 32, please increase gradient_accumulation_steps to reach the same batch size if you have a smaller machine."

thomwolf on 17 Jul 2019

@avostryakov Did you try to reduce the learning rate? I had a similar issue training with the TensorFlow version XLNet on only one GPU. I tried reducing the learning rate from 5e-5 to 1e-5, and it worked. Wish this can help you.

bugface on 17 Jul 2019

@thomwolf @tbright17 I got similar numbers like you Squad 2.0. Seems that the model probably isn't learning much. I'll print out the losses to explore. Also should we change the LR as well?
: the best I got with fine-tuning on Squad 2.0 with a train_batch_size=8 and gas=1 all others are default on a single v100 gpu was:
07/16/2019 16:21:43 - INFO - __main__ - Results: {'exact': 26.438136949380947, 'f1': 28.470459931964722, 'total': 11873, 'HasAns_exact': 0.08434547908232119, 'HasAns_f1': 4.154819630940996, 'HasAns_total': 5928, 'NoAns_exact': 52.716568544995795, 'NoAns_f1': 52.716568544995795, 'NoAns_total': 5945, 'best_exact': 50.07159100480081, 'best_exact_thresh': 0.0, 'best_f1': 50.07159100480081, 'best_f1_thresh': 0.0}

avisil on 17 Jul 2019

May also be a problem of batch size, the authors use a batch size between 32 and 128 in the paper.

What effective batch size do you have (printed during training)?

While we reproduce the official XLNet number on STS-B, I still have to work a bit on the SQuAD example for XLNet, the XLNet authors used a complex pre- and post-processing of the data (smarter than Bert's) that I haven't fully integrated into our run_squad example yet.

thomwolf on 17 Jul 2019

Maybe you haven't fully read the explanation accompanying the STS-B example in the readme?

It says "On this machine we thus have a batch size of 32, please increase gradient_accumulation_steps to reach the same batch size if you have a smaller machine."

@thomwolf You are right, STS-B started to train with batch size 32 and gradient_accumulation_steps = 2. Now I'm wondering why it so heavily depends on batch size. But it doesn't help for STS-2, I set max_steps=5000 (it's 5 epochs) and training and evaluation loss didn't change at all during training. I'm trying to train with learning rate 1e-5 how it was recommended by @alexpython1988

avostryakov on 17 Jul 2019

@thomwolf maybe. Also my sequence length is 384: the authors did mention they prolly did 512. Here's my batch size related printout: I think the number of examples seem a lil low. No? I think Squad has about 150K examples (ha and na questions) and with the doc_stride I think it should be more than 150k examples (I think).

07/15/2019 13:23:32 - INFO - __main__ - ***** Running training *****
07/15/2019 13:23:32 - INFO - __main__ - Num examples = 133947
07/15/2019 13:23:32 - INFO - __main__ - Num Epochs = 3
07/15/2019 13:23:32 - INFO - __main__ - Instantaneous batch size per GPU = 4
07/15/2019 13:23:32 - INFO - __main__ - Total train batch size (w. parallel, distributed & accumulation) = 4
07/15/2019 13:23:32 - INFO - __main__ - Gradient Accumulation steps = 1
07/15/2019 13:23:32 - INFO - __main__ - Total optimization steps = 100461

I saw in the renatoviolin's repo that they have the following which gives them 86F1 on a RTX2080:
flags.DEFINE_integer("max_seq_length", default=512, help="Max sequence length") flags.DEFINE_integer("max_query_length", default=64, help="Max query length") flags.DEFINE_integer("doc_stride", default=128, help="Doc stride") flags.DEFINE_integer("max_answer_length", default=64, help="Max answer length")

Also, lr is different than ours (5e-5 in this repo):
flags.DEFINE_float("learning_rate", default=3e-5, help="initial learning rate")

avisil on 17 Jul 2019

Learning rate = 1e-5 helps to train STS-2 together with batch size 32 and accumulation steps = 2. I need more experiments but it works. Thanks, @thomwolf, and @alexpython1988!

avostryakov on 17 Jul 2019

Great to hear, good job and good luck @avostryakov! Feel free to share good hyper-parameters if you find a nice set and I can add them to the documentation (with credits).

thomwolf on 17 Jul 2019

May also be a problem of batch size, the authors use a batch size between 32 and 128 in the paper.

What effective batch size do you have (printed during training)?

While we reproduce the official XLNet number on STS-B, I still have to work a bit on the SQuAD example for XLNet, the XLNet authors used a complex pre- and post-processing of the data (smarter than Bert's) that I haven't fully integrated into our run_squad example yet.

I was using per_gpu_train_batch 8 for squad 2.0. Powerful model is hard to tune maybe

tbright17 on 17 Jul 2019

👍1

Great to hear, good job and good luck @avostryakov! Feel free to share good hyper-parameters if you find a nice set and I can add them to the documentation (with credits).

@thomwolf My the best result for SST-2 so far is 94.15 of accuracy (in xlnet's article 95.6). It's better than BERT-large. I trained with the following parameters:

python ./examples/run_glue.py \
    --model_type xlnet \
    --model_name_or_path xlnet-large-cased \
    --do_train  \
    --evaluate_during_training \
    --do_eval   \
    --logging_steps 500 \
    --save_steps 3000 \
    --task_name=sst-2     \
    --data_dir=${GLUE_DIR}/SST-2  \
    --output_dir=./proc_data/sst-2   \
    --max_seq_length=128   \
    --learning_rate 1e-5 \
    --per_gpu_eval_batch_size=8   \
    --per_gpu_train_batch_size=8   \
    --gradient_accumulation_steps=1 \
    --max_steps=16000  \
    --model_name=xlnet-large-cased   \
    --overwrite_output_dir   \
    --overwrite_cache \
    --warmup_steps=120 \
    --fp16

avostryakov on 17 Jul 2019

👍1

@thomwolf Ok, the last result for SST-2 almost matched with XLNet article: Accuracy 95.4:

python ./examples/run_glue.py \
    --model_type xlnet \
    --model_name_or_path xlnet-large-cased \
    --do_train  \
    --evaluate_during_training \
    --do_eval   \
    --logging_steps 400 \
    --save_steps 3000 \
    --task_name=sst-2     \
    --data_dir=${GLUE_DIR}/SST-2  \
    --output_dir=./proc_data/sst-2   \
    --max_seq_length=128   \
    --learning_rate 1e-5 \
    --per_gpu_eval_batch_size=16   \
    --per_gpu_train_batch_size=16   \
    --gradient_accumulation_steps=1 \
    --max_steps=8000  \
    --model_name=xlnet-large-cased   \
    --overwrite_output_dir   \
    --overwrite_cache \
    --warmup_steps=120 \
    --fp16

Thank you for your work!

avostryakov on 18 Jul 2019

👍1

This is great @avostryakov! Thanks for sharing the results!
I'm editing the issue title until I've time to add the hyperparameters to the doc.

thomwolf on 18 Jul 2019

Hi, how could I finetune the model for text generation? Is it possible just having raw text for the finetuning?

sakalouski on 25 Jul 2019

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.