Transformers: Albert Hyperparameters for Fine-tuning SQuAD 2.0

Created on 27 Nov 2019 · 4Comments · Source: huggingface/transformers

❓ Questions & Help

I want to fine-tune albert-xxlarge-v1 on SQuAD 2.0 and am in need of optimal hyperparameters. I did not find any discussion in the Albert original paper regarding suggested fine-tuning hyperparameters, as is provided in the XLNet original paper. I did find the following hard-coded parameters in the Google-research Albert run_squad_sp.py code:

'do_lower_case' = True
'train_batch_size' = 32
'predict_batch_size' = 8
'learning_rate' = 5e-5
'num_train_epochs' = 3.0
'warmup_proportion' = 0.1

With fine-tuning on my 2x GPUs taking ~69 hours, I'd like to shrink the number of fine-tuning iterations necessary to attain optimal model performance. Anyone have a bead on the optimal hyperparameters?

Also, Google-research comments in run_squad_sp.py state that warmup_proportion is "Proportion of training to perform linear learning rate warmup for." "E.g., 0.1 = 10% of training". Since 3 epochs, batch size = 32 while fine-tuning SQuAD 2.0 results in approximately 12.5K total optimization steps, would I set --warmup_steps = 1250 when calling Transformers' run_squad.py?

Thanks in advance for any input.

wontfix

Source

ahotrod

Most helpful comment

@ahotrod There is a table in the appendix section of the ALBERT paper, which shows hyperparameters for ALBERT in downstream tasks:

fgksgf on 20 Dec 2019

👍4

All 4 comments

Wondering this as well but for GLUE tasks. There don't seem to be a good consensus on hyperparameters such as weight decay and such

frankfka on 28 Nov 2019

Results using hyperparameters from my first post above, varying only batch size:

albert_xxlargev1_squad2_512_bs32:
{
  "exact": 83.67725090541565,
  "f1": 87.51235434089064,
  "total": 11873,
  "HasAns_exact": 81.86572199730094,
  "HasAns_f1": 89.54692697189559,
  "HasAns_total": 5928,
  "NoAns_exact": 85.48359966358284,
  "NoAns_f1": 85.48359966358284,
  "NoAns_total": 5945
}

albert_xxlargev1_squad2_512_bs48:
{
  "exact": 83.65198349195654,
  "f1": 87.4736247587816,
  "total": 11873,
  "HasAns_exact": 81.73076923076923,
  "HasAns_f1": 89.38501126197984,
  "HasAns_total": 5928,
  "NoAns_exact": 85.5677039529016,
  "NoAns_f1": 85.5677039529016,
  "NoAns_total": 5945
}

loss

ahotrod on 7 Dec 2019

@ahotrod There is a table in the appendix section of the ALBERT paper, which shows hyperparameters for ALBERT in downstream tasks:

fgksgf on 20 Dec 2019

👍4

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.