Transformers: Albert Hyperparameters for Fine-tuning SQuAD 2.0

Created on 27 Nov 2019  ยท  4Comments  ยท  Source: huggingface/transformers

โ“ Questions & Help

I want to fine-tune albert-xxlarge-v1 on SQuAD 2.0 and am in need of optimal hyperparameters. I did not find any discussion in the Albert original paper regarding suggested fine-tuning hyperparameters, as is provided in the XLNet original paper. I did find the following hard-coded parameters in the Google-research Albert run_squad_sp.py code:

'do_lower_case' = True
'train_batch_size' = 32
'predict_batch_size' = 8
'learning_rate' = 5e-5
'num_train_epochs' = 3.0
'warmup_proportion' = 0.1

With fine-tuning on my 2x GPUs taking ~69 hours, I'd like to shrink the number of fine-tuning iterations necessary to attain optimal model performance. Anyone have a bead on the optimal hyperparameters?

Also, Google-research comments in run_squad_sp.py state that warmup_proportion is "Proportion of training to perform linear learning rate warmup for." "E.g., 0.1 = 10% of training". Since 3 epochs, batch size = 32 while fine-tuning SQuAD 2.0 results in approximately 12.5K total optimization steps, would I set --warmup_steps = 1250 when calling Transformers' run_squad.py?

Thanks in advance for any input.

wontfix

Most helpful comment

@ahotrod There is a table in the appendix section of the ALBERT paper, which shows hyperparameters for ALBERT in downstream tasks:
image

All 4 comments

Wondering this as well but for GLUE tasks. There don't seem to be a good consensus on hyperparameters such as weight decay and such

Results using hyperparameters from my first post above, varying only batch size:

albert_xxlargev1_squad2_512_bs32:
{
  "exact": 83.67725090541565,
  "f1": 87.51235434089064,
  "total": 11873,
  "HasAns_exact": 81.86572199730094,
  "HasAns_f1": 89.54692697189559,
  "HasAns_total": 5928,
  "NoAns_exact": 85.48359966358284,
  "NoAns_f1": 85.48359966358284,
  "NoAns_total": 5945
}

albert_xxlargev1_squad2_512_bs48:
{
  "exact": 83.65198349195654,
  "f1": 87.4736247587816,
  "total": 11873,
  "HasAns_exact": 81.73076923076923,
  "HasAns_f1": 89.38501126197984,
  "HasAns_total": 5928,
  "NoAns_exact": 85.5677039529016,
  "NoAns_f1": 85.5677039529016,
  "NoAns_total": 5945
}

lr
loss

@ahotrod There is a table in the appendix section of the ALBERT paper, which shows hyperparameters for ALBERT in downstream tasks:
image

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

lcswillems picture lcswillems  ยท  3Comments

siddsach picture siddsach  ยท  3Comments

iedmrc picture iedmrc  ยท  3Comments

quocnle picture quocnle  ยท  3Comments

ereday picture ereday  ยท  3Comments