I want to fine-tune albert-xxlarge-v1 on SQuAD 2.0 and am in need of optimal hyperparameters. I did not find any discussion in the Albert original paper regarding suggested fine-tuning hyperparameters, as is provided in the XLNet original paper. I did find the following hard-coded parameters in the Google-research Albert run_squad_sp.py code:
'do_lower_case' = True
'train_batch_size' = 32
'predict_batch_size' = 8
'learning_rate' = 5e-5
'num_train_epochs' = 3.0
'warmup_proportion' = 0.1
With fine-tuning on my 2x GPUs taking ~69 hours, I'd like to shrink the number of fine-tuning iterations necessary to attain optimal model performance. Anyone have a bead on the optimal hyperparameters?
Also, Google-research comments in run_squad_sp.py state that warmup_proportion is "Proportion of training to perform linear learning rate warmup for." "E.g., 0.1 = 10% of training". Since 3 epochs, batch size = 32 while fine-tuning SQuAD 2.0 results in approximately 12.5K total optimization steps, would I set --warmup_steps = 1250 when calling Transformers' run_squad.py?
Thanks in advance for any input.
Wondering this as well but for GLUE tasks. There don't seem to be a good consensus on hyperparameters such as weight decay and such
Results using hyperparameters from my first post above, varying only batch size:
albert_xxlargev1_squad2_512_bs32:
{
"exact": 83.67725090541565,
"f1": 87.51235434089064,
"total": 11873,
"HasAns_exact": 81.86572199730094,
"HasAns_f1": 89.54692697189559,
"HasAns_total": 5928,
"NoAns_exact": 85.48359966358284,
"NoAns_f1": 85.48359966358284,
"NoAns_total": 5945
}
albert_xxlargev1_squad2_512_bs48:
{
"exact": 83.65198349195654,
"f1": 87.4736247587816,
"total": 11873,
"HasAns_exact": 81.73076923076923,
"HasAns_f1": 89.38501126197984,
"HasAns_total": 5928,
"NoAns_exact": 85.5677039529016,
"NoAns_f1": 85.5677039529016,
"NoAns_total": 5945
}


@ahotrod There is a table in the appendix section of the ALBERT paper, which shows hyperparameters for ALBERT in downstream tasks:

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Most helpful comment
@ahotrod There is a table in the appendix section of the ALBERT paper, which shows hyperparameters for ALBERT in downstream tasks:
