I tried to finetune ALBERT v2 models on SQuAD 2.0, but sometimes the loss doesn't decrease and performance on dev set is low. The problem may happen when using albert-large-v2 and albert-xlarge-v2 in my case. Any suggestions?



What GPU(s) and hyperparameters are you using?
Specifically:
--learning_rate ?
--per_gpu_train_batch_size ?
--gradient_accumulation_steps ?
--warmup_steps ?
I'm on my third xxlarge-v1 fine-tune, ~23 hours each epoch plus eval on 2x NVIDIA 1080Ti. Results are relatively good, best of all the models I've fine-tuned on SQuAD 2.0 so far:
albert_xxlargev1_squad2_512_bs32:
{
"exact": 83.67725090541565,
"f1": 87.51235434089064,
"total": 11873,
"HasAns_exact": 81.86572199730094,
"HasAns_f1": 89.54692697189559,
"HasAns_total": 5928,
"NoAns_exact": 85.48359966358284,
"NoAns_f1": 85.48359966358284,
"NoAns_total": 5945
}


I use 6xP40 for xlarge-v2 and 4xP40 for large-v2 with a same total batch size of 48 (8x6 & 12x4), lr is set to 3e-5 for all the runs. Other options remain default.
I also launched several runs with same setting, sometimes the problem happened but sometimes didn't, this is weird because I didn't even change the random seed.
I meant to include this link in my post above, which details the Google-Research (GR) run_squad_sp.py hyperparameters: #https://github.com/huggingface/transformers/issues/1974
As demonstrated and referenced in my link, GR's bs=32 was a very slight improvement for me over my initial bs=48 fine-tune as you also chose. Peak learning_rate=5e-5 after a 10% linear lr warm-up proportion and linear lr decay after that.
Hope this helps, please post your results for comparison.
From tensorboard, the best-performed one is albert-xxlarge-v2 with 88.49 F1 and 84.83 EM at step 25k. I didn't run any experiment on v1 models
From tensorboard, the best-performed one is albert-xxlarge-v2 with 88.49 F1 and 84.83 EM at step 25k. I didn't run any experiment on v1 models
Nice results, 6 epochs?
According to GR at the time of V2 release, the xxlarge-V1 model outperforms the xxlarge-V2 model.
Not sure if this is related, but I found that ALBERT is very unstable. When running in non-deterministic mode, it will sometimes get stuck in a very strange spot and never recover. This becomes very clear when you use a secondary score as a sanity check (e.g. Pearson correlation for regression, f1 for classification). So for the exact same parameters (but each time presumably another random seed), I would sometimes get e.g. r=0.02 and other times r=0.77.
I'd have to test more to get conclusive results, but it's something that I haven't experienced before with other models.
The best I can get with xxlarge-v2 is
Results: {'exact': 84.86481933799377, 'f1': 88.43795242530017, 'total': 11873, 'HasAns_exact': 82.05128205128206, 'HasAns_f1': 89.20779506504576, 'HasAns_total': 5928, 'NoAns_exact': 87. 67031118587047, 'NoAns_f1': 87.67031118587047, 'NoAns_total': 5945, 'best_exact': 84.86481933799377, 'best_exact_thresh': 0.0, 'best_f1': 88.4379524253, 'best_f1_thresh': 0.0}
with 2e-5 lr, 4xV100, 2 samples per GPU, no gradient accumulation, and ran for 3 epochs.
The current results are pretty about the same with Roberta large, but I expect better performance from ALBERT.
Still tuning. Any idea on how to improve it?
Same issue with albert-large-v2 but don't know why. Any result?
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.