Transformers: Having trouble reproducing SQuAD 2.0 results using ALBERT v2 models

Created on 9 Dec 2019 · 9Comments · Source: huggingface/transformers

❓ Questions & Help

I tried to finetune ALBERT v2 models on SQuAD 2.0, but sometimes the loss doesn't decrease and performance on dev set is low. The problem may happen when using albert-large-v2 and albert-xlarge-v2 in my case. Any suggestions?
TIM截图20191209111606
TIM截图20191209111551
TIM截图20191209111533

wontfix

Source

shuaihuaiyi

All 9 comments

What GPU(s) and hyperparameters are you using?

Specifically:
--learning_rate ?
--per_gpu_train_batch_size ?
--gradient_accumulation_steps ?
--warmup_steps ?

I'm on my third xxlarge-v1 fine-tune, ~23 hours each epoch plus eval on 2x NVIDIA 1080Ti. Results are relatively good, best of all the models I've fine-tuned on SQuAD 2.0 so far:

albert_xxlargev1_squad2_512_bs32:
{
  "exact": 83.67725090541565,
  "f1": 87.51235434089064,
  "total": 11873,
  "HasAns_exact": 81.86572199730094,
  "HasAns_f1": 89.54692697189559,
  "HasAns_total": 5928,
  "NoAns_exact": 85.48359966358284,
  "NoAns_f1": 85.48359966358284,
  "NoAns_total": 5945
}

loss

ahotrod on 9 Dec 2019

I use 6xP40 for xlarge-v2 and 4xP40 for large-v2 with a same total batch size of 48 (8x6 & 12x4), lr is set to 3e-5 for all the runs. Other options remain default.

I also launched several runs with same setting, sometimes the problem happened but sometimes didn't, this is weird because I didn't even change the random seed.

shuaihuaiyi on 9 Dec 2019

I meant to include this link in my post above, which details the Google-Research (GR) run_squad_sp.py hyperparameters: #https://github.com/huggingface/transformers/issues/1974

As demonstrated and referenced in my link, GR's bs=32 was a very slight improvement for me over my initial bs=48 fine-tune as you also chose. Peak learning_rate=5e-5 after a 10% linear lr warm-up proportion and linear lr decay after that.

Hope this helps, please post your results for comparison.

ahotrod on 9 Dec 2019

From tensorboard, the best-performed one is albert-xxlarge-v2 with 88.49 F1 and 84.83 EM at step 25k. I didn't run any experiment on v1 models

shuaihuaiyi on 9 Dec 2019

From tensorboard, the best-performed one is albert-xxlarge-v2 with 88.49 F1 and 84.83 EM at step 25k. I didn't run any experiment on v1 models

Nice results, 6 epochs?

According to GR at the time of V2 release, the xxlarge-V1 model outperforms the xxlarge-V2 model.

ahotrod on 9 Dec 2019

Not sure if this is related, but I found that ALBERT is very unstable. When running in non-deterministic mode, it will sometimes get stuck in a very strange spot and never recover. This becomes very clear when you use a secondary score as a sanity check (e.g. Pearson correlation for regression, f1 for classification). So for the exact same parameters (but each time presumably another random seed), I would sometimes get e.g. r=0.02 and other times r=0.77.

I'd have to test more to get conclusive results, but it's something that I haven't experienced before with other models.

BramVanroy on 11 Dec 2019

👍1

The best I can get with xxlarge-v2 is
Results: {'exact': 84.86481933799377, 'f1': 88.43795242530017, 'total': 11873, 'HasAns_exact': 82.05128205128206, 'HasAns_f1': 89.20779506504576, 'HasAns_total': 5928, 'NoAns_exact': 87. 67031118587047, 'NoAns_f1': 87.67031118587047, 'NoAns_total': 5945, 'best_exact': 84.86481933799377, 'best_exact_thresh': 0.0, 'best_f1': 88.4379524253, 'best_f1_thresh': 0.0}
with 2e-5 lr, 4xV100, 2 samples per GPU, no gradient accumulation, and ran for 3 epochs.
The current results are pretty about the same with Roberta large, but I expect better performance from ALBERT.
Still tuning. Any idea on how to improve it?

tbright17 on 30 Dec 2019

👍1

Same issue with albert-large-v2 but don't know why. Any result?

githubche on 13 Jan 2020

👍1

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.