Transformers: Having trouble reproducing SQuAD 2.0 results using ALBERT v2 models

Created on 9 Dec 2019  ·  9Comments  ·  Source: huggingface/transformers

❓ Questions & Help

I tried to finetune ALBERT v2 models on SQuAD 2.0, but sometimes the loss doesn't decrease and performance on dev set is low. The problem may happen when using albert-large-v2 and albert-xlarge-v2 in my case. Any suggestions?
TIM截图20191209111606
TIM截图20191209111551
TIM截图20191209111533

wontfix

All 9 comments

What GPU(s) and hyperparameters are you using?

Specifically:
--learning_rate ?
--per_gpu_train_batch_size ?
--gradient_accumulation_steps ?
--warmup_steps ?

I'm on my third xxlarge-v1 fine-tune, ~23 hours each epoch plus eval on 2x NVIDIA 1080Ti. Results are relatively good, best of all the models I've fine-tuned on SQuAD 2.0 so far:

albert_xxlargev1_squad2_512_bs32:
{
  "exact": 83.67725090541565,
  "f1": 87.51235434089064,
  "total": 11873,
  "HasAns_exact": 81.86572199730094,
  "HasAns_f1": 89.54692697189559,
  "HasAns_total": 5928,
  "NoAns_exact": 85.48359966358284,
  "NoAns_f1": 85.48359966358284,
  "NoAns_total": 5945
}

lr
loss

I use 6xP40 for xlarge-v2 and 4xP40 for large-v2 with a same total batch size of 48 (8x6 & 12x4), lr is set to 3e-5 for all the runs. Other options remain default.

I also launched several runs with same setting, sometimes the problem happened but sometimes didn't, this is weird because I didn't even change the random seed.

I meant to include this link in my post above, which details the Google-Research (GR) run_squad_sp.py hyperparameters: #https://github.com/huggingface/transformers/issues/1974

As demonstrated and referenced in my link, GR's bs=32 was a very slight improvement for me over my initial bs=48 fine-tune as you also chose. Peak learning_rate=5e-5 after a 10% linear lr warm-up proportion and linear lr decay after that.

Hope this helps, please post your results for comparison.

From tensorboard, the best-performed one is albert-xxlarge-v2 with 88.49 F1 and 84.83 EM at step 25k. I didn't run any experiment on v1 models

From tensorboard, the best-performed one is albert-xxlarge-v2 with 88.49 F1 and 84.83 EM at step 25k. I didn't run any experiment on v1 models

Nice results, 6 epochs?

According to GR at the time of V2 release, the xxlarge-V1 model outperforms the xxlarge-V2 model.

Not sure if this is related, but I found that ALBERT is very unstable. When running in non-deterministic mode, it will sometimes get stuck in a very strange spot and never recover. This becomes very clear when you use a secondary score as a sanity check (e.g. Pearson correlation for regression, f1 for classification). So for the exact same parameters (but each time presumably another random seed), I would sometimes get e.g. r=0.02 and other times r=0.77.

I'd have to test more to get conclusive results, but it's something that I haven't experienced before with other models.

The best I can get with xxlarge-v2 is
Results: {'exact': 84.86481933799377, 'f1': 88.43795242530017, 'total': 11873, 'HasAns_exact': 82.05128205128206, 'HasAns_f1': 89.20779506504576, 'HasAns_total': 5928, 'NoAns_exact': 87. 67031118587047, 'NoAns_f1': 87.67031118587047, 'NoAns_total': 5945, 'best_exact': 84.86481933799377, 'best_exact_thresh': 0.0, 'best_f1': 88.4379524253, 'best_f1_thresh': 0.0}
with 2e-5 lr, 4xV100, 2 samples per GPU, no gradient accumulation, and ran for 3 epochs.
The current results are pretty about the same with Roberta large, but I expect better performance from ALBERT.
Still tuning. Any idea on how to improve it?

Same issue with albert-large-v2 but don't know why. Any result?

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Was this page helpful?
0 / 5 - 0 ratings