Thanks for the impressive work of XLM-R.
Recently I found that the results on XNLI are updated: the avg-acc of XLM-R_base is increased from 74.6 to 76.1.
I can obtain the best results 74.6 by finetuning 5 epochs with lr=1e-5 with batch size of 32, weight decay 0.1, and 10% warm up.
I have also tried the suggestion by @kartikayk from Issue-1367, but it seems doesn't work for me.
I learn the model with batch size of 32 and 4-step grad accumulation, 5K steps for each epoch, and fixed lr=5e-6 or lr=5e-6 with a linear decay of lr and 10% warm up. However, I cannot obtain the results of 76.1.
Maybe I miss some important details.
Could you provide me more details or your finetuning code?
Thanks.
CC @kartikayk
Thanks in advance for the rely! @lematt1991 @kartikayk
I have the same reproduction problem with the following settings
--learning_rate 5e-5
--batch_size 32
--n_gpu 3 (using DataParallel)
--max_steps 12000 (roughly 3 epochs)
--save_steps 2000
--warmup_steps 1200 (first 10% training steps)
--max_seq_length 128
The obtained result is 74.2 (averaged over 5 runs).
Are there steps outlined to do the XNLI fine-tuning on xlm-r?
Hi! Apologies for the delayed response here, seems like I missed some questions. A couple of comments:
For the settings I used, following are the details:
We run validation after each epoch - where the epoch consists of 10K batches with data randomly sampled from the training set - and select the checkpoint with the best validation set result. This is quite important. In all we run training for 30 epochs.
We use Adam with a LR of 0.0000075 without any warmup and decay.
max_seq_length is 256.
We select the model with the best result on the validation set and then pick the final number on the test set by averaging the results from 5 runs.
Thanks for the link and details.
Sent via Superhuman ( https://sprh.mn/[email protected] )
On Sat, Jun 13, 2020 at 1:33 PM, kartikayk < [email protected] > wrote:
Hi! Apologies for the delayed response here, seems like I missed some
questions. A couple of comments:
- Please ensure that you have the latest checkpoint for XLMR-Base here.
The updated numbers in the paper are with a checkpoint that was trained
for 1.5M updates (more details on the fairseq page).- For finetuning, you can look at the PyText Tutorial ( https:/ / github. com/
facebookresearch/ pytext/ blob/ master/ demo/ notebooks/ xlm_r_tutorial. ipynb
(
https://github.com/facebookresearch/pytext/blob/master/demo/notebooks/xlm_r_tutorial.ipynb
) )For the settings I used, following are the details:
- *
Batch Size:
-- batch_size_per_gpu = 8
-- num_gpus = 8
-- gradient_accumulation_steps = 2
-- effective_batch_size = 8 * 8 * 2 = 128We run validation after each epoch - where the epoch consists of 10K
batches with data randomly sampled from the training set - and select the
checkpoint with the best validation set result. This is quite important.
In all we run training for 30 epochs.
- *
We use Adam with a LR of 0.0000075 without any warmup and decay.
max_seq_length is 256.
We select the model with the best result on the validation set and then
pick the final number on the test set by averaging the results from 5
runs.—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub (
https://github.com/pytorch/fairseq/issues/2057#issuecomment-643674771 ) ,
or unsubscribe (
https://github.com/notifications/unsubscribe-auth/AMMFKWHGMR5VWX4D53W3VRDRWPPC5ANCNFSM4MP6HAUA
).
Thanks a lot @kartikayk !
By the way, how about the XLM-R large model? Does it use the same hyper parameter with the base model?
Hi! Apologies for the delayed response here, seems like I missed some questions. A couple of comments:
- Please ensure that you have the latest checkpoint for XLMR-Base here. The updated numbers in the paper are with a checkpoint that was trained for 1.5M updates (more details on the fairseq page).
- For finetuning, you can look at the PyText Tutorial (https://github.com/facebookresearch/pytext/blob/master/demo/notebooks/xlm_r_tutorial.ipynb)
For the settings I used, following are the details:
- Batch Size:
-- batch_size_per_gpu = 8
-- num_gpus = 8
-- gradient_accumulation_steps = 2
-- effective_batch_size = 8 * 8 * 2 = 128- We run validation after each epoch - where the epoch consists of 10K batches with data randomly sampled from the training set - and select the checkpoint with the best validation set result. This is quite important. In all we run training for 30 epochs.
- We use Adam with a LR of 0.0000075 without any warmup and decay.
- max_seq_length is 256.
- We select the model with the best result on the validation set and then pick the final number on the test set by averaging the results from 5 runs.
@kartikayk Thanks for your reply, where can we find the latest checkpoint for XLMR-Base?
Hi! Apologies for the delayed response here, seems like I missed some questions. A couple of comments:
- Please ensure that you have the latest checkpoint for XLMR-Base here. The updated numbers in the paper are with a checkpoint that was trained for 1.5M updates (more details on the fairseq page).
- For finetuning, you can look at the PyText Tutorial (https://github.com/facebookresearch/pytext/blob/master/demo/notebooks/xlm_r_tutorial.ipynb)
For the settings I used, following are the details:
- Batch Size:
-- batch_size_per_gpu = 8
-- num_gpus = 8
-- gradient_accumulation_steps = 2
-- effective_batch_size = 8 * 8 * 2 = 128- We run validation after each epoch - where the epoch consists of 10K batches with data randomly sampled from the training set - and select the checkpoint with the best validation set result. This is quite important. In all we run training for 30 epochs.
- We use Adam with a LR of 0.0000075 without any warmup and decay.
- max_seq_length is 256.
- We select the model with the best result on the validation set and then pick the final number on the test set by averaging the results from 5 runs.
@kartikayk Hi! Thanks for the hyper-params for XLMR-Base.
For "Fine-tune multilingual model on English training set (Cross-lingual Transfer)" mentioned in table 1 in https://arxiv.org/pdf/1911.02116.pdf
What is the hyper-params XLMR-Large?
Also when you select the best model in this setup, did you use validation set for all languages or only english language?