Fairseq: XNLI Results Reproduction of XLM-R

Created on 24 Apr 2020 · 10Comments · Source: pytorch/fairseq

Thanks for the impressive work of XLM-R.

Recently I found that the results on XNLI are updated: the avg-acc of XLM-R_base is increased from 74.6 to 76.1.

I can obtain the best results 74.6 by finetuning 5 epochs with lr=1e-5 with batch size of 32, weight decay 0.1, and 10% warm up.

I have also tried the suggestion by @kartikayk from Issue-1367, but it seems doesn't work for me.
I learn the model with batch size of 32 and 4-step grad accumulation, 5K steps for each epoch, and fixed lr=5e-6 or lr=5e-6 with a linear decay of lr and 10% warm up. However, I cannot obtain the results of 76.1.
Maybe I miss some important details.

Could you provide me more details or your finetuning code?

Thanks.

question

Source

MGithubGA

👍1

All 10 comments

CC @kartikayk

lematt1991 on 24 Apr 2020

Thanks in advance for the rely! @lematt1991 @kartikayk

MGithubGA on 29 Apr 2020

I have the same reproduction problem with the following settings

--learning_rate 5e-5
--batch_size 32
--n_gpu 3 (using DataParallel)
--max_steps 12000 (roughly 3 epochs)
--save_steps 2000
--warmup_steps 1200 (first 10% training steps)
--max_seq_length 128

The obtained result is 74.2 (averaged over 5 runs).

lixin4ever on 3 May 2020

Are there steps outlined to do the XNLI fine-tuning on xlm-r?

jinoobaek-qz on 4 Jun 2020

Hi! Apologies for the delayed response here, seems like I missed some questions. A couple of comments:

Please ensure that you have the latest checkpoint for XLMR-Base here. The updated numbers in the paper are with a checkpoint that was trained for 1.5M updates (more details on the fairseq page).
For finetuning, you can look at the PyText Tutorial (https://github.com/facebookresearch/pytext/blob/master/demo/notebooks/xlm_r_tutorial.ipynb)

For the settings I used, following are the details:

Batch Size:
-- batch_size_per_gpu = 8
-- num_gpus = 8
-- gradient_accumulation_steps = 2
-- effective_batch_size = 8 * 8 * 2 = 128

We run validation after each epoch - where the epoch consists of 10K batches with data randomly sampled from the training set - and select the checkpoint with the best validation set result. This is quite important. In all we run training for 30 epochs.
We use Adam with a LR of 0.0000075 without any warmup and decay.
max_seq_length is 256.
We select the model with the best result on the validation set and then pick the final number on the test set by averaging the results from 5 runs.

kartikayk on 13 Jun 2020

Thanks for the link and details.

Sent via Superhuman ( https://sprh.mn/[email protected] )

On Sat, Jun 13, 2020 at 1:33 PM, kartikayk < [email protected] > wrote:

Hi! Apologies for the delayed response here, seems like I missed some
questions. A couple of comments:

Please ensure that you have the latest checkpoint for XLMR-Base here.
The updated numbers in the paper are with a checkpoint that was trained
for 1.5M updates (more details on the fairseq page).

For finetuning, you can look at the PyText Tutorial ( https:/ / github. com/
facebookresearch/ pytext/ blob/ master/ demo/ notebooks/ xlm_r_tutorial. ipynb
(
https://github.com/facebookresearch/pytext/blob/master/demo/notebooks/xlm_r_tutorial.ipynb
) )

For the settings I used, following are the details:

Batch Size:
-- batch_size_per_gpu = 8
-- num_gpus = 8
-- gradient_accumulation_steps = 2
-- effective_batch_size = 8 * 8 * 2 = 128
*

We run validation after each epoch - where the epoch consists of 10K
batches with data randomly sampled from the training set - and select the
checkpoint with the best validation set result. This is quite important.
In all we run training for 30 epochs.

We use Adam with a LR of 0.0000075 without any warmup and decay.
*

max_seq_length is 256.

We select the model with the best result on the validation set and then
pick the final number on the test set by averaging the results from 5
runs.

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub (
https://github.com/pytorch/fairseq/issues/2057#issuecomment-643674771 ) ,
or unsubscribe (
https://github.com/notifications/unsubscribe-auth/AMMFKWHGMR5VWX4D53W3VRDRWPPC5ANCNFSM4MP6HAUA
).

jinoobaek-qz on 15 Jun 2020

Thanks a lot @kartikayk !
By the way, how about the XLM-R large model? Does it use the same hyper parameter with the base model?

MGithubGA on 26 Jun 2020

Hi! Apologies for the delayed response here, seems like I missed some questions. A couple of comments:

Please ensure that you have the latest checkpoint for XLMR-Base here. The updated numbers in the paper are with a checkpoint that was trained for 1.5M updates (more details on the fairseq page).

For finetuning, you can look at the PyText Tutorial (https://github.com/facebookresearch/pytext/blob/master/demo/notebooks/xlm_r_tutorial.ipynb)

For the settings I used, following are the details:

Batch Size:
-- batch_size_per_gpu = 8
-- num_gpus = 8
-- gradient_accumulation_steps = 2
-- effective_batch_size = 8 * 8 * 2 = 128

We run validation after each epoch - where the epoch consists of 10K batches with data randomly sampled from the training set - and select the checkpoint with the best validation set result. This is quite important. In all we run training for 30 epochs.

We use Adam with a LR of 0.0000075 without any warmup and decay.

max_seq_length is 256.

We select the model with the best result on the validation set and then pick the final number on the test set by averaging the results from 5 runs.

@kartikayk Thanks for your reply, where can we find the latest checkpoint for XLMR-Base?

ruizewang on 6 Jul 2020

Hi! Apologies for the delayed response here, seems like I missed some questions. A couple of comments:

Please ensure that you have the latest checkpoint for XLMR-Base here. The updated numbers in the paper are with a checkpoint that was trained for 1.5M updates (more details on the fairseq page).

For finetuning, you can look at the PyText Tutorial (https://github.com/facebookresearch/pytext/blob/master/demo/notebooks/xlm_r_tutorial.ipynb)

For the settings I used, following are the details:

Batch Size:
-- batch_size_per_gpu = 8
-- num_gpus = 8
-- gradient_accumulation_steps = 2
-- effective_batch_size = 8 * 8 * 2 = 128

We run validation after each epoch - where the epoch consists of 10K batches with data randomly sampled from the training set - and select the checkpoint with the best validation set result. This is quite important. In all we run training for 30 epochs.

We use Adam with a LR of 0.0000075 without any warmup and decay.

max_seq_length is 256.

We select the model with the best result on the validation set and then pick the final number on the test set by averaging the results from 5 runs.

@kartikayk Hi! Thanks for the hyper-params for XLMR-Base.
For "Fine-tune multilingual model on English training set (Cross-lingual Transfer)" mentioned in table 1 in https://arxiv.org/pdf/1911.02116.pdf
What is the hyper-params XLMR-Large?
Also when you select the best model in this setup, did you use validation set for all languages or only english language?

sbmaruf on 16 Sep 2020

👀1 👍1

What's the difference between the latest XLM-R checkpoint and the first checkpoint?

And Is the improvements of the second versions of results in the XLM-R papers (v1, v2) are mainly from the pre-trained models? Or from the fine-tune strategies?
@kartikayk