Fairseq: What's the evaluation metric for each dataset on GLUE of RoBERTa?

Created on 30 Dec 2019 · 5Comments · Source: pytorch/fairseq

Hi,

I have a question about the results of RoBERTa on GLUE.
According to GLUE leaderboard, there are two different metrics for MRPC, STSand QQP. Which evaluation metric do you use to compute the results of these datasets shown in the RoBERTa paper and this page?

I try to figure out this problem via the alignment of the ensemble results of RoBERTa on test set at Table 5 in the paper and GLUE leaderboard RoBERTa. And I find their evaluation metrics are as follows:

STS: Pearson
MRPC: F1:
QQP: Accuracy

However, this conflicts with some related papers such as ELECTRA. ELECTRA directly copies your results shown in the RoBERTa paper and said that their evaluation metrics (in the Section 3.1) are as follows:

STS: Spearman
MRPC: Accuracy
QQP: Accuracy

To conclude, I just wonder which metric is used on each dataset for GLUE in the RoBERTa paper.

question

Source

luofuli

Most helpful comment

However, in the Table 5 of the RoBERTa paper, MRPC obtains 92.3 on the test set. And according to the glue leaderboard, this score is F1, not the accuracy (Acc). @ngoyal2707

luofuli on 10 Jan 2020

🎉1 👍1

All 5 comments

CC @myleott @ngoyal2707

lematt1991 on 30 Dec 2019

Following are the metrics used for the 3 tasks you mentioned:

STS: Pearson
MRPC: ACC
QQP: ACC

ngoyal2707 on 7 Jan 2020

However, in the Table 5 of the RoBERTa paper, MRPC obtains 92.3 on the test set. And according to the glue leaderboard, this score is F1, not the accuracy (Acc). @ngoyal2707

luofuli on 10 Jan 2020

🎉1 👍1

@luofuli good catch. I think, it's a mistake in our manuscript where we are reporting Acc for dev and F1 for test.
The system are still comparable as all systems report the same measures. But we will update the next version to make it clear.

Thanks!

ngoyal2707 on 21 Jan 2020

Thank you very much.@ngoyal2707

luofuli on 25 Jan 2020

Was this page helpful?

0 / 5 - 0 ratings