Fairseq: What's the evaluation metric for each dataset on GLUE of RoBERTa?

Created on 30 Dec 2019  路  5Comments  路  Source: pytorch/fairseq

Hi,

I have a question about the results of RoBERTa on GLUE.
According to GLUE leaderboard, there are two different metrics for MRPC, STSand QQP. Which evaluation metric do you use to compute the results of these datasets shown in the RoBERTa paper and this page?

I try to figure out this problem via the alignment of the ensemble results of RoBERTa on test set at Table 5 in the paper and GLUE leaderboard RoBERTa. And I find their evaluation metrics are as follows:

  • STS: Pearson
  • MRPC: F1:
  • QQP: Accuracy

However, this conflicts with some related papers such as ELECTRA. ELECTRA directly copies your results shown in the RoBERTa paper and said that their evaluation metrics (in the Section 3.1) are as follows:

  • STS: Spearman
  • MRPC: Accuracy
  • QQP: Accuracy

To conclude, I just wonder which metric is used on each dataset for GLUE in the RoBERTa paper.

question

Most helpful comment

However, in the Table 5 of the RoBERTa paper, MRPC obtains 92.3 on the test set. And according to the glue leaderboard, this score is F1, not the accuracy (Acc). @ngoyal2707

image

All 5 comments

CC @myleott @ngoyal2707

Following are the metrics used for the 3 tasks you mentioned:

  • STS: Pearson
  • MRPC: ACC
  • QQP: ACC

However, in the Table 5 of the RoBERTa paper, MRPC obtains 92.3 on the test set. And according to the glue leaderboard, this score is F1, not the accuracy (Acc). @ngoyal2707

image

@luofuli good catch. I think, it's a mistake in our manuscript where we are reporting Acc for dev and F1 for test.
The system are still comparable as all systems report the same measures. But we will update the next version to make it clear.

Thanks!

Thank you very much.@ngoyal2707

Was this page helpful?
0 / 5 - 0 ratings