Hi,
I have a question about the results of RoBERTa on GLUE.
According to GLUE leaderboard, there are two different metrics for MRPC, STSand QQP. Which evaluation metric do you use to compute the results of these datasets shown in the RoBERTa paper and this page?
I try to figure out this problem via the alignment of the ensemble results of RoBERTa on test set at Table 5 in the paper and GLUE leaderboard RoBERTa. And I find their evaluation metrics are as follows:
STS: PearsonMRPC: F1: QQP: AccuracyHowever, this conflicts with some related papers such as ELECTRA. ELECTRA directly copies your results shown in the RoBERTa paper and said that their evaluation metrics (in the Section 3.1) are as follows:
STS: SpearmanMRPC: AccuracyQQP: AccuracyTo conclude, I just wonder which metric is used on each dataset for GLUE in the RoBERTa paper.
CC @myleott @ngoyal2707
Following are the metrics used for the 3 tasks you mentioned:
STS: PearsonMRPC: ACCQQP: ACCHowever, in the Table 5 of the RoBERTa paper, MRPC obtains 92.3 on the test set. And according to the glue leaderboard, this score is F1, not the accuracy (Acc). @ngoyal2707

@luofuli good catch. I think, it's a mistake in our manuscript where we are reporting Acc for dev and F1 for test.
The system are still comparable as all systems report the same measures. But we will update the next version to make it clear.
Thanks!
Thank you very much.@ngoyal2707
Most helpful comment
However, in the Table 5 of the RoBERTa paper,
MRPCobtains92.3on the test set. And according to the glue leaderboard, this score isF1, not the accuracy (Acc). @ngoyal2707