Transformers: pretrained GPT-2 checkpoint gets only 31% accuracy on Lambada

Created on 16 Apr 2019 · 7Comments · Source: huggingface/transformers

For some reason I only see 26% accuracy when evaluating on Lambada for GPT-2 checkpoint instead of expected 45.99%

Here's a file of predictions with sets of 3 lines of the form:

ground truth
predicted last_word
is_counted_as_error

Generated by this script

Could this be caused by the way GPT-2 checkpoint was imported into HuggingFace?

Discussion

Source

yaroslavvb

Most helpful comment

I'm now able to get within 1% of their reported accuracy on GPT2-small. The two missing modifications were:

Evaluate on OpenAI's version of lambada which adds extra formatting
Evaluate by counting number of times the last BPE token is predicted incorrectly instead of last word, details are in https://github.com/openai/gpt-2/issues/131#issuecomment-497136199

yaroslavvb on 30 May 2019

👍2 ❤1

All 7 comments

Accuracy goes to 31% if I use stop-word filter, still seems lower than expected (predictions)

yaroslavvb on 16 Apr 2019

Hi, I doubt it's a problem with the model. Usually the culprit is too find in the pre-processing logic.

Your dataset seems to be pre-processed but Radford, Wu et al. says they are using a version without preprocessing (end of section 3.3). GPT-2 is likely sensitive to tokenization issues and the like.

If you want to check the model it-self, you could try comparing with the predictions of the Tensorflow version on a few lambada completions?

thomwolf on 16 Apr 2019

Applying detokenization raises accuracy to 33.11%

I spot checked a few errors against TF implementation and they give the same errors, so it seems likely the difference is due to eval protocol, rather than the checkpoint

yaroslavvb on 17 Apr 2019

IMHO "without pre-processing" means taking the original dataset without modification, which is what I also did here.

However in the original dataset, everything is tokenized. IE "haven't" was turned into "have n't"
Either way, undoing this tokenization only has a improvement of 2%, so there must be some deeper underlying difference in the way OpenAI did their evaluation.

yaroslavvb on 17 Apr 2019

Indeed. It's not very clear to me what they mean exactly by "stop-word filter". It seems like the kind of heuristic that can have a very large impact on the performances.

Maybe a better filtering is key. I would probably go with a sort of beam-search to compute the probability of having a punctuation/end-of-sentence token after the predicted word and use that to filter the results.

thomwolf on 17 Apr 2019

I spoke with Alec and turns out for evaluation they got used the "raw" lambada corpus which was obtained by finding original sentences in book corpus that matched the tokenized versions in the lambada release. So to to reproduce the numbers we need the "raw" corpus https://github.com/openai/gpt-2/issues/131

yaroslavvb on 11 May 2019

👍1

I'm now able to get within 1% of their reported accuracy on GPT2-small. The two missing modifications were:

Evaluate on OpenAI's version of lambada which adds extra formatting
Evaluate by counting number of times the last BPE token is predicted incorrectly instead of last word, details are in https://github.com/openai/gpt-2/issues/131#issuecomment-497136199

yaroslavvb on 30 May 2019

👍2 ❤1

Was this page helpful?

0 / 5 - 0 ratings