Transformers: pretrained GPT-2 checkpoint gets only 31% accuracy on Lambada

Created on 16 Apr 2019  路  7Comments  路  Source: huggingface/transformers

For some reason I only see 26% accuracy when evaluating on Lambada for GPT-2 checkpoint instead of expected 45.99%

Here's a file of predictions with sets of 3 lines of the form:

ground truth
predicted last_word
is_counted_as_error

Generated by this script

Could this be caused by the way GPT-2 checkpoint was imported into HuggingFace?

Discussion

Most helpful comment

I'm now able to get within 1% of their reported accuracy on GPT2-small. The two missing modifications were:

  1. Evaluate on OpenAI's version of lambada which adds extra formatting
  2. Evaluate by counting number of times the last BPE token is predicted incorrectly instead of last word, details are in https://github.com/openai/gpt-2/issues/131#issuecomment-497136199

All 7 comments

Accuracy goes to 31% if I use stop-word filter, still seems lower than expected (predictions)

Hi, I doubt it's a problem with the model. Usually the culprit is too find in the pre-processing logic.

Your dataset seems to be pre-processed but Radford, Wu et al. says they are using a version without preprocessing (end of section 3.3). GPT-2 is likely sensitive to tokenization issues and the like.

If you want to check the model it-self, you could try comparing with the predictions of the Tensorflow version on a few lambada completions?

Applying detokenization raises accuracy to 33.11%

I spot checked a few errors against TF implementation and they give the same errors, so it seems likely the difference is due to eval protocol, rather than the checkpoint

IMHO "without pre-processing" means taking the original dataset without modification, which is what I also did here.

However in the original dataset, everything is tokenized. IE "haven't" was turned into "have n't"
Either way, undoing this tokenization only has a improvement of 2%, so there must be some deeper underlying difference in the way OpenAI did their evaluation.

Indeed. It's not very clear to me what they mean exactly by "stop-word filter". It seems like the kind of heuristic that can have a very large impact on the performances.

Maybe a better filtering is key. I would probably go with a sort of beam-search to compute the probability of having a punctuation/end-of-sentence token after the predicted word and use that to filter the results.

I spoke with Alec and turns out for evaluation they got used the "raw" lambada corpus which was obtained by finding original sentences in book corpus that matched the tokenized versions in the lambada release. So to to reproduce the numbers we need the "raw" corpus https://github.com/openai/gpt-2/issues/131

I'm now able to get within 1% of their reported accuracy on GPT2-small. The two missing modifications were:

  1. Evaluate on OpenAI's version of lambada which adds extra formatting
  2. Evaluate by counting number of times the last BPE token is predicted incorrectly instead of last word, details are in https://github.com/openai/gpt-2/issues/131#issuecomment-497136199
Was this page helpful?
0 / 5 - 0 ratings

Related issues

alphanlp picture alphanlp  路  3Comments

hsajjad picture hsajjad  路  3Comments

delip picture delip  路  3Comments

0x01h picture 0x01h  路  3Comments

lcswillems picture lcswillems  路  3Comments