Transformers: GPT-2 fine tunning

Created on 10 Apr 2019 · 7Comments · Source: huggingface/transformers

I wonder if GPT-2 model has some examples of how to do fine tuning like GPT. The DoubleHeadsModel interface of GPT-2 looks similar to GPT. But there's no special token handler for GPT-2 tokenizer. Is that necessary?

wontfix

Source

Jonbean

👍2

Most helpful comment

Indeed we should probably add additional embeddings for GPT-2 also.
I'll give it a look, should be pretty easy to add.

thomwolf on 11 Apr 2019

👍4

All 7 comments

What kind of task are you fine-tuning on? I If it's something like ROCstories task, you need the extra tokens. I think people are doing BERT for up-stream tasks because birectional context gives better results than left-to-right

yaroslavvb on 11 Apr 2019

Hi @yaroslavvb, I am mostly focusing on classification tasks(like ROCstories as you mentioned). I just want to confirm that the special tokens used by GPT and GPT2 should be treated the same(add to the end of the vocabulary and feed in along with the original text).

We would like to run both BERT and GPT2 which are the two SOTA models, since they shine in different ways.

I will be happy to submit a pull request if this is a valid extension to the code base or if anyone is interested.

Jonbean on 11 Apr 2019

👍3

Indeed we should probably add additional embeddings for GPT-2 also.
I'll give it a look, should be pretty easy to add.

thomwolf on 11 Apr 2019

👍4

I just tried implementing the changes suggested in order to make gpt-2 amenable to being finetuned on the Cloze Story Task from ROCStories. However, my eval-accuracy seems to be topping out at 68%. Is this what others are getting?

rohuns on 18 Apr 2019

I also want to fine tune gpt-2 for qa and run it on squad. I am new to the field. Should I be following BertForQuestionAnswering and run BERT or SQuAD as a model to do the same for gpt-2?

ada-dl on 21 Apr 2019

Hi @rohuns, I met the same situation, the performance of the pre-trained GPT-2 with extra task head is poor on both ROC and sts. I don't truly know the reason.
My hypothesis is GPT-2 was not trained in the same "multi-task" fashion as the GPT1, therefore adding the special token will destroy the model or at least having hard time generating good representation for the sentence for the downstream tasks.
I hope someone can get the fine-tuning work to disprove the above hypothesis.

Jonbean on 23 Apr 2019

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.