Transformers: How to finetune GPT2

Created on 29 Aug 2019 · 11Comments · Source: huggingface/transformers

❓ Questions & Help

Hi all,

I would like to finetune the pretrained gpt2 model with a newspapers dataset. Do you know how would that be possible? I haven't found any train scipt for gpt2.

Thanks a lot.

wontfix

Source

alecalma

Most helpful comment

Good catch, it was initially made for WikiText-2 but it was generalized to be used with any text file. ~I'll add an example of usage shortly in our Documentation section.~ An example is now available in the documentation.

You can run it like so:

python run_lm_finetuning.py \
    --train_data_file=$TEXT_FILE \
    --output_dir=$OUTPUT_DIRECTORY \
    --model_type=gpt2 \
    --model_name_or_path=gpt2 \
    --do_train

You don't need to remove any newline in your text file, it all depends on what you're looking for. If you're keeping the line returns, the model will learn to generate line returns as well.

You can easily change the way the model inputs are built by changing the TextDataset class.
Right now, with:

while len(tokenized_text) >= block_size:  # Truncate in block of block_size
    self.examples.append(tokenizer.add_special_tokens_single_sentence(tokenized_text[:block_size]))
    tokenized_text = tokenized_text[block_size:]

We are simply creating token lists (of size block_size) that will then be fed to the model. We are not doing any special preprocessing (such as removing the line returns).

LysandreJik on 2 Sep 2019

👍3

All 11 comments

Hi, we have an example to fine-tune several models on language modeling here.
You can look into GPT-2's training on the CLM task, which is done on WikiText-2 in this example.

LysandreJik on 29 Aug 2019

@LysandreJik would you please provide an example of usage?
In the code you mentioned WikiText-2 only in doctoring.
I believe this input file is a text file without any new line, right?
Can't we pass an input file, with one sentence per line?

ehsan-soe on 2 Sep 2019

You can run it like so:

python run_lm_finetuning.py \
    --train_data_file=$TEXT_FILE \
    --output_dir=$OUTPUT_DIRECTORY \
    --model_type=gpt2 \
    --model_name_or_path=gpt2 \
    --do_train

You don't need to remove any newline in your text file, it all depends on what you're looking for. If you're keeping the line returns, the model will learn to generate line returns as well.

You can easily change the way the model inputs are built by changing the TextDataset class.
Right now, with:

while len(tokenized_text) >= block_size:  # Truncate in block of block_size
    self.examples.append(tokenizer.add_special_tokens_single_sentence(tokenized_text[:block_size]))
    tokenized_text = tokenized_text[block_size:]

We are simply creating token lists (of size block_size) that will then be fed to the model. We are not doing any special preprocessing (such as removing the line returns).

LysandreJik on 2 Sep 2019

👍3

@LysandreJik Great thanks.
The current version of TextDataset class will concat text from different articles (if any) together, right? I mean there is no notion of separate documents (articles) and it's all a continious collection of tokens?

ehsan-soe on 2 Sep 2019

That's true. If you're looking to get the best prediction out of it, you should be careful that unrelated pieces of text are not concatenated in a single input. We didn't do it in that example for simplicity's sake.

LysandreJik on 2 Sep 2019

👍1

@LysandreJik in Line 76 of the code:

self.examples.append(tokenizer.add_special_tokens_single_sentence(tokenized_text[:block_size]))

If models other than Bert is used, then the tokenizer does not make use of special tokens, right? It is only applicable for Bert

ehsan-soe on 3 Sep 2019

Both BERT and RoBERTa use special tokens. For GPT and GPT-2, no special token will be added using this method, since, as you said, they do not make use of special tokens.

LysandreJik on 3 Sep 2019

In the code you mentioned that we might want to add model specific padding. I wonder if got-2 has padding implemented? if yes, does it accept right-side zero padding similar to BERT?
I want to finetune gpt-2 on a dataset which each instance length is generally less than 65 tokens, I want to make all the same length by adding 0 padding up to max_length of 128.
any idea?

ehsan-soe on 6 Sep 2019

How we can add a [CLS] token to beginning of every inputs for gpt2 (and add it to vocabulary) and fine-tune it?
I see an example of adding [CLS] in modeling_gpt2.py for the GPT2DoubleHeadsModel class. I wonder if we can finetune gpt2 with added [CLS] token?

ehsan-soe on 10 Sep 2019

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] on 9 Nov 2019

In the code you mentioned that we might want to add model specific padding. I wonder if got-2 has padding implemented? if yes, does it accept right-side zero padding similar to BERT?
I want to finetune gpt-2 on a dataset which each instance length is generally less than 65 tokens, I want to make all the same length by adding 0 padding up to max_length of 128.
any idea?

I think you can use ANY tokens for padding as GPT-2 is causal. You just need to mask out these positions when calculating loss.