Hi all,
I would like to finetune the pretrained gpt2 model with a newspapers dataset. Do you know how would that be possible? I haven't found any train scipt for gpt2.
Thanks a lot.
Hi, we have an example to fine-tune several models on language modeling here.
You can look into GPT-2's training on the CLM task, which is done on WikiText-2 in this example.
@LysandreJik would you please provide an example of usage?
In the code you mentioned WikiText-2 only in doctoring.
I believe this input file is a text file without any new line, right?
Can't we pass an input file, with one sentence per line?
Good catch, it was initially made for WikiText-2 but it was generalized to be used with any text file. ~I'll add an example of usage shortly in our Documentation section.~ An example is now available in the documentation.
You can run it like so:
python run_lm_finetuning.py \
--train_data_file=$TEXT_FILE \
--output_dir=$OUTPUT_DIRECTORY \
--model_type=gpt2 \
--model_name_or_path=gpt2 \
--do_train
You don't need to remove any newline in your text file, it all depends on what you're looking for. If you're keeping the line returns, the model will learn to generate line returns as well.
You can easily change the way the model inputs are built by changing the TextDataset class.
Right now, with:
while len(tokenized_text) >= block_size: # Truncate in block of block_size
self.examples.append(tokenizer.add_special_tokens_single_sentence(tokenized_text[:block_size]))
tokenized_text = tokenized_text[block_size:]
We are simply creating token lists (of size block_size) that will then be fed to the model. We are not doing any special preprocessing (such as removing the line returns).
@LysandreJik Great thanks.
The current version of TextDataset class will concat text from different articles (if any) together, right? I mean there is no notion of separate documents (articles) and it's all a continious collection of tokens?
That's true. If you're looking to get the best prediction out of it, you should be careful that unrelated pieces of text are not concatenated in a single input. We didn't do it in that example for simplicity's sake.
@LysandreJik in Line 76 of the code:
self.examples.append(tokenizer.add_special_tokens_single_sentence(tokenized_text[:block_size]))
If models other than Bert is used, then the tokenizer does not make use of special tokens, right? It is only applicable for Bert
Both BERT and RoBERTa use special tokens. For GPT and GPT-2, no special token will be added using this method, since, as you said, they do not make use of special tokens.
In the code you mentioned that we might want to add model specific padding. I wonder if got-2 has padding implemented? if yes, does it accept right-side zero padding similar to BERT?
I want to finetune gpt-2 on a dataset which each instance length is generally less than 65 tokens, I want to make all the same length by adding 0 padding up to max_length of 128.
any idea?
How we can add a [CLS] token to beginning of every inputs for gpt2 (and add it to vocabulary) and fine-tune it?
I see an example of adding [CLS] in modeling_gpt2.py for the GPT2DoubleHeadsModel class. I wonder if we can finetune gpt2 with added [CLS] token?
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
In the code you mentioned that we might want to add model specific padding. I wonder if got-2 has padding implemented? if yes, does it accept right-side zero padding similar to BERT?
I want to finetune gpt-2 on a dataset which each instance length is generally less than 65 tokens, I want to make all the same length by adding 0 padding up to max_length of 128.
any idea?
I think you can use ANY tokens for padding as GPT-2 is causal. You just need to mask out these positions when calculating loss.
Most helpful comment
Good catch, it was initially made for WikiText-2 but it was generalized to be used with any text file. ~I'll add an example of usage shortly in our Documentation section.~ An example is now available in the documentation.
You can run it like so:
You don't need to remove any newline in your text file, it all depends on what you're looking for. If you're keeping the line returns, the model will learn to generate line returns as well.
You can easily change the way the model inputs are built by changing the
TextDatasetclass.Right now, with:
We are simply creating token lists (of size
block_size) that will then be fed to the model. We are not doing any special preprocessing (such as removing the line returns).