Transformers: Best way to fine tune GPT-2 in order to create a custom text generator?

Created on 13 Nov 2019  路  9Comments  路  Source: huggingface/transformers

Hello to everyone, and thanks for this wonderful work.

I am new to this library and I would appreciate an help for a task that i want to accomplish, just to know if I am acting right, to create a custom english text generator, such that giving it an input (title/sentence) it would generate 200-300 words based on that input.

My questions are:
1) I have prepared my dataset (each input is composed basically by the title and the corpus), what file should I look for to fine-tune GPT-2: run_lm_finetuning.py ? How many epochs/iterations do you suggest to run to fine tune? How large the dataset should be?

2) Once I have fine-tuned GPT-2, how to generate my custom text giving as input a title/sentence and using the fine-tuned model?

Thanks a lot

wontfix

Most helpful comment

Hi, you can use a combination of the scripts run_lm_finetuning.py and run_generation.py to accomplish what you want:

  • Fine-tune GPT-2 to your dataset using run_lm_finetuning.py. The default parameters should work well enough, I usually use three epochs (rather than the default 1) when training on small datasets. I have had success with datasets as small as a few 10s of MBs, but I have never tried with less.

  • Generate text by using run_generation.py and specifying your custom checkpoint. Specify a --length of 200 or 300.

All 9 comments

Hi, you can use a combination of the scripts run_lm_finetuning.py and run_generation.py to accomplish what you want:

  • Fine-tune GPT-2 to your dataset using run_lm_finetuning.py. The default parameters should work well enough, I usually use three epochs (rather than the default 1) when training on small datasets. I have had success with datasets as small as a few 10s of MBs, but I have never tried with less.

  • Generate text by using run_generation.py and specifying your custom checkpoint. Specify a --length of 200 or 300.

Thanks @LysandreJik !

Can you point me out how to organize my dataset file(s) or where to look within the repository? Moreover, does fine-tuning handle OOV words?

Thanks again

Merge your files into one by splitting them via <|endoftext|> token. You could also split the dataset into two files in order to have a dataset for evaluation. I use %90 and %10 for training and evaluation, respectively. Don't remove any stopwords etc. GPT2 will do the rest. Don't forget that GPT2 is so powerful to learn from your dataset so it may slightly overfit if you have not enough data. For example, train GPT2 with just 10MB data and you'll see it won't generate anything other than learnt from the dataset.

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Did you use a special token to separate the title and the corpus?

I think this would depend on what specifically you are inputting. If it's a title that you want to be part of the body (part of the first sentence) then you wouldn't want to break that sentence up with a separate token. If it's a title that you want the document to derive the topic from but not include as part of the body then a separate token might be helpful to prevent the model from expanding the title to form the body. :)

If it's a title that you want the document to derive the topic from but not include as part of the body then a separate token might be helpful to prevent the model from expanding the title to form the body. :)

So in essence, if you want to have a title should be used as context but not include as part of the body, you should structure data as:

Title: This is a great title
<|endoftext|>
Titles are one of the greatest inventions of humanity. 
Well crafted titles continue to save countless man-years by not requiring readers to actually read the article.
<|endoftext|>
Title: ...

Is that what you mean? Because intuitively I would assume that this wouldn't work as intended: since the Title is separated in a different token, it should not influence the next token. Or am I missing something?

Same question. Is there different token for separating the title and the body?

You could add a special token to the tokenizer and train the dataset with that.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

0x01h picture 0x01h  路  3Comments

adigoryl picture adigoryl  路  3Comments

hsajjad picture hsajjad  路  3Comments

HanGuo97 picture HanGuo97  路  3Comments

rsanjaykamath picture rsanjaykamath  路  3Comments