Transformers: How is it possible to furthur tune gpt-2(or gpt) in a seq2seq manner?

Created on 8 Oct 2019  路  10Comments  路  Source: huggingface/transformers

Hi,

Can we futhur funetue gpt-2 pretrained model in a sequence 2 sequence manner, where we want to minimize the loss of log p(y|x).
In other words, our dataset has both source and target and we want to generate target given source.
But I want to start from using gpt-2 weights and then tune it.

wontfix

Most helpful comment

@Hannabrahman Great questions:

  1. This is up to you. The model can learn the sequence of known tokens (e.g. "[", "E", "OS", "]") and use that as a prompt. I used a sequence and found that it worked well enough so I did not try adding extra tokens. There is already an "<|endoftext|>" token in the vocabulary which you can leverage.

  2. I created a custom data loader which concatenated the desired sample with randomly selected sequences from the data up to the desired length. E.g., A training sample may be a concat of sample translation #1 and #32 which would look like this: "[SOS] something in English_#1 = something in French_#1 [EOS] [SOS] something in English_#32 = something in French_#32 [EOS] [SOS] .. etc"

This then gets tokenized and truncated to the max length. This will allow the model to learn variable length sequences.

You can accomplish the same effect by concatenating all of your text into a single string and sampling sections of it. However, if you do this the model will learn associations between neighbouring samples over multiple epochs, so I recommend having something that shuffles the order of concatenated samples each epoch.

During generation you prompt with "[SOS] something in English = " and stop generating when it produces an [EOS] token.

All 10 comments

Hi, this is on our mid-term roadmap (seq2seq models).

@Hannabrahman In the original GPT2 paper (section 3.7 Translation) the authors used the format "english sentence = french sentence" to produce translations. You can definitely fine tune the model using this format to produce translations using the existing scripts if you structure your seq2seq data this way.

@dvaltchanov and @thomwolf thanks for pointing out to me.
Do you think for that, I need to pass another input to the forward method of GPTLMHead method which is a list containing the length of source sequence, so that I will be able to zero out the loss calculated for the tokens in source?
I mean did I have to zero out the lm_logits associated with source sequence tokens so that I do not count them in loss calculation?

Or it doesn't matter if we include the source tokens loss in our total loss?

@Hannabrahman Based on my tests, it doesn't matter if you include them. Your total loss will be higher but you're mainly interested in the validation loss on the translations anyway. As long as you use the "start of text" and "end of text" tokens to wrap your "sequence = sequence" text the model seems to be able to figure it out after a little bit of fine tuning.

@dvaltchanov Thanks.
Just one question since you had experimented this.
I want to finetune gpt on a new dataset using the format you said and this script. which is for finetuning pretained model on new dataset.

1- should I add special tokens ( [SOS], some separator token for source and target, [EOS]) and train it like this:

# Add a [SOS], [SEP] and [EOS] to the vocabulary (we should train it also!)
  tokenizer.add_special_tokens({'start_token': '[CLS]', 'sep_token': '[SEP]', 'end_token': '[EOS]'})
  model.resize_token_embeddings(len(tokenizer))  # Update the model embeddings with the new vocabulary size

2- The instances in my dataset have different length ( 60-85 tokens). I have to either trim them to be the same size (it is not really good for my usecase), or use padding to pad them to same size. However, I read somewhere in this repo that gpt and gpt-2 doesnt handle right padding, how did you solve this issue while finetuning gpt on your own usecase and dataset?

Many thanks in advance.

@Hannabrahman Great questions:

  1. This is up to you. The model can learn the sequence of known tokens (e.g. "[", "E", "OS", "]") and use that as a prompt. I used a sequence and found that it worked well enough so I did not try adding extra tokens. There is already an "<|endoftext|>" token in the vocabulary which you can leverage.

  2. I created a custom data loader which concatenated the desired sample with randomly selected sequences from the data up to the desired length. E.g., A training sample may be a concat of sample translation #1 and #32 which would look like this: "[SOS] something in English_#1 = something in French_#1 [EOS] [SOS] something in English_#32 = something in French_#32 [EOS] [SOS] .. etc"

This then gets tokenized and truncated to the max length. This will allow the model to learn variable length sequences.

You can accomplish the same effect by concatenating all of your text into a single string and sampling sections of it. However, if you do this the model will learn associations between neighbouring samples over multiple epochs, so I recommend having something that shuffles the order of concatenated samples each epoch.

During generation you prompt with "[SOS] something in English = " and stop generating when it produces an [EOS] token.

@dvaltchanov
regarding 2 - I didn't get it completely.
Where is the padding in your given batch example? Also, did you mean you concat all the instances back to back to create a single instance when you have #32 after #1 or #32 is probably another instance in the same batch? that being said the input is [bs, max_seq_len]? (bs = 2 in this example)
Also did you add a [pad] token to the vocabulary? because gpt and gpt2 doesnt have padding token. Or you follow the same strategy as in question 1

Do you have your custom data loader code somewhere so that I can take a look?

@Hannabrahman See my edited response above. I hope my clarification helps.

@dvaltchanov Thankss. Basically you followed the same approach as in here . They read all the input into one long string and then truncate it in max_len. However it doesn't have any sampling or shuffling.
My data is stories and each story is around 60-80 tokens. I read all the stories in one long string and truncate each section to 128 tokens. The problem is sometimes the beginning of an story may goes into previous sample section. and the rest goes in to next section.

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Was this page helpful?
0 / 5 - 0 ratings