Transformers: Finue-tuning T5 model

Created on 1 May 2020 · 28Comments · Source: huggingface/transformers

Hi,
I want to fine-tune T5 for a seq2seq task and I'm using the T5ForConditionalGeneration as it seems to have an LM decoder on top.
As there's no code example for this, I have lots of questions:

Am I doing the right thing?
I'm using the Adam optimizer. Is it ok?
I'm a bit confused about the forward inputs in the training phase. I read this explanation over and over again and I don't understand whether I should just use input_ids and lm_labels for the training or not. Also somewhere in this issue someone's mentioned that:
> T5 input sequence should be formatted with [CLS] and [SEP] tokens

So which one is right? I'm super confused.

LM (Finetuning) LM (Pretraining)

Source

Palipoor

👍1

Most helpful comment

@amitness

E.g. in your summarization case, it would look something like:

from transformers import T5Tokenizer, T5Model

tokenizer = T5Tokenizer.from_pretrained('t5-small')
model = T5Model.from_pretrained('t5-small')
input_ids = tokenizer.encode("summarize: Hello, my dog is cute", return_tensors="pt")
decoder_input_ids = tokenizer.encode("<pad>", return_tensors="pt") 
outputs = model(input_ids=input_ids, decoder_input_ids=decoder_input_ids)
outputs[0]

Do note that T5ForConditionalGeneration already prepends the padding by default. Above is only necessary if you're doing a forward pass straight from T5Model.

Regarding your question about making your own prefix, yes, you should be able to train on your own prefix. This is the whole point of T5's text-to-text approach. You should be able to specify any problem through this kind of approach (e.g. Appendix D in the T5 paper).

enzoampil on 4 May 2020

👍3 🚀1

All 28 comments

+1. I'm also confused on how to structure the lm_labels and the decoder_input_ids.

amitness on 3 May 2020

Given T5's universal text-to-text objective, I'm under the impression that the T5 summarization example should be applicable for all T5 tasks, as long as the input and target sequences are correctly structured for the specified task. Hope this can be confirmed!

Sample input and target structures for specific tasks can be found at Appendix D in the T5 paper.

enzoampil on 3 May 2020

To correctly train T5 one should follow the instructions at https://huggingface.co/transformers/model_doc/t5.html#training .

For training, there is no need to provide the decoder_input_ids - they are created automatically. One only has to provide the lm_labels.

As @enzoampil, Appendix D of the paper gives good input/output examples.

patrickvonplaten on 3 May 2020

👍2

@patrickvonplaten What exactly would be the lm_labels for something like summarization?

Example Usecase
Text: "ABC" with maximum length 500
Summary: "XYZ" with maximum length 50

I understand that we can prepare input_ids and attention_mask like this for the document.

x = tokenizer.encode_plus(sentence, 
                          max_length=500, 
                          pad_to_max_length=True, 
                          return_tensors='pt')

Now for the lm_labels i.e. summary, is simply doing this enough?

lm_labels = tokenizer.encode(summary,  
                            return_tensors='pt', 
                            max_length=50, 
                            pad_to_max_length=True)

And the model as

model = T5ForConditionalGeneration.from_pretrained('t5-small')
model(input_ids=..., lm_labels=lm_labels, attention_mask=...)

In your examples folder for summarization, I've seen some preprocessing like this for lm_labels. I didn't understand why this is being done.

y_ids = y[:, :-1].contiguous()
lm_labels = y[:, 1:].clone()
lm_labels[y[:, 1:] == tokenizer.pad_token_id] = -100

amitness on 3 May 2020

Hi @amitness,

For T5 summarization you will have to append the prefix "summarize: " to every input data. But you are more or less right. All you have to do is:

Prepare input data

x = tokenizer.encode_plus("summarize: " + sentence, 
                          max_length=500, 
                          pad_to_max_length=True, 
                          return_tensors='pt')

Prepare labels

lm_labels = tokenizer.encode_plus(summary,  
                            return_tensors='pt', 
                            max_length=50, 
                            pad_to_max_length=True)

For tokens that are padded (which is only relevant if you train with batch_size > 1) you need to make sure that no loss is calculated on those tokens, so

lm_labels[lm_labels == tokenizer.pad_token_id] = -100

There is no need to shift the tokens as you show at the end of your comment because T5 does that automatically - see https://github.com/huggingface/transformers/blob/6af3306a1da0322f58861b1fbb62ce5223d97b8a/src/transformers/modeling_t5.py#L1063.

This is also explained in https://huggingface.co/transformers/model_doc/t5.html#training .

patrickvonplaten on 4 May 2020

👍3

Thanks for this clarification @patrickvonplaten ! Finally got it to work from my side 😄

Gotcha for me was that the decoder_input_ids at inference should be prepended by the padding token as stated in the docs for T5ForConditionalGeneration.

enzoampil on 4 May 2020

@enzoampil Can you give an example code of what you meant by prepending padding token at inference time?

amitness on 4 May 2020

@patrickvonplaten Thank you.

Besides the inbuilt prefix like summarize:, translate: etc, can I train with my own prefix? Let's say there is a prefix called "simplify:" and I have pairs of datasets. Is adding the prefix and preparing data in the format you mentioned above enough?

amitness on 4 May 2020

@amitness

E.g. in your summarization case, it would look something like:

from transformers import T5Tokenizer, T5Model

tokenizer = T5Tokenizer.from_pretrained('t5-small')
model = T5Model.from_pretrained('t5-small')
input_ids = tokenizer.encode("summarize: Hello, my dog is cute", return_tensors="pt")
decoder_input_ids = tokenizer.encode("<pad>", return_tensors="pt") 
outputs = model(input_ids=input_ids, decoder_input_ids=decoder_input_ids)
outputs[0]

Do note that T5ForConditionalGeneration already prepends the padding by default. Above is only necessary if you're doing a forward pass straight from T5Model.

enzoampil on 4 May 2020

👍3 🚀1

@enzoampil Makes sense. Thank you so much.

amitness on 4 May 2020

👍1

@patrickvonplaten Thank you.

Besides the inbuilt prefix like summarize:, translate: etc, can I train with my own prefix? Let's say there is a prefix called "simplify:" and I have pairs of datasets. Is adding the prefix and preparing data in the format you mentioned above enough?

Sure, you can train with your own prefix.

patrickvonplaten on 4 May 2020

Thanks for this clarification @patrickvonplaten ! Finally got it to work from my side

Gotcha for me was that the decoder_input_ids at inference should be prepended by the padding token as stated in the docs for T5ForConditionalGeneration.

Yeah that's actually a bit hidden in the code. So to clarify:
During training, there is no need to prepend the padding token since this is done automatically in T5 when lm_labels is provided.
During evaluation, one has to prepend the PAD token as you stated in your example.

After training, the mode can be used with the generate() method (which actually powers the summarization, translation and text-generation pipeline).
In the generate() method, the padding token is automatically prepended.

patrickvonplaten on 4 May 2020

👍1

@patrickvonplaten One thing I've noticed is the discrepancy between huggingface's and the original google-research tokenization.

In the official colab by the paper authors, they seem to add </s> when tokenizing to the end of each text. But, when we use tokenizers from hugging face, it is not added. Not sure if it is a problem or not.
Here is an excerpt from their official colab

'inputs_plaintext': b'trivia question: what is the population of fayetteville north carolina?', 'inputs': array([22377,   822,    10,   125,    19,     8,  2074,    13,     3,
          89,     9,    63,  1954,  1420,  3457,   443, 12057,     9,
          58,     1])

You can see 1 added at the end of the token_ids. But if we tokenize this same sentence with huggingface tokenizer, we don't get 1 at end.

tokenizer.encode('trivia question: what is the population of fayetteville north carolina?')
# [22377,   822,    10,   125,    19,     8,  2074,    13,     3, 89,     9,    63,  1954,  1420,  3457,   443, 12057,     9, 58]

When I was prototyping with the models, I tried preparing data like this to solve it. This adds 1 to the end. Not sure if we need to do this or not.

tokenizer.encode("summarize: Hello world</s>", return_tensors="pt")

amitness on 4 May 2020

Yes you are right, you should add the </s> token to the end of a sentence. I think this is also shown in the docs: https://huggingface.co/transformers/model_doc/t5.html#training.

patrickvonplaten on 4 May 2020

👍1

Thanks to @patrickvonplaten for all clarification and others for their further questions that led to more details on the subject.

Palipoor on 4 May 2020

Hello everyone,

I am currently working on finetuning the TFT5ForConditionalGeneration model on a parallel dataset.
Questions:

Can I call model.fit like this - model.fit([x,y]) where x is input_ids and y is lm_labels?

If not, how do I pass in lm_labels and train with the model.

Thanks.

keleog on 5 May 2020

@patrickvonplaten

keleog on 5 May 2020

For the tensorflow version you have to input input_ids, decoder_input_ids and lm_labels yourself. The model should work fine with the keras framework!

patrickvonplaten on 5 May 2020

I will soon add more documentation for T5 for tensorflow. It's true that there is not enough documentation for TF at the moment.

patrickvonplaten on 5 May 2020

👍1

Okay, I would appreciate that. So, do I add the input_ids, decoder_input_idsand lm_labels as keywords when calling model.fit(which I doubt) or when do I do that?

keleog on 5 May 2020

I have not looked trained tensorflow using keras model.fit function yet. The forward pass in tensorflow's T5 implementation needs both input_ids and decoder_input_ids as you can see when going through this function:
https://github.com/huggingface/transformers/blob/fd2174664c8879c747ada3e6e0a2486858808421/src/transformers/modeling_tf_t5.py#L980

So, depending on your code you will have to create input_ids, decoder_input_ids and lm_labels yourself. Feel free to share your code here if you have a working training pipeline for TFT5 :-)

patrickvonplaten on 6 May 2020

Hi Patrick. Got it to work with Pytorch. However, I have a question:

Is it possible to use a different vocab size with this pretrained model? I have a trained sentence piece model and it only works with this pretrained t5 when I use a beam size of 1. I have manually changed the vocab size by setting model.config.vocab_size = tokenizer.vocab_size . However, the beam size problem still persists and it returns a shape mismatch error.

Please let me know if this is possible, thanks.

keleog on 10 May 2020

@patrickvonplaten

keleog on 12 May 2020

I think it will work in case the targets pieces from the new vocab is the same in the old one.
Besides, what is the benefit from the pretrained T5 if the sentence piece targets changed ?!!

IslamMohamedMosaad on 24 Jun 2020

Created a little repo for NMT finetuning https://github.com/keleog/finetune_huggingace_t5

keleog on 13 Jul 2020

I have not looked trained tensorflow using keras model.fit function yet. The forward pass in tensorflow's T5 implementation needs both input_ids and decoder_input_ids as you can see when going through this function:
https://github.com/huggingface/transformers/blob/fd2174664c8879c747ada3e6e0a2486858808421/src/transformers/modeling_tf_t5.py#L980

So, depending on your code you will have to create input_ids, decoder_input_ids and lm_labels yourself. Feel free to share your code here if you have a working training pipeline for TFT5 :-)

Hi @patrickvonplaten, I was able to create a data source with the input data and labels as you described.
Now I'm trying to use that data for keras fit with the loss function tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
The shape of the labels is (batch_size, seq_len), and I would expect that the model TFT5ForConditionalGeneration would return the logits of shape (batch_size, seq_len, vocab_size). However its call method returns this:
return decoder_outputs + encoder_outputs
so I get an error:
ValueError: Error when checking model target: the list of Numpy arrays that you are passing to your model is not the size the model expected. Expected to see 27 array(s), for inputs ['output_1', 'output_2', 'output_3', 'output_4', 'output_5', 'output_6', 'output_7', 'output_8', 'output_9', 'output_10', 'output_11', 'output_12', 'output_13', 'output_14', 'output_15', 'output_16', 'output_17', 'output_18', 'output_19', 'output_20', 'output_21', 'output_22', 'output_23', 'output_24', 'output_25', 'output_26', 'output_27'] but instead got the following list of 1 arrays: [<tf.Tensor 'args_4:0' shape=(32, 128) dtype=int32>]...

I can think of two solutions, neither sounds good:

override call method in a subclass and return only the decoder outputs
use a custom loss function that extracts the decoder outputs from the model output

What would you advice?

artem-spector on 18 Aug 2020

Hi @patrickvonplaten ,
I am working on one question answering task using TFT5. I have done a text encoding step.
My raw input is question and target is the answer shown in the below image

How should I configure input so that I can pass it in model.fit() method
like this way!! I am able to get input_id and input mask.

model = TFT5ForConditionalGeneration.from_pretrained("t5-small")
optimizer = keras.optimizers.Adam(lr=5e-5)
model.compile(optimizer=optimizer)
model.fit(
    x_train,
    y_train,
    epochs=1,  
    verbose=2,
    batch_size=2,
)

Here is the Colab Notebook