Hi, I am implementing T5 model on the SQuAD 1.1 dataset.
When I fine-tune the model with the Adam or AdaFactor optimizier, validation accuracy goes down.
But, the training accuracy goes up.
Could you give to me any advice for me?
I feed into the model as below:
input_ids = ['(QUESTION)', Q_W1, Q_W2, ..., '(CONTEXT)', C_W1, C_W2, ..., '(PAD)', ...]
attention_masks = [1, 1, 1, ..., 1, 1, 1, ..., 0, ...]
decoder_input_ids = ['(PAD)', W1, W2, ..., '(PAD)', '(PAD)', ...]
decoder_attention_masks = [1, 1, 1, ..., 0, 0, ...]
lm_labels = [W1, W2, ..., '(EOS)', '(PAD)', ...]
I matched the shape between 'decoder_input_ids' and 'lm_labels'. (Shift doesn't be used.)
And, '(PAD)' in 'lm_labels' is converted into -100 in a process of the loss calculation.
For the generation, 'decoder_input_ids' is generated from the decoder except for an initial token '(PAD)'.
The result from the 'T5-Small' pretrained weights on the dataset is
{
"exact": 71.03122043519394,
"f1": 81.08158598580584,
"total": 10570,
"HasAns_exact": 71.03122043519394,
"HasAns_f1": 81.08158598580584,
"HasAns_total": 10570
}
Update:
When I pre-process a token not applied lower case, its EM from the initial weights was almost 76.85. (I couldn't remember exact score.)
In the TensorFlow version, its initial weights outputs 76.30 EM.
However, their pre-processing has been applied to lower case for a question, document, and answer.
I got a EM 74.02 with the lower case at the initial step.
And, I used an answer as a target inputs and outputs instead of using the answer from a spanning.
If an answer is in a extracted document, it will be an example in the training step.
But serious problem is to decrease the validation performance after a few training steps.
When I validated my trained model at 100 steps, its score went down almost 62.XX.
I think that a problem is one of them for the batch size or wrong pre-processing or bugs in the T5 model.
For the optimizer, I tested for AdaFactor and Adam optimizer but both results are same.
I didn't understand one thing when implementing it.
The thing is that loaded pre-trained weights don't have a weight for the 'lm_head' layer.
I guess that it is for an user who wants to implement own vocabulary.
But I think this is a reason for that validation accuracy is lower than the TensorFlow version at the initial step. (lm_head layer should be randomly initialized.)
When I applied a mask for the outputs and inputs_embeds for the encoder and decoder, the validation accuracy goes up. [Value * (mask == 1).float().unsqueeze(2), features of (PAD) should be zero.]
But I have to train the T5 model more for proving whether correct or not.
And, low learning rate is better than a learning rate from the original paper. (Original paper: 1e-3, but I used 5e-5 mentioned in the BERT.)
Last, in the previous comment I forgot to write something for my inputs.
input_ids = ['(QUESTION)', Q_W1, Q_W2, ..., '(CONTEXT)', C_W1, C_W2, (EOS), '(PAD)', ...]
attention_masks = [1, 1, 1, ..., 1, 1, 1, 1, ..., 0, ...]
decoder_input_ids = ['(PAD)', W1, W2, ..., '(PAD)', '(PAD)', ...]
decoder_attention_masks = [1, 1, 1, ..., 0, 0, ...]
lm_labels = [W1, W2, ..., '(EOS)', '(PAD)', ...]
EOS token should be added in a context.
And, tokens of the input_ids are [question : Q_W1, Q_W2, ..., context : 'C_W1, C_W2, (EOS), (PAD)', ...]
I hope it will be helpful for someone who is implementing it.
And, I will write more about it when I finish to train my model.
Hi @h19920918,
Thanks for the in-detail report. Could you quickly post your environment information here as well?
You can simply run python transformers-cli env in the root folder of your cloned transformers repo and copy paste it below.
And do you use T5 with Tensorflow or PyTorch?
Also it would be great if you could copy paste your code for the above experiment here :-)
@patrickvonplaten Thank you for your answer.
Unfortunately, I didn't use all codes in yours.
I partly used your code to implement it.
First, my environment is below:
Python == 3.6.4
Pytorch == 1.4.0+cu92
CUDA == 9.2
CuDNN == 6 or 7? (I don't know exactly.)
Transformer == 2.5.1
Actually, I solved the problem.
Paper:
T5-Small: EM: 79.10 || F1: 87.24
Own:
T5-Small: EM: 79.03 || F1: 87.35
I suspected four things:
Learning rate
I adjust a learning rate from 1e-3 to 1e-4 with the AdaFactor optimizer.
Masking for the 'inputs_embeds', 'encoder_outputs', and 'decoder_outputs'
I masked for three things with [Value * (mask == 1).float().unsqueeze(2)].
Loss scale
Originally, a loss is calculated by dividing the number of tokens.
But I changed this to diving the number of batch size.
Additionally, I changed pre-processing part to usage an example if an answer is in an extracted document.
However, it can be a problem since some of the documents have an answer but it is not reasonable answer.
A reason to do this, some of spanned answers are a little bit different with original answer. (e.g. answer, != answer)
And, some of spanned answers are converted into (UNK) tokens. (I'm not sure it is fixed right thing after changing my pre-process code.)
I will upload my code on the GitHub as soon as possible.
Great, happy that you solved it :-)
I think this will be very useful for others. If you could link your uploaded GitHub code to this issue this would be very helpful :-)
I upload my Github, you can see the code in https://github.com/h19920918/T5_SQuAD.
But it is quite dirty code..
So, I recommend which part you watch.
Almost implementation came from your code.
Mask
https://github.com/h19920918/T5_SQuAD/blob/c75d44544c3f18b87a4d8d09ed320742f9aaab36/models/modeling_t5.py#L556
https://github.com/h19920918/T5_SQuAD/blob/c75d44544c3f18b87a4d8d09ed320742f9aaab36/models/modeling_t5.py#L659
Loss
https://github.com/h19920918/T5_SQuAD/blob/c75d44544c3f18b87a4d8d09ed320742f9aaab36/models/modeling_t5.py#L941
https://github.com/h19920918/T5_SQuAD/blob/c75d44544c3f18b87a4d8d09ed320742f9aaab36/models/modeling_t5.py#L943
Pre-processing
https://github.com/h19920918/T5_SQuAD/blob/c75d44544c3f18b87a4d8d09ed320742f9aaab36/datasets/squad.py#L101
Above links are the modification by me.
With these modification, my model could be trained.
I'm sorry not to provide clean code since I'm working on something in this code..
I hope it will be helpful for someone.
p.s. I have to do ablation study for which part is the real problem.
@patrickvonplaten I have a question.
Are the T5 checkpoints pre-trained by TensorFlow version or yours?
It means I want to know whether the checkpoints are converted from somewhere or not.
I forgot to write something.
The results from the initial checkpoint are same.
However, I don't understand since the 'lm_head' layer should be initialized randomly. (I used different seed for each result.)
Thanks for linking your code! I think especially the pre-processing code can be very useful for others!
The T5 checkpoints are the official Google checkpoints pre-trained by the T5 team: https://github.com/google-research/text-to-text-transfer-transformer#released-model-checkpoints .
These checkpoints were attained by pretraining on both the unsupervised C4 dataset (using a denoising objective) and the mixed multi-task supervised dataset (see paper). The PyTorch weights were retrieved by conversion from these weights but correspond 1-to-1 to the same values as the original TF weights.
Does that make sense?
Thank you for your detail answer.
Still, I don't understand the pre-trained weights output the same results with different seeds.
As I know, the 'lm_head' layer should be used in the inference process for generating tokens.
However, the layer is initialized randomly since it is not in the pre-trained weights.
I guess one thing about it, where the pre-trained weights dominate all features, therefore, the outputs are same regardless of the 'lm_head' layer.
Is my inference correct?
The lm_head layer corresponds to the "inverse" token embeddings. It is tied to the input embeddings. It should not be randomly initialized when loading weights from the pretrained models.
Thank you for your answer.
Sorry, it is my mistake.