tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = GPT2LMHeadModel.from_pretrained('gpt2',output_hidden_states=True)
model.eval()
input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0) # Batch size 1
outputs = model(input_ids, labels=input_ids)
hidden_states = outputs[3]
Here the shape of hidden_states is (13,6,768). I have 2 questions.
Please help me with the questions. Thank you in advance!
Hi! The vector of the hidden_states is indeed of shape (13, seq_len, 768). The first value (hidden_states[0]), of shape (seq_len, 768) corresponds to the sum of the word + positional embeddings. The subsequent values are added every time the model goes through an attention layer.
Without taking into account the dropout, you would therefore have:
hidden_states[0] | 0 -> word_embeddings(inputs) + positional_embeds(outputs)
hidden_states[1] | 1 -> first_attention_layer(0)
hidden_states[2] | 2 -> second_attention_layer(1)
...
If by top layer you mean first attention layer of the model, then it would be hidden_states[1]. If by top you mean last, it would be hidden_states[12], which would be the same as outputs[0].
The size of those is of (13, seq_len, 768) and not (13, 1, 768) because the model computes every token and not only the last token.
The size of those is of
(13, seq_len, 768)and not(13, 1, 768)because the model computes every token and not only the last token.
Hi! Thank you for your reply. I wonder if the states for the previous token will be used for calculating the attention when predicting the later token? Is that the reason that you store the states for the previous tokens?
The models keep the key-value pairs so that they're not recomputed on the next model pass. These are stored in the past, and can reduce the amount of computing for each following model pass if you pass them to the next forward pass (like we do in run_generation).
The hidden states won't be used for this though, but you can use them to extract intermediate features from the transformer.
The models keep the key-value pairs so that they're not recomputed on the next model pass. These are stored in the
past, and can reduce the amount of computing for each following model pass if you pass them to the next forward pass (like we do in run_generation).The hidden states won't be used for this though, but you can use them to extract intermediate features from the transformer.
Hi! Thank you for your reply. That really helps.
So now I want to make sure that in the code block in question:
Since hidden_states[12] is for the top layer, then I extract hidden_states[12][0][5], whose size is 768. Is it the vector for prediction based on the word "cute" (and all previous 5 words)?
Yes, you're right. You could also retrieve this vector by using a GPT2Model instead of a GPT2LMHeadModel, which is the base transformer:
from transformers import GPT2Tokenizer, GPT2LMHeadModel, GPT2Model
import torch
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
lm_model = GPT2LMHeadModel.from_pretrained("gpt2", output_hidden_states=True)
lm_model.eval()
model = GPT2Model.from_pretrained('gpt2')
model.eval()
input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0) # Batch size 1
outputs = model(input_ids)
lm_outputs = lm_model(input_ids, labels=input_ids)
transformer_output = outputs[0]
transformer_hidden_states = lm_outputs[3]
print(transformer_hidden_states[12][:, -1, :] - transformer_output[:, -1, :])
This should output a tensor of 0s as the two tensors are equal.
@LysandreJik Thank you so much for your help! That works.
Most helpful comment
Hi! The vector of the
hidden_statesis indeed of shape(13, seq_len, 768). The first value (hidden_states[0]), of shape(seq_len, 768)corresponds to the sum of the word + positional embeddings. The subsequent values are added every time the model goes through an attention layer.Without taking into account the dropout, you would therefore have:
If by top layer you mean first attention layer of the model, then it would be
hidden_states[1]. If by top you mean last, it would behidden_states[12], which would be the same asoutputs[0].The size of those is of
(13, seq_len, 768)and not(13, 1, 768)because the model computes every token and not only the last token.