Transformers: Question about hidden states in GPT2

Created on 15 Oct 2019 · 6Comments · Source: huggingface/transformers

❓ Questions & Help

tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = GPT2LMHeadModel.from_pretrained('gpt2',output_hidden_states=True)
model.eval()
input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0) # Batch size 1
outputs = model(input_ids, labels=input_ids)
hidden_states = outputs[3]

Here the shape of hidden_states is (13,6,768). I have 2 questions.

Which one is the vector for the top layer, hidden_states[0] or hidden_states[12]?

Suppose hidden_states[12] is for the top layer, then I extract hidden_states[12][0][0], whose size is 768. Is it the vector for prediction based on the word "hello"? But since I know the next word is ",", why do I need hidden_states[12][0][0]? In my opinion, the shape of hidden_states should be (13,1,768), which is only used for predicting next word after "cute". I'm quite confused of the "6" here.

Please help me with the questions. Thank you in advance!

Source

weiguowilliam

Most helpful comment

Hi! The vector of the hidden_states is indeed of shape (13, seq_len, 768). The first value (hidden_states[0]), of shape (seq_len, 768) corresponds to the sum of the word + positional embeddings. The subsequent values are added every time the model goes through an attention layer.

Without taking into account the dropout, you would therefore have:

hidden_states[0] | 0 -> word_embeddings(inputs) + positional_embeds(outputs)
hidden_states[1] | 1 -> first_attention_layer(0)
hidden_states[2] | 2 -> second_attention_layer(1)
...

If by top layer you mean first attention layer of the model, then it would be hidden_states[1]. If by top you mean last, it would be hidden_states[12], which would be the same as outputs[0].

The size of those is of (13, seq_len, 768) and not (13, 1, 768) because the model computes every token and not only the last token.

LysandreJik on 16 Oct 2019

👍3 🚀1 ❤1 🎉1

All 6 comments

Without taking into account the dropout, you would therefore have:

hidden_states[0] | 0 -> word_embeddings(inputs) + positional_embeds(outputs)
hidden_states[1] | 1 -> first_attention_layer(0)
hidden_states[2] | 2 -> second_attention_layer(1)
...

If by top layer you mean first attention layer of the model, then it would be hidden_states[1]. If by top you mean last, it would be hidden_states[12], which would be the same as outputs[0].

The size of those is of (13, seq_len, 768) and not (13, 1, 768) because the model computes every token and not only the last token.

LysandreJik on 16 Oct 2019

👍3 🚀1 ❤1 🎉1

The size of those is of (13, seq_len, 768) and not (13, 1, 768) because the model computes every token and not only the last token.

Hi! Thank you for your reply. I wonder if the states for the previous token will be used for calculating the attention when predicting the later token? Is that the reason that you store the states for the previous tokens?

weiguowilliam on 17 Oct 2019

The models keep the key-value pairs so that they're not recomputed on the next model pass. These are stored in the past, and can reduce the amount of computing for each following model pass if you pass them to the next forward pass (like we do in run_generation).

The hidden states won't be used for this though, but you can use them to extract intermediate features from the transformer.

LysandreJik on 17 Oct 2019

The models keep the key-value pairs so that they're not recomputed on the next model pass. These are stored in the past, and can reduce the amount of computing for each following model pass if you pass them to the next forward pass (like we do in run_generation).

The hidden states won't be used for this though, but you can use them to extract intermediate features from the transformer.

Hi! Thank you for your reply. That really helps.

So now I want to make sure that in the code block in question:
Since hidden_states[12] is for the top layer, then I extract hidden_states[12][0][5], whose size is 768. Is it the vector for prediction based on the word "cute" (and all previous 5 words)?

weiguowilliam on 18 Oct 2019

Yes, you're right. You could also retrieve this vector by using a GPT2Model instead of a GPT2LMHeadModel, which is the base transformer:

from transformers import GPT2Tokenizer, GPT2LMHeadModel, GPT2Model
import torch

tokenizer = GPT2Tokenizer.from_pretrained('gpt2')

lm_model = GPT2LMHeadModel.from_pretrained("gpt2", output_hidden_states=True)
lm_model.eval()

model = GPT2Model.from_pretrained('gpt2')
model.eval()

input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0) # Batch size 1

outputs = model(input_ids)
lm_outputs = lm_model(input_ids, labels=input_ids)

transformer_output = outputs[0]
transformer_hidden_states = lm_outputs[3]

print(transformer_hidden_states[12][:, -1, :] - transformer_output[:, -1, :])

This should output a tensor of 0s as the two tensors are equal.

LysandreJik on 22 Oct 2019

👍1

@LysandreJik Thank you so much for your help! That works.

weiguowilliam on 28 Oct 2019

Was this page helpful?

0 / 5 - 0 ratings