Transformers: How to get all layers(12) hidden states of BERT?

Created on 14 Nov 2019  ·  17Comments  ·  Source: huggingface/transformers

❓ Questions & Help


I tried to set the output_hidden_states=True, but only got 3 layers of the hidden states of model outputs for BERT, but theoricaly it should be 12, how can I get that?

Most helpful comment

You should have obtained the 12 layers as well as the embedding output. Are you sure you're not mistaking the output of the forward call (which is a tuple as well) with the hidden states?

Just to make sure, this is the correct way to obtain the hidden states:

from transformers import BertModel, BertConfig

config = BertConfig.from_pretrained("xxx", output_hidden_states=True)
model = BertModel.from_pretrained("xxx", config=config)

outputs = model(inputs)
print(len(outputs))  # 3

hidden_states = outputs[2]
print(len(hidden_states))  # 13

embedding_output = hidden_states[0]
attention_hidden_states = hidden_states[1:]

All 17 comments

You should have obtained the 12 layers as well as the embedding output. Are you sure you're not mistaking the output of the forward call (which is a tuple as well) with the hidden states?

Just to make sure, this is the correct way to obtain the hidden states:

from transformers import BertModel, BertConfig

config = BertConfig.from_pretrained("xxx", output_hidden_states=True)
model = BertModel.from_pretrained("xxx", config=config)

outputs = model(inputs)
print(len(outputs))  # 3

hidden_states = outputs[2]
print(len(hidden_states))  # 13

embedding_output = hidden_states[0]
attention_hidden_states = hidden_states[1:]

You should have obtained the 12 layers as well as the embedding output. Are you sure you're not mistaking the output of the forward call (which is a tuple as well) with the hidden states?

Just to make sure, this is the correct way to obtain the hidden states:

from transformers import BertModel, BertConfig

config = BertConfig.from_pretrained("xxx", output_hidden_states=True)
model = BertModel.from_pretrained("xxx", config=config)

outputs = model(inputs)
print(len(outputs))  # 3

hidden_states = outputs[2]
print(len(hidden_states))  # 13

embedding_output = hidden_states[0]
attention_hidden_states = hidden_states[1:]

Thanks a lot, I think I just not find and realized the hidden states are stored at index 2 in the outputs.
By the way, where can I find the docs about the meaning of stored vectors at each index of the tuples?

In the doc for the outputs of `BertModel.
it's here: https://huggingface.co/transformers/model_doc/bert.html#transformers.BertModel

But for BERT model there is two input pooled_output and sequence_output.
pooled_output, sequence_output = bert_layer([input_word_id, input_mask, segment_id])
From here how can I get last 3 hidden layer outputs?

Hidden states will be returned if you will specify it in bert config, as noted above:

config = BertConfig.from_pretrained("xxx", output_hidden_states=True)
model = BertModel.from_pretrained("xxx", config=config)

@LysandreJik In ur code, what are output[0] andoutput[1]?

As it is mentioned in the documentation, the returns of the BERT model are (last_hidden_state, pooler_output, hidden_states[optional], attentions[optional])

output[0] is therefore the last hidden state and output[1] is the pooler output.

@LysandreJik What exactly pooler output is?

It's written in the documentation:

Last layer hidden-state of the first token of the sequence (classification token) further processed by a Linear layer and a Tanh activation function. The Linear layer weights are trained from the next sentence prediction (classification) objective during pre-training.

This output is usually not a good summary of the semantic content of the input, you’re often better with averaging or pooling the sequence of hidden-states for the whole input sequence.

@LysandreJik Sorry but I don't understand what this means. If I want to get a vector of a sentence - I use the hidden state (output[0]) right?
What could pooler output be used for?

pooler: (batch, hidden_dim): Can be used when you want to have a representation for the whole sequence, like the last state of a RNN would give you. It is used for instance in Text Classification task where the predicted label doesn't depend on each token in the input.

@mfuntowicz So output[0] is for a separate representation of each word in the sequence, and the pooler is for a joint representation of the entire sequence?

Exactly 😊

@mfuntowicz Great thanks!
Two question please:

  1. When taking hidden state, I can also access per-token representation of intermediate layers by adding config. Is it possible to do access pooler_output of an intermediate layer?
  2. So if I want to analyse sentence similarity (so the sentence "this desk is green" will be more similar to "this chair is yellow" than to "We ate pizza") - Is it better to take pooler output or to average token representation in the hidden states?

@mfuntowicz Can you please help?

Hi @orko19,

  1. No this is not possible to do so because the "pooler" is a layer in itself in BERT that depends on the last representation.

  2. The best would be to finetune the pooling representation for you task and use the pooler then. Using either the pooling layer or the averaged representation of the tokens as it, might be too biased towards the training objective it was initially trained for. These layers directly linked to the loss so very prone to high bias.

@mfuntowicz Great thanks!

Was this page helpful?
0 / 5 - 0 ratings