Transformers: How to get all layers(12) hidden states of BERT?

Created on 14 Nov 2019 · 17Comments · Source: huggingface/transformers

❓ Questions & Help

I tried to set the output_hidden_states=True, but only got 3 layers of the hidden states of model outputs for BERT, but theoricaly it should be 12, how can I get that?

Source

ChaoYue0307

Most helpful comment

You should have obtained the 12 layers as well as the embedding output. Are you sure you're not mistaking the output of the forward call (which is a tuple as well) with the hidden states?

Just to make sure, this is the correct way to obtain the hidden states:

from transformers import BertModel, BertConfig

config = BertConfig.from_pretrained("xxx", output_hidden_states=True)
model = BertModel.from_pretrained("xxx", config=config)

outputs = model(inputs)
print(len(outputs))  # 3

hidden_states = outputs[2]
print(len(hidden_states))  # 13

embedding_output = hidden_states[0]
attention_hidden_states = hidden_states[1:]

LysandreJik on 14 Nov 2019

👍15 🚀3 🎉1

All 17 comments

You should have obtained the 12 layers as well as the embedding output. Are you sure you're not mistaking the output of the forward call (which is a tuple as well) with the hidden states?

Just to make sure, this is the correct way to obtain the hidden states:

from transformers import BertModel, BertConfig

config = BertConfig.from_pretrained("xxx", output_hidden_states=True)
model = BertModel.from_pretrained("xxx", config=config)

outputs = model(inputs)
print(len(outputs))  # 3

hidden_states = outputs[2]
print(len(hidden_states))  # 13

embedding_output = hidden_states[0]
attention_hidden_states = hidden_states[1:]

LysandreJik on 14 Nov 2019

👍15 🚀3 🎉1

You should have obtained the 12 layers as well as the embedding output. Are you sure you're not mistaking the output of the forward call (which is a tuple as well) with the hidden states?

Just to make sure, this is the correct way to obtain the hidden states:
from transformers import BertModel, BertConfig

config = BertConfig.from_pretrained("xxx", output_hidden_states=True)
model = BertModel.from_pretrained("xxx", config=config)

outputs = model(inputs)
print(len(outputs))  # 3

hidden_states = outputs[2]
print(len(hidden_states))  # 13

embedding_output = hidden_states[0]
attention_hidden_states = hidden_states[1:]

Thanks a lot, I think I just not find and realized the hidden states are stored at index 2 in the outputs.
By the way, where can I find the docs about the meaning of stored vectors at each index of the tuples?

ChaoYue0307 on 15 Nov 2019

In the doc for the outputs of `BertModel.
it's here: https://huggingface.co/transformers/model_doc/bert.html#transformers.BertModel

thomwolf on 4 Dec 2019

But for BERT model there is two input pooled_output and sequence_output.
pooled_output, sequence_output = bert_layer([input_word_id, input_mask, segment_id])
From here how can I get last 3 hidden layer outputs?

soumayan on 7 Apr 2020

Hidden states will be returned if you will specify it in bert config, as noted above:

config = BertConfig.from_pretrained("xxx", output_hidden_states=True)
model = BertModel.from_pretrained("xxx", config=config)

Daniil-Osokin on 18 Apr 2020

@LysandreJik In ur code, what are output[0] andoutput[1]?

orko19 on 27 May 2020

As it is mentioned in the documentation, the returns of the BERT model are (last_hidden_state, pooler_output, hidden_states[optional], attentions[optional])

output[0] is therefore the last hidden state and output[1] is the pooler output.

LysandreJik on 27 May 2020

👍1

@LysandreJik What exactly pooler output is?

orko19 on 28 May 2020

It's written in the documentation:

Last layer hidden-state of the first token of the sequence (classification token) further processed by a Linear layer and a Tanh activation function. The Linear layer weights are trained from the next sentence prediction (classification) objective during pre-training.

This output is usually not a good summary of the semantic content of the input, you’re often better with averaging or pooling the sequence of hidden-states for the whole input sequence.

LysandreJik on 28 May 2020

@LysandreJik Sorry but I don't understand what this means. If I want to get a vector of a sentence - I use the hidden state (output[0]) right?
What could pooler output be used for?

orko19 on 31 May 2020

pooler: (batch, hidden_dim): Can be used when you want to have a representation for the whole sequence, like the last state of a RNN would give you. It is used for instance in Text Classification task where the predicted label doesn't depend on each token in the input.

mfuntowicz on 31 May 2020

@mfuntowicz So output[0] is for a separate representation of each word in the sequence, and the pooler is for a joint representation of the entire sequence?

orko19 on 31 May 2020

Exactly 😊

mfuntowicz on 31 May 2020

@mfuntowicz Great thanks!
Two question please:

When taking hidden state, I can also access per-token representation of intermediate layers by adding config. Is it possible to do access pooler_output of an intermediate layer?
So if I want to analyse sentence similarity (so the sentence "this desk is green" will be more similar to "this chair is yellow" than to "We ate pizza") - Is it better to take pooler output or to average token representation in the hidden states?

orko19 on 31 May 2020

@mfuntowicz Can you please help?

orko19 on 11 Jun 2020

Hi @orko19,

No this is not possible to do so because the "pooler" is a layer in itself in BERT that depends on the last representation.
The best would be to finetune the pooling representation for you task and use the pooler then. Using either the pooling layer or the averaged representation of the tokens as it, might be too biased towards the training objective it was initially trained for. These layers directly linked to the loss so very prone to high bias.

mfuntowicz on 11 Jun 2020

@mfuntowicz Great thanks!

orko19 on 14 Jun 2020

Was this page helpful?

0 / 5 - 0 ratings