Transformers: word or sentence embedding from BERT model

Created on 26 Nov 2019 · 32Comments · Source: huggingface/transformers

How can I extract embeddings for a sentence or a set of words directly from pre-trained models (Standard BERT)? For example, I am using Spacy for this purpose at the moment where I can do it as follows:

sentence vector:
sentence_vector = bert_model("This is an apple").vector

word_vectors:

words = bert_model("This is an apple")
word_vectors = [w.vector for w in words]

I am wondering if this is possible directly with huggingface pre-trained models (especially BERT).

wontfix

Source

engrsfi

👀4 ❤4

Most helpful comment

Found it, thanks @bkkaggle . Just for others who are looking for the same information.

Using Pytorch:

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')
input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0)  # Batch size 1
outputs = model(input_ids)
last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple

Using Tensorflow:

import tensorflow as tf
from transformers import BertTokenizer, TFBertModel

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = TFBertModel.from_pretrained('bert-base-uncased')
input_ids = tf.constant(tokenizer.encode("Hello, my dog is cute"))[None, :]  # Batch size 1
outputs = model(input_ids)
last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple

engrsfi on 26 Nov 2019

👍137 🎉38 🚀2 😄2

All 32 comments

You can use BertModel, it'll return the hidden states for the input sentence.

bkkaggle on 26 Nov 2019

👍3

Found it, thanks @bkkaggle . Just for others who are looking for the same information.

Using Pytorch:

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')
input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0)  # Batch size 1
outputs = model(input_ids)
last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple

Using Tensorflow:

import tensorflow as tf
from transformers import BertTokenizer, TFBertModel

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = TFBertModel.from_pretrained('bert-base-uncased')
input_ids = tf.constant(tokenizer.encode("Hello, my dog is cute"))[None, :]  # Batch size 1
outputs = model(input_ids)
last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple

engrsfi on 26 Nov 2019

👍137 🎉38 🚀2 😄2

This is a bit different for ...ForSequenceClassification models. I've found that the item at outputs[0] are the logits and the only way to get the hidden_states is to set config.output_hidden_states=True when initializing the model. Only then was I able to get the hidden_states which are located at outputs[1].

Example:

inputs = {
    "input_ids": batch[0],
    "attention_mask": batch[1]
}

output = bertmodel(**inputs)
logits = output[0]
hidden_states = output[1]

maxzzze on 26 Nov 2019

👍20

By using this code, you can obtain a PyTorch tensor of (1, N, 768) shape, where _N_ is the number of different tokens extracted from BertTokenizer. If you want to build the sentence vector by exploiting these N tensors, how do you do that? @engrsfi

Found it, thanks @bkkaggle . Just for others who are looking for the same information.

Using Pytorch:

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')
input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0)  # Batch size 1
outputs = model(input_ids)
last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple

Using Tensorflow:

import tensorflow as tf
from transformers import BertTokenizer, TFBertModel

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = TFBertModel.from_pretrained('bert-base-uncased')
input_ids = tf.constant(tokenizer.encode("Hello, my dog is cute"))[None, :]  # Batch size 1
outputs = model(input_ids)
last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple

TheEdoardo93 on 26 Nov 2019

👍12 👀3

This is a bit different for ...ForSequenceClassification models. I've found that the item at outputs[0] are the logits and the only way to get the hidden_states is to set config.output_hidden_states=True when initializing the model. Only then was I able to get the hidden_states which are located at outputs[1].

Example:
inputs = {
    "input_ids": batch[0],
    "attention_mask": batch[1]
}

output = bertmodel(**inputs)
logits = output[0]
hidden_states = output[1]

I am interested in the last hidden states which are seen as kind of embeddings. I think you are referring to all hidden states including the output of the embedding layer.

"**hidden_states**: (`optional`, returned when ``config.output_hidden_states=True``) list of ``torch.FloatTensor`` (one for the output of each layer + the output of the embeddings) of shape ``(batch_size, sequence_length, hidden_size)``: Hidden-states of the model at the output of each layer plus the initial embedding outputs.

engrsfi on 26 Nov 2019

By using this code, you can obtain a PyTorch tensor of (1, N, 768) shape, where _N_ is the number of different tokens extracted from BertTokenizer. If you want to build the sentence vector by exploiting these N tensors, how do you do that? @engrsfi
Found it, thanks @bkkaggle . Just for others who are looking for the same information.
Using Pytorch:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')
input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0)  # Batch size 1
outputs = model(input_ids)
last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple
Using Tensorflow:
import tensorflow as tf
from transformers import BertTokenizer, TFBertModel

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = TFBertModel.from_pretrained('bert-base-uncased')
input_ids = tf.constant(tokenizer.encode("Hello, my dog is cute"))[None, :]  # Batch size 1
outputs = model(input_ids)
last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple

You can take an average of them. However, I think the embeddings at first position [CLS] are considered a kind of sentence vector because only those are fed to a further classifier if any for downstream tasks. Disclaimer: I am not sure about it.

engrsfi on 26 Nov 2019

👍11

This is a bit different for ...ForSequenceClassification models. I've found that the item at outputs[0] are the logits and the only way to get the hidden_states is to set config.output_hidden_states=True when initializing the model. Only then was I able to get the hidden_states which are located at outputs[1].
Example:
inputs = {
    "input_ids": batch[0],
    "attention_mask": batch[1]
}

output = bertmodel(**inputs)
logits = output[0]
hidden_states = output[1]
I am interested in the last hidden states which are seen as kind of embeddings. I think you are referring to all hidden states including the output of the embedding layer.
"**hidden_states**: (`optional`, returned when ``config.output_hidden_states=True``)
            list of ``torch.FloatTensor`` (one for the output of each layer + the output of the embeddings)
            of shape ``(batch_size, sequence_length, hidden_size)``:
            Hidden-states of the model at the output of each layer plus the initial embedding outputs
```.

Should be as simple as grabbing the last element in the list:

last_layer = hidden_states[-1]

maxzzze on 26 Nov 2019

👍3 👎1

@maxzzze According to the documentation, one can get the last hidden states directly without setting this flag to True. See below.
https://huggingface.co/transformers/_modules/transformers/modeling_bert.html#BertModel

Outputs: `Tuple` comprising various elements depending on the configuration (config) and inputs:
        **last_hidden_state**: ``torch.FloatTensor`` of shape ``(batch_size, sequence_length, hidden_size)``
            Sequence of hidden-states at the output of the last layer of the model.
        **pooler_output**: ``torch.FloatTensor`` of shape ``(batch_size, hidden_size)``
            Last layer hidden-state of the first token of the sequence (classification token)
            further processed by a Linear layer and a Tanh activation function. The Linear
            layer weights are trained from the next sentence prediction (classification)
            objective during Bert pretraining. This output is usually *not* a good summary
            of the semantic content of the input, you're often better with averaging or pooling
            the sequence of hidden-states for the whole input sequence.
        **hidden_states**: (`optional`, returned when ``config.output_hidden_states=True``)
            list of ``torch.FloatTensor`` (one for the output of each layer + the output of the embeddings)
            of shape ``(batch_size, sequence_length, hidden_size)``:
            Hidden-states of the model at the output of each layer plus the initial embedding outputs.
        **attentions**: (`optional`, returned when ``config.output_attentions=True``)
            list of ``torch.FloatTensor`` (one for each layer) of shape ``(batch_size, num_heads, sequence_length, sequence_length)``:
            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.

BTW, for me, the shape of hidden_states in the below code is (batch_size, 768) when I set this Flag to True, not sure if I can extract last hidden states from that.

output = bertmodel(**inputs)
logits = output[0]
hidden_states = output[1]

engrsfi on 26 Nov 2019

👍2

@maxzzze According to the documentation, one can get the last hidden states directly without setting this flag to True. See below.

Outputs: `Tuple` comprising various elements depending on the configuration (config) and inputs:
        **last_hidden_state**: ``torch.FloatTensor`` of shape ``(batch_size, sequence_length, hidden_size)``
            Sequence of hidden-states at the output of the last layer of the model.
        **pooler_output**: ``torch.FloatTensor`` of shape ``(batch_size, hidden_size)``
            Last layer hidden-state of the first token of the sequence (classification token)
            further processed by a Linear layer and a Tanh activation function. The Linear
            layer weights are trained from the next sentence prediction (classification)
            objective during Bert pretraining. This output is usually *not* a good summary
            of the semantic content of the input, you're often better with averaging or pooling
            the sequence of hidden-states for the whole input sequence.
        **hidden_states**: (`optional`, returned when ``config.output_hidden_states=True``)
            list of ``torch.FloatTensor`` (one for the output of each layer + the output of the embeddings)
            of shape ``(batch_size, sequence_length, hidden_size)``:
            Hidden-states of the model at the output of each layer plus the initial embedding outputs.
        **attentions**: (`optional`, returned when ``config.output_attentions=True``)
            list of ``torch.FloatTensor`` (one for each layer) of shape ``(batch_size, num_heads, sequence_length, sequence_length)``:
            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.

BTW, for me, the shape of hidden_states in the below code is (batch_size, 768) whereas it should be (batch_size, num_heads, sequence_length, sequence_length).

output = bertmodel(**inputs)
logits = output[0]
hidden_states = output[1]

I believe your comment is in reference to the standard models, but its hard to tell without a link. Can you link where to where in the documentation the pasted doc string is from?

I dont know if you saw my original comment but I was providing an example for how to get hidden_states from the ..ForSequenceClassification models, not the standard ones. The ..ForSequenceClassification models do not output hidden_states by default: https://huggingface.co/transformers/model_doc/bert.html#bertforsequenceclassification

maxzzze on 26 Nov 2019

👍1

Sorry, I missed that part :) I am referring to the standard BERTMODEL. Doc link:
https://huggingface.co/transformers/model_doc/bert.html#bertmodel

@maxzzze According to the documentation, one can get the last hidden states directly without setting this flag to True. See below.
Outputs: `Tuple` comprising various elements depending on the configuration (config) and inputs:
        **last_hidden_state**: ``torch.FloatTensor`` of shape ``(batch_size, sequence_length, hidden_size)``
            Sequence of hidden-states at the output of the last layer of the model.
        **pooler_output**: ``torch.FloatTensor`` of shape ``(batch_size, hidden_size)``
            Last layer hidden-state of the first token of the sequence (classification token)
            further processed by a Linear layer and a Tanh activation function. The Linear
            layer weights are trained from the next sentence prediction (classification)
            objective during Bert pretraining. This output is usually *not* a good summary
            of the semantic content of the input, you're often better with averaging or pooling
            the sequence of hidden-states for the whole input sequence.
        **hidden_states**: (`optional`, returned when ``config.output_hidden_states=True``)
            list of ``torch.FloatTensor`` (one for the output of each layer + the output of the embeddings)
            of shape ``(batch_size, sequence_length, hidden_size)``:
            Hidden-states of the model at the output of each layer plus the initial embedding outputs.
        **attentions**: (`optional`, returned when ``config.output_attentions=True``)
            list of ``torch.FloatTensor`` (one for each layer) of shape ``(batch_size, num_heads, sequence_length, sequence_length)``:
            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
BTW, for me, the shape of hidden_states in the below code is (batch_size, 768) whereas it should be (batch_size, num_heads, sequence_length, sequence_length).
output = bertmodel(**inputs)
logits = output[0]
hidden_states = output[1]
I believe your comment is in reference to the standard models, but its hard to tell without a link. Can you link where to where in the documentation the pasted doc string is from?

I dont know if you saw my original comment but I was providing an example for how to get hidden_states from the ..ForSequenceClassification models, not the standard ones. The ..ForSequenceClassification models do not output hidden_states by default: https://huggingface.co/transformers/model_doc/bert.html#bertforsequenceclassification

engrsfi on 26 Nov 2019

👍1

@engrsfi @maxzzze @bkkaggle
Please, look here. I hope it can help :)

TheEdoardo93 on 26 Nov 2019

@TheEdoardo93 is this example taking the first element in each of the hidden_states?

maxzzze on 26 Nov 2019

@engrsfi You can process the hidden states of BERT (all layers or only the last layer) in whatever way you want.

Most people usually only take the hidden states of the [CLS] token of the last layer - using the hidden states for all tokens or from multiple layers doesn't usually help you that much.

If you want to get the embeddings for classification, just do something like:

input_sentence = torch.tensor(tokenizer.encode("[CLS] My sentence")).unsqueeze(0)
out = model(input_sentence)
embeddings_of_last_layer = out[0]
cls_embeddings = embeddings_of_last_layer[0]

bkkaggle on 26 Nov 2019

👍18 👎1

@engrsfi You can process the hidden states of BERT (all layers or only the last layer) in whatever way you want.

Most people usually only take the hidden states of the [CLS] token of the last layer - using the hidden states for all tokens or from multiple layers doesn't usually help you that much.

If you want to get the embeddings for classification, just do something like:
input_sentence = torch.tensor(tokenizer.encode("[CLS] My sentence")).unsqueeze(0)
out = model(input_sentence)
embeddings_of_last_layer = out[0]
cls_embeddings = embeddings_of_last_layer[0]

Do you have any reference as to "people usually only take the hidden states of the [CLS] token of the last layer"?

maxzzze on 26 Nov 2019

😄3

Here are a few related links: 1, 2, 3

The [CLS] token isn't the only (or necessarily the best) way to finetune, but it is the easiest and is Bert's default

bkkaggle on 27 Nov 2019

👍10

There is some clarification about the use of the last hidden states in the BERT Paper.
According to the paper, the last hidden state for [CLS] is mainly used for classification tasks and the last hidden states for all tokens are used for token level tasks such as sequence tagging or question answering.

From the paper:

At the output, the token representations are fed into an output layer for token level tasks, such as sequence tagging or question answering, and the [CLS] representation is fed into an output layer for classification, such as entailment or sentiment analysis.

Reference:
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding (https://arxiv.org/pdf/1810.04805.pdf)

engrsfi on 27 Nov 2019

🎉15 👍14 🚀3

What about ALBERT? The output of the last hidden state isn't the same of the embedding because in the doc they say that the embedding have a size of 128 for every model (https://arxiv.org/pdf/1909.11942.pdf page 6).
But I'm not sure if the 128-embedding referenced in the table is something internally used to represent words or the final word embedding.

alessiocancian on 8 Dec 2019

Found it, thanks @bkkaggle . Just for others who are looking for the same information.

Using Pytorch:

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')
input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0)  # Batch size 1
outputs = model(input_ids)
last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple

Using Tensorflow:

import tensorflow as tf
from transformers import BertTokenizer, TFBertModel

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = TFBertModel.from_pretrained('bert-base-uncased')
input_ids = tf.constant(tokenizer.encode("Hello, my dog is cute"))[None, :]  # Batch size 1
outputs = model(input_ids)
last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple

if batch size is N, how to convert?

duan348733684 on 14 Jan 2020

🎉1 👍1

What about ALBERT? The output of the last hidden state isn't the same of the embedding because in the doc they say that the embedding have a size of 128 for every model (https://arxiv.org/pdf/1909.11942.pdf page 6).
But I'm not sure if the 128-embedding referenced in the table is something internally used to represent words or the final word embedding.

128 is used internally by Albert. The output of the model (last hidden state) is your actual word embeddings. In order to understand this better, you should read the following blog from Google.
https://ai.googleblog.com/2019/12/albert-lite-bert-for-self-supervised.html

Quote:
"The key to optimizing performance, captured in the design of ALBERT, is to allocate the model’s capacity more efficiently. Input-level embeddings (words, sub-tokens, etc.) need to learn context-independent representations, a representation for the word “bank”, for example. In contrast, hidden-layer embeddings need to refine that into context-dependent representations, e.g., a representation for “bank” in the context of financial transactions, and a different representation for “bank” in the context of river-flow management.

This is achieved by factorization of the embedding parametrization — the embedding matrix is split between input-level embeddings with a relatively-low dimension (e.g., 128), while the hidden-layer embeddings use higher dimensionalities (768 as in the BERT case, or more). With this step alone, ALBERT achieves an 80% reduction in the parameters of the projection block, at the expense of only a minor drop in performance — 80.3 SQuAD2.0 score, down from 80.4; or 67.9 on RACE, down from 68.2 — with all other conditions the same as for BERT."

engrsfi on 15 Jan 2020

👍2

Found it, thanks @bkkaggle . Just for others who are looking for the same information.
Using Pytorch:

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')
input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0)  # Batch size 1
outputs = model(input_ids)
last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple

Using Tensorflow:

import tensorflow as tf
from transformers import BertTokenizer, TFBertModel

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = TFBertModel.from_pretrained('bert-base-uncased')
input_ids = tf.constant(tokenizer.encode("Hello, my dog is cute"))[None, :]  # Batch size 1
outputs = model(input_ids)
last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple

if batch size is N, how to convert?

If I understand you correctly, you are asking for how to get the last hidden states for all entries in a batch of size N. If that's the case, then here is the explanation.

Your model expect input of the following shape:

(batch_size, sequence_length)

and returns last hidden states of the following shape:

(batch_size, sequence_length, hidden_size)

You can just go through the last hidden states to get the individual last hidden state for each input in the batch size of N.

Reference:
https://huggingface.co/transformers/model_doc/bert.html

engrsfi on 15 Jan 2020

👍1

@engrsfi : What if I want to use bert embedding vector of each token as an input to an LSTM network? Can I get the embedding of each token of the sentence from the last hidden layer of the bert model? In this case I think I can't just use the embedding for [CLS] token as I need word embedding of each token?
I used the code below to get bert's word embedding for all tokens of my sentences. I padded all my sentences to have maximum length of 80 and also used attention mask to ignore padded elements. in this case the shape of last_hidden_states element is of size (batch_size ,80 ,768). However, when I see my embeddings, I can see that embedding vectors for padded elements are not the same? like I have a vector of size 768 for each token of the sentence(most of them are padded tokens). but vectors for padded element are not equal. is it natural?

import tensorflow as tf
import numpy as np
from transformers import BertTokenizer, TFBertModel

bert_model = TFBertModel.from_pretrained("bert-base-uncased")
bert_tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
tokenized = x_train['token'].apply((lambda x: bert_tokenizer.encode(x, add_special_tokens=True, max_length=80)))
padded = np.array([i + [0]*(80-len(i)) for i in tokenized.values])
attention_mask = np.where(padded != 0, 1, 0)
input_ids = tf.constant(padded)
attention_mask = tf.constant(attention_mask)
output= bert_model(input_ids, attention_mask=attention_mask)
last_hidden_states=output[0]

sahand91 on 28 Jan 2020

👍4

How can I extract embeddings for a sentence or a set of words directly from pre-trained models (Standard BERT)? For example, I am using Spacy for this purpose at the moment where I can do it as follows:

sentence vector:
sentence_vector = bert_model("This is an apple").vector

word_vectors:
words = bert_model("This is an apple")
word_vectors = [w.vector for w in words]
I am wondering if this is possible directly with huggingface pre-trained models (especially BERT).

Hi, could I ask how you would use Spacy to do this? Is there a link? Thanks a lot.

lisazhao9897 on 28 Feb 2020

How can I extract embeddings for a sentence or a set of words directly from pre-trained models (Standard BERT)? For example, I am using Spacy for this purpose at the moment where I can do it as follows:
sentence vector:
sentence_vector = bert_model("This is an apple").vector
word_vectors:
words = bert_model("This is an apple")
word_vectors = [w.vector for w in words]
I am wondering if this is possible directly with huggingface pre-trained models (especially BERT).
Hi, could I ask how you would use Spacy to do this? Is there a link? Thanks a lot.

Here is the link:
https://spacy.io/usage/vectors-similarity

engrsfi on 3 Mar 2020

@engrsfi You can process the hidden states of BERT (all layers or only the last layer) in whatever way you want.

Most people usually only take the hidden states of the [CLS] token of the last layer - using the hidden states for all tokens or from multiple layers doesn't usually help you that much.

If you want to get the embeddings for classification, just do something like:
input_sentence = torch.tensor(tokenizer.encode("[CLS] My sentence")).unsqueeze(0)
out = model(input_sentence)
embeddings_of_last_layer = out[0]
cls_embeddings = embeddings_of_last_layer[0]

Thank you for sharing the code. It really helped in understanding tokenization in BERT. I ran this and had a minor problem. Shouldn't it be:

cls_embeddings = embeddings_of_last_layer[0][0]? This is because embeddings_of_last_layer is of the dimension: 1#tokens#hidden-units. Then, since [CLS] is the first token (and usually have 101 as id), we want embedding corresponding to just [CLS]. embeddings_of_last_layer[0] is of shape #tokens*#hidden-units and contains embeddings of all the tokens.

sumitsidana on 1 Apr 2020

👍3

@sahand91
pooled_output, sequence_output = bert_model(input_)
pooled_output.shape = (1, 768), one vector on 768 entries (represent the whole sentence)
sequence_output.shape = (batch_size, max_len, dim), (1, 256, 768) bs = 1, n_tokens = 256
sequence output gives the vector for each token of the sentence.

I have used the sequence output for classification task like sentiment analysis. As the paper mentions that the pooled output is not a good representation of the whole sentence so we use the sequence output and feed it further in a CNN or LSTM.

So I don't see any problem in using the sequence output for classification tasks as we get to see the actual vector representation of the word say "bank" in both contexts "commercial" and "location" (bank of a river)

muhammadfahid51 on 23 Apr 2020

👍1

@engrsfi You can process the hidden states of BERT (all layers or only the last layer) in whatever way you want.
Most people usually only take the hidden states of the [CLS] token of the last layer - using the hidden states for all tokens or from multiple layers doesn't usually help you that much.
If you want to get the embeddings for classification, just do something like:
input_sentence = torch.tensor(tokenizer.encode("[CLS] My sentence")).unsqueeze(0)
out = model(input_sentence)
embeddings_of_last_layer = out[0]
cls_embeddings = embeddings_of_last_layer[0]
Thank you for sharing the code. It really helped in understanding tokenization in BERT. I ran this and had a minor problem. Shouldn't it be:

cls_embeddings = embeddings_of_last_layer[0][0]? This is because embeddings_of_last_layer is of the dimension: 1#tokens#hidden-units. Then, since [CLS] is the first token (and usually have 101 as id), we want embedding corresponding to just [CLS]. embeddings_of_last_layer[0] is of shape #tokens*#hidden-units and contains embeddings of all the tokens.

Yes i think the same. @sumitsidana
embeddings_of_last_layer[0][0].shape
Out[179]: torch.Size([144]) # where 144 in my case is the hidden_size

Anyone confirming that embeddings_of_last_layer[0][0] is the embedding related to CLS token for each sequence?

giorgiomondauto on 24 Apr 2020

@engrsfi You can process the hidden states of BERT (all layers or only the last layer) in whatever way you want.
Most people usually only take the hidden states of the [CLS] token of the last layer - using the hidden states for all tokens or from multiple layers doesn't usually help you that much.
If you want to get the embeddings for classification, just do something like:
input_sentence = torch.tensor(tokenizer.encode("[CLS] My sentence")).unsqueeze(0)
out = model(input_sentence)
embeddings_of_last_layer = out[0]
cls_embeddings = embeddings_of_last_layer[0]
Thank you for sharing the code. It really helped in understanding tokenization in BERT. I ran this and had a minor problem. Shouldn't it be:
cls_embeddings = embeddings_of_last_layer[0][0]? This is because embeddings_of_last_layer is of the dimension: 1#tokens#hidden-units. Then, since [CLS] is the first token (and usually have 101 as id), we want embedding corresponding to just [CLS]. embeddings_of_last_layer[0] is of shape #tokens*#hidden-units and contains embeddings of all the tokens.
Yes i think the same. @sumitsidana
embeddings_of_last_layer[0][0].shape
Out[179]: torch.Size([144]) # where 144 in my case is the hidden_size

Anyone confirming that embeddings_of_last_layer[0][0] is the embedding related to CLS token for each sequence?

Yes it is. but it is only for first batch. you will have to loop through all the batches and get the first element (CLS) for each sentence.

muhammadfahid51 on 24 Apr 2020

Yes gotcha. Thanks

giorgiomondauto on 24 Apr 2020

This is a bit different for ...ForSequenceClassification models. I've found that the item at outputs[0] are the logits and the only way to get the hidden_states is to set config.output_hidden_states=True when initializing the model. Only then was I able to get the hidden_states which are located at outputs[1].

Example:
inputs = {
    "input_ids": batch[0],
    "attention_mask": batch[1]
}

output = bertmodel(**inputs)
logits = output[0]
hidden_states = output[1]
logtis = output[0] means the word embedding. So, does hidden_states = output[1] means the sentence level embedding ?

leopardv10 on 24 Apr 2020

👀1 👍1

Found it, thanks @bkkaggle . Just for others who are looking for the same information.

Using Pytorch:

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')
input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0)  # Batch size 1
outputs = model(input_ids)
last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple

Using Tensorflow:

import tensorflow as tf
from transformers import BertTokenizer, TFBertModel

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = TFBertModel.from_pretrained('bert-base-uncased')
input_ids = tf.constant(tokenizer.encode("Hello, my dog is cute"))[None, :]  # Batch size 1
outputs = model(input_ids)
last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple

outputs[0] is sentence embedding for "Hello, my dog is cute" right?
then what is outputs[1]?

Saichethan on 27 May 2020

👀1 👍1

Found it, thanks @bkkaggle . Just for others who are looking for the same information.

Using Pytorch:

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')
input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0)  # Batch size 1
outputs = model(input_ids)
last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple

Using Tensorflow:

import tensorflow as tf
from transformers import BertTokenizer, TFBertModel

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = TFBertModel.from_pretrained('bert-base-uncased')
input_ids = tf.constant(tokenizer.encode("Hello, my dog is cute"))[None, :]  # Batch size 1
outputs = model(input_ids)
last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple

If I want to encode a list of strings,
input_ids = torch.tensor(tokenizer.encode(["Hello, my dog is cute", "how are you"])).unsqueeze(0)
It does not really gives me 2*768 array. The only is would be
input_ids = [torch.tensor([tokenizer.encode(text) for text in ["Hello, my dog is cute", "how are you"]]).unsqueeze(0)]
Anything to make it faster?

steveguang on 28 May 2020

👍2 🎉1

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.