Transformers: Bert output last hidden state

Created on 8 Sep 2019 · 8Comments · Source: huggingface/transformers

❓ Questions & Help

Hi,

Suppose we have an utterance of length 24 (considering special tokens) and we right-pad it with 0 to max length of 64.
If we use Bert pertained model to get the last hidden states, the output would be of size [1, 64, 768].
Can we use just the first 24 as the hidden states of the utterance? I mean is it right to say that the output[0, :24, :] has all the required information?
I realized that from index 24:64, the outputs has float values as well.

wontfix

Source

ehsan-soe

Most helpful comment

Can we use just the first 24 as the hidden states of the utterance? I mean is it right to say that the output[0, :24, :] has all the required information?
I realized that from index 24:64, the outputs has float values as well.

yes, the remaining indices are values of padding embeddings, you can try/prove it out by different length of padding

take a look at that post #1013 (XLNet) and #278 (Bert)

cherepanovic on 9 Sep 2019

👍2

All 8 comments

Hello! I believe that you are currently computing values for your padding indices, resulting in your confusion. There is a parameter attention_mask to be passed to the forward/__call__ method which will prevent the values to be computed for the padded indices!

LysandreJik on 9 Sep 2019

👍1

@LysandreJik thanks for replying.
Consider the example given in the modeling_bert.py script:

        tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
        model = BertModel.from_pretrained('bert-base-uncased')
        input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0)  # Batch size 1
        padding = [0] * ( 128 - len(input_ids))
        input_ids += padding

        attn_mask = input_ids.ne(0) # I added this to create a mask for padded indices
        outputs = model(input_ids, attention_mask=attn_mask)
        last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple

even with passing attention_mask parameter, it still compute values for the padded indices.
Am I doing something wrong?

ehsan-soe on 9 Sep 2019

Can we use just the first 24 as the hidden states of the utterance? I mean is it right to say that the output[0, :24, :] has all the required information?
I realized that from index 24:64, the outputs has float values as well.

yes, the remaining indices are values of padding embeddings, you can try/prove it out by different length of padding

take a look at that post #1013 (XLNet) and #278 (Bert)

cherepanovic on 9 Sep 2019

👍2

@cherepanovic Thanks for your reply.
Oh See, I tried padding w/wo passing attention mask and I realized the output would be completely different for all indices.
So I understand that when we use padding we must pass the attention mask for sure, this way the output (on non padded indices) would be equal (not exactly, but almost) to when we don't use padding at all, right?

ehsan-soe on 9 Sep 2019

would be equal (not exactly, but almost)

right

cherepanovic on 9 Sep 2019

@cherepanovic Just my very main question is are the output values in the padded indices, create noise or in other word misleading? or can we just make use of the whole output without being worried that for example the last 20 indices in the output is for padded tokens.

ehsan-soe on 9 Sep 2019

@ehsan-soe can you describe your intent more precisely

cherepanovic on 10 Sep 2019

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.