Transformers: Bert output last hidden state

Created on 8 Sep 2019  ยท  8Comments  ยท  Source: huggingface/transformers

โ“ Questions & Help

Hi,

Suppose we have an utterance of length 24 (considering special tokens) and we right-pad it with 0 to max length of 64.
If we use Bert pertained model to get the last hidden states, the output would be of size [1, 64, 768].
Can we use just the first 24 as the hidden states of the utterance? I mean is it right to say that the output[0, :24, :] has all the required information?
I realized that from index 24:64, the outputs has float values as well.

wontfix

Most helpful comment

Can we use just the first 24 as the hidden states of the utterance? I mean is it right to say that the output[0, :24, :] has all the required information?
I realized that from index 24:64, the outputs has float values as well.

yes, the remaining indices are values of padding embeddings, you can try/prove it out by different length of padding

take a look at that post #1013 (XLNet) and #278 (Bert)

All 8 comments

Hello! I believe that you are currently computing values for your padding indices, resulting in your confusion. There is a parameter attention_mask to be passed to the forward/__call__ method which will prevent the values to be computed for the padded indices!

@LysandreJik thanks for replying.
Consider the example given in the modeling_bert.py script:

        tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
        model = BertModel.from_pretrained('bert-base-uncased')
        input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0)  # Batch size 1
        padding = [0] * ( 128 - len(input_ids))
        input_ids += padding

        attn_mask = input_ids.ne(0) # I added this to create a mask for padded indices
        outputs = model(input_ids, attention_mask=attn_mask)
        last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple

even with passing attention_mask parameter, it still compute values for the padded indices.
Am I doing something wrong?

Can we use just the first 24 as the hidden states of the utterance? I mean is it right to say that the output[0, :24, :] has all the required information?
I realized that from index 24:64, the outputs has float values as well.

yes, the remaining indices are values of padding embeddings, you can try/prove it out by different length of padding

take a look at that post #1013 (XLNet) and #278 (Bert)

@cherepanovic Thanks for your reply.
Oh See, I tried padding w/wo passing attention mask and I realized the output would be completely different for all indices.
So I understand that when we use padding we must pass the attention mask for sure, this way the output (on non padded indices) would be equal (not exactly, but almost) to when we don't use padding at all, right?

would be equal (not exactly, but almost)

right

@cherepanovic Just my very main question is are the output values in the padded indices, create noise or in other word misleading? or can we just make use of the whole output without being worried that for example the last 20 indices in the output is for padded tokens.

@ehsan-soe can you describe your intent more precisely

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

hsajjad picture hsajjad  ยท  3Comments

fyubang picture fyubang  ยท  3Comments

delip picture delip  ยท  3Comments

iedmrc picture iedmrc  ยท  3Comments

chuanmingliu picture chuanmingliu  ยท  3Comments