Hi,
Suppose we have an utterance of length 24 (considering special tokens) and we right-pad it with 0 to max length of 64.
If we use Bert pertained model to get the last hidden states, the output would be of size [1, 64, 768].
Can we use just the first 24 as the hidden states of the utterance? I mean is it right to say that the output[0, :24, :] has all the required information?
I realized that from index 24:64, the outputs has float values as well.
Hello! I believe that you are currently computing values for your padding indices, resulting in your confusion. There is a parameter attention_mask to be passed to the forward/__call__ method which will prevent the values to be computed for the padded indices!
@LysandreJik thanks for replying.
Consider the example given in the modeling_bert.py script:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')
input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0) # Batch size 1
padding = [0] * ( 128 - len(input_ids))
input_ids += padding
attn_mask = input_ids.ne(0) # I added this to create a mask for padded indices
outputs = model(input_ids, attention_mask=attn_mask)
last_hidden_states = outputs[0] # The last hidden-state is the first element of the output tuple
even with passing attention_mask parameter, it still compute values for the padded indices.
Am I doing something wrong?
Can we use just the first 24 as the hidden states of the utterance? I mean is it right to say that the output[0, :24, :] has all the required information?
I realized that from index 24:64, the outputs has float values as well.
yes, the remaining indices are values of padding embeddings, you can try/prove it out by different length of padding
take a look at that post #1013 (XLNet) and #278 (Bert)
@cherepanovic Thanks for your reply.
Oh See, I tried padding w/wo passing attention mask and I realized the output would be completely different for all indices.
So I understand that when we use padding we must pass the attention mask for sure, this way the output (on non padded indices) would be equal (not exactly, but almost) to when we don't use padding at all, right?
would be equal (not exactly, but almost)
right
@cherepanovic Just my very main question is are the output values in the padded indices, create noise or in other word misleading? or can we just make use of the whole output without being worried that for example the last 20 indices in the output is for padded tokens.
@ehsan-soe can you describe your intent more precisely
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Most helpful comment
yes, the remaining indices are values of padding embeddings, you can try/prove it out by different length of padding
take a look at that post #1013 (XLNet) and #278 (Bert)