I don't quite understand the attention mask in the way that it's implemented.
Here is the relevant line: https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/pytorch_pretrained_bert/modeling.py#L312 :
...
attention_scores = attention_scores / math.sqrt(self.attention_head_size)
# Apply the attention mask is (precomputed for all layers in BertModel forward() function)
attention_scores = attention_scores + attention_mask
# Normalize the attention scores to probabilities.
attention_probs = nn.Softmax(dim=-1)(attention_scores)
...
So it seems the proper way to use attention_mask is to set the positions you want to keep to 1's, and positions you want to mask out to 0's.
Curious why we don't simply multiply instead of add and then normalize? Is it for stability reasons?
The reason a classic binary attention mask won't work here is that the Softmax activation includes an exponential, and so an input of 0 can still yield quite a large softmax weight (since e^0 = 1).
The mask can't be applied after the softmax, because then the resulting values will not sum to 1. So the best solution is to add (not multiply!) a large negative value to the indices you want to mask. That means they will be 0 or almost 0 after the softmax step (because as you make x more negative, e^x becomes closer and closer to 0).
So you're recommending using a large negative value for the inputs you want to mask. It makes sense to me, though it seems the documentation ought to be updated, since it currently reads:
`attention_mask`: an optional torch.LongTensor of shape [batch_size, sequence_length] with indices
selected in [0, 1]. It's a mask to be used if the input sequence length is smaller than the max
input sequence length in the current batch. It's the mask that we typically use for attention when
a batch has varying length sentences.
Although I've been testing with 0 and it seems to produce the same vectors as when I only pass in a tensor of exactly the size I need. I understand this may not always be the case, however.
Note this code chunk: https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/pytorch_pretrained_bert/modeling.py#L722-L728
Thank you, that clarifies everything.
@Rocketknight1 Hi, I would like to check the code chunk, but the url you provided is out dated, could you show the code here again? Thanks.
Hi, sorry! The repo code has changed massively since last year, so I don't know if there's a single chunk corresponding to that link anymore. However, if I recall, all it showed was a short code snippet where the attention_mask tensor was converted into the additive pre-softmax mask by first inverting it and then multiplying it by -10,000. Feel free to ask questions and @tag me if you're still uncertain.
@Rocketknight1 Thank you for your reply. Yes, I understand how to change attention_mask into a quite small negative value and why. But in modeling_bert.py file, it seems like there is no such a code chunk to convert attention_mask into a proper format. check this out https://github.com/huggingface/transformers/blob/master/src/transformers/modeling_bert.py#L274
I found the corresponding source code: https://github.com/huggingface/transformers/issues/542
Most helpful comment
The reason a classic binary attention mask won't work here is that the Softmax activation includes an exponential, and so an input of 0 can still yield quite a large softmax weight (since e^0 = 1).
The mask can't be applied after the softmax, because then the resulting values will not sum to 1. So the best solution is to add (not multiply!) a large negative value to the indices you want to mask. That means they will be 0 or almost 0 after the softmax step (because as you make x more negative, e^x becomes closer and closer to 0).