Transformers: Clarifying attention mask

Created on 26 Apr 2019  路  8Comments  路  Source: huggingface/transformers

I don't quite understand the attention mask in the way that it's implemented.

Here is the relevant line: https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/pytorch_pretrained_bert/modeling.py#L312 :

...
attention_scores = attention_scores / math.sqrt(self.attention_head_size)
# Apply the attention mask is (precomputed for all layers in BertModel forward() function)
attention_scores = attention_scores + attention_mask

# Normalize the attention scores to probabilities.
attention_probs = nn.Softmax(dim=-1)(attention_scores)
...

So it seems the proper way to use attention_mask is to set the positions you want to keep to 1's, and positions you want to mask out to 0's.

Curious why we don't simply multiply instead of add and then normalize? Is it for stability reasons?

Most helpful comment

The reason a classic binary attention mask won't work here is that the Softmax activation includes an exponential, and so an input of 0 can still yield quite a large softmax weight (since e^0 = 1).

The mask can't be applied after the softmax, because then the resulting values will not sum to 1. So the best solution is to add (not multiply!) a large negative value to the indices you want to mask. That means they will be 0 or almost 0 after the softmax step (because as you make x more negative, e^x becomes closer and closer to 0).

All 8 comments

The reason a classic binary attention mask won't work here is that the Softmax activation includes an exponential, and so an input of 0 can still yield quite a large softmax weight (since e^0 = 1).

The mask can't be applied after the softmax, because then the resulting values will not sum to 1. So the best solution is to add (not multiply!) a large negative value to the indices you want to mask. That means they will be 0 or almost 0 after the softmax step (because as you make x more negative, e^x becomes closer and closer to 0).

So you're recommending using a large negative value for the inputs you want to mask. It makes sense to me, though it seems the documentation ought to be updated, since it currently reads:

`attention_mask`: an optional torch.LongTensor of shape [batch_size, sequence_length] with indices
    selected in [0, 1]. It's a mask to be used if the input sequence length is smaller than the max
    input sequence length in the current batch. It's the mask that we typically use for attention when
    a batch has varying length sentences.

Although I've been testing with 0 and it seems to produce the same vectors as when I only pass in a tensor of exactly the size I need. I understand this may not always be the case, however.

Thank you, that clarifies everything.

@Rocketknight1 Hi, I would like to check the code chunk, but the url you provided is out dated, could you show the code here again? Thanks.

Hi, sorry! The repo code has changed massively since last year, so I don't know if there's a single chunk corresponding to that link anymore. However, if I recall, all it showed was a short code snippet where the attention_mask tensor was converted into the additive pre-softmax mask by first inverting it and then multiplying it by -10,000. Feel free to ask questions and @tag me if you're still uncertain.

@Rocketknight1 Thank you for your reply. Yes, I understand how to change attention_mask into a quite small negative value and why. But in modeling_bert.py file, it seems like there is no such a code chunk to convert attention_mask into a proper format. check this out https://github.com/huggingface/transformers/blob/master/src/transformers/modeling_bert.py#L274

I found the corresponding source code: https://github.com/huggingface/transformers/issues/542

Was this page helpful?
0 / 5 - 0 ratings

Related issues

yspaik picture yspaik  路  3Comments

rsanjaykamath picture rsanjaykamath  路  3Comments

lcswillems picture lcswillems  路  3Comments

alphanlp picture alphanlp  路  3Comments

lemonhu picture lemonhu  路  3Comments