Potential Bug(?)
Reading the codebase I see that attention masks are ignored for many of the pretrained model configs such as 'bert-base-uncased'. We can see here that the attention mask is simply cleared out. Is this intentional?
from transformers import BertModel
config_path = 'bert-base-uncased'
config = BertModel.config_class.from_pretrained(config_path)
print(f'is_decoder: {config.is_decoder}')
outputs False
The encoder_attention_mask is only relevant if BERT is uses as a Encoder-Decoder model using the EncoderDecoderModel wrapper class. In this case the decoder should be able to accept an encoder_attention_mask for its cross-attention layers.
In all other cases this mask is not relevant and should be set to None.
I agree that the check if self.is_decoder is probably not the best one here it should rather be if self.is_encoder_decoder and self.is_decoder. will update this soon.
Feel free to reopen if this does not answer your question
Hi Patrick, thanks for the swift response. I鈥檓 not sure if I understand: shouldn鈥檛 we always want to mask the padded tokens, even in the encoder?
In fact the canonical BERT model suggests this, where they have no such check: https://github.com/google-research/bert/blob/master/modeling.py#L200
@patrickvonplaten Sorry for the noise. Noticed you said to reopen the issue but I think only maintainers have this permission :)
This encoder_attention_mask is only relevent for a Bert EncoderDecoder model. It is not the same as the usual attention_mask
Ah, I see. Looking again at the code I definitely misunderstood that. Thanks a ton.
Most helpful comment
This
encoder_attention_maskis only relevent for a Bert EncoderDecoder model. It is not the same as the usualattention_mask