Transformers: Why is `encoder_extended_attention_mask = None` when `config.is_decoder == False`

Created on 3 Jul 2020 · 5Comments · Source: huggingface/transformers

Potential Bug(?)

Reading the codebase I see that attention masks are ignored for many of the pretrained model configs such as 'bert-base-uncased'. We can see here that the attention mask is simply cleared out. Is this intentional?

from transformers import BertModel
config_path = 'bert-base-uncased'
config = BertModel.config_class.from_pretrained(config_path)
print(f'is_decoder: {config.is_decoder}')

outputs False

Source

UsmannK

👍1

Most helpful comment

This encoder_attention_mask is only relevent for a Bert EncoderDecoder model. It is not the same as the usual attention_mask

patrickvonplaten on 6 Jul 2020

❤1 🎉1

All 5 comments

The encoder_attention_mask is only relevant if BERT is uses as a Encoder-Decoder model using the EncoderDecoderModel wrapper class. In this case the decoder should be able to accept an encoder_attention_mask for its cross-attention layers.

In all other cases this mask is not relevant and should be set to None.

I agree that the check if self.is_decoder is probably not the best one here it should rather be if self.is_encoder_decoder and self.is_decoder. will update this soon.

Feel free to reopen if this does not answer your question

patrickvonplaten on 6 Jul 2020

Hi Patrick, thanks for the swift response. I’m not sure if I understand: shouldn’t we always want to mask the padded tokens, even in the encoder?

In fact the canonical BERT model suggests this, where they have no such check: https://github.com/google-research/bert/blob/master/modeling.py#L200

UsmannK on 6 Jul 2020

@patrickvonplaten Sorry for the noise. Noticed you said to reopen the issue but I think only maintainers have this permission :)

UsmannK on 6 Jul 2020

👍1

This encoder_attention_mask is only relevent for a Bert EncoderDecoder model. It is not the same as the usual attention_mask

patrickvonplaten on 6 Jul 2020

❤1 🎉1

Ah, I see. Looking again at the code I definitely misunderstood that. Thanks a ton.

UsmannK on 6 Jul 2020

👍1

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Unable to get hidden states and attentions BertForSequenceClassification

delip · 3Comments

ValueError while using --optimize_on_cpu

rsanjaykamath · 3Comments

Fine-tune specific layers

hsajjad · 3Comments

BERT tokenizer - set special tokens

adigoryl · 3Comments

Unseen Vocab

siddsach · 3Comments