Model I am using (Bert, XLNet ...): GPT-2
Language I am using the model on (English, Chinese ...): English
The problem arises when using:
The tasks I am working on is:
Steps to reproduce the behavior:
attention_mask should reflect causal attention masking for the LM objective in finetuning GPT-2 so that outputs (t) only attend to inputs at previous time steps (1,..,t-1) instead of relying on input at the same time-step of output (t) where GPT-2 can simply learn to copy the input as output to optimize the LM objective.
transformers version: 2.5.1Hi @alvinchangw,
GPT2 always uses causal masking no matter what kind of attention_mask you give it.
It's easy to see when you print out the computed attentions for each layer (by setting output_attentions=True) => see for this also #2975.
In the code this is done in this line:
https://github.com/huggingface/transformers/blob/298bed16a841fae3608d334441ccae4d9043611f/src/transformers/modeling_gpt2.py#L146
I admit it is very cryptic and probably should have better naming. Essentially what happens here is the following:
self.bias is defined as a lower triangular mask (see torch function here ). according to the sequence length (params nd and ns), we derive b. b is then a lower triangular mask of shape sequence length x sequence length. Using this mask, we substract 10^4 from all values in w which should be masked, which sets their attention to 0.
Hi @patrickvonplaten,
Thank you for pointing this out and for the detailed explanation!
Most helpful comment
Hi @alvinchangw,
GPT2 always uses causal masking no matter what kind of attention_mask you give it.
It's easy to see when you print out the computed attentions for each layer (by setting
output_attentions=True) => see for this also #2975.In the code this is done in this line:
https://github.com/huggingface/transformers/blob/298bed16a841fae3608d334441ccae4d9043611f/src/transformers/modeling_gpt2.py#L146
I admit it is very cryptic and probably should have better naming. Essentially what happens here is the following:
self.biasis defined as a lower triangular mask (see torch function here ). according to the sequence length (paramsndandns), we deriveb.bis then a lower triangular mask of shape sequence length x sequence length. Using this mask, we substract 10^4 from all values inwwhich should be masked, which sets their attention to 0.