Transformers: No Causal Attention Masking in GPT-2 LM Finetuning Script

Created on 1 Mar 2020 · 2Comments · Source: huggingface/transformers

🐛 Bug

Information

Model I am using (Bert, XLNet ...): GPT-2

Language I am using the model on (English, Chinese ...): English

The problem arises when using:

[X] the official example scripts: (give details below)
[ ] my own modified scripts: (give details below)

The tasks I am working on is:

[X] an official GLUE/SQUaD task: (give the name)
running run_language_modeling.py on WikiText-2 dataset

[ ] my own task or dataset: (give details below)

To reproduce

Steps to reproduce the behavior:

attention_mask is None at each forward step of the GPT-2 model (GPT2LMHeadModel)

Expected behavior

attention_mask should reflect causal attention masking for the LM objective in finetuning GPT-2 so that outputs (t) only attend to inputs at previous time steps (1,..,t-1) instead of relying on input at the same time-step of output (t) where GPT-2 can simply learn to copy the input as output to optimize the LM objective.

Environment info

transformers version: 2.5.1
Platform: Linux-4.15.0-76-generic-x86_64-with-glibc2.10
Python version: 3.8.1
PyTorch version (GPU?): 1.4.0 (True)
Tensorflow version (GPU?): not installed (NA)
Using GPU in script?:
Using distributed or parallel set-up in script?:

Source

alvinchangw

Most helpful comment

Hi @alvinchangw,

GPT2 always uses causal masking no matter what kind of attention_mask you give it.
It's easy to see when you print out the computed attentions for each layer (by setting output_attentions=True) => see for this also #2975.

In the code this is done in this line:
https://github.com/huggingface/transformers/blob/298bed16a841fae3608d334441ccae4d9043611f/src/transformers/modeling_gpt2.py#L146

I admit it is very cryptic and probably should have better naming. Essentially what happens here is the following:
self.bias is defined as a lower triangular mask (see torch function here ). according to the sequence length (params nd and ns), we derive b. b is then a lower triangular mask of shape sequence length x sequence length. Using this mask, we substract 10^4 from all values in w which should be masked, which sets their attention to 0.