Transformers: No Causal Attention Masking in GPT-2 LM Finetuning Script

Created on 1 Mar 2020  路  2Comments  路  Source: huggingface/transformers

馃悰 Bug

Information

Model I am using (Bert, XLNet ...): GPT-2

Language I am using the model on (English, Chinese ...): English

The problem arises when using:

  • [X] the official example scripts: (give details below)
  • [ ] my own modified scripts: (give details below)

The tasks I am working on is:

  • [X] an official GLUE/SQUaD task: (give the name)
    running run_language_modeling.py on WikiText-2 dataset
  • [ ] my own task or dataset: (give details below)

To reproduce

Steps to reproduce the behavior:

  1. attention_mask is None at each forward step of the GPT-2 model (GPT2LMHeadModel)

Expected behavior

attention_mask should reflect causal attention masking for the LM objective in finetuning GPT-2 so that outputs (t) only attend to inputs at previous time steps (1,..,t-1) instead of relying on input at the same time-step of output (t) where GPT-2 can simply learn to copy the input as output to optimize the LM objective.

Environment info

  • transformers version: 2.5.1
  • Platform: Linux-4.15.0-76-generic-x86_64-with-glibc2.10
  • Python version: 3.8.1
  • PyTorch version (GPU?): 1.4.0 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Using GPU in script?:
  • Using distributed or parallel set-up in script?:

Most helpful comment

Hi @alvinchangw,

GPT2 always uses causal masking no matter what kind of attention_mask you give it.
It's easy to see when you print out the computed attentions for each layer (by setting output_attentions=True) => see for this also #2975.

In the code this is done in this line:
https://github.com/huggingface/transformers/blob/298bed16a841fae3608d334441ccae4d9043611f/src/transformers/modeling_gpt2.py#L146

I admit it is very cryptic and probably should have better naming. Essentially what happens here is the following:
self.bias is defined as a lower triangular mask (see torch function here ). according to the sequence length (params nd and ns), we derive b. b is then a lower triangular mask of shape sequence length x sequence length. Using this mask, we substract 10^4 from all values in w which should be masked, which sets their attention to 0.

All 2 comments

Hi @alvinchangw,

GPT2 always uses causal masking no matter what kind of attention_mask you give it.
It's easy to see when you print out the computed attentions for each layer (by setting output_attentions=True) => see for this also #2975.

In the code this is done in this line:
https://github.com/huggingface/transformers/blob/298bed16a841fae3608d334441ccae4d9043611f/src/transformers/modeling_gpt2.py#L146

I admit it is very cryptic and probably should have better naming. Essentially what happens here is the following:
self.bias is defined as a lower triangular mask (see torch function here ). according to the sequence length (params nd and ns), we derive b. b is then a lower triangular mask of shape sequence length x sequence length. Using this mask, we substract 10^4 from all values in w which should be masked, which sets their attention to 0.

Hi @patrickvonplaten,

Thank you for pointing this out and for the detailed explanation!

Was this page helpful?
0 / 5 - 0 ratings

Related issues

fyubang picture fyubang  路  3Comments

0x01h picture 0x01h  路  3Comments

lemonhu picture lemonhu  路  3Comments

siddsach picture siddsach  路  3Comments

adigoryl picture adigoryl  路  3Comments