Hello, in the doc string of GPT2 model, it says there is an optional input called attention_mask to avoid computing attention on paddings. But actually I cannot find the implementation and there is no such arguments either.
Indeed, I will remove this doctring, there is no attention_mask on GPT-2.
Indeed, I will remove this doctring, there is no attention_mask on GPT-2.
But what to do if I do want to avoid computing attention on the paddings in the input sequences.
@Saner3 @thomwolf I have same question? don't we need that for paddings?
GPT-2 is a model with absolute position embeddings (like Bert) so you should always pad on the right to get best performances for this model (will add this information to the doc_string).
As it's a causal model (only attend to the left context), also means that the model will not attend to the padding tokens (which are on the right) for any real token anyway.
So in conclusion, no need to take special care of avoiding attention on padding.
Just don't use the output of the padded tokens for anything as they don't contain any reliable information (which is obvious I hope).
@thomwolf thanks much, and great job!
Most helpful comment
GPT-2 is a model with absolute position embeddings (like Bert) so you should always pad on the right to get best performances for this model (will add this information to the doc_string).
As it's a causal model (only attend to the left context), also means that the model will not attend to the padding tokens (which are on the right) for any real token anyway.
So in conclusion, no need to take special care of avoiding attention on padding.
Just don't use the output of the padded tokens for anything as they don't contain any reliable information (which is obvious I hope).