transformers version: 3.3.1@cccntu @patrickvonplaten @LysandreJik
Model I am using (Bert, XLNet ...): DialoGPT-large
The problem arises when using:
The tasks I am working on is:
Steps to reproduce the behavior:
Checkout 121dd43.
Run the DialoGPT "How to use" code given here, but change DialoGPT-medium to DialoGPT-large:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
tokenizer = AutoTokenizer.from_pretrained("microsoft/DialoGPT-large")
model = AutoModelForCausalLM.from_pretrained("microsoft/DialoGPT-large")
# Let's chat for 5 lines
for step in range(5):
# encode the new user input, add the eos_token and return a tensor in Pytorch
new_user_input_ids = tokenizer.encode(input(">> User:") + tokenizer.eos_token, return_tensors='pt')
# append the new user input tokens to the chat history
bot_input_ids = torch.cat([chat_history_ids, new_user_input_ids], dim=-1) if step > 0 else new_user_input_ids
# generated a response while limiting the total chat history to 1000 tokens,
chat_history_ids = model.generate(bot_input_ids, max_length=1000, pad_token_id=tokenizer.eos_token_id)
# pretty print last ouput tokens from bot
print("DialoGPT: {}".format(tokenizer.decode(chat_history_ids[:, bot_input_ids.shape[-1]:][0], skip_special_tokens=True)))
>> User:Hello, how are you?
DialoGPT: 're you a fan of the show?
Note: this problem is still present in the current version of master (5148f43).
With the previous commit, 0c64b18, I get this output:
>> User:Hello, how are you?
DialoGPT: I'm good, you?
The issue seems to be related to the <|endoftext|> token, which is used at the end of every utterance. This is being regarded as a padding token, and thus it's attention-masked, which also seems to affect the position ids.
Hi @abisee , sorry for the inconvenience.
Even though you did not pass in attention mask, it is created here: (first 2 lines)
https://github.com/huggingface/transformers/blob/5148f433097915f30864bf0ca6090656fecefbb8/src/transformers/generation_utils.py#L352-L363
changing this
chat_history_ids = model.generate(bot_input_ids, max_length=1000, pad_token_id=tokenizer.eos_token_id)
to
chat_history_ids = model.generate(bot_input_ids, max_length=1000, )
seems to solve the problem. (the pad_token_id will still be set to tokenizer.eos_token_id, but after attention_mask is set to all ones)
Here is how the bug can happen:
If someone tries to
Don't have a better solution for now, will think about it.
@patrickvonplaten @LysandreJik What do you think?
Maybe generate() should not create attention mask for users, but this can break other code, too.
Thanks for the response @cccntu!
My understanding is that both GPT2 and DialoGPT were trained without a pad token; i.e. neither model has a pad token embedding. In that case, why does the DialoGPT example code contain pad_token_id=tokenizer.eos_token_id? What's the purpose of doing this, if a pad token isn't needed for generation, and the EOS token was never used as a pad token during training?
For generation, it seems that attention masks are created automatically (if there's an assigned pad token that appears in the input). See GenerationMixin.generate():
# create attention mask if necessary
# TODO (PVP): this should later be handled by the forward fn() in each model in the future see PR 3140
if (attention_mask is None) and (pad_token_id is not None) and (pad_token_id in input_ids):
attention_mask = input_ids.ne(pad_token_id).long()
elif attention_mask is None:
attention_mask = input_ids.new_ones(input_ids.shape)
However for training (at least for GPT2 models), as far as I can tell, the attention mask is _not_ created automatically, even if there's an assigned pad token that appears in the input.
This seems like an unexpected discrepancy, and another reason to put the attention mask creation in the model's forward as proposed by @thomwolf in PR 3140.
That's a super interesting issue! Thanks for posting it here!
So in short, in order to be able to do batch_generation with GPT2 (or Beam Search), we have to use some kind of token as the pad_token_id in case one batch finishes early. We decided a while back that for GPT2 we will just use the eos_token_id as the pad_token_id in this case.
Just as you guys noticed the problem lies in generate() automatically creating the attention_mask and falsely assuming the eos_token_id is a pad_token_id .
IMO, it was a mistake to automatically create the attention_mask in generate() as it could lead to unexpected problems such as those!
I'm currently doing a big generate() refactor and in this refactor the problem should be solved (see comment in PR linked below).
I hope that I'll be able to merge the PR in ~1 week.
Most helpful comment
That's a super interesting issue! Thanks for posting it here!
So in short, in order to be able to do batch_generation with GPT2 (or Beam Search), we have to use some kind of token as the
pad_token_idin case one batch finishes early. We decided a while back that for GPT2 we will just use theeos_token_idas thepad_token_idin this case.Just as you guys noticed the problem lies in
generate()automatically creating theattention_maskand falsely assuming theeos_token_idis apad_token_id.IMO, it was a mistake to automatically create the
attention_maskingenerate()as it could lead to unexpected problems such as those!I'm currently doing a big
generate()refactor and in this refactor the problem should be solved (see comment in PR linked below).I hope that I'll be able to merge the PR in ~1 week.