Transformers: Can GPT2LMHeadModel do batch inference with variable sentence lengths?

Created on 25 Feb 2020  路  39Comments  路  Source: huggingface/transformers

Given GPT2 tokenizer do not have an internal pad_token_id, how do I pad sentences and do batch inference using GPT2LMHeadModel?
Specifically my code as:

prompt_text = [
    'in this paper we',
    'we are trying to',
    'The purpose of this workshop is to check whether we can', ]

tokens = [tokenizer.convert_tokens_to_ids(tokenizer.tokenize(x, add_prefix_space=True)) for x in prompt_text]

inputs = pad_sequence([torch.LongTensor(x) for x in tokens], batch_first = True, padding_value=tokenizer.eos_token_id)

outputs, past = model(input_ids=inputs, attention_mask=None)

This will return non-relevant predictions since GPT2 will consider the eos_tokens and start a new sentence in the batch.

Can anyone please share sample codes that using GPT2LMHeadModel to do batch inference with various sentence lengths?

Thanks!

Most helpful comment

@schizism Concerning LM inference on batches of different lengths is actually a problem we are currently looking at. Ideally, you should be able to simple put your input_ids (and an attention_mask) to model.generate() to make it work.

@XinyuHua thanks for your great contribution to make LM inference work on batches having different lengths. Also it seems like you found a bug, when using the past and attention_mask variables as an input in GPT2. That's great! I will open a new issue for that and take a look :-)

Below, I am adding a simplified code snippet using simpler tokenization functions.
In this code, no past variable is used related to the bug found by @XinyuHua.

from transformers import GPT2LMHeadModel, GPT2Tokenizer
import torch

model = GPT2LMHeadModel.from_pretrained('gpt2')
tokenizer = GPT2Tokenizer.from_pretrained('gpt2', pad_token='<PAD>')
# IMPORTANT: Note that setting the <PAD> token like this itn the constructor gives the
# pad_token the pad_token_id = 50256, which normally belongs to <BOS> token_ids in GPT2
# This is a very ugly way that works at the moment of setting the pad_token_id to the <BOS> token that is already included in the vocab size. This will be updated in the coming weeks! # noqa: E501

prompt_text = [
    'in this paper we',
    'we are trying to',
    'The purpose of this workshop is to check whether we can']

# encode plus batch handles multiple batches and automatically creates attention_masks
seq_len = 11
encodings_dict = tokenizer.batch_encode_plus(prompt_text, max_length=seq_len, pad_to_max_length=True)

# ideally we should be able to just input the following two variables to the function model.generate() ... => to be implemented soon!  # noqa: E501
input_ids = torch.tensor(encodings_dict['input_ids'])
attn_mask = torch.tensor(encodings_dict['attention_mask'])

num_tokens_to_produce = 20
pad_token_id = tokenizer.pad_token_id
eos_token_id = tokenizer.eos_token_id
eos_not_in_sents = torch.ones(input_ids.shape[0]).long()

# we need to get the token ids of the last non-padded value
last_non_masked_idx = torch.sum(attn_mask, dim=1) - 1
start_idx = inp_idx = (last_non_masked_idx).view(-1, 1).repeat(1, tokenizer.vocab_size).unsqueeze(1)
past = None

# get correct position ids
position_ids = torch.tensor([list(range(seq_len)) for i in range(input_ids.shape[0])])
for i, position_ids_slice in enumerate(position_ids):
    position_ids_slice[last_non_masked_idx[i]:] = position_ids_slice[last_non_masked_idx[i]]

for step in range(num_tokens_to_produce):
    outputs = model(input_ids, attention_mask=attn_mask, position_ids=position_ids)

    # in the first decoding step, we want to use the 'real' last position for each sentence
    if step == 0:
        next_token_logits = outputs[0].gather(1, start_idx).squeeze(1)
    else:
        next_token_logits = outputs[0][:, -1, :]

    next_tokens = torch.argmax(next_token_logits, dim=-1)

    # this updates which sentences have not seen an <EOS> token so far
    # if one <EOS> token was seen the sentence is finished
    eos_not_in_sents.mul_(next_tokens.ne(eos_token_id).long())

    # either append a padding token here if <EOS> has been seen or append next token
    tokens_to_add = next_tokens * (eos_not_in_sents) + pad_token_id * (1 - eos_not_in_sents)

    # Update input_ids, attn_mask and position_ids
    input_ids = torch.cat([input_ids, tokens_to_add.unsqueeze(-1)], dim=-1)
    attn_mask = torch.cat([attn_mask, torch.ones((attn_mask.shape[0], 1)).long()], dim=1)
    position_ids = torch.cat([position_ids, (position_ids[:, -1] + 1).unsqueeze(-1)], dim=1)

[print(tokenizer.decode(output, skip_special_tokens=True)) for output in input_ids]

All 39 comments

It seems possible to by-pass this issue by setting appropriate attention_mask so that no tokens will attend the positions that are supposed to be paddings, this way you can use whatever token as padding. I'm working on this issue too, will try to follow up if it works out.

I tried a rough version, basically adding attention mask to the padding positions and keep updating this mask as generation grows. One thing worth noting is that in the first step instead of extract the -1-th positions output for each sample, we need to keep track of the real prompt ending position, otherwise sometimes the output from padding positions will be extracted and produce random results.

Code snippet:

import torch
from transformers import GPT2Tokenizer, GPT2LMHeadModel

model = GPT2LMHeadModel.from_pretrained('gpt2')
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')

prompt_text = [
    'in this paper we',
    'we are trying to',
    'The purpose of this workshop is to check whether we can', ]
batch_size = len(prompt_text)
max_length = 30
eos_token_id = tokenizer.eos_token_id

model = model.cuda()

token_ids = [tokenizer.encode(s, add_special_tokens=False) for s in prompt_text]
prompt_lengths = [len(s) for s in token_ids]
max_prompt_len = max(prompt_lengths)

# use 0 as padding id, shouldn't matter
padded_tokens = [tok_ids + [0] * (max_prompt_len - len(tok_ids)) for tok_ids in token_ids]
input_ids = torch.LongTensor(padded_tokens).cuda()
attn_mask = torch.zeros(input_ids.shape).long().cuda()
for ix, tok_ids in enumerate(token_ids):
    attn_mask[ix][:len(tok_ids)] = 1

unfinished_sents = input_ids.new(batch_size).fill_(1)
past = None
cur_len = input_ids.shape[1]

def post_processing(input_ids, attn_mask):
    """Remove padding tokens in the middle of the sequence."""
    input_ids_proc = []
    for ix, seq in enumerate(input_ids):
        input_ids_proc.append([tok_id for tok_id, mask in zip(seq, attn_mask[ix]) if mask != 0])
    return input_ids_proc


input_lengths_index = torch.tensor([x - 1 for x in prompt_lengths]).cuda()
input_lengths_index = input_lengths_index.view(-1, 1).repeat(1, 50257).unsqueeze(1)

while cur_len < max_length:
    model_inputs = model.prepare_inputs_for_generation(input_ids, past=past, attention_mask=attn_mask)
    outputs = model(**model_inputs)
    if cur_len == max_prompt_len:
        # at first step we can't directly extract the -1-th position's
        # prediction for next word, since for some samples the -1-th
        # token is PAD. Instead we keep track of the real prompt ending.
        next_token_logits = outputs[0].gather(1, input_lengths_index).squeeze(1)
    else:
        next_token_logits = outputs[0][:, -1, :]
    past = outputs[1]
    next_token = torch.argmax(next_token_logits, dim=-1)
    tokens_to_add = next_token * unfinished_sents + 0 * (1 - unfinished_sents)
    input_ids = torch.cat([input_ids, tokens_to_add.unsqueeze(-1)], dim=-1)
    attn_mask = torch.cat([attn_mask, torch.ones((batch_size, 1)).long().cuda()], dim=1)

    unfinished_sents.mul_(tokens_to_add.ne(eos_token_id).long())
    cur_len += 1

    if unfinished_sents.max() == 0:
        break

input_ids = post_processing(input_ids, attn_mask)
for item in input_ids:
    print(tokenizer.decode(item))

Also a minor change to src/transformers/modeling_gpt2.py:

line 422: attention_mask = attention_mask.view(-1, input_shape[-1])

change to attention_mask = attention_mask.view(input_shape[0], -1)

(not sure if this change will break other things)

Output:

in this paper we have a very good idea of how to use the data to make predictions about the future. We
we are trying to get the best possible deal for the best price. We are not going to be able to offer
The purpose of this workshop is to check whether we can make a difference in the lives of people who are struggling with mental illness.

@schizism Concerning LM inference on batches of different lengths is actually a problem we are currently looking at. Ideally, you should be able to simple put your input_ids (and an attention_mask) to model.generate() to make it work.

@XinyuHua thanks for your great contribution to make LM inference work on batches having different lengths. Also it seems like you found a bug, when using the past and attention_mask variables as an input in GPT2. That's great! I will open a new issue for that and take a look :-)

Below, I am adding a simplified code snippet using simpler tokenization functions.
In this code, no past variable is used related to the bug found by @XinyuHua.

from transformers import GPT2LMHeadModel, GPT2Tokenizer
import torch

model = GPT2LMHeadModel.from_pretrained('gpt2')
tokenizer = GPT2Tokenizer.from_pretrained('gpt2', pad_token='<PAD>')
# IMPORTANT: Note that setting the <PAD> token like this itn the constructor gives the
# pad_token the pad_token_id = 50256, which normally belongs to <BOS> token_ids in GPT2
# This is a very ugly way that works at the moment of setting the pad_token_id to the <BOS> token that is already included in the vocab size. This will be updated in the coming weeks! # noqa: E501

prompt_text = [
    'in this paper we',
    'we are trying to',
    'The purpose of this workshop is to check whether we can']

# encode plus batch handles multiple batches and automatically creates attention_masks
seq_len = 11
encodings_dict = tokenizer.batch_encode_plus(prompt_text, max_length=seq_len, pad_to_max_length=True)

# ideally we should be able to just input the following two variables to the function model.generate() ... => to be implemented soon!  # noqa: E501
input_ids = torch.tensor(encodings_dict['input_ids'])
attn_mask = torch.tensor(encodings_dict['attention_mask'])

num_tokens_to_produce = 20
pad_token_id = tokenizer.pad_token_id
eos_token_id = tokenizer.eos_token_id
eos_not_in_sents = torch.ones(input_ids.shape[0]).long()

# we need to get the token ids of the last non-padded value
last_non_masked_idx = torch.sum(attn_mask, dim=1) - 1
start_idx = inp_idx = (last_non_masked_idx).view(-1, 1).repeat(1, tokenizer.vocab_size).unsqueeze(1)
past = None

# get correct position ids
position_ids = torch.tensor([list(range(seq_len)) for i in range(input_ids.shape[0])])
for i, position_ids_slice in enumerate(position_ids):
    position_ids_slice[last_non_masked_idx[i]:] = position_ids_slice[last_non_masked_idx[i]]

for step in range(num_tokens_to_produce):
    outputs = model(input_ids, attention_mask=attn_mask, position_ids=position_ids)

    # in the first decoding step, we want to use the 'real' last position for each sentence
    if step == 0:
        next_token_logits = outputs[0].gather(1, start_idx).squeeze(1)
    else:
        next_token_logits = outputs[0][:, -1, :]

    next_tokens = torch.argmax(next_token_logits, dim=-1)

    # this updates which sentences have not seen an <EOS> token so far
    # if one <EOS> token was seen the sentence is finished
    eos_not_in_sents.mul_(next_tokens.ne(eos_token_id).long())

    # either append a padding token here if <EOS> has been seen or append next token
    tokens_to_add = next_tokens * (eos_not_in_sents) + pad_token_id * (1 - eos_not_in_sents)

    # Update input_ids, attn_mask and position_ids
    input_ids = torch.cat([input_ids, tokens_to_add.unsqueeze(-1)], dim=-1)
    attn_mask = torch.cat([attn_mask, torch.ones((attn_mask.shape[0], 1)).long()], dim=1)
    position_ids = torch.cat([position_ids, (position_ids[:, -1] + 1).unsqueeze(-1)], dim=1)

[print(tokenizer.decode(output, skip_special_tokens=True)) for output in input_ids]

Thanks for this much cleaned version @patrickvonplaten! Just one quick issue, I forgot to modify the position ids for each sample, so the padding will add up to the position ids and future tokens will get wrong position ids. This might cause issues when the prompt lengths in a batch are very different.

Fixed the issue #3033 regarding the attention mask with your proposed solution @XinyuHua - thanks!

Thanks for this much cleaned version @patrickvonplaten! Just one quick issue, I forgot to modify the position ids for each sample, so the padding will add up to the position ids and future tokens will get wrong position ids. This might cause issues when the prompt lengths in a batch are very different.

added the correct position ids. Feel free to review and comment!

Thank you @XinyuHua @patrickvonplaten! These are very helpful!

@patrickvonplaten It looks like tokens_to_add in your script is unused, should that be used in place of next_tokens in the line input_ids = torch.cat([input_ids, next_tokens.unsqueeze(-1)], dim=-1)?

Uups! Yeah definitely - thanks a lot for pointing this out. Edited the script :-)

Hi, padding still seems to be an issue with LMHeads in case of just perplexity calculation (and not generation). I am trying to run examples/run_language_modelling.py and having a hard time using GPT2LMHeadModel and same is the case with transformer-XL. I am running it in just evaluation mode (by setting --do_eval).

That example code uses training.py and data/data_collator.py, which throws the following error while batching sentences:
"ValueError: You are attempting to pad samples but the tokenizer you are using (TransfoXLTokenizer) does not have one."

Any idea where I could be going wrong?
Thanks

@bajajahsaas Are you using the --line_by_line flag? Can you post the exact command you're running?

@julien-c I just ran into the exact same issue and I am indeed using the --line_by_line flag. The exact command I'm using:

python run_language_modeling.py \
    --output_dir='/content/drive/My Drive/finetuned_models/run1' \
    --model_type=gpt2 \
    --model_name_or_path=gpt2 \
    --save_total_limit=5 \
    --num_train_epochs=1.0 \
    --overwrite_output_dir \
    --do_train \
    --evaluate_during_training \
    --logging_steps=1000 \
    --save_steps=1000 \
    --train_data_file=/content/train.txt \
    --line_by_line \
    --do_eval \
    --eval_data_file=/content/valid.txt \
    --per_gpu_train_batch_size=2 \
    --per_gpu_eval_batch_size=2 \

If I take the --line_by_line flag out, the command executes fine.

HI @julien-c, thanks for checking this. I am using --line_by_line and my exact command is as below:

python run_lm.py --model_type gpt2 --model_name_or_path gpt2 --do_eval --eval_data_file ../../data/wikitext-103/valid.txt --line_by_line --output_dir logslm

I am just running inference on wikitext-103 dataset, and both xlnet and transformer-xl are throwing this error. However, since the error is caused by: https://github.com/huggingface/transformers/blob/4e817ff41885063e08bb3bcd63e5adfd835b9911/src/transformers/data/data_collator.py#L106
I tried a simple workaround using: tokenizer.pad_token = "<pad>". I am not sure if this is a correct fix and even perplexity scores are not matching on standard datasets. Note: I am not doing any training, just perplexity calculation.

Yes GPT2 is not compatible with the LineByLineDataset, because it doesn't have a padding token out of the box.

Feel free to propose an update to the error's wording if you think of a clearer way to express that.

Sure, thanks for looking into this. Moreover, how shall we use this example code (run_language_modelling.py) for such models? I tried removing --line_by_line for wikitext-103 dataset, but that screws up the data processing in my opinion.

This is not a real fix, more of a hack, but if you change the code in transformers.data.data_collator.DataCollatorForLanguageModelling._tensorize_batch
from:

if self.tokenizer._pad_token is None:
    raise ValueError(...)

to:

if self.tokenizer._pad_token is None:
    return pad_sequence(examples, batch_first=True)

The language modelling script will run fine with the --line_by_line. In practice, it means it does padding with zeros, which is the default value for padding_value.

This "error" was introduced a week ago with the commit to master dd9d483d03962fea127f59661f3ae6156e7a91d2 by @julien-c that refactored the LM train script. I was using the LM script with the same data before that and it was working.

I am not sure how "wrong" this is, but I'm using a dataset of relatively short texts (up to 400 words each, often shorter), and I'm getting decent results. I get a bunch of "!" (the token 0) at the end of the generation sometimes, but other than that, it looks good.

I tried an alternative of separating the short texts with <|endoftext|> tokens, and training without the --line_by_line option, but the results I get in generation are qualitatively much worse.

Hi @jorgemcgomes, thanks for checking. However, check this issue, it seems tokenizers have 0 index pointed to some vocab token.

How about using left side padding for GPT-2, and use attention mask to avoid attending to those padded words? Of course, position_ids shall be set properly to avoid impacting position embeddings. This approach could work with past state since padding word will not be in the middle after appending generated word.

How about using left side padding for GPT-2, and use attention mask to avoid attending to those padded words? Of course, position_ids shall be set properly to avoid impacting position embeddings. This approach could work with past state since padding word will not be in the middle after appending generated word.

@tianleiwu this worked for me! Saved me HOURS in compute time, thank you!

tokenizer.padding_side = "left"
encoded_prompt_dict = tokenizer.batch_encode_plus(input, return_tensors="pt", pad_to_max_length=True)
encoded_prompt = encoded_prompt_dict['input_ids'].to(args.device)
encoded_mask = encoded_prompt_dict['attention_mask'].to(args.device)

@schizism Concerning LM inference on batches of different lengths is actually a problem we are currently looking at. Ideally, you should be able to simple put your input_ids (and an attention_mask) to model.generate() to make it work.

@XinyuHua thanks for your great contribution to make LM inference work on batches having different lengths. Also it seems like you found a bug, when using the past and attention_mask variables as an input in GPT2. That's great! I will open a new issue for that and take a look :-)

Below, I am adding a simplified code snippet using simpler tokenization functions.
In this code, no past variable is used related to the bug found by @XinyuHua.

from transformers import GPT2LMHeadModel, GPT2Tokenizer
import torch

model = GPT2LMHeadModel.from_pretrained('gpt2')
tokenizer = GPT2Tokenizer.from_pretrained('gpt2', pad_token='<PAD>')
# IMPORTANT: Note that setting the <PAD> token like this itn the constructor gives the
# pad_token the pad_token_id = 50256, which normally belongs to <BOS> token_ids in GPT2
# This is a very ugly way that works at the moment of setting the pad_token_id to the <BOS> token that is already included in the vocab size. This will be updated in the coming weeks! # noqa: E501

prompt_text = [
    'in this paper we',
    'we are trying to',
    'The purpose of this workshop is to check whether we can']

# encode plus batch handles multiple batches and automatically creates attention_masks
seq_len = 11
encodings_dict = tokenizer.batch_encode_plus(prompt_text, max_length=seq_len, pad_to_max_length=True)

# ideally we should be able to just input the following two variables to the function model.generate() ... => to be implemented soon!  # noqa: E501
input_ids = torch.tensor(encodings_dict['input_ids'])
attn_mask = torch.tensor(encodings_dict['attention_mask'])

num_tokens_to_produce = 20
pad_token_id = tokenizer.pad_token_id
eos_token_id = tokenizer.eos_token_id
eos_not_in_sents = torch.ones(input_ids.shape[0]).long()

# we need to get the token ids of the last non-padded value
last_non_masked_idx = torch.sum(attn_mask, dim=1) - 1
start_idx = inp_idx = (last_non_masked_idx).view(-1, 1).repeat(1, tokenizer.vocab_size).unsqueeze(1)
past = None

# get correct position ids
position_ids = torch.tensor([list(range(seq_len)) for i in range(input_ids.shape[0])])
for i, position_ids_slice in enumerate(position_ids):
    position_ids_slice[last_non_masked_idx[i]:] = position_ids_slice[last_non_masked_idx[i]]

for step in range(num_tokens_to_produce):
    outputs = model(input_ids, attention_mask=attn_mask, position_ids=position_ids)

    # in the first decoding step, we want to use the 'real' last position for each sentence
    if step == 0:
        next_token_logits = outputs[0].gather(1, start_idx).squeeze(1)
    else:
        next_token_logits = outputs[0][:, -1, :]

    next_tokens = torch.argmax(next_token_logits, dim=-1)

    # this updates which sentences have not seen an <EOS> token so far
    # if one <EOS> token was seen the sentence is finished
    eos_not_in_sents.mul_(next_tokens.ne(eos_token_id).long())

    # either append a padding token here if <EOS> has been seen or append next token
    tokens_to_add = next_tokens * (eos_not_in_sents) + pad_token_id * (1 - eos_not_in_sents)

    # Update input_ids, attn_mask and position_ids
    input_ids = torch.cat([input_ids, tokens_to_add.unsqueeze(-1)], dim=-1)
    attn_mask = torch.cat([attn_mask, torch.ones((attn_mask.shape[0], 1)).long()], dim=1)
    position_ids = torch.cat([position_ids, (position_ids[:, -1] + 1).unsqueeze(-1)], dim=1)

[print(tokenizer.decode(output, skip_special_tokens=True)) for output in input_ids]

@patrickvonplaten Thanks for sharing this, I wonder if inputting input_ids and attn_mask to model.generate is possible now? is this feature available now?
I've tried it and I think there should be some concerns regarding positional_embedding since I don't get meaningful result.

On the other hand when I try setting tokenizer.padding_side = "left" as suggested/tried by @tianleiwu @AADeLucia, I get the same output for different hyper parameters like k_sampling, p_sampling, length, ...
@AADeLucia @tianleiwu have you been successful on this? did you take any action regarding position_ids?

Would appreciate any pointer.

@fabrahman I realize I have huggingface version 2.8 installed, which was not working with generate(). I used the left-side padding with p-sampling and it worked for me (i.e. the outputs were reasonable for the settings and I was not getting the same issues as when I did not use left-side padding). I took no action regarding position_ids and I only provided the attention mask. Maybe the newest version of huggingface implemented generate() correctly?

What do you mean you get the same output? Can you post your code?

@AADeLucia thanks for you quick reply. When you say it was not working with generate(), does that mean you got errors when passing encoded_prompt and 'encoded_mask` to generate function?

Actually, I resolved same outputs with different decoding issue, but now I get similar outputs if I sample 5 times (num_return_sequences=5). That is the returning sequences are the same:
This is the code I am trying as an example:

from transformers import GPT2LMHeadModel, GPT2Tokenizer
import torch

model = GPT2LMHeadModel.from_pretrained('gpt2')
tokenizer = GPT2Tokenizer.from_pretrained('gpt2', pad_token='<PAD>')
prompt_text = [
    'in this paper we',
    'we are trying to',
    'The purpose of this workshop is to check whether we can']

# encode plus batch handles multiple batches and automatically creates attention_masks
seq_len = 11
tokenizer.padding_side = "left"
encodings_dict = tokenizer.batch_encode_plus(prompt_text, max_length=seq_len, pad_to_max_length=True)

input_ids = torch.tensor(encodings_dict['input_ids'])
attn_mask = torch.tensor(encodings_dict['attention_mask'])

outputs = model.generate(input_ids, attention_mask=attn_mask, do_sample=True, max_length=40, top_k=10, num_return_sequences=5)
outputs = [tokenizer.decode(output, skip_special_tokens=True) for output in outputs]
outputs = [text[:text.find(".")+1] for text in outputs if "." in text]
outputs

and here is the output results:

['in this paper we present a new approach to the problem of the "unconscious" and the "conscious" in the study of the unconscious.',
 'in this paper we present a new approach to the problem of the "unconscious" and the "conscious" in the study of the unconscious.',
 'in this paper we present a new approach to the problem of the "unconscious" and the "conscious" in the study of the unconscious.',
 'in this paper we present a new approach to the problem of the "unconscious" and the "conscious" in the study of the unconscious.',
 'in this paper we present a new approach to the problem of the "unconscious" and the "conscious" in the study of the unconscious.',
 'we are trying to get a new version of the game to work on the PC.',
 'we are trying to get a new version of the game to work on the PC.',
 'we are trying to get a new version of the game to work on the PC.',
 'we are trying to get a new version of the game to work on the PC.',
 'we are trying to get a new version of the game to work on the PC.',
 'The purpose of this workshop is to check whether we can make a difference in the lives of people who are struggling with mental illness.',
 'The purpose of this workshop is to check whether we can make a difference in the lives of people who are struggling with mental illness.',
 'The purpose of this workshop is to check whether we can make a difference in the lives of people who are struggling with mental illness.',
 'The purpose of this workshop is to check whether we can make a difference in the lives of people who are struggling with mental illness.',
 'The purpose of this workshop is to check whether we can make a difference in the lives of people who are struggling with mental illness.']

@faiazrahman By "not working" I mean I would pass in padded prompts and masks and the model would generate as if the mask was not there. So the padded prompts were like

<|startoftext|>Hello there<|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|>

(I padded with <|endoftext|> but it shouldn't matter as long as the attention mask is working)
And then the output would see the multiple <|endoftext|> padding tokens and start generating <|startoftext|> instead of continuing from the prompts!

Hmm I only generated 1 sequence for each input. But I just tried to generate multiple outputs as a test. I run into the same repetition issue as you with top-k but not with top-p.

I believe Alexandra meant to tag @fabrahman :)

@faiazrahman By "not working" I mean I would pass in padded prompts and masks and the model would generate as if the mask was not there. So the padded prompts were like

<|startoftext|>Hello there<|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|>

(I padded with <|endoftext|> but it shouldn't matter as long as the attention mask is working)
And then the output would see the multiple <|endoftext|> padding tokens and start generating <|startoftext|> instead of continuing from the prompts!

Hmm I only generated 1 sequence for each input. But I just tried to generate multiple outputs as a test. I run into the same repetition issue as you with top-k but not with top-p.

@AADeLucia I actually found the issue. It is because I am passing both top_p=0 and top_k=10. When I removed top_p in case of topk_sampling the problem resolved. I updated my code snippet.
BTW my transformer version is 2.11.0 in case you wanted to try.

@patrickvonplaten Would you please confirm if this is the right approach and doesn't crash anything?

@fabrahman,

I did not use generate() method but batch inference works for me like the following way:
(1) Get input_ids and attention_mask from tokenizer.batch_encode_plus directly. The padding strategy does not matter.

       position_ids = (attention_mask.long().cumsum(-1) - 1)
       position_ids.masked_fill_(position_ids < 0, 0)
       past = None

(2) Use model to do inference and get outputs including past. For example, we can construct new inputs like:

  • update past tensor from the outputs
  • input_ids is the generated tokens with shape (batch_size, 1)
       position_ids = (position_ids[:,-1] + 1).reshape(batch_size,1)
       attention_mask = torch.cat([attention_mask, torch.ones([self.batch_size, 1]).type_as(attention_mask)], 1).to(device)

Loop this step until exit condition is satisfied.

I have a notebook shows example of batch generation

Sorry for the wrong tag! And @fabrahman , glad you found the bug!

For GPT2LMHeadModel, I think we can do this:

def prepare_inputs_for_generation(self, input_ids, past=None, **kwargs):
    # only last token for inputs_ids if past is defined in kwargs
    if past:
        input_ids = input_ids[:, -1].unsqueeze(-1)

    attention_mask = kwargs.get("attention_mask", None)
    if attention_mask is not None:
        position_ids = (attention_mask.long().cumsum(-1) - 1)
        position_ids.masked_fill_(attention_mask==0, 0) # can be filled with anything >= 0
        if past:
            position_ids = position_ids[:, -1].unsqueeze(-1)
    else:
        position_ids = None
    return {
            "input_ids": input_ids,
            "past_key_values": past,
            "use_cache": kwargs.get("use_cache"),
            "position_ids": position_ids,
            "attention_mask": attention_mask, # I forgot to add this line and it took me hours debugging.
            }

here:
https://github.com/huggingface/transformers/blob/4bd7be9a4268221d2a0000c7e8033aaeb365c03b/src/transformers/modeling_gpt2.py#L665-L674

So we don't need to care about position ids in generate(), since it callsprepare_inputs_for_generation.
https://github.com/huggingface/transformers/blob/4bd7be9a4268221d2a0000c7e8033aaeb365c03b/src/transformers/generation_utils.py#L534-L536

And in examples/text-generation/run_generation.py,
use tokenizer.padding_side = "left" to avoid this:

for step in range(num_tokens_to_produce):
    outputs = model(input_ids, attention_mask=attn_mask, position_ids=position_ids)

    # in the first decoding step, we want to use the 'real' last position for each sentence
    if step == 0:
        next_token_logits = outputs[0].gather(1, start_idx).squeeze(1)
    else:
        next_token_logits = outputs[0][:, -1, :]

    next_tokens = torch.argmax(next_token_logits, dim=-1)

and use tokenizer.batch_encode_plus to get attention_mask and pass to generate().

@patrickvonplaten What do you think? I see you are working on this. 馃槂

@cccntu have you tested this changes? do they work?

If it works, I think would be very useful to many folks out there (i.e., including me 馃槉). If so, maybe just send a pull request.

@andreamad8 I haven't tried it.馃槄 Maybe I will try it next week, idk. Feel free to try it yourself, and let me know the results! 馃槂

I hope to be able to tackle the problem of batch generation soon. @cccntu your approach looks very interesting. Before we will add this feature, I think the generate function needs a bigger refactoring though...=> see https://github.com/huggingface/transformers/pull/6949

@andreamad8 I tried it and after some debugging, it seems to work! The code above is updated.
@patrickvonplaten My approach does not involve generation_utils.py, so I guess I will submit a pr later this week.
note: I only tested it using greedy search to verify that the results are the same for batch size = 1, 2.

Hi @cccntu . Do you have a pull request you could share with the above working?

Also, out of curiosity, did you find that being able to do batch inference sped things up a lot for you? I'm wondering what batch size you managed to fit in memory.

Thanks!

Hi @daphnei . My code above works, but I think I need a minimal example to show how it works, so I haven't submit a pr yet.

I am able to fit 16~64 lines (probably depends on the lengths) in 11 GB 2080ti (model_name="gpt2"), with at least 20x seed-up with batch size = 64, did not calculate the speed up specifically, just some rough calculation base on my memory.

Hi, thank you so much for your solution for batch inference in GPT-2 Model @XinyuHua @patrickvonplaten.
After reading your codes, I find the main idea of the solution is to use the attention_mask to ignore the [PAD] tokens during generation.
Before knowing your solutions in this issue, I also use a similar way to do the batch inference for myself. But I am not sure about it. Can you guys help me to check it?

The main idea of my solution is to pad in front of the sequences instead of the end of the sequences, for example:

sentences = [
    'I have a dog',
    'My dog is very cute and good looking', 
    'A good boy'
]
tokens_after_padding = [
    [0, 0, 0, 0, 1045, 2031, 1037, 3899],
    [2026, 3899, 2003, 2200, 10140, 1998, 2204, 2559],
    [0, 0, 0, 0, 0, 1037, 2204, 2879],
]
attention_mask = [
    [0, 0, 0, 0, 1, 1, 1, 1], 
    [1, 1, 1, 1, 1, 1, 1, 1], 
    [0, 0, 0, 0, 0, 1, 1, 1]
]

After the processing, all the sentences have the same length, and the batch inference is the same as the batch training. Besides, I think this way is easier than yours. During my testing, I found this way is just okay, but I am not sure about it. Can you guys give me some suggestions?

Thanks @cccntu for the response! I will try it out soon.

Would be super if something like this was default eventually.

Thanks for this answer! It was very helpful for batching with variable sequence lengths.

Reopening with the question:
Can GPT2LMHeadModel do batch inference with variable sentence lengths AND with usage of past_key_values?

IE:

batch item 1 inputs:
input_ids = [1, 2, 3]
past_key_values = None

batch item 2 inputs:
input_ids = [1]
past_key_values = (tensor of shape (2, batch_size = 1, num_heads, sequence_length, embed_size_per_head))

As you explained above (thanks), input ids for batching would combine to be:
[[1, 2, 3],
[1, 0, 0]]
and attention_mask:
[[1, 1, 1],
[1, 0, 0]]

Is it possible to combine past_key_values in the same way?
Or is batching only possible with same-sized past_key_values?

Two considerations to torch.cat() the individual past_key_values together:
1) representing the None past_key_values as a tensor
2) padding all past_key_values so that they all have the same sequence_length (dim=3) size

I searched deeply through the source code and couldn't find what None becomes represented as.

Thanks for the help!

CC @patrickvonplaten is there anything obvious I can do to make the above" batching with variable pasts" inference work?

If I get something functional I could add it in a PR for the prepare_inputs_for_generation() function

Hey @erik-dunteman could you add a code snippet that currently fails, describing your problem in more detail?

Was this page helpful?
0 / 5 - 0 ratings

Related issues

chuanmingliu picture chuanmingliu  路  3Comments

siddsach picture siddsach  路  3Comments

yspaik picture yspaik  路  3Comments

hsajjad picture hsajjad  路  3Comments

lcswillems picture lcswillems  路  3Comments