Transformers: GPT2 Sentence Probability: Necessary to Prepend "<|endoftext|>"?

Created on 12 Aug 2019 · 13Comments · Source: huggingface/transformers

When computing sentence probability, do we need to prepend the sentence with a dummy start token (e.g. <|endoftext|>) to get the full sentence probability? I am currently using the following implemention (from https://github.com/huggingface/pytorch-transformers/issues/473):

model = GPT2LMHeadModel.from_pretrained("gpt2")
model.eval()
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")

def score(sentence):
    tokenize_input = tokenizer.tokenize(sentence)
    tensor_input = torch.tensor([tokenizer.convert_tokens_to_ids(tokenize_input)])
    loss=model(tensor_input, lm_labels=tensor_input)
    return -loss[0] * len(tokenize_input)

a=['there is a book on the desk',
                'there is a plane on the desk',
                        'there is a book in the desk']
print([score(i) for i in a])

Source

jhlau

👍2

Most helpful comment

Dig into this a little, and it looks like the answer is yes:

text = "the book is on the desk."
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = GPT2LMHeadModel.from_pretrained('gpt2')
input_ids = torch.tensor(tokenizer.encode(text)).unsqueeze(0)  # Batch size 1
tokenize_input = tokenizer.tokenize(text)
#50256 is the token_id for <|endoftext|>
tensor_input = torch.tensor([ [50256]  +  tokenizer.convert_tokens_to_ids(tokenize_input)])
with torch.no_grad():
    outputs = model(tensor_input, labels=tensor_input)
    loss, logits = outputs[:2]
print("a=", loss*len(tokenize_input))

lp = 0.0
for i in range(len(tokenize_input)):
    masked_index = i
    predicted_score = logits[0, masked_index]
    predicted_prob = softmax(np.array(predicted_score))
    lp += np.log(predicted_prob[tokenizer.convert_tokens_to_ids([tokenize_input[i]])[0]])

print("b=", lp)

produces:
a= tensor(32.5258)
b= -32.52579879760742

Without prepending [50256]:
a= tensor(30.4421)
b= -59.90513229370117

jhlau on 15 Aug 2019

👍8 👀1

All 13 comments

Dig into this a little, and it looks like the answer is yes:

text = "the book is on the desk."
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = GPT2LMHeadModel.from_pretrained('gpt2')
input_ids = torch.tensor(tokenizer.encode(text)).unsqueeze(0)  # Batch size 1
tokenize_input = tokenizer.tokenize(text)
#50256 is the token_id for <|endoftext|>
tensor_input = torch.tensor([ [50256]  +  tokenizer.convert_tokens_to_ids(tokenize_input)])
with torch.no_grad():
    outputs = model(tensor_input, labels=tensor_input)
    loss, logits = outputs[:2]
print("a=", loss*len(tokenize_input))

lp = 0.0
for i in range(len(tokenize_input)):
    masked_index = i
    predicted_score = logits[0, masked_index]
    predicted_prob = softmax(np.array(predicted_score))
    lp += np.log(predicted_prob[tokenizer.convert_tokens_to_ids([tokenize_input[i]])[0]])

print("b=", lp)

produces:
a= tensor(32.5258)
b= -32.52579879760742

Without prepending [50256]:
a= tensor(30.4421)
b= -59.90513229370117

jhlau on 15 Aug 2019

👍8 👀1

@jhlau hello, out of curiosity, why are you multiplying the loss with length of tokenize_input?

LearnedVector on 20 Sep 2019

The loss returned is the average loss (i.e. it is already divided by the length); since I am interested in getting the sentence probability, I need to revert that.

jhlau on 21 Sep 2019

👍2

Instead of hard-coding 50256 better to use:

tokenizer.convert_tokens_to_ids(tokenizer.special_tokens_map['eos_token'])

leopd on 16 Oct 2019

👍4

You can also use tokenizer. eos_token_id (doc)

thomwolf on 4 Nov 2019

👍2

Hope this question is simple to answer: How can I run the probability calculation entirely on gpu? When I start with numpy in the for loop I am supposed to put my data back on cpu right? I'd like to avoid that as long as possible.

padmalcom on 3 Dec 2019

@jhlau your code does not seem to be correct to me. Refer to this or #2026 for a (hopefully) correct implementation.

You can also try lm-scorer, a tiny wrapper around transformers I wrote that allows you to get sentences probabilities using models that support it (only GPT2 models are implemented at the time of writing).

I included this here because this issue is still the first result when searching from GitHub/Google about using transformers' models to get sentences probabilities and I think it might be useful to many.

simonepri on 7 Apr 2020

👍2

I see. So I should be using self.tokenizer.bos_token and self.tokenizer.eos_token to start and end a sentence properly (instead of the hardcoded 50526 |endoftext| token). I'll give it a run and see if I find much difference.

jhlau on 7 Apr 2020

The loss returned is the average loss (i.e. it is already divided by the length); since I am interested in getting the sentence probability, I need to revert that.

I think this is incorrect. If you multiply by length, you will get higher probability for long sentences even if they make no sense. The average aims to normalize so that the probability is independent of the number of tokens. Does that make sense?

JacoboLansac on 2 Jul 2020

I understand that of course. I need the full sentence probability because I intend to do other types of normalisation myself (e.g. based unigram frequencies). I am not saying returning the average loss is wrong - I was just clarifying to another user why I multiplied the average loss with length (because I need the full sentence probability).

jhlau on 3 Jul 2020

I understand that of course. I need the full sentence probability because I intend to do other types of normalisation myself (e.g. based unigram frequencies). I am not saying returning the average loss is wrong - I was just clarifying to another user why I multiplied the average loss with length (because I need the full sentence probability).

AAAAh I see. Thanks

JacoboLansac on 4 Jul 2020

When computing sentence probability, do we need to prepend the sentence with a dummy start token (e.g. <|endoftext|>) to get the full sentence probability? I am currently using the following implemention (from #473):
model = GPT2LMHeadModel.from_pretrained("gpt2")
model.eval()
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")

def score(sentence):
    tokenize_input = tokenizer.tokenize(sentence)
    tensor_input = torch.tensor([tokenizer.convert_tokens_to_ids(tokenize_input)])
    loss=model(tensor_input, lm_labels=tensor_input)
    return -loss[0] * len(tokenize_input)

a=['there is a book on the desk',
                'there is a plane on the desk',
                        'there is a book in the desk']
print([score(i) for i in a])
With this implementation, say for the sentence "there is a book on the desk", is it taking into consideration all the words when computing the full sentence probability (i.e. it's computing P(there|<|endoftext|>) * P(is|there,<|endoftext|>) * ... * P(desk|the,...))? If not, what's the right way to prepend the dummy start token?

sent_probability = math.exp(-1.0 * loss * (num_of_word_piece - 1))

num_of_word_piece is the num of encoded ids by the tokenizer.
When calculating sent probability, it is appropriate to prepend "<|endoftext|>" in front of the sent text.
tokenizer will tokenize the "<|endoftext|>" into one token_id, which is tokenizer.eos_token_id.

The loss is calculated from the cross-entropy of shift_logits and shift_labels. By default, cross_entropy gives the mean reduction. And in this case, it is the mean reduction of num_of_word_piece - 1 word_pieces.

jacobma-create on 22 Aug 2020

👍1

For anyone who's interested in batching the above process, here's the code:

lines = [tokenizer.eos_token + line for line in lines]

tok_res = tokenizer.batch_encode_plus(lines, return_tensors='pt', pad_to_max_length=True)
input_ids = tok_res['input_ids']
attention_mask = tok_res['attention_mask']
lines_len = torch.sum(tok_res['attention_mask'], dim=1)

outputs = gpt2_model(input_ids=input_ids, attention_mask=attention_mask, labels=input_ids)
loss, logits = outputs[:2]

for line_ind in range(len(lines)):
    line_log_prob = 0.0
    for token_ind in range(lines_len[line_ind] - 1):
        token_prob = F.softmax(logits[line_ind, token_ind], dim=0)
        token_id = input_ids[line_ind, token_ind + 1]
        line_log_prob += torch.log(token_prob[token_id])
    print(f'line_log_prob:{line_log_prob}')

A caveat was that token_type_ids from tokenizer.batch_encode_plus should not be passed to the gpt2_model in order to obtain the same results as the line-by-line inference.