Transformers: XLNet: Sentence probability/perplexity

Created on 29 Jul 2019 · 14Comments · Source: huggingface/transformers

Based on my understanding, XLnet can compute sentence probability/perplexity. Is there a example that illustrates how we can do this? I saw one for GPT-2 (https://github.com/huggingface/pytorch-transformers/issues/473), but don't think it'll work exactly the same...

Source

jhlau

Most helpful comment

Looks like they found that scoring sentences based on bidirectional context is better than unidirectional context for speech recognition, and that's a result similar to what we found for scoring sentences for naturalness/fluency: https://arxiv.org/pdf/2004.00881.pdf

(in summary we found that sentence probability (not true probability) computed with bidirectional context with simple normalisation (PenLP in table 2) correlates strongly with human perception of sentence naturalness/fluency)

jhlau on 4 May 2020

👍2

All 14 comments

Hi, I want to ask that question too.
Below is my implementation

def xlnet_score(text, model, tokenizer):
    #text = "<cls>" + text + "<sep>"
    # Tokenized input
    tokenized_text = tokenizer.tokenize(text)
    # text = "[CLS] Stir the mixture until it is done [SEP]"
    sentence_prob = 0
    #Sprint(len(tokenized_text))
    for masked_index in range(0,len(tokenized_text)):
        # Mask a token that we will try to predict back with `BertForMaskedLM`
        masked_word = tokenized_text[masked_index]
        if masked_word!= "<sep>":
            masked_word = tokenized_text[masked_index]
            tokenized_text[masked_index] = '<mask>'
            input_ids = torch.tensor(tokenizer.convert_tokens_to_ids(tokenized_text)).unsqueeze(0)
            index = torch.tensor(tokenizer.convert_tokens_to_ids(masked_word))

            perm_mask = torch.zeros((1, input_ids.shape[1], input_ids.shape[1]), dtype=torch.float)
            perm_mask[:, :, masked_index] = 1.0  # Previous tokens don't see last token
            target_mapping = torch.zeros((1, 1, input_ids.shape[1]), dtype=torch.float)  # Shape [1, 1, seq_length] => let's predict one token
            target_mapping[0, 0, masked_index] = 1.0  # Our first (and only) prediction will be the last token of the sequence (the masked token)

            input_ids = input_ids.to('cuda')
            perm_mask = perm_mask.to('cuda')
            target_mapping = target_mapping.to('cuda')
            index = index.to('cuda')

            with torch.no_grad():
                outputs = model(input_ids, perm_mask=perm_mask, target_mapping=target_mapping, labels = index)
            next_token_logits = outputs[0]
            length = len(tokenized_text)
            sentence_prob += next_token_logits.item()
            tokenized_text[masked_index] = masked_word
    return sentence_prob/(length)

a=['there is a book on the desk',
                'there is a rocket on the desk',
                        'he put an elephant into the fridge', 'he put an apple into the fridge']

tokenizer = XLNetTokenizer.from_pretrained('xlnet-base-cased')
model = XLNetLMHeadModel.from_pretrained('xlnet-base-cased')
model.to('cuda')
model.eval()
print([xlnet_score(i,model,tokenizer) for i in a])

The result, anyway, does not seem to make much sense to me.
So I also want to ask if there is a better way to implement the model.

Nealcly on 7 Aug 2019

This is how I did it in the end. The important thing is that you need to pad it with a long context before hand (discussed here), and you need to iterate through the sentence, one word at a time to collect the conditional word probabilities.

import torch
from pytorch_transformers import XLNetTokenizer, XLNetLMHeadModel
import numpy as np
from scipy.special import softmax

PADDING_TEXT = """In 1991, the remains of Russian Tsar Nicholas II and his family
(except for Alexei and Maria) are discovered.
The voice of Nicholas's young son, Tsarevich Alexei Nikolaevich, narrates the
remainder of the story. 1883 Western Siberia,
a young Grigori Rasputin is asked by his father and a group of men to perform magic.
Rasputin has a vision and denounces one of the men as a horse thief. Although his
father initially slaps him for making such an accusation, Rasputin watches as the
man is chased outside and beaten. Twenty years later, Rasputin sees a vision of
the Virgin Mary, prompting him to become a priest. Rasputin quickly becomes famous,
with people, even a bishop, begging for his blessing. <eod> """

text = "The dog is very cute."

tokenizer = XLNetTokenizer.from_pretrained('xlnet-large-cased')
model = XLNetLMHeadModel.from_pretrained('xlnet-large-cased')

tokenize_input = tokenizer.tokenize(PADDING_TEXT + text)
tokenize_text = tokenizer.tokenize(text)

sum_lp = 0.0
for max_word_id in range((len(tokenize_input)-len(tokenize_text)), (len(tokenize_input))):

    sent = tokenize_input[:]

    input_ids = torch.tensor([tokenizer.convert_tokens_to_ids(sent)])

    perm_mask = torch.zeros((1, input_ids.shape[1], input_ids.shape[1]), dtype=torch.float)
    perm_mask[:, :, max_word_id:] = 1.0 

    target_mapping = torch.zeros((1, 1, input_ids.shape[1]), dtype=torch.float)
    target_mapping[0, 0, max_word_id] = 1.0

    with torch.no_grad():
        outputs = model(input_ids, perm_mask=perm_mask, target_mapping=target_mapping)
        next_token_logits = outputs[0]  # Output has shape [target_mapping.size(0), target_mapping.size(1), config.vocab_size]

    word_id = tokenizer.convert_tokens_to_ids([tokenize_input[max_word_id]])[0]
    predicted_prob = softmax(np.array(next_token_logits[0][-1]))
    lp = np.log(predicted_prob[word_id])

    sum_lp += lp

print("sentence logprob =", sum_lp)

jhlau on 27 Aug 2019

👍1

@jhlau Hi, thanks for sharing your solution. Just wondering if the padded text beforehand is very important for evaluating the sentence scores? What if you use a different text?

yuchenlin on 29 Sep 2019

Yes, it is very important. Without the padded text, the sentence probability is pretty much useless. Pretty sure you can use any text, as long as you include the eod tag.

jhlau on 30 Sep 2019

👍2

Hey @jhlau , thank you for sharing this with us!

I have been trying to accelerate the operation of the function by using mems, i.e. caching of the hidden states. Since we are The only changes I made are these:

model = XLNetLMHeadModel.from_pretrained('xlnet-large-cased', mem_len=1024)

, and

    with torch.no_grad():
        outputs = model(input_ids, perm_mask=perm_mask, target_mapping=target_mapping, mems=mems)
        mems = outputs[1] # on the first word is none, i.e during first iteration of the for-loop
        next_token_logits = outputs[0]  # Output has shape [target_mapping.size(0), target_mapping.size(1), config.vocab_size]
        predicted_prob = torch.softmax(next_token_logits[0][-1], dim=-1)

However, the probabilities for the tokens appear different between the cached and the non-cached version. Do you know if this is actually correct and what could be wrong? Does it actually make sense to cache the intermediate states?

Thanks!

roskoN on 22 Nov 2019

I don't think you can cache it, since the hidden states are different for every step (which has a different masked word).

jhlau on 23 Nov 2019

👍1

hi @jhlau , wondering if you have a batch-processing version of your script such that people can use as an off-the-shelf tool for evaluating a (big) list of sentences? Thanks very much!

yuchenlin on 1 Dec 2019

Unfortunately not. Haven't had the time to look into processing sentences in batch.

jhlau on 1 Dec 2019

import torch
from pytorch_transformers import XLNetTokenizer, XLNetLMHeadModel
import numpy as np
from scipy.special import softmax

PADDING_TEXT = """In 1991, the remains of Russian Tsar Nicholas II and his family
(except for Alexei and Maria) are discovered.
The voice of Nicholas's young son, Tsarevich Alexei Nikolaevich, narrates the
remainder of the story. 1883 Western Siberia,
a young Grigori Rasputin is asked by his father and a group of men to perform magic.
Rasputin has a vision and denounces one of the men as a horse thief. Although his
father initially slaps him for making such an accusation, Rasputin watches as the
man is chased outside and beaten. Twenty years later, Rasputin sees a vision of
the Virgin Mary, prompting him to become a priest. Rasputin quickly becomes famous,
with people, even a bishop, begging for his blessing. <eod> """

text = "The dog is very cute."

tokenizer = XLNetTokenizer.from_pretrained('xlnet-large-cased')
model = XLNetLMHeadModel.from_pretrained('xlnet-large-cased')

tokenize_input = tokenizer.tokenize(PADDING_TEXT + text)
tokenize_text = tokenizer.tokenize(text)

sum_lp = 0.0
for max_word_id in range((len(tokenize_input)-len(tokenize_text)), (len(tokenize_input))):

    sent = tokenize_input[:]

    input_ids = torch.tensor([tokenizer.convert_tokens_to_ids(sent)])

    perm_mask = torch.zeros((1, input_ids.shape[1], input_ids.shape[1]), dtype=torch.float)
    perm_mask[:, :, max_word_id:] = 1.0 

    target_mapping = torch.zeros((1, 1, input_ids.shape[1]), dtype=torch.float)
    target_mapping[0, 0, max_word_id] = 1.0

    with torch.no_grad():
        outputs = model(input_ids, perm_mask=perm_mask, target_mapping=target_mapping)
        next_token_logits = outputs[0]  # Output has shape [target_mapping.size(0), target_mapping.size(1), config.vocab_size]

    word_id = tokenizer.convert_tokens_to_ids([tokenize_input[max_word_id]])[0]
    predicted_prob = softmax(np.array(next_token_logits[0][-1]))
    lp = np.log(predicted_prob[word_id])

    sum_lp += lp

print("sentence logprob =", sum_lp)

@jhlau I selected the link you mentioned but it doesn't talk about the long text for padding. Could you please explain why it is needed or where you found it?

PhaelIshall on 16 Jan 2020

Hmm I should have cited the github link. Anyway it's explained in his GitHub implementation code README: https://github.com/rusiaaman/XLNet-gen#methodology

(and you can see it in the code, and the dummy text he used)

jhlau on 16 Jan 2020

@jhlau Do you think this same reasoning could be applied to extract sentence probabilities from BERT?

ruanchaves on 3 May 2020

@ruanchaves: you can, and I tried it with BERT (left context only for prediction). But the results isn't as good as XLNET (no surprises I supposed since BERT is used to seeing left and right context during training).

jhlau on 4 May 2020

I just found a paper where they use BERT for sentence probabilities (
https://arxiv.org/abs/1905.06655 ). It states that one must train BERT on
the Mask LM task ( without NSP ) before reasonable results can be achieved.

ruanchaves on 4 May 2020

jhlau on 4 May 2020

👍2

Was this page helpful?

0 / 5 - 0 ratings