Based on my understanding, XLnet can compute sentence probability/perplexity. Is there a example that illustrates how we can do this? I saw one for GPT-2 (https://github.com/huggingface/pytorch-transformers/issues/473), but don't think it'll work exactly the same...
Hi, I want to ask that question too.
Below is my implementation
def xlnet_score(text, model, tokenizer):
#text = "<cls>" + text + "<sep>"
# Tokenized input
tokenized_text = tokenizer.tokenize(text)
# text = "[CLS] Stir the mixture until it is done [SEP]"
sentence_prob = 0
#Sprint(len(tokenized_text))
for masked_index in range(0,len(tokenized_text)):
# Mask a token that we will try to predict back with `BertForMaskedLM`
masked_word = tokenized_text[masked_index]
if masked_word!= "<sep>":
masked_word = tokenized_text[masked_index]
tokenized_text[masked_index] = '<mask>'
input_ids = torch.tensor(tokenizer.convert_tokens_to_ids(tokenized_text)).unsqueeze(0)
index = torch.tensor(tokenizer.convert_tokens_to_ids(masked_word))
perm_mask = torch.zeros((1, input_ids.shape[1], input_ids.shape[1]), dtype=torch.float)
perm_mask[:, :, masked_index] = 1.0 # Previous tokens don't see last token
target_mapping = torch.zeros((1, 1, input_ids.shape[1]), dtype=torch.float) # Shape [1, 1, seq_length] => let's predict one token
target_mapping[0, 0, masked_index] = 1.0 # Our first (and only) prediction will be the last token of the sequence (the masked token)
input_ids = input_ids.to('cuda')
perm_mask = perm_mask.to('cuda')
target_mapping = target_mapping.to('cuda')
index = index.to('cuda')
with torch.no_grad():
outputs = model(input_ids, perm_mask=perm_mask, target_mapping=target_mapping, labels = index)
next_token_logits = outputs[0]
length = len(tokenized_text)
sentence_prob += next_token_logits.item()
tokenized_text[masked_index] = masked_word
return sentence_prob/(length)
a=['there is a book on the desk',
'there is a rocket on the desk',
'he put an elephant into the fridge', 'he put an apple into the fridge']
tokenizer = XLNetTokenizer.from_pretrained('xlnet-base-cased')
model = XLNetLMHeadModel.from_pretrained('xlnet-base-cased')
model.to('cuda')
model.eval()
print([xlnet_score(i,model,tokenizer) for i in a])
The result, anyway, does not seem to make much sense to me.
So I also want to ask if there is a better way to implement the model.
This is how I did it in the end. The important thing is that you need to pad it with a long context before hand (discussed here), and you need to iterate through the sentence, one word at a time to collect the conditional word probabilities.
import torch
from pytorch_transformers import XLNetTokenizer, XLNetLMHeadModel
import numpy as np
from scipy.special import softmax
PADDING_TEXT = """In 1991, the remains of Russian Tsar Nicholas II and his family
(except for Alexei and Maria) are discovered.
The voice of Nicholas's young son, Tsarevich Alexei Nikolaevich, narrates the
remainder of the story. 1883 Western Siberia,
a young Grigori Rasputin is asked by his father and a group of men to perform magic.
Rasputin has a vision and denounces one of the men as a horse thief. Although his
father initially slaps him for making such an accusation, Rasputin watches as the
man is chased outside and beaten. Twenty years later, Rasputin sees a vision of
the Virgin Mary, prompting him to become a priest. Rasputin quickly becomes famous,
with people, even a bishop, begging for his blessing. <eod> """
text = "The dog is very cute."
tokenizer = XLNetTokenizer.from_pretrained('xlnet-large-cased')
model = XLNetLMHeadModel.from_pretrained('xlnet-large-cased')
tokenize_input = tokenizer.tokenize(PADDING_TEXT + text)
tokenize_text = tokenizer.tokenize(text)
sum_lp = 0.0
for max_word_id in range((len(tokenize_input)-len(tokenize_text)), (len(tokenize_input))):
sent = tokenize_input[:]
input_ids = torch.tensor([tokenizer.convert_tokens_to_ids(sent)])
perm_mask = torch.zeros((1, input_ids.shape[1], input_ids.shape[1]), dtype=torch.float)
perm_mask[:, :, max_word_id:] = 1.0
target_mapping = torch.zeros((1, 1, input_ids.shape[1]), dtype=torch.float)
target_mapping[0, 0, max_word_id] = 1.0
with torch.no_grad():
outputs = model(input_ids, perm_mask=perm_mask, target_mapping=target_mapping)
next_token_logits = outputs[0] # Output has shape [target_mapping.size(0), target_mapping.size(1), config.vocab_size]
word_id = tokenizer.convert_tokens_to_ids([tokenize_input[max_word_id]])[0]
predicted_prob = softmax(np.array(next_token_logits[0][-1]))
lp = np.log(predicted_prob[word_id])
sum_lp += lp
print("sentence logprob =", sum_lp)
@jhlau Hi, thanks for sharing your solution. Just wondering if the padded text beforehand is very important for evaluating the sentence scores? What if you use a different text?
Yes, it is very important. Without the padded text, the sentence probability is pretty much useless. Pretty sure you can use any text, as long as you include the eod tag.
Hey @jhlau , thank you for sharing this with us!
I have been trying to accelerate the operation of the function by using mems, i.e. caching of the hidden states. Since we are The only changes I made are these:
model = XLNetLMHeadModel.from_pretrained('xlnet-large-cased', mem_len=1024)
, and
with torch.no_grad():
outputs = model(input_ids, perm_mask=perm_mask, target_mapping=target_mapping, mems=mems)
mems = outputs[1] # on the first word is none, i.e during first iteration of the for-loop
next_token_logits = outputs[0] # Output has shape [target_mapping.size(0), target_mapping.size(1), config.vocab_size]
predicted_prob = torch.softmax(next_token_logits[0][-1], dim=-1)
However, the probabilities for the tokens appear different between the cached and the non-cached version. Do you know if this is actually correct and what could be wrong? Does it actually make sense to cache the intermediate states?
Thanks!
I don't think you can cache it, since the hidden states are different for every step (which has a different masked word).
hi @jhlau , wondering if you have a batch-processing version of your script such that people can use as an off-the-shelf tool for evaluating a (big) list of sentences? Thanks very much!
Unfortunately not. Haven't had the time to look into processing sentences in batch.
This is how I did it in the end. The important thing is that you need to pad it with a long context before hand (discussed here), and you need to iterate through the sentence, one word at a time to collect the conditional word probabilities.
import torch from pytorch_transformers import XLNetTokenizer, XLNetLMHeadModel import numpy as np from scipy.special import softmax PADDING_TEXT = """In 1991, the remains of Russian Tsar Nicholas II and his family (except for Alexei and Maria) are discovered. The voice of Nicholas's young son, Tsarevich Alexei Nikolaevich, narrates the remainder of the story. 1883 Western Siberia, a young Grigori Rasputin is asked by his father and a group of men to perform magic. Rasputin has a vision and denounces one of the men as a horse thief. Although his father initially slaps him for making such an accusation, Rasputin watches as the man is chased outside and beaten. Twenty years later, Rasputin sees a vision of the Virgin Mary, prompting him to become a priest. Rasputin quickly becomes famous, with people, even a bishop, begging for his blessing. <eod> """ text = "The dog is very cute." tokenizer = XLNetTokenizer.from_pretrained('xlnet-large-cased') model = XLNetLMHeadModel.from_pretrained('xlnet-large-cased') tokenize_input = tokenizer.tokenize(PADDING_TEXT + text) tokenize_text = tokenizer.tokenize(text) sum_lp = 0.0 for max_word_id in range((len(tokenize_input)-len(tokenize_text)), (len(tokenize_input))): sent = tokenize_input[:] input_ids = torch.tensor([tokenizer.convert_tokens_to_ids(sent)]) perm_mask = torch.zeros((1, input_ids.shape[1], input_ids.shape[1]), dtype=torch.float) perm_mask[:, :, max_word_id:] = 1.0 target_mapping = torch.zeros((1, 1, input_ids.shape[1]), dtype=torch.float) target_mapping[0, 0, max_word_id] = 1.0 with torch.no_grad(): outputs = model(input_ids, perm_mask=perm_mask, target_mapping=target_mapping) next_token_logits = outputs[0] # Output has shape [target_mapping.size(0), target_mapping.size(1), config.vocab_size] word_id = tokenizer.convert_tokens_to_ids([tokenize_input[max_word_id]])[0] predicted_prob = softmax(np.array(next_token_logits[0][-1])) lp = np.log(predicted_prob[word_id]) sum_lp += lp print("sentence logprob =", sum_lp)
@jhlau I selected the link you mentioned but it doesn't talk about the long text for padding. Could you please explain why it is needed or where you found it?
Hmm I should have cited the github link. Anyway it's explained in his GitHub implementation code README: https://github.com/rusiaaman/XLNet-gen#methodology
(and you can see it in the code, and the dummy text he used)
@jhlau Do you think this same reasoning could be applied to extract sentence probabilities from BERT?
@ruanchaves: you can, and I tried it with BERT (left context only for prediction). But the results isn't as good as XLNET (no surprises I supposed since BERT is used to seeing left and right context during training).
I just found a paper where they use BERT for sentence probabilities (
https://arxiv.org/abs/1905.06655 ). It states that one must train BERT on
the Mask LM task ( without NSP ) before reasonable results can be achieved.
Looks like they found that scoring sentences based on bidirectional context is better than unidirectional context for speech recognition, and that's a result similar to what we found for scoring sentences for naturalness/fluency: https://arxiv.org/pdf/2004.00881.pdf
(in summary we found that sentence probability (not true probability) computed with bidirectional context with simple normalisation (PenLP in table 2) correlates strongly with human perception of sentence naturalness/fluency)
Most helpful comment
Looks like they found that scoring sentences based on bidirectional context is better than unidirectional context for speech recognition, and that's a result similar to what we found for scoring sentences for naturalness/fluency: https://arxiv.org/pdf/2004.00881.pdf
(in summary we found that sentence probability (not true probability) computed with bidirectional context with simple normalisation (PenLP in table 2) correlates strongly with human perception of sentence naturalness/fluency)