can we use this project to calculate the probability that a input text as a real/resonable sentence base on the corpus we trained
@frankniujc it is helpful
but maybe a better way is take the all tokens in a whole, not prediction the next tokens
The probability of a sentence P(s0s1s2s3s4...sn) = P(s1|s0) * P(s2|s0s1) * P(s3|s0s1s2) * ... * P(sn|s0s1s2...sn-1)
So you can do something like this
def sentence_probability(sent):
bos = tokenizer.encode('<|endoftext|>')
tokens = tokenizer.encode(sent)
tokens = bos + tokens
input_ids = torch.tensor(tokens).unsqueeze(0).to('cuda')
sent_probs = []
for i, next_word in enumerate(tokens[1:]):
next_word_logits = model(input_ids[:,:i+1])[0][0, -1].detach()
next_word_prob = F.log_softmax(next_word_logits, dim=0)[next_word].item()
sent_probs.append(next_word_prob)
return sum(sent_probs)
@loveJasmine Have a look at lm-scorer.
It is a tiny wrapper around transformers I wrote that allows you to get sentences probabilities using models that support it (only GPT2 models are implemented at the time of writing).
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Most helpful comment
@loveJasmine Have a look at
lm-scorer.It is a tiny wrapper around
transformersI wrote that allows you to get sentences probabilities using models that support it (only GPT2 models are implemented at the time of writing).