Hi,
I am trying to use the fill-mask pipeline:
nlp_fm = pipeline('fill-mask')
nlp_fm('Hugging Face is a French company based in <mask>')
And get the output:
[{'sequence': '<s> Hugging Face is a French company based in Paris</s>',
'score': 0.23106734454631805,
'token': 2201},
{'sequence': '<s> Hugging Face is a French company based in Lyon</s>',
'score': 0.08198195695877075,
'token': 12790},
{'sequence': '<s> Hugging Face is a French company based in Geneva</s>',
'score': 0.04769458621740341,
'token': 11559},
{'sequence': '<s> Hugging Face is a French company based in Brussels</s>',
'score': 0.04762236401438713,
'token': 6497},
{'sequence': '<s> Hugging Face is a French company based in France</s>',
'score': 0.041305914521217346,
'token': 1470}]
But let's say I want to get the score & rank on other word - such as London - is this possible?
Hi, the pipeline doesn't offer such a functionality yet. You're better off using the model directly. Here's an example of how you would replicate the pipeline's behavior, and get a token score at the end:
from transformers import AutoModelWithLMHead, AutoTokenizer
import torch
tokenizer = AutoTokenizer.from_pretrained("distilroberta-base")
model = AutoModelWithLMHead.from_pretrained("distilroberta-base")
sequence = f"Hugging Face is a French company based in {tokenizer.mask_token}"
input_ids = tokenizer.encode(sequence, return_tensors="pt")
mask_token_index = torch.where(input_ids == tokenizer.mask_token_id)[1]
token_logits = model(input_ids)[0]
mask_token_logits = token_logits[0, mask_token_index, :]
mask_token_logits = torch.softmax(mask_token_logits, dim=1)
top_5 = torch.topk(mask_token_logits, 5, dim=1)
top_5_tokens = zip(top_5.indices[0].tolist(), top_5.values[0].tolist())
for token, score in top_5_tokens:
print(sequence.replace(tokenizer.mask_token, tokenizer.decode([token])), f"(score: {score})")
# Get the score of token_id
sought_after_token = "London"
sought_after_token_id = tokenizer.encode(sought_after_token, add_special_tokens=False, add_prefix_space=True)[0] # 928
token_score = mask_token_logits[:, sought_after_token_id]
print(f"Score of {sought_after_token}: {mask_token_logits[:, sought_after_token_id]}")
Outputs:
Hugging Face is a French company based in Paris (score: 0.2310674488544464)
Hugging Face is a French company based in Lyon (score: 0.08198253810405731)
Hugging Face is a French company based in Geneva (score: 0.04769456014037132)
Hugging Face is a French company based in Brussels (score: 0.047622524201869965)
Hugging Face is a French company based in France (score: 0.04130581393837929)
Score of London: tensor([0.0343], grad_fn=<SelectBackward>)
Let me know if it helps.
@lavanyashukla Great thanks!
And if I want a predicability of a whole sentence, the best way will be just to average all words scores?
Yes, that's one way to do it.
@LysandreJik I get an error:
"NLP_engine.py", line 120, in _word_in_sentence_prob
mask_token_index = torch.where(input_ids == bert_tokenizer.mask_token_id)[1]
TypeError: where(): argument 'condition' (position 1) must be Tensor, not bool
For the code:
```
def _word_in_sentence_prob(self, sentence, word):
sequence = f"{sentence} {bert_tokenizer.mask_token}"
input_ids = bert_tokenizer.encode(sequence, bert_tokenizer="pt")
mask_token_index = torch.where(input_ids == bert_tokenizer.mask_token_id)[1]
token_logits = bert_model(input_ids)[0]
mask_token_logits = token_logits[0, mask_token_index, :]
mask_token_logits = torch.softmax(mask_token_logits, dim=1)
top_5 = torch.topk(mask_token_logits, 5, dim=1)
top_5_tokens = zip(top_5.indices[0].tolist(), top_5.values[0].tolist())
for token, score in top_5_tokens:
print(sequence.replace(bert_tokenizer.mask_token, bert_tokenizer.decode([token])), f"(score: {score})")
# Get the score of token_id
sought_after_token = word
sought_after_token_id = bert_tokenizer.encode(sought_after_token, add_special_tokens=False, add_prefix_space=True)[
0] # 928
token_score = mask_token_logits[:, sought_after_token_id]
print(f"Score of {sought_after_token}: {mask_token_logits[:, sought_after_token_id]}")
return token_score
```
Any idea why?
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Most helpful comment
Hi, the pipeline doesn't offer such a functionality yet. You're better off using the model directly. Here's an example of how you would replicate the pipeline's behavior, and get a token score at the end:
Outputs:
Let me know if it helps.