Transformers: Use fill-mask pipeline to get probability of specific token

Created on 27 May 2020 · 5Comments · Source: huggingface/transformers

Hi,
I am trying to use the fill-mask pipeline:

nlp_fm = pipeline('fill-mask')
nlp_fm('Hugging Face is a French company based in <mask>')

And get the output:

[{'sequence': '<s> Hugging Face is a French company based in Paris</s>',
  'score': 0.23106734454631805,
  'token': 2201},
 {'sequence': '<s> Hugging Face is a French company based in Lyon</s>',
  'score': 0.08198195695877075,
  'token': 12790},
 {'sequence': '<s> Hugging Face is a French company based in Geneva</s>',
  'score': 0.04769458621740341,
  'token': 11559},
 {'sequence': '<s> Hugging Face is a French company based in Brussels</s>',
  'score': 0.04762236401438713,
  'token': 6497},
 {'sequence': '<s> Hugging Face is a French company based in France</s>',
  'score': 0.041305914521217346,
  'token': 1470}]

But let's say I want to get the score & rank on other word - such as London - is this possible?

wontfix

Source

orko19

Most helpful comment

Hi, the pipeline doesn't offer such a functionality yet. You're better off using the model directly. Here's an example of how you would replicate the pipeline's behavior, and get a token score at the end:

from transformers import AutoModelWithLMHead, AutoTokenizer
import torch

tokenizer = AutoTokenizer.from_pretrained("distilroberta-base")
model = AutoModelWithLMHead.from_pretrained("distilroberta-base")

sequence = f"Hugging Face is a French company based in {tokenizer.mask_token}"

input_ids = tokenizer.encode(sequence, return_tensors="pt")
mask_token_index = torch.where(input_ids == tokenizer.mask_token_id)[1]

token_logits = model(input_ids)[0]
mask_token_logits = token_logits[0, mask_token_index, :]
mask_token_logits = torch.softmax(mask_token_logits, dim=1)

top_5 = torch.topk(mask_token_logits, 5, dim=1)
top_5_tokens = zip(top_5.indices[0].tolist(), top_5.values[0].tolist())

for token, score in top_5_tokens:
    print(sequence.replace(tokenizer.mask_token, tokenizer.decode([token])), f"(score: {score})")

# Get the score of token_id
sought_after_token = "London"
sought_after_token_id = tokenizer.encode(sought_after_token, add_special_tokens=False, add_prefix_space=True)[0]  # 928

token_score = mask_token_logits[:, sought_after_token_id]
print(f"Score of {sought_after_token}: {mask_token_logits[:, sought_after_token_id]}")

Outputs:

Hugging Face is a French company based in  Paris (score: 0.2310674488544464)
Hugging Face is a French company based in  Lyon (score: 0.08198253810405731)
Hugging Face is a French company based in  Geneva (score: 0.04769456014037132)
Hugging Face is a French company based in  Brussels (score: 0.047622524201869965)
Hugging Face is a French company based in  France (score: 0.04130581393837929)
Score of London: tensor([0.0343], grad_fn=<SelectBackward>)

Let me know if it helps.

LysandreJik on 27 May 2020

👍4

All 5 comments

from transformers import AutoModelWithLMHead, AutoTokenizer
import torch

tokenizer = AutoTokenizer.from_pretrained("distilroberta-base")
model = AutoModelWithLMHead.from_pretrained("distilroberta-base")

sequence = f"Hugging Face is a French company based in {tokenizer.mask_token}"

input_ids = tokenizer.encode(sequence, return_tensors="pt")
mask_token_index = torch.where(input_ids == tokenizer.mask_token_id)[1]

token_logits = model(input_ids)[0]
mask_token_logits = token_logits[0, mask_token_index, :]
mask_token_logits = torch.softmax(mask_token_logits, dim=1)

top_5 = torch.topk(mask_token_logits, 5, dim=1)
top_5_tokens = zip(top_5.indices[0].tolist(), top_5.values[0].tolist())

for token, score in top_5_tokens:
    print(sequence.replace(tokenizer.mask_token, tokenizer.decode([token])), f"(score: {score})")

# Get the score of token_id
sought_after_token = "London"
sought_after_token_id = tokenizer.encode(sought_after_token, add_special_tokens=False, add_prefix_space=True)[0]  # 928

token_score = mask_token_logits[:, sought_after_token_id]
print(f"Score of {sought_after_token}: {mask_token_logits[:, sought_after_token_id]}")

Outputs:

Hugging Face is a French company based in  Paris (score: 0.2310674488544464)
Hugging Face is a French company based in  Lyon (score: 0.08198253810405731)
Hugging Face is a French company based in  Geneva (score: 0.04769456014037132)
Hugging Face is a French company based in  Brussels (score: 0.047622524201869965)
Hugging Face is a French company based in  France (score: 0.04130581393837929)
Score of London: tensor([0.0343], grad_fn=<SelectBackward>)

Let me know if it helps.

LysandreJik on 27 May 2020

👍4

@lavanyashukla Great thanks!
And if I want a predicability of a whole sentence, the best way will be just to average all words scores?

orko19 on 28 May 2020

Yes, that's one way to do it.

LysandreJik on 28 May 2020

@LysandreJik I get an error:

"NLP_engine.py", line 120, in _word_in_sentence_prob
    mask_token_index = torch.where(input_ids == bert_tokenizer.mask_token_id)[1]
TypeError: where(): argument 'condition' (position 1) must be Tensor, not bool

For the code:
```
def _word_in_sentence_prob(self, sentence, word):

    sequence = f"{sentence} {bert_tokenizer.mask_token}"

    input_ids = bert_tokenizer.encode(sequence, bert_tokenizer="pt")
    mask_token_index = torch.where(input_ids == bert_tokenizer.mask_token_id)[1]

    token_logits = bert_model(input_ids)[0]
    mask_token_logits = token_logits[0, mask_token_index, :]
    mask_token_logits = torch.softmax(mask_token_logits, dim=1)

    top_5 = torch.topk(mask_token_logits, 5, dim=1)
    top_5_tokens = zip(top_5.indices[0].tolist(), top_5.values[0].tolist())

    for token, score in top_5_tokens:
        print(sequence.replace(bert_tokenizer.mask_token, bert_tokenizer.decode([token])), f"(score: {score})")

    # Get the score of token_id
    sought_after_token = word
    sought_after_token_id = bert_tokenizer.encode(sought_after_token, add_special_tokens=False, add_prefix_space=True)[
        0]  # 928

    token_score = mask_token_logits[:, sought_after_token_id]
    print(f"Score of {sought_after_token}: {mask_token_logits[:, sought_after_token_id]}")
    return token_score

```

Any idea why?

orko19 on 2 Jun 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.