Transformers: Use fill-mask pipeline to get probability of specific token

Created on 27 May 2020  路  5Comments  路  Source: huggingface/transformers

Hi,
I am trying to use the fill-mask pipeline:

nlp_fm = pipeline('fill-mask')
nlp_fm('Hugging Face is a French company based in <mask>')

And get the output:

[{'sequence': '<s> Hugging Face is a French company based in Paris</s>',
  'score': 0.23106734454631805,
  'token': 2201},
 {'sequence': '<s> Hugging Face is a French company based in Lyon</s>',
  'score': 0.08198195695877075,
  'token': 12790},
 {'sequence': '<s> Hugging Face is a French company based in Geneva</s>',
  'score': 0.04769458621740341,
  'token': 11559},
 {'sequence': '<s> Hugging Face is a French company based in Brussels</s>',
  'score': 0.04762236401438713,
  'token': 6497},
 {'sequence': '<s> Hugging Face is a French company based in France</s>',
  'score': 0.041305914521217346,
  'token': 1470}]

But let's say I want to get the score & rank on other word - such as London - is this possible?

wontfix

Most helpful comment

Hi, the pipeline doesn't offer such a functionality yet. You're better off using the model directly. Here's an example of how you would replicate the pipeline's behavior, and get a token score at the end:

from transformers import AutoModelWithLMHead, AutoTokenizer
import torch

tokenizer = AutoTokenizer.from_pretrained("distilroberta-base")
model = AutoModelWithLMHead.from_pretrained("distilroberta-base")

sequence = f"Hugging Face is a French company based in {tokenizer.mask_token}"

input_ids = tokenizer.encode(sequence, return_tensors="pt")
mask_token_index = torch.where(input_ids == tokenizer.mask_token_id)[1]

token_logits = model(input_ids)[0]
mask_token_logits = token_logits[0, mask_token_index, :]
mask_token_logits = torch.softmax(mask_token_logits, dim=1)

top_5 = torch.topk(mask_token_logits, 5, dim=1)
top_5_tokens = zip(top_5.indices[0].tolist(), top_5.values[0].tolist())

for token, score in top_5_tokens:
    print(sequence.replace(tokenizer.mask_token, tokenizer.decode([token])), f"(score: {score})")

# Get the score of token_id
sought_after_token = "London"
sought_after_token_id = tokenizer.encode(sought_after_token, add_special_tokens=False, add_prefix_space=True)[0]  # 928

token_score = mask_token_logits[:, sought_after_token_id]
print(f"Score of {sought_after_token}: {mask_token_logits[:, sought_after_token_id]}")

Outputs:

Hugging Face is a French company based in  Paris (score: 0.2310674488544464)
Hugging Face is a French company based in  Lyon (score: 0.08198253810405731)
Hugging Face is a French company based in  Geneva (score: 0.04769456014037132)
Hugging Face is a French company based in  Brussels (score: 0.047622524201869965)
Hugging Face is a French company based in  France (score: 0.04130581393837929)
Score of London: tensor([0.0343], grad_fn=<SelectBackward>)

Let me know if it helps.

All 5 comments

Hi, the pipeline doesn't offer such a functionality yet. You're better off using the model directly. Here's an example of how you would replicate the pipeline's behavior, and get a token score at the end:

from transformers import AutoModelWithLMHead, AutoTokenizer
import torch

tokenizer = AutoTokenizer.from_pretrained("distilroberta-base")
model = AutoModelWithLMHead.from_pretrained("distilroberta-base")

sequence = f"Hugging Face is a French company based in {tokenizer.mask_token}"

input_ids = tokenizer.encode(sequence, return_tensors="pt")
mask_token_index = torch.where(input_ids == tokenizer.mask_token_id)[1]

token_logits = model(input_ids)[0]
mask_token_logits = token_logits[0, mask_token_index, :]
mask_token_logits = torch.softmax(mask_token_logits, dim=1)

top_5 = torch.topk(mask_token_logits, 5, dim=1)
top_5_tokens = zip(top_5.indices[0].tolist(), top_5.values[0].tolist())

for token, score in top_5_tokens:
    print(sequence.replace(tokenizer.mask_token, tokenizer.decode([token])), f"(score: {score})")

# Get the score of token_id
sought_after_token = "London"
sought_after_token_id = tokenizer.encode(sought_after_token, add_special_tokens=False, add_prefix_space=True)[0]  # 928

token_score = mask_token_logits[:, sought_after_token_id]
print(f"Score of {sought_after_token}: {mask_token_logits[:, sought_after_token_id]}")

Outputs:

Hugging Face is a French company based in  Paris (score: 0.2310674488544464)
Hugging Face is a French company based in  Lyon (score: 0.08198253810405731)
Hugging Face is a French company based in  Geneva (score: 0.04769456014037132)
Hugging Face is a French company based in  Brussels (score: 0.047622524201869965)
Hugging Face is a French company based in  France (score: 0.04130581393837929)
Score of London: tensor([0.0343], grad_fn=<SelectBackward>)

Let me know if it helps.

@lavanyashukla Great thanks!
And if I want a predicability of a whole sentence, the best way will be just to average all words scores?

Yes, that's one way to do it.

@LysandreJik I get an error:

"NLP_engine.py", line 120, in _word_in_sentence_prob
    mask_token_index = torch.where(input_ids == bert_tokenizer.mask_token_id)[1]
TypeError: where(): argument 'condition' (position 1) must be Tensor, not bool

For the code:
```
def _word_in_sentence_prob(self, sentence, word):

    sequence = f"{sentence} {bert_tokenizer.mask_token}"

    input_ids = bert_tokenizer.encode(sequence, bert_tokenizer="pt")
    mask_token_index = torch.where(input_ids == bert_tokenizer.mask_token_id)[1]

    token_logits = bert_model(input_ids)[0]
    mask_token_logits = token_logits[0, mask_token_index, :]
    mask_token_logits = torch.softmax(mask_token_logits, dim=1)

    top_5 = torch.topk(mask_token_logits, 5, dim=1)
    top_5_tokens = zip(top_5.indices[0].tolist(), top_5.values[0].tolist())

    for token, score in top_5_tokens:
        print(sequence.replace(bert_tokenizer.mask_token, bert_tokenizer.decode([token])), f"(score: {score})")

    # Get the score of token_id
    sought_after_token = word
    sought_after_token_id = bert_tokenizer.encode(sought_after_token, add_special_tokens=False, add_prefix_space=True)[
        0]  # 928

    token_score = mask_token_logits[:, sought_after_token_id]
    print(f"Score of {sought_after_token}: {mask_token_logits[:, sought_after_token_id]}")
    return token_score

```

Any idea why?

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

alphanlp picture alphanlp  路  3Comments

zhezhaoa picture zhezhaoa  路  3Comments

HansBambel picture HansBambel  路  3Comments

chuanmingliu picture chuanmingliu  路  3Comments

hsajjad picture hsajjad  路  3Comments