Transformers: KeyError in Pipeline Question Answering with LongFormer

Created on 4 Jun 2020 · 9Comments · Source: huggingface/transformers

I'm trying to do QA with LongFormer in a Pipeline. First of all, I generate the pipeline:
MODEL_STR = "mrm8488/longformer-base-4096-finetuned-squadv2" tokenizer = AutoTokenizer.from_pretrained(MODEL_STR) model = AutoModelForQuestionAnswering.from_pretrained(MODEL_STR) QA = pipeline('question-answering', model=model, tokenizer=tokenizer)

Then, I get the paper text from which I want the answer to come from, named my_article, that's a string containing the full body of the article (around 3000 words). Then, I try:

with torch.no_grad(): answer = QA(question=question, context=articles_abstract.body_text.iloc[0])

And it throws the following error:

`
eyError Traceback (most recent call last)
in
1 with torch.no_grad():
----> 2 answer = QA(question=question, context=articles_abstract.body_text.iloc[0])

~/miniconda/envs/transformers_env/lib/python3.7/site-packages/transformers/pipelines.py in __call__(self, args, *kwargs)
1225 ),
1226 }
-> 1227 for s, e, score in zip(starts, ends, scores)
1228 ]
1229

~/miniconda/envs/transformers_env/lib/python3.7/site-packages/transformers/pipelines.py in (.0)
1225 ),
1226 }
-> 1227 for s, e, score in zip(starts, ends, scores)
1228 ]
1229

KeyError: 382
`

How can I solve this issue? More importantly, what do you think is causing the issue?

Thanks in advance! :)

wontfix

Source

alexvaca0

Most helpful comment

@alexvaca0

Please check which architecture you are using, and then go to the docs and find the doc for QA model, it contains the example on how to use it without pipeline. So if your architecture is BERT then there will be a model BertForQuestionAnswering. You'll find the example in the model's doc. Basically what you'll need to do is this

# import your model class, you can also use AutoModelForQuestionAnswering and AutoTokenizer
from transformers import BertTokenizer, BertForQuestionAnswering
import torch

# load the model and tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForQuestionAnswering.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad')

# encode the question and text
question, text = "Who was Jim Henson?", "Jim Henson was a nice puppet"
encoding = tokenizer.encode_plus(question, text)
input_ids, token_type_ids = encoding["input_ids"], encoding["token_type_ids"]

# do the forward pass, each qa model returns start_scores, end_scores
start_scores, end_scores = model(torch.tensor([input_ids]), token_type_ids=torch.tensor([token_type_ids]))

# extract the span
all_tokens = tokenizer.convert_ids_to_tokens(input_ids)
answer = ' '.join(all_tokens[torch.argmax(start_scores) : torch.argmax(end_scores)+1])

assert answer == "a nice puppet"

Hope this helps you.

patil-suraj on 5 Jun 2020

👍2 ❤1

All 9 comments

It seems that I have the same (or at least very similar) issue but using ner pipeline.
My model is a fine-tuned RoBERTa (xlm-roberta-base).
I can produce different predictions with different inputs, but all are way outside the range of the actual label IDs.

The error shows where the predicted label ID can't be found in the id2label map in the model config:

~/projects/env/lib/python3.7/site-packages/transformers/pipelines.py in __call__(self, *args, **kwargs)
    920             filtered_labels_idx = [
    921                 (idx, label_idx)
--> 922                 for idx, label_idx in enumerate(labels_idx)
    923                 if self.model.config.id2label[label_idx] not in self.ignore_labels
    924             ]

~/projects/env/lib/python3.7/site-packages/transformers/pipelines.py in <listcomp>(.0)
    921                 (idx, label_idx)
    922                 for idx, label_idx in enumerate(labels_idx)
--> 923                 if self.model.config.id2label[label_idx] not in self.ignore_labels
    924             ]
    925

KeyError: 741

Philipduerholt on 4 Jun 2020

Longformer isn't yet supported in the pipeline. For now you'll need to do this manually as given in the example or doc.

@patrickvonplaten

patil-suraj on 4 Jun 2020

That's correct, adding Longformer to the QA pipeline is on the ToDo List :-)

patrickvonplaten on 4 Jun 2020

Actually LongFormer isn't the only model that fails inside the Pipeline. I'm trying to use now 'ktrapeznikov/biobert_v1.1_pubmed_squad_v2' and it throws the same error: KeyError.

alexvaca0 on 5 Jun 2020

Anyone has an example of how to do QA without the Pipeline? That'd be really helpful for checking whether the models work or not, regardless of them having been added to the pipeline or not.

alexvaca0 on 5 Jun 2020

@alexvaca0

# import your model class, you can also use AutoModelForQuestionAnswering and AutoTokenizer
from transformers import BertTokenizer, BertForQuestionAnswering
import torch

# load the model and tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForQuestionAnswering.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad')

# encode the question and text
question, text = "Who was Jim Henson?", "Jim Henson was a nice puppet"
encoding = tokenizer.encode_plus(question, text)
input_ids, token_type_ids = encoding["input_ids"], encoding["token_type_ids"]

# do the forward pass, each qa model returns start_scores, end_scores
start_scores, end_scores = model(torch.tensor([input_ids]), token_type_ids=torch.tensor([token_type_ids]))

# extract the span
all_tokens = tokenizer.convert_ids_to_tokens(input_ids)
answer = ' '.join(all_tokens[torch.argmax(start_scores) : torch.argmax(end_scores)+1])

assert answer == "a nice puppet"

Hope this helps you.

patil-suraj on 5 Jun 2020

👍2 ❤1

Also https://huggingface.co/transformers/usage.html#extractive-question-answering

patil-suraj on 5 Jun 2020

Actually LongFormer isn't the only model that fails inside the Pipeline. I'm trying to use now 'ktrapeznikov/biobert_v1.1_pubmed_squad_v2' and it throws the same error: KeyError.

Feel free to open a separate issue on this so that we can investigate more :-)

patrickvonplaten on 5 Jun 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.