Transformers: QuestionAnsweringPipeline query performance

Created on 3 Jun 2020 · 11Comments · Source: huggingface/transformers

This is my first issue posted here, so first off thank you for building this library, it's really pushing NLP forward.

The current QuestionAnsweringPipeline relies on the method squad_convert_examples_to_features to convert question/context pairs to SquadFeatures. In reviewing this method, it looks like it spawns a process for each example.

This is causing performance issues when looking to support near real-time queries or bulk queries. As a workaround, I can directly issue the queries against the model but the pipeline has a lot of nice logic to help format answers properly and pulling the best answer vs start/end argmax.

Please see the results of a rudimentary performance test to demonstrate:

import time

from transformers import pipeline

context = r"""
The extractive question answering process took an average of 36.555 seconds using pipelines and about 2 seconds when
queried directly using the models.
"""
question = "How long did the process take?"

nlp = pipeline("question-answering", model="distilbert-base-cased-distilled-squad", tokenizer="distilbert-base-cased-distilled-squad")

start = time.time()
for x in range(100):
    answer = nlp(question=question, context=context)

print("Answer", answer)
print("Time", time.time() - start, "s")

Answer {'score': 0.8029816785368773, 'start': 62, 'end': 76, 'answer': '36.555 seconds'}
Time 36.703474044799805 s



md5-37f0c7a5f3848dfbab68e23979c23ee8

Answer 36 . 555 seconds
Time 2.1718859672546387 s

I believe the 10x slowdown is that the first example had to spawn 100 processes. I also tried passing a list of 100 question/context pairs to see if that was better and that took ~28s. But for this use case, all 100 questions wouldn't be available at once.

The additional logic for answer extraction doesn't come for free but it doesn't add much overhead. The third test below uses a [custom pipeline component](https://github.com/neuml/cord19q/blob/master/src/python/cord19q/pipeline.py) to demonstrate.

```python
from cord19q.pipeline import Pipeline

pipeline = Pipeline("distilbert-base-cased-distilled-squad", False)

start = time.time()
for x in range(100):
    answer = pipeline([question], [context])

print("\nAnswer", answer)
print("Time", time.time() - start, "s")

Answer [{'answer': '36.555 seconds', 'score': 0.8029860216482803}]
Time 2.219379186630249 s

It would be great if the QuestionAnsweringPipeline could either not use the squad processor or the processor is changed to have an argument to not spawn processes.

wontfix

Source

davidmezzetti

Most helpful comment

Hi @davidmezzetti, just to let you know we're working towards a bigger pipeline refactor, with a strong focus on performance. Let's keep this issue open while it's still in the works in case more is to be said on the matter.

LysandreJik on 3 Aug 2020

👍3

All 11 comments

Hi! Thanks for the detailed report. Indeed, it would be nice to keep the performance high, especially if it's due to something annex than pure inference. I'm looking into it.

LysandreJik on 3 Jun 2020

👍1

Great, thank you for the quick response!

davidmezzetti on 3 Jun 2020

After looking into it, it seems that the threading is only part of the problem. Removing it results in 24 seconds instead of 36 seconds, which is still 10x slower than pure inference.

I believe this is mostly due to the squad_convert_example_to_features, which is made to be very robust. By doing so, it slows things down by quite a big factor.

There's probably a few things that are overkill for the pipeline when compared to a SQuAD training.

LysandreJik on 3 Jun 2020

Thanks once again for the quick response. I did notice that the tokenizer in squad_convert_example_to_features was also padding to the max sequence length, which makes sense for batch inputs. My guess is that the value add was in how the squad processor can robustly extract answers. It's tricky to find the match in the original text when all you have are model tokens.

The custom example referenced above builds a regular expression joining the tokens on \s? and handles BERT subwords but I'm not sure how that would work for all models.

davidmezzetti on 3 Jun 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] on 3 Aug 2020

LysandreJik on 3 Aug 2020

👍3

Thank you for following up, sounds great, thank you.

davidmezzetti on 3 Aug 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] on 3 Oct 2020

@LysandreJik has there been any update in the library with respect to this issue ?