Summarization task is returning an unexpected results. For an input of
"We have a telephony partner who is very interested in this program and may be able to help identify pilot customers."
The results is
[{'summary_text': '"We have a telephony partner who is very interested in this program and may be able to help identify pilot customers," the company says. "We are looking at a number of different ways to get people talking to each other," it adds. "It's a very exciting time for us," says the company's chief operating officer.'}]
Model I am using (Bert, XLNet ...): Summarization pipeline
Language I am using the model on (English, Chinese ...): Eng
The problem arises when using:
The tasks I am working on is:
Steps to reproduce the behavior:
!pip install -q transformers --upgrade
from transformers import pipeline
summarizer = pipeline(task="summarization")
data = "We have a telephony partner who is very interested in this program and may be able to help identify pilot customers."
print(summarizer(data))
Would expect the summary to 1) not add contextual information that doesn't exist, and 2) to not be longer than the input.
Arguably the input is short but still...
Colab
You can pass summarizer(data, min_length=10, max_length=20) to get a summary whose length is between 10 and 20 tokens. By default, summaries will be between 56 and 142 tokens.
Thanks @sshleifer, interestingly now by having a max_length the summary is just arbitrarily cut, which is not great either. Is there a way to constrain the summary length and actually preserve the sense?
[{'summary_text': '"We have a telephony partner who is very interested in this program and may be'}]
The logic of the program is "generate the most likely summary" of between min_length and max_length. So it's not programmed to cut the summary in a rules based way.
With that in mind, I've also seen poor results summarizing documents that are very different than the finetuning distribution (news articles of ~1024 tokens).
You might get better results with summarizer = pipeline(task="summarization", model='bart-large-xsum') .
The logic of the program is "generate the most likely summary" of between min_length and max_length. So it's not programmed to cut the summary in a rules based way.
Thanks for confirming - seems to be the right approach :)!
You might get better results with summarizer = pipeline(task="summarization", model='bart-large-xsum') .
Ok, will give it a try then!
With that in mind, I've also seen poor results summarizing documents that are very different than the finetuning distribution (news articles of ~1024 tokens).
So you want to keep it open as a bug or should we close?
As a side request, it would be awesome to have metrics associated with each models that are part of transformers to help users choose the right one for their job (cc: @julien-c ).
Hi @sshleifer Can we increase token length beyond 1024 for generating a summary.
I got the following message while generating a summary of the 20000-word document.
Your max_length is set to 1300, but you input_length is only 1024. You might consider decreasing max_length manually, e.g. summarizer('...', max_length=50)
Unfortunately, Bart can only process 1024 tokens at once, so your best best would be to split your doc into chunks, summarize each one, and concatenate the summaries.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Most helpful comment
As a side request, it would be awesome to have metrics associated with each models that are part of transformers to help users choose the right one for their job (cc: @julien-c ).