I'm currently using sshleifer/distill-pegasus-xsum-16-8 model to perform abstractive text summarization, I've found this particular model to be most useful for my desired application. However, when attempting to summarize on inputted source text, the output returns tokens returned are nowhere in the source text. I suspect Pegasus is returning tokens from the dataset that it was trained. That said, is finetuning needed? Should hyperparameter tweaking solve this?
I wonder if PEGASUS + GAN could help teach the model to abstract from tokens in the input text?
_Here's an example_
Source Text:
German shares suffered their weakest day since early June on Wednesday as the government agreed on an emergency lockdown to combat surging COVID-19 cases, with other European markets following suit on fears of more curbs around the continent. The German DAX sank as much as 5% before cutting some losses to close down 4.2% at its lowest in five months. The precise measures were still subject to negotiation, with sources saying the government had agreed to shut bars and restaurants from Nov. 2. The pan-European STOXX 600 index fell 3% in its sharpest one-day drop in five weeks. France's main index dropped 3.4% ahead of a televised address by President Emmanuel Macron at 8:00 pm when he is expected to issue stay-at-home orders.
# XSUM 16-8
model_name = "sshleifer/distill-pegasus-xsum-16-8"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model_pegasus_distill_xsum_16_8 = AutoModelForSeq2SeqLM.from_pretrained(model_name).to(torch_device)
batch = tokenizer.prepare_seq2seq_batch([src_text], truncation=True, padding='longest').to(torch_device)
translated = model_pegasus_distill_xsum_16_8.generate(**batch,num_beams=9, num_return_sequences=3, temperature=1, length_penalty=5, max_length = 256, min_length=0)
tgt_text = tokenizer.batch_decode(translated, skip_special_tokens=True)`
Output Text:
Shares in Europe have fallen sharply after the German government agreed to shut down bars and restaurants in a bid to curb the spread of carbon monoxide (CO) in the country's capital, Berlin. The pan-European STOXX 600 index fell 3% in its sharpest one-day drop in five weeks, while the FTSE 100 index closed down 3.7% in its sharpest one-day fall in five weeks.
From the outputted text, one can see that nowhere in the input text was carbon monoxide (CO) or Berlin or FTSE 100 mentioned.
Not an expert in summarization, but abstractive text summarization does not extract sequences/tokens from the initial text to produce a summary. That would be extractive text summarization. Abstractive text summarization instead can be done with rephrasing, as it seems to be the case here.
On a second note, I believe the Pegasus checkpoints were trained on very long sequences, so I'm not entirely sure how it would deal with smaller sequences as the one you used here.
On a third note, we try to keep the github issues reserved for issues/feature requests; you would have more luck asking this over on the forum.
@patrickvonplaten or @patil-suraj can chime in if I'm wrong.
The hyperparameters seem very extreme to me... also temperature=1 does not do anything and length_penalty=5 is very high - also note that a length_penalty > 1 actually incentivizes longer sequences. @sshleifer 's model already has good hyper-parameters set as default values that you can see here:
https://huggingface.co/sshleifer/distill-pegasus-xsum-16-8/blob/main/config.json
If you just use those, e.g.:
translated = model_pegasus_distill_xsum_16_8.generate(**batch)
you get this summary:
European shares fell sharply on Wednesday as investors remained cautious ahead of a speech by France's president later in the day.
My conclusion would be that it's just the hyperparameters that are badly chosen - not sure if @sshleifer has something to add...
"sshleifer/distill-pegasus-xsum-16-4" is better and faster. See Table 6 of the best paper in AI history ;). num_beams=4 if I cared at all about speed.
Most helpful comment
"sshleifer/distill-pegasus-xsum-16-4"is better and faster. See Table 6 of the best paper in AI history ;).num_beams=4if I cared at all about speed.