Transformers: Pegasus Xsum Returning Tokens Not In Source Text

Created on 20 Nov 2020  路  3Comments  路  Source: huggingface/transformers

I'm currently using sshleifer/distill-pegasus-xsum-16-8 model to perform abstractive text summarization, I've found this particular model to be most useful for my desired application. However, when attempting to summarize on inputted source text, the output returns tokens returned are nowhere in the source text. I suspect Pegasus is returning tokens from the dataset that it was trained. That said, is finetuning needed? Should hyperparameter tweaking solve this?

I wonder if PEGASUS + GAN could help teach the model to abstract from tokens in the input text?

_Here's an example_

Source Text:
German shares suffered their weakest day since early June on Wednesday as the government agreed on an emergency lockdown to combat surging COVID-19 cases, with other European markets following suit on fears of more curbs around the continent. The German DAX sank as much as 5% before cutting some losses to close down 4.2% at its lowest in five months. The precise measures were still subject to negotiation, with sources saying the government had agreed to shut bars and restaurants from Nov. 2. The pan-European STOXX 600 index fell 3% in its sharpest one-day drop in five weeks. France's main index dropped 3.4% ahead of a televised address by President Emmanuel Macron at 8:00 pm when he is expected to issue stay-at-home orders.

# XSUM 16-8
model_name = "sshleifer/distill-pegasus-xsum-16-8"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model_pegasus_distill_xsum_16_8 = AutoModelForSeq2SeqLM.from_pretrained(model_name).to(torch_device)
batch = tokenizer.prepare_seq2seq_batch([src_text], truncation=True, padding='longest').to(torch_device)
translated = model_pegasus_distill_xsum_16_8.generate(**batch,num_beams=9, num_return_sequences=3, temperature=1, length_penalty=5, max_length = 256, min_length=0)
tgt_text = tokenizer.batch_decode(translated, skip_special_tokens=True)`

Output Text:
Shares in Europe have fallen sharply after the German government agreed to shut down bars and restaurants in a bid to curb the spread of carbon monoxide (CO) in the country's capital, Berlin. The pan-European STOXX 600 index fell 3% in its sharpest one-day drop in five weeks, while the FTSE 100 index closed down 3.7% in its sharpest one-day fall in five weeks.

From the outputted text, one can see that nowhere in the input text was carbon monoxide (CO) or Berlin or FTSE 100 mentioned.

Most helpful comment

  • Lysandre is correct about abstractive vs. extractive.
  • Hallucination is a known issue with Neural Text Generation. It will happen more often if you generate summaries that are more than ~30% the length of the input document (which your length_penalty and max_length encourage).
  • "sshleifer/distill-pegasus-xsum-16-4" is better and faster. See Table 6 of the best paper in AI history ;).
  • I would set num_beams=4 if I cared at all about speed.

All 3 comments

Not an expert in summarization, but abstractive text summarization does not extract sequences/tokens from the initial text to produce a summary. That would be extractive text summarization. Abstractive text summarization instead can be done with rephrasing, as it seems to be the case here.

On a second note, I believe the Pegasus checkpoints were trained on very long sequences, so I'm not entirely sure how it would deal with smaller sequences as the one you used here.

On a third note, we try to keep the github issues reserved for issues/feature requests; you would have more luck asking this over on the forum.

@patrickvonplaten or @patil-suraj can chime in if I'm wrong.

The hyperparameters seem very extreme to me... also temperature=1 does not do anything and length_penalty=5 is very high - also note that a length_penalty > 1 actually incentivizes longer sequences. @sshleifer 's model already has good hyper-parameters set as default values that you can see here:
https://huggingface.co/sshleifer/distill-pegasus-xsum-16-8/blob/main/config.json

If you just use those, e.g.:

translated = model_pegasus_distill_xsum_16_8.generate(**batch)

you get this summary:

European shares fell sharply on Wednesday as investors remained cautious ahead of a speech by France's president later in the day.

You can try it yourself here:
https://huggingface.co/sshleifer/distill-pegasus-xsum-16-8?text=German+shares+suffered+their+weakest+day+since+early+June+on+Wednesday+as+the+government+agreed+on+an+emergency+lockdown+to+combat+surging+COVID-19+cases%2C+with+other+European+markets+following+suit+on+fears+of+more+curbs+around+the+continent.+The+German+DAX+sank+as+much+as+5%25+before+cutting+some+losses+to+close+down+4.2%25+at+its+lowest+in+five+months.+The+precise+measures+were+still+subject+to+negotiation%2C+with+sources+saying+the+government+had+agreed+to+shut+bars+and+restaurants+from+Nov.+2.+The+pan-European+STOXX+600+index+fell+3%25+in+its+sharpest+one-day+drop+in+five+weeks.+France%27s+main+index+dropped+3.4%25+ahead+of+a+televised+address+by+President+Emmanuel+Macron+at+8%3A00+pm+when+he+is+expected+to+issue+stay-at-home+orders.
```

My conclusion would be that it's just the hyperparameters that are badly chosen - not sure if @sshleifer has something to add...

  • Lysandre is correct about abstractive vs. extractive.
  • Hallucination is a known issue with Neural Text Generation. It will happen more often if you generate summaries that are more than ~30% the length of the input document (which your length_penalty and max_length encourage).
  • "sshleifer/distill-pegasus-xsum-16-4" is better and faster. See Table 6 of the best paper in AI history ;).
  • I would set num_beams=4 if I cared at all about speed.
Was this page helpful?
0 / 5 - 0 ratings