Transformers: Pegasus: replication and distillation results

Created on 31 Aug 2020 · 10Comments · Source: huggingface/transformers

Replication

mixed & stochastic column of this table

| dataset | Authors| This Repo| best bart | best bart name
| ---- | ----|----|----|----|
| xsum | 47.60/24.83/39.64| 46.87/24.46/39.15|22.32/37.39|distilbart-xsum-12-6|
| cnn_dailymail | 44.16/21.56/41.30| see comment|21.26/30.59|distilbart-cnn-12-6|
| newsroom | 45.07/33.39/41.28 |41.03/29.83/36.96|
| multi_news | 47.65/18.75/24.95|47.58/19.0/24.77|
| gigaword | 39.65/20.47/36.76|39.79/20.56/36.80|
| wikihow | 46.39/22.12/38.41 *|46.85/23.64/28.73|
| reddit_tifu | 27.99/9.81/22.94|32.75/11.68/24.97|
| big_patent |52.29/33.08/41.66 *|
| arxiv | 44.21/16.95/25.67|44.83/17.34/25.60|
| pubmed | 45.97/20.15/28.25|45.40/19.42/26.93|
| aeslc | 37.68/21.25/36.51|37.09/21.40/35.93|
| billsum | 59.67/41.58/47.59|56.18/39.94/45.39|

(* (authors footnote)) the numbers of wikihow and big_patent datasets are not comparable because of change in tokenization and data

Final Update (2020-10-16)

Mission accomplished thanks to the work of @patil-suraj, and @stas00 !

The above table now shows that our results are close enough.
We suspect differences are due to treatment of the <n> character that pegasus generates and slightly different beam search implementations.

Link to Spreadsheet with timing data

Questions about specific results should be asked on the forums/separate issues with @stas00, @patil-suraj, and @sshleifer tagged.

Help wanted Replication

Source

sshleifer

Most helpful comment

Updated the table in the Issue description with most recent results after the calculate_rouge_fix
Moving forward, questions about specific results should be asked on the forums or in a separate issue with @stas00, @patil-suraj, and @sshleifer tagged.

sshleifer on 16 Oct 2020

❤2 👍1

All 10 comments

If anyone wants to help, evaluate on a dataset where the third column is not filled it.
Steps:
First, download the data from nlp package, save to disk in format described in https://github.com/huggingface/transformers/blob/master/examples/seq2seq/download_wmt.py

Helper function for run_eval

gen_test_hub_summ () {
    # need to add --fp16 and --bs = whatever
    model=$1
    DATA_DIR=$2
    echo $DATA_DIR
    save_dir=$3
    mkdir -p $save_dir
    shift
    shift
    shift
    python run_eval.py $model $DATA_DIR/test.source $save_dir/test_gens.txt --reference_path $DATA_DIR/test.target --score_path $save_dir/test_rouge.json --task summarization $@
}

Then Roughly:

cd examples/seq2seq
gen_test_hub_summ google/pegasus-{dataset} dataset  {dataset}_results --bs 4

Leave the results, as well as any observations about truncation produced summaries as a comment in this issue!

sshleifer on 31 Aug 2020

CNN Dailymail

One possible reason for replication issue is that our beam search logic differs from the original, causing 16% of the summaries to be truncated.

Finetuning with our finetuning code and --max_target_length=142 partially fixes this issue:

Can get a distilled version (16-4) 43.23/21.29/31.3 .436 S/sample (released at sshleifer/dpx-cnn-16-4)
Can finetune the 16-16 pegasus-cnn checkpoint to get 44.13/21.37/30.94 1.4S/Sample (0.2 Rouge2 behind published.) ( sshleifer/pegasus-cnn-ft-v2)
original google/pegasus-cnn_dailymail scored 20.73 Rouge 2.
For both of these finetuned models, >99.8% of generations end in punctuation.

XSUM

sshleifer/distill-pegasus-xsum-16-4

{"rouge1": 44.942, "rouge2": 23.0412, "rougeL": 37.8579,
 "n_obs": 11333, "seconds_per_sample": 0.1972, "batch_size": 16}

Teacher metrics (I don't remember batch size):

{"rouge1": 46.8773, "rouge2": 24.46, "rougeL": 39.1507, 
"n_obs": 11328,  "seconds_per_sample": 0.3308}

sshleifer on 5 Sep 2020

👍1

I intend to post a writeup on distillation techniques at some point before Oct 15!

sshleifer on 14 Sep 2020

Re: replication, best download strategy maybe to start with
https://github.com/google-research/pegasus/blob/master/pegasus/data/public_datasets_test.py and modify.

sshleifer on 18 Sep 2020

Cnn update:

I believe we have a preprocessing issue. Ported models generate the <n> token at the beginning of sentences, whereas ours do not. The pegasus original code replaces newline symbol with <n>. PegasusTokenizer should probably do this: https://github.com/huggingface/transformers/issues/7327

sshleifer on 22 Sep 2020

For CNNDM, I can get this score with google/pegasus-cnn_dailymail model.

ROUGE-1:
rouge_1_f_score: 0.4436 with confidence interval (0.4413, 0.4459)
rouge_1_recall: 0.4825 with confidence interval (0.4797, 0.4853)
rouge_1_precision: 0.4368 with confidence interval (0.4339, 0.4395)

ROUGE-2:
rouge_2_f_score: 0.2145 with confidence interval (0.2120, 0.2170)
rouge_2_recall: 0.2323 with confidence interval (0.2297, 0.2350)
rouge_2_precision: 0.2124 with confidence interval (0.2097, 0.2150)

ROUGE-l:
rouge_l_f_score: 0.4141 with confidence interval (0.4118, 0.4165)
rouge_l_recall: 0.4501 with confidence interval (0.4474, 0.4530)
rouge_l_precision: 0.4079 with confidence interval (0.4051, 0.4106)

Script I run:

./run_eval.py google/pegasus-cnn_dailymail /home/ffajri/Data/huggingface/cnn_dm/test.source pred_cnndm_pegasus.txt \
    --reference_path /home/ffajri/Data/huggingface/cnn_dm/test.target \
    --score_path cnn_rouge.json \
    --task summarization \
    --device cuda \
    --max_source_length 512 \
    --max_target_length 128 \
    --bs 4

I notice the first R1 output from the transformer is 43.xx something, but I recalculate ROUGE (to get the scores above) as follows:
1) First, I replace <n> with \n in the decoding results. (as you said above)
2) I don't use the gold summary provided by huggingface because sentences are not separated by the newline character. I think its necessary to separate sentences in the gold summary. So I use the original gold test set from See et al., 2017 to compute ROUGE.
2) I lower case all decoded and gold summary (but not sure if it really affects the ROUGE score)
3) I calculate ROUGE with the pyrouge code (not the ROUGE in transformer)

Hope it can help the fix.

fajri91 on 26 Sep 2020

Would you be willing to share a few lines of

cnn_dm/test.source, pred_cnndm_pegasus.txt, and cnn_dm/test.target

Thanks!

sshleifer on 26 Sep 2020

Hi, for inference, I use the same set from huggingface

test.source
Marseille, France (CNN)The French prosecutor leading an investigation into the crash of Germanwings Flight 9525 insisted Wednesday that he was not aware of any video footage from on board the plane. Marseille prosecutor Brice Robin told CNN that "so far no videos were used in the crash investigation." He added, "A person who has such a video needs to immediately give it to the investigators." ............

test.target
Marseille prosecutor says "so far no videos were used in the crash investigation" despite media reports . Journalists at Bild and Paris Match are "very confident" the video clip is real, an editor says . Andreas Lubitz had informed his Lufthansa training school of an episode of severe depression, airline says .

pred_cnndm_pegasus.txt (Result)
"A person who has such a video needs to immediately give it to the investigators," prosecutor says .<n>"It is a very disturbing scene," editor-in-chief of Bild online tells "Erin Burnett: Outfront"

Then, I got R1 = 43.xx (as the ./run_eval.py output)

To get the R1 = 44.xx, I separately calculate ROUGE (pyrouge) with:

test.target from See et al., 2017
marseille prosecutor says '' so far no videos were used in the crash investigation '' despite media reports .\njournalists at bild and paris match are '' very confident '' the video clip is real , an editor says .\nandreas lubitz had informed his lufthansa training school of an episode of severe depression , airline says .

_updated_ pred_cnndm_pegasus.txt
"a person who has such a video needs to immediately give it to the investigators," prosecutor says .\n"it is a very disturbing scene," editor-in-chief of bild online tells "erin burnett: outfront"

Both now have \n which I think is necessary for calculating ROUGE.

fajri91 on 26 Sep 2020

We fixed our calculate_rouge_score to address the \n issue and now we are getting

44.31/21.53/41.15 for sshleifer/pegasus-cnn-ft-v2! Thanks for the help!

sshleifer on 2 Oct 2020

👍1