Transformers: Pegasus: replication and distillation results

Created on 31 Aug 2020  路  10Comments  路  Source: huggingface/transformers

Replication

link

mixed & stochastic column of this table

| dataset | Authors| This Repo| best bart | best bart name
| ---- | ----|----|----|----|
| xsum | 47.60/24.83/39.64| 46.87/24.46/39.15|22.32/37.39|distilbart-xsum-12-6|
| cnn_dailymail | 44.16/21.56/41.30| see comment|21.26/30.59|distilbart-cnn-12-6|
| newsroom | 45.07/33.39/41.28 |41.03/29.83/36.96|
| multi_news | 47.65/18.75/24.95|47.58/19.0/24.77|
| gigaword | 39.65/20.47/36.76|39.79/20.56/36.80|
| wikihow | 46.39/22.12/38.41 *|46.85/23.64/28.73|
| reddit_tifu | 27.99/9.81/22.94|32.75/11.68/24.97|
| big_patent |52.29/33.08/41.66 *|
| arxiv | 44.21/16.95/25.67|44.83/17.34/25.60|
| pubmed | 45.97/20.15/28.25|45.40/19.42/26.93|
| aeslc | 37.68/21.25/36.51|37.09/21.40/35.93|
| billsum | 59.67/41.58/47.59|56.18/39.94/45.39|

  • (* (authors footnote)) the numbers of wikihow and big_patent datasets are not comparable because of change in tokenization and data

Final Update (2020-10-16)

Mission accomplished thanks to the work of @patil-suraj, and @stas00 !

The above table now shows that our results are close enough.
We suspect differences are due to treatment of the <n> character that pegasus generates and slightly different beam search implementations.

Link to Spreadsheet with timing data

Questions about specific results should be asked on the forums/separate issues with @stas00, @patil-suraj, and @sshleifer tagged.

Help wanted Replication

Most helpful comment

Updated the table in the Issue description with most recent results after the calculate_rouge_fix
Moving forward, questions about specific results should be asked on the forums or in a separate issue with @stas00, @patil-suraj, and @sshleifer tagged.

All 10 comments

If anyone wants to help, evaluate on a dataset where the third column is not filled it.
Steps:
First, download the data from nlp package, save to disk in format described in https://github.com/huggingface/transformers/blob/master/examples/seq2seq/download_wmt.py

Helper function for run_eval

gen_test_hub_summ () {
    # need to add --fp16 and --bs = whatever
    model=$1
    DATA_DIR=$2
    echo $DATA_DIR
    save_dir=$3
    mkdir -p $save_dir
    shift
    shift
    shift
    python run_eval.py $model $DATA_DIR/test.source $save_dir/test_gens.txt --reference_path $DATA_DIR/test.target --score_path $save_dir/test_rouge.json --task summarization $@
}

Then Roughly:

cd examples/seq2seq
gen_test_hub_summ google/pegasus-{dataset} dataset  {dataset}_results --bs 4

Leave the results, as well as any observations about truncation produced summaries as a comment in this issue!

CNN Dailymail

One possible reason for replication issue is that our beam search logic differs from the original, causing 16% of the summaries to be truncated.

Finetuning with our finetuning code and --max_target_length=142 partially fixes this issue:

  • Can get a distilled version (16-4) 43.23/21.29/31.3 .436 S/sample (released at sshleifer/dpx-cnn-16-4)
  • Can finetune the 16-16 pegasus-cnn checkpoint to get 44.13/21.37/30.94 1.4S/Sample (0.2 Rouge2 behind published.) ( sshleifer/pegasus-cnn-ft-v2)
  • original google/pegasus-cnn_dailymail scored 20.73 Rouge 2.
  • For both of these finetuned models, >99.8% of generations end in punctuation.

XSUM

sshleifer/distill-pegasus-xsum-16-4

{"rouge1": 44.942, "rouge2": 23.0412, "rougeL": 37.8579,
 "n_obs": 11333, "seconds_per_sample": 0.1972, "batch_size": 16}

Teacher metrics (I don't remember batch size):

{"rouge1": 46.8773, "rouge2": 24.46, "rougeL": 39.1507, 
"n_obs": 11328,  "seconds_per_sample": 0.3308}

I intend to post a writeup on distillation techniques at some point before Oct 15!

Re: replication, best download strategy maybe to start with
https://github.com/google-research/pegasus/blob/master/pegasus/data/public_datasets_test.py and modify.

Cnn update:

  • I believe we have a preprocessing issue. Ported models generate the <n> token at the beginning of sentences, whereas ours do not. The pegasus original code replaces newline symbol with <n>. PegasusTokenizer should probably do this: https://github.com/huggingface/transformers/issues/7327

For CNNDM, I can get this score with google/pegasus-cnn_dailymail model.

ROUGE-1:
rouge_1_f_score: 0.4436 with confidence interval (0.4413, 0.4459)
rouge_1_recall: 0.4825 with confidence interval (0.4797, 0.4853)
rouge_1_precision: 0.4368 with confidence interval (0.4339, 0.4395)

ROUGE-2:
rouge_2_f_score: 0.2145 with confidence interval (0.2120, 0.2170)
rouge_2_recall: 0.2323 with confidence interval (0.2297, 0.2350)
rouge_2_precision: 0.2124 with confidence interval (0.2097, 0.2150)

ROUGE-l:
rouge_l_f_score: 0.4141 with confidence interval (0.4118, 0.4165)
rouge_l_recall: 0.4501 with confidence interval (0.4474, 0.4530)
rouge_l_precision: 0.4079 with confidence interval (0.4051, 0.4106)

Script I run:

./run_eval.py google/pegasus-cnn_dailymail /home/ffajri/Data/huggingface/cnn_dm/test.source pred_cnndm_pegasus.txt \
    --reference_path /home/ffajri/Data/huggingface/cnn_dm/test.target \
    --score_path cnn_rouge.json \
    --task summarization \
    --device cuda \
    --max_source_length 512 \
    --max_target_length 128 \
    --bs 4

I notice the first R1 output from the transformer is 43.xx something, but I recalculate ROUGE (to get the scores above) as follows:
1) First, I replace <n> with \n in the decoding results. (as you said above)
2) I don't use the gold summary provided by huggingface because sentences are not separated by the newline character. I think its necessary to separate sentences in the gold summary. So I use the original gold test set from See et al., 2017 to compute ROUGE.
2) I lower case all decoded and gold summary (but not sure if it really affects the ROUGE score)
3) I calculate ROUGE with the pyrouge code (not the ROUGE in transformer)

Hope it can help the fix.

Would you be willing to share a few lines of

cnn_dm/test.source, pred_cnndm_pegasus.txt, and cnn_dm/test.target

Thanks!

Hi, for inference, I use the same set from huggingface

test.source
Marseille, France (CNN)The French prosecutor leading an investigation into the crash of Germanwings Flight 9525 insisted Wednesday that he was not aware of any video footage from on board the plane. Marseille prosecutor Brice Robin told CNN that "so far no videos were used in the crash investigation." He added, "A person who has such a video needs to immediately give it to the investigators." ............

test.target
Marseille prosecutor says "so far no videos were used in the crash investigation" despite media reports . Journalists at Bild and Paris Match are "very confident" the video clip is real, an editor says . Andreas Lubitz had informed his Lufthansa training school of an episode of severe depression, airline says .

pred_cnndm_pegasus.txt (Result)
"A person who has such a video needs to immediately give it to the investigators," prosecutor says .<n>"It is a very disturbing scene," editor-in-chief of Bild online tells "Erin Burnett: Outfront"

Then, I got R1 = 43.xx (as the ./run_eval.py output)

To get the R1 = 44.xx, I separately calculate ROUGE (pyrouge) with:

test.target from See et al., 2017
marseille prosecutor says '' so far no videos were used in the crash investigation '' despite media reports .\njournalists at bild and paris match are '' very confident '' the video clip is real , an editor says .\nandreas lubitz had informed his lufthansa training school of an episode of severe depression , airline says .

_updated_ pred_cnndm_pegasus.txt
"a person who has such a video needs to immediately give it to the investigators," prosecutor says .\n"it is a very disturbing scene," editor-in-chief of bild online tells "erin burnett: outfront"

Both now have \n which I think is necessary for calculating ROUGE.

We fixed our calculate_rouge_score to address the \n issue and now we are getting

44.31/21.53/41.15 for sshleifer/pegasus-cnn-ft-v2! Thanks for the help!

Updated the table in the Issue description with most recent results after the calculate_rouge_fix
Moving forward, questions about specific results should be asked on the forums or in a separate issue with @stas00, @patil-suraj, and @sshleifer tagged.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

HanGuo97 picture HanGuo97  路  3Comments

rsanjaykamath picture rsanjaykamath  路  3Comments

chuanmingliu picture chuanmingliu  路  3Comments

siddsach picture siddsach  路  3Comments

guanlongtianzi picture guanlongtianzi  路  3Comments