mixed & stochastic column of this table
| dataset | Authors| This Repo| best bart | best bart name
| ---- | ----|----|----|----|
| xsum | 47.60/24.83/39.64| 46.87/24.46/39.15|22.32/37.39|distilbart-xsum-12-6|
| cnn_dailymail | 44.16/21.56/41.30| see comment|21.26/30.59|distilbart-cnn-12-6|
| newsroom | 45.07/33.39/41.28 |41.03/29.83/36.96|
| multi_news | 47.65/18.75/24.95|47.58/19.0/24.77|
| gigaword | 39.65/20.47/36.76|39.79/20.56/36.80|
| wikihow | 46.39/22.12/38.41 *|46.85/23.64/28.73|
| reddit_tifu | 27.99/9.81/22.94|32.75/11.68/24.97|
| big_patent |52.29/33.08/41.66 *|
| arxiv | 44.21/16.95/25.67|44.83/17.34/25.60|
| pubmed | 45.97/20.15/28.25|45.40/19.42/26.93|
| aeslc | 37.68/21.25/36.51|37.09/21.40/35.93|
| billsum | 59.67/41.58/47.59|56.18/39.94/45.39|
Mission accomplished thanks to the work of @patil-suraj, and @stas00 !
The above table now shows that our results are close enough.
We suspect differences are due to treatment of the <n> character that pegasus generates and slightly different beam search implementations.
Link to Spreadsheet with timing data
Questions about specific results should be asked on the forums/separate issues with @stas00, @patil-suraj, and @sshleifer tagged.
If anyone wants to help, evaluate on a dataset where the third column is not filled it.
Steps:
First, download the data from nlp package, save to disk in format described in https://github.com/huggingface/transformers/blob/master/examples/seq2seq/download_wmt.py
Helper function for run_eval
gen_test_hub_summ () {
# need to add --fp16 and --bs = whatever
model=$1
DATA_DIR=$2
echo $DATA_DIR
save_dir=$3
mkdir -p $save_dir
shift
shift
shift
python run_eval.py $model $DATA_DIR/test.source $save_dir/test_gens.txt --reference_path $DATA_DIR/test.target --score_path $save_dir/test_rouge.json --task summarization $@
}
Then Roughly:
cd examples/seq2seq
gen_test_hub_summ google/pegasus-{dataset} dataset {dataset}_results --bs 4
Leave the results, as well as any observations about truncation produced summaries as a comment in this issue!
One possible reason for replication issue is that our beam search logic differs from the original, causing 16% of the summaries to be truncated.
Finetuning with our finetuning code and --max_target_length=142 partially fixes this issue:
43.23/21.29/31.3 .436 S/sample (released at sshleifer/dpx-cnn-16-4)44.13/21.37/30.94 1.4S/Sample (0.2 Rouge2 behind published.) ( sshleifer/pegasus-cnn-ft-v2)sshleifer/distill-pegasus-xsum-16-4
{"rouge1": 44.942, "rouge2": 23.0412, "rougeL": 37.8579,
"n_obs": 11333, "seconds_per_sample": 0.1972, "batch_size": 16}
Teacher metrics (I don't remember batch size):
{"rouge1": 46.8773, "rouge2": 24.46, "rougeL": 39.1507,
"n_obs": 11328, "seconds_per_sample": 0.3308}
I intend to post a writeup on distillation techniques at some point before Oct 15!
Re: replication, best download strategy maybe to start with
https://github.com/google-research/pegasus/blob/master/pegasus/data/public_datasets_test.py and modify.
Cnn update:
<n> token at the beginning of sentences, whereas ours do not. The pegasus original code replaces newline symbol with <n>. PegasusTokenizer should probably do this: https://github.com/huggingface/transformers/issues/7327For CNNDM, I can get this score with google/pegasus-cnn_dailymail model.
ROUGE-1:
rouge_1_f_score: 0.4436 with confidence interval (0.4413, 0.4459)
rouge_1_recall: 0.4825 with confidence interval (0.4797, 0.4853)
rouge_1_precision: 0.4368 with confidence interval (0.4339, 0.4395)
ROUGE-2:
rouge_2_f_score: 0.2145 with confidence interval (0.2120, 0.2170)
rouge_2_recall: 0.2323 with confidence interval (0.2297, 0.2350)
rouge_2_precision: 0.2124 with confidence interval (0.2097, 0.2150)
ROUGE-l:
rouge_l_f_score: 0.4141 with confidence interval (0.4118, 0.4165)
rouge_l_recall: 0.4501 with confidence interval (0.4474, 0.4530)
rouge_l_precision: 0.4079 with confidence interval (0.4051, 0.4106)
Script I run:
./run_eval.py google/pegasus-cnn_dailymail /home/ffajri/Data/huggingface/cnn_dm/test.source pred_cnndm_pegasus.txt \
--reference_path /home/ffajri/Data/huggingface/cnn_dm/test.target \
--score_path cnn_rouge.json \
--task summarization \
--device cuda \
--max_source_length 512 \
--max_target_length 128 \
--bs 4
I notice the first R1 output from the transformer is 43.xx something, but I recalculate ROUGE (to get the scores above) as follows:
1) First, I replace <n> with \n in the decoding results. (as you said above)
2) I don't use the gold summary provided by huggingface because sentences are not separated by the newline character. I think its necessary to separate sentences in the gold summary. So I use the original gold test set from See et al., 2017 to compute ROUGE.
2) I lower case all decoded and gold summary (but not sure if it really affects the ROUGE score)
3) I calculate ROUGE with the pyrouge code (not the ROUGE in transformer)
Hope it can help the fix.
Would you be willing to share a few lines of
cnn_dm/test.source, pred_cnndm_pegasus.txt, and cnn_dm/test.target
Thanks!
Hi, for inference, I use the same set from huggingface
test.source
Marseille, France (CNN)The French prosecutor leading an investigation into the crash of Germanwings Flight 9525 insisted Wednesday that he was not aware of any video footage from on board the plane. Marseille prosecutor Brice Robin told CNN that "so far no videos were used in the crash investigation." He added, "A person who has such a video needs to immediately give it to the investigators." ............
test.target
Marseille prosecutor says "so far no videos were used in the crash investigation" despite media reports . Journalists at Bild and Paris Match are "very confident" the video clip is real, an editor says . Andreas Lubitz had informed his Lufthansa training school of an episode of severe depression, airline says .
pred_cnndm_pegasus.txt (Result)
"A person who has such a video needs to immediately give it to the investigators," prosecutor says .<n>"It is a very disturbing scene," editor-in-chief of Bild online tells "Erin Burnett: Outfront"
Then, I got R1 = 43.xx (as the ./run_eval.py output)
To get the R1 = 44.xx, I separately calculate ROUGE (pyrouge) with:
test.target from See et al., 2017
marseille prosecutor says '' so far no videos were used in the crash investigation '' despite media reports .\njournalists at bild and paris match are '' very confident '' the video clip is real , an editor says .\nandreas lubitz had informed his lufthansa training school of an episode of severe depression , airline says .
_updated_ pred_cnndm_pegasus.txt
"a person who has such a video needs to immediately give it to the investigators," prosecutor says .\n"it is a very disturbing scene," editor-in-chief of bild online tells "erin burnett: outfront"
Both now have \n which I think is necessary for calculating ROUGE.
We fixed our calculate_rouge_score to address the \n issue and now we are getting
44.31/21.53/41.15 for sshleifer/pegasus-cnn-ft-v2! Thanks for the help!
Updated the table in the Issue description with most recent results after the calculate_rouge_fix
Moving forward, questions about specific results should be asked on the forums or in a separate issue with @stas00, @patil-suraj, and @sshleifer tagged.
Most helpful comment
Updated the table in the Issue description with most recent results after the
calculate_rouge_fixMoving forward, questions about specific results should be asked on the forums or in a separate issue with @stas00, @patil-suraj, and @sshleifer tagged.