Dear @stas00 and whoever else is willing to help!
So far I have only checked pegasus' rouge scores on 2/12 datasets for which we have checkpoints.
For the other 10 datasets I either haven't tried or have tried briefly and gotten stuck.
The full scope of the project is that:
for each dataset:
1) There is an automated way to download the data, either from S3 or source. (To the extent possible, much of the logic in this script should eventually live in the datasets package).
2) we know our pegasus implementation's rouge score
2b) if our score is very different than the authors', we know whether that difference is due to data preprocessing, and if it is, we can preprocess the dataset similarly to the pegasus authors.
3) Our rouge score is within 0.3 Rouge2 of the reported. (Authors) column below.
By far the most difficult part of each project is getting the dataset. And giving up quickly if you can't and writing a github issue somewhere.
I tried 1 approach to getting data: this script
It worked for gigaword, I just haven't done the evaluation, but it failed for aeslc and then I gave up.
Another complementary approach would be to try to directly use the pegasus dataset code
This will likely push preprocessing issues towards the back of the project. (when we try to send PRs to the datasets repo), but might be better than using my script.
When you have gotten a dataset you can sanity check
python -m torch.distributed.launch --nproc_per_node=2 run_distributed_eval.py \
--model_name google/pegasus-large $\ # see note 1
--save_dir xsum_generations \
--data_dir xsum \
--prefix test \
--n_obs 100 \
Note 1: you can just keep running pegasus-large and expect a high single digits or better rouge2 score,to avoid downloading all the checkpoints or, you can change this to the relevant checkpoint.
Note 2: I am happy to run all the evals on newer hardware, very easy for me.
Note 3: We can do data sharing by getting you aws creds, or some other solution. Key is that I can download from command line, e.g. Google Drive + gdown.
scientific_papers in the datasets package.DataLoader)Here is a copy of the table we are trying to fill out in #6844 : (I made a new issue to avoid spamming that one)
| dataset | Authors| This Repo|
| ---- | ----|----|
| xsum | 47.60/24.83/39.64| 46.87/24.46/39.15|
| cnn_dailymail | 44.16/21.56/41.30| see 1|
| newsroom | 45.07/33.39/41.28 | have .tar file|
| multi_news | 47.65/18.75/24.95|
| gigaword | 39.65/20.47/36.76| 39.79/20.56/36.80|
| wikihow | 46.39/22.12/38.41 *| Asked Authors |
| reddit_tifu | 27.99/9.81/22.94|32.75/11.68/24.97|
| big_patent |52.29/33.08/41.66 *| |
| arxiv | 44.21/16.95/25.67| |
| pubmed | 45.97/20.15/28.25| |
| aeslc | 37.68/21.25/36.51|37.1/21.4/35.94|
| billsum | 59.67/41.58/47.59|54.99/37.43/43.07|
Originally from mixed & stochastic column of this table
This was really long, and probably disorganized, so feel free to ask clarifying questions here or on slack!
cc @stas00
1) I got similar scores on cnn-dailymail by finetuning the authors' model on our dataset for a bit.
2) reddit_tifu: added --min_length 32
yes, please
I could work on getting the datsets, replicating will be hard (compute!!!). I have shared wikihow and arxiv on forum
I will start working on this over the next few days, so let's not duplicate the efforts and claim here which ones we are working on.
@stas00
following remaining datsets are available in the datsets lib
- multi_news
- reddit_tifu
- billsum
- aeslc
could write a script to download and process these
Do you mean to say that these 4 you listed are already in hf's datasets, and so we only need to download and convert these, right?
So the others that you haven't listed and Sam hasn't already processed still need to be sorted out from scratch, correct?
My plan was to start with wikihow as you shared some instructions at https://discuss.huggingface.co/t/wikihow-dataset-preprocessing/1413
And so we only need to download and convert these, right?
Yes, these 4 are already in hf's datasets, we just convert and do some pre-processing before,
I have shared arxiv as well but that needs to be pre-processed.
for newsroom we need to request it from the author, so I'm not sure if we are allowed to share it directly.
If it's very heavy compute+disc-space-wise we could write scripts for small samples and then ask Sam or somebody at HF to run on the full data - since they probably have access to better hardware than us.
arxiv is huge (3.9 GB something), rest we can handle on colab I guess
OK, I will start with wikihow and in parallel will inquire w/ the author of newsroom wrt permission, since the latter could take time.
And then do arxiv afterwards.
So do you want to work on the 4 you listed, meanwhile? Either way works for me so please don't hesitate to choose what works the best for you.
Yes, I'll take those 4 :)
newsroom can also be consumed through datsets but needs manual download
yes, I was just looking at https://huggingface.co/datasets/newsroom but the information is wrong:
from datasets import load_dataset
dataset = load_dataset("newsroom")
Downloading: 5.21kB [00:00, 1.45MB/s]
Downloading: 2.68kB [00:00, 844kB/s]
Using custom data configuration default
Downloading and preparing dataset newsroom/default (download: Unknown size, generated: 4.94 GiB, post-processed: Unknown size, total: 4.94 GiB) to /home/stas/.cache/huggingface/datasets/newsroom/default/1.0.0/4b405ccd64e15f685065870ea563a1e6a034d1bd269a5427f40146d81549095e...
Traceback (most recent call last):
File "x", line 3, in <module>
dataset = load_dataset("newsroom")
File "/mnt/nvme1/code/huggingface/datasets-master/src/datasets/load.py", line 608, in load_dataset
builder_instance.download_and_prepare(
File "/mnt/nvme1/code/huggingface/datasets-master/src/datasets/builder.py", line 453, in download_and_prepare
assert (
AssertionError: The dataset newsroom with config default requires manual data.
Please follow the manual download instructions: You should download the dataset from http://lil.datasets.cornell.edu/newsroom/
The webpage requires registration.
To unzip the .tar file run `tar -zxvf complete.tar`. To unzip the .gz files
run `gunzip train.json.gz` , ...
After downloading, please put the files under the following names
dev.jsonl, test.jsonl and train.jsonl in a dir of your choice,
which will be used as a manual_dir, e.g. `~/.manual_dirs/newsroom`
Newsroom can then be loaded via:
`datasets.load_dataset("newsroom", data_dir="~/.manual_dirs/newsroom")`.
.
Manual data can be loaded with `datasets.load_dataset(newsroom, data_dir='<path/to/manual/data>')
No such thing as http://lil.datasets.cornell.edu/newsroom/ - getting 404.
This is not the first bogus dataset in datasets.
We need to request it from here http://lil.nlp.cornell.edu/newsroom/download/index.html
Geesh, this one
https://github.com/lil-lab/newsroom
also links to 404
https://summari.es/download/
Hmm, it looks that perhaps somebody at HF should file this form then, correct?
http://lil.nlp.cornell.edu/newsroom/download/index.html -> https://cornell.qualtrics.com/jfe/form/SV_6YA3HQ2p75XH4IR
We can't use our names to ask for a permission for the dataset to be used by an open source project.
@sshleifer?
scraping newsroom is hard! Better to request it.
I had requested it, I got the link after a month and by the time I saw the mail it was already expired 馃槀
So, it would be better if someone from HF requests it, they will probably receive it faster
We definitely shouldn't scrape it, since we won't be able to use it anyway w/o their permission. So yes, @sshleifer, please help us out here.
Helper scripts for pubmed
https://github.com/armancohan/long-summarization
https://github.com/kedz/summarization-datasets
here are the results of eval on the wikihow data you shared, @patil-suraj
This on dual Titan X:
{'rouge1': 23.7695, 'rouge2': 5.3349, 'rougeL': 15.6991, 'rougeLsum': 16.7567, 'n_obs': 100, 'seconds_per_sample': 2.433, 'n_gpus': 2}{'rouge1': 24.6291, 'rouge2': 5.7999, 'rougeL': 15.6812, 'rougeLsum': 16.6907, 'n_obs': 11996, 'seconds_per_sample': 2.505, 'n_gpus': 2}So that gives us 24.63/5.80/16.69 which is far far away from 46.39/22.12/38.41
The command was:
python -m torch.distributed.launch --nproc_per_node=2 run_distributed_eval.py --model_name google/pegasus-large \
--save_dir xsum_generations --data_dir /hf/wikihow/wikihow --prefix test --bs 4
That's scary low. Do you think there is an issue with dataset ?
@stas00 , @sshleifer
Wrote a helper script to download and save summ datasets
https://github.com/patil-suraj/summarization_datasets
Currently includes aeslc, billsum and reddit_tifu, rest should be easy to add.
Processing scripts are taken form the official datset repos, split information is copied from the pegasus repo.
Enjoy!
@stas00 Try using google/pegasus-wikihow as the model can do --n_obs 100 now that we are calibrated. I should have specified that in the spec. We want to test the fine-tuned model.
Would also be interested in knowing whether --max_source_length 512 changes anything.
(You can see the expected params that should be checked into each config here In those triples, length_penalty and max_length are generation params that should be reflected in model.config, max_position_embeddings should only be reflected in tokenizer.model_max_length (didn't save static pos embeddings, I don't think).
python -m torch.distributed.launch --nproc_per_node=2 run_distributed_eval.py --model_name google/pegasus-wikihow \
--save_dir xsum_generations --data_dir /hf/wikihow/wikihow --prefix test --n_obs 100 --bs 4
{'rouge1': 21.4782, 'rouge2': 8.7003, 'rougeL': 18.9314, 'rougeLsum': 18.8476, 'n_obs': 100, 'seconds_per_sample': 1.1432, 'n_gpus': 2}
There is a slight improvement on all but rouge1 w/ google/pegasus-wikihow
It also appears to be much faster!
On 1000 objects the performance drops:
{'rouge1': 20.7939, 'rouge2': 8.4804, 'rougeL': 18.12, 'rougeLsum': 18.0778, 'n_obs': 1000, 'seconds_per_sample': 0.3459, 'n_gpus': 2}
my intuition tells me that either the dataset has some broken data in it, or all of it has some issues - since we aren't getting above the score from 100 objects.
python -m torch.distributed.launch --nproc_per_node=2 run_distributed_eval.py --model_name google/pegasus-wikihow \
--save_dir xsum_generations --data_dir /hf/wikihow/wikihow --prefix test --n_obs 100 --bs 4 \
--max_source_length 512
{'rouge1': 21.5527, 'rouge2': 8.6861, 'rougeL': 18.9145, 'rougeLsum': 18.9772, 'n_obs': 100, 'seconds_per_sample': 0.5674, 'n_gpus': 2}
looks worse on 2 scores, better on 2 other scores.
Do you think there is an issue with dataset ?
I didn't get a chance to study it yet - just had the time to run the eval.
need a little script to convert the json dumps into a nice md table so that it's easier to read the results, like run_eval_search.py does.
newsroom: filled out the form
wikihow: asked the authors https://github.com/google-research/pegasus/issues/111 if @stas00 could paste 1 article, 1 target and 1 generation as a comment on that issue, it would be helpful.
gigaword: Done
@patil-suraj if you have preprocessed links you want me to run evaluate on, feel free to post/slack and I can run eval. My preference would be to gdown/unzip a directory that includes only
data/test.source
data/test.target
I started a sub-section of my porting repo to gather script and instructions for building these datasets:
https://github.com/stas00/porting/tree/master/datasets/pegasus
So for completed things please either submit a PR or send me the files and I will add them there. Whatever is more efficient for you.
p.s. I'm doing it in a separate repo, since @sshleifer doesn't think they should go into the main repo (I think they should, but this can be fixed later as long as we have them).
Here is a little helper util that helps to show the differences in strings - useful when matching pre-processing data.
import difflib
def str_compare(a, b):
"""
If strings are mismatched, print the diff with context
Returns true if strings match, false otherwise
adapted from https://stackoverflow.com/a/17904977/9201239
"""
match = True
if len(a) != len(b):
print(f"length mismatch: a={len(a)}, b={len(b)}")
def context(s, i):
start = i-10
end = i+10
if start < 0: start = 0
if end > len(s)-1: end = len(s)-1
return s[start:end]
for i, s in enumerate(difflib.ndiff(a, b)):
if s[0] == ' ':
continue
elif s[0] == '-':
match = False
print(f'Delete "{s[-1]}" from position {i}, ctx=[{context(a, i)}]')
elif s[0] == '+':
match = False
print(f'Add "{s[-1]}" to position {i}, ctx=[{context(a, i)}')
return match
I'm trying to reproduce the multi-news results. But it seems the ROUGE scores are not even in the ballpark of the original report or the ones in here.
The command I used was
python -m torch.distributed.launch --nproc_per_node=4 run_distributed_eval.py --model_name google/pegasus-multi_news --data_dir multi-news/processed/hf/ --save_dir output_data/ --bs 6
{"rouge1": 44.7752, "rouge2": 16.1437, "rougeL": 22.7593, "rougeLsum": 40.5531, "n_obs": 5622, "seconds_per_sample": 0.6931, "n_gpus": 4}
I downloaded the data from the original authors of Multi-News: link.
I'm not sure if the discrepancy is due to the preprocessing, but to my understanding, pegasus only replaces NEWLINE_CHAR with \n. Could someone give some hints?
Hi @kylie-box ,
pegasus used the datasets provided by tfds library. There is some discrepancy in processing in original data and the data provided by tfds. We also faced similar problem initially.
Use the scripts in this repo (which @stas00 built for reproducing the results) to download and process the datasets.
Following up to @patil-suraj's comment - specifically, this for multi_news:
https://github.com/stas00/porting/tree/master/datasets/pegasus/multi_news
follow the instructions under process.txt.
The key is not to do any preprocessing, other than newlines.
Thanks, @stas00 and @patil-suraj! I was able to reproduce the results using data from tfds and their preprocessing.
Awesome - glad to hear it worked and we uploaded the eval tar balls as well, see https://github.com/stas00/porting/tree/master/datasets/pegasus/