Transformers: Project: Gather summarization datasets and try to replicate pegasus results on them

Created on 7 Oct 2020  路  34Comments  路  Source: huggingface/transformers

Dear @stas00 and whoever else is willing to help!

So far I have only checked pegasus' rouge scores on 2/12 datasets for which we have checkpoints.
For the other 10 datasets I either haven't tried or have tried briefly and gotten stuck.

The full scope of the project is that:
for each dataset:

1) There is an automated way to download the data, either from S3 or source. (To the extent possible, much of the logic in this script should eventually live in the datasets package).
2) we know our pegasus implementation's rouge score
2b) if our score is very different than the authors', we know whether that difference is due to data preprocessing, and if it is, we can preprocess the dataset similarly to the pegasus authors.
3) Our rouge score is within 0.3 Rouge2 of the reported. (Authors) column below.

Steps

Getting Data

By far the most difficult part of each project is getting the dataset. And giving up quickly if you can't and writing a github issue somewhere.
I tried 1 approach to getting data: this script
It worked for gigaword, I just haven't done the evaluation, but it failed for aeslc and then I gave up.

Another complementary approach would be to try to directly use the pegasus dataset code

This will likely push preprocessing issues towards the back of the project. (when we try to send PRs to the datasets repo), but might be better than using my script.

After you get data

When you have gotten a dataset you can sanity check

python -m torch.distributed.launch --nproc_per_node=2  run_distributed_eval.py \
    --model_name google/pegasus-large $\  # see note 1
    --save_dir xsum_generations \
    --data_dir xsum \
    --prefix test \
    --n_obs 100 \

Note 1: you can just keep running pegasus-large and expect a high single digits or better rouge2 score,to avoid downloading all the checkpoints or, you can change this to the relevant checkpoint.
Note 2: I am happy to run all the evals on newer hardware, very easy for me.
Note 3: We can do data sharing by getting you aws creds, or some other solution. Key is that I can download from command line, e.g. Google Drive + gdown.

Misc thoughts:

  • arxiv and pubmed are listed under scientific_papers in the datasets package.
  • This is really 10 projects (1 each dataset, 2 of which I've started). If I were you I would ignore the started 2 and start on a few other ones.
  • If a dataset only has train/test or train/val or some other splits, see how the pegasus authors did the split.
  • Partial credit is valuable!
  • this could easily have been an issue for the datasets project rather than the transformers project.
  • There is no reason to merge PRs quickly for this project, but eventually we want a (much better) download_summ_dataset.py script or instructions for using other libs to accomplish the same outcome.
  • Will be good for both of us to learn the datasets internals.
  • Raw Billsum has multiple line articles, which breaks everything :( , (we could try to support raw nlp datasets in our DataLoader)

Here is a copy of the table we are trying to fill out in #6844 : (I made a new issue to avoid spamming that one)

| dataset | Authors| This Repo|
| ---- | ----|----|
| xsum | 47.60/24.83/39.64| 46.87/24.46/39.15|
| cnn_dailymail | 44.16/21.56/41.30| see 1|
| newsroom | 45.07/33.39/41.28 | have .tar file|
| multi_news | 47.65/18.75/24.95|
| gigaword | 39.65/20.47/36.76| 39.79/20.56/36.80|
| wikihow | 46.39/22.12/38.41 *| Asked Authors |
| reddit_tifu | 27.99/9.81/22.94|32.75/11.68/24.97|
| big_patent |52.29/33.08/41.66 *| |
| arxiv | 44.21/16.95/25.67| |
| pubmed | 45.97/20.15/28.25| |
| aeslc | 37.68/21.25/36.51|37.1/21.4/35.94|
| billsum | 59.67/41.58/47.59|54.99/37.43/43.07|

Originally from mixed & stochastic column of this table

This was really long, and probably disorganized, so feel free to ask clarifying questions here or on slack!
cc @stas00

1) I got similar scores on cnn-dailymail by finetuning the authors' model on our dataset for a bit.
2) reddit_tifu: added --min_length 32

Help wanted pegasus

All 34 comments

yes, please

I could work on getting the datsets, replicating will be hard (compute!!!). I have shared wikihow and arxiv on forum

I will start working on this over the next few days, so let's not duplicate the efforts and claim here which ones we are working on.

@stas00

following remaining datsets are available in the datsets lib

- multi_news
- reddit_tifu
- billsum
- aeslc

could write a script to download and process these

Do you mean to say that these 4 you listed are already in hf's datasets, and so we only need to download and convert these, right?

So the others that you haven't listed and Sam hasn't already processed still need to be sorted out from scratch, correct?

My plan was to start with wikihow as you shared some instructions at https://discuss.huggingface.co/t/wikihow-dataset-preprocessing/1413

And so we only need to download and convert these, right?

Yes, these 4 are already in hf's datasets, we just convert and do some pre-processing before,

I have shared arxiv as well but that needs to be pre-processed.

for newsroom we need to request it from the author, so I'm not sure if we are allowed to share it directly.

If it's very heavy compute+disc-space-wise we could write scripts for small samples and then ask Sam or somebody at HF to run on the full data - since they probably have access to better hardware than us.

arxiv is huge (3.9 GB something), rest we can handle on colab I guess

OK, I will start with wikihow and in parallel will inquire w/ the author of newsroom wrt permission, since the latter could take time.

And then do arxiv afterwards.

So do you want to work on the 4 you listed, meanwhile? Either way works for me so please don't hesitate to choose what works the best for you.

Yes, I'll take those 4 :)

newsroom can also be consumed through datsets but needs manual download

yes, I was just looking at https://huggingface.co/datasets/newsroom but the information is wrong:

from datasets import load_dataset
dataset = load_dataset("newsroom")
Downloading: 5.21kB [00:00, 1.45MB/s]
Downloading: 2.68kB [00:00, 844kB/s]
Using custom data configuration default
Downloading and preparing dataset newsroom/default (download: Unknown size, generated: 4.94 GiB, post-processed: Unknown size, total: 4.94 GiB) to /home/stas/.cache/huggingface/datasets/newsroom/default/1.0.0/4b405ccd64e15f685065870ea563a1e6a034d1bd269a5427f40146d81549095e...
Traceback (most recent call last):
  File "x", line 3, in <module>
    dataset = load_dataset("newsroom")
  File "/mnt/nvme1/code/huggingface/datasets-master/src/datasets/load.py", line 608, in load_dataset
    builder_instance.download_and_prepare(
  File "/mnt/nvme1/code/huggingface/datasets-master/src/datasets/builder.py", line 453, in download_and_prepare
    assert (
AssertionError: The dataset newsroom with config default requires manual data.
 Please follow the manual download instructions:   You should download the dataset from http://lil.datasets.cornell.edu/newsroom/
  The webpage requires registration.
  To unzip the .tar file run `tar -zxvf complete.tar`. To unzip the .gz files
  run `gunzip train.json.gz` , ...
  After downloading, please put the files under the following names
  dev.jsonl, test.jsonl and train.jsonl in a dir of your choice,
  which will be used as a manual_dir, e.g. `~/.manual_dirs/newsroom`
  Newsroom can then be loaded via:
  `datasets.load_dataset("newsroom", data_dir="~/.manual_dirs/newsroom")`.
  .
 Manual data can be loaded with `datasets.load_dataset(newsroom, data_dir='<path/to/manual/data>')

No such thing as http://lil.datasets.cornell.edu/newsroom/ - getting 404.

This is not the first bogus dataset in datasets.

Hmm, it looks that perhaps somebody at HF should file this form then, correct?
http://lil.nlp.cornell.edu/newsroom/download/index.html -> https://cornell.qualtrics.com/jfe/form/SV_6YA3HQ2p75XH4IR
We can't use our names to ask for a permission for the dataset to be used by an open source project.
@sshleifer?

scraping newsroom is hard! Better to request it.

I had requested it, I got the link after a month and by the time I saw the mail it was already expired 馃槀

So, it would be better if someone from HF requests it, they will probably receive it faster

We definitely shouldn't scrape it, since we won't be able to use it anyway w/o their permission. So yes, @sshleifer, please help us out here.

here are the results of eval on the wikihow data you shared, @patil-suraj

This on dual Titan X:

  • sample of 100, run time: 0:03:05
    {'rouge1': 23.7695, 'rouge2': 5.3349, 'rougeL': 15.6991, 'rougeLsum': 16.7567, 'n_obs': 100, 'seconds_per_sample': 2.433, 'n_gpus': 2}
  • full, run time: 8:19:35
    {'rouge1': 24.6291, 'rouge2': 5.7999, 'rougeL': 15.6812, 'rougeLsum': 16.6907, 'n_obs': 11996, 'seconds_per_sample': 2.505, 'n_gpus': 2}

So that gives us 24.63/5.80/16.69 which is far far away from 46.39/22.12/38.41

The command was:

python -m torch.distributed.launch --nproc_per_node=2  run_distributed_eval.py --model_name google/pegasus-large \
--save_dir xsum_generations --data_dir /hf/wikihow/wikihow --prefix test --bs 4

That's scary low. Do you think there is an issue with dataset ?

@stas00 , @sshleifer
Wrote a helper script to download and save summ datasets
https://github.com/patil-suraj/summarization_datasets

Currently includes aeslc, billsum and reddit_tifu, rest should be easy to add.

Processing scripts are taken form the official datset repos, split information is copied from the pegasus repo.

Enjoy!

@stas00 Try using google/pegasus-wikihow as the model can do --n_obs 100 now that we are calibrated. I should have specified that in the spec. We want to test the fine-tuned model.

Would also be interested in knowing whether --max_source_length 512 changes anything.
(You can see the expected params that should be checked into each config here In those triples, length_penalty and max_length are generation params that should be reflected in model.config, max_position_embeddings should only be reflected in tokenizer.model_max_length (didn't save static pos embeddings, I don't think).

google/pegasus-wikihow

python -m torch.distributed.launch --nproc_per_node=2  run_distributed_eval.py --model_name google/pegasus-wikihow \
--save_dir xsum_generations --data_dir /hf/wikihow/wikihow --prefix test  --n_obs 100 --bs 4
{'rouge1': 21.4782, 'rouge2': 8.7003, 'rougeL': 18.9314, 'rougeLsum': 18.8476, 'n_obs': 100, 'seconds_per_sample': 1.1432, 'n_gpus': 2}

There is a slight improvement on all but rouge1 w/ google/pegasus-wikihow

It also appears to be much faster!

On 1000 objects the performance drops:

{'rouge1': 20.7939, 'rouge2': 8.4804, 'rougeL': 18.12, 'rougeLsum': 18.0778, 'n_obs': 1000, 'seconds_per_sample': 0.3459, 'n_gpus': 2}

my intuition tells me that either the dataset has some broken data in it, or all of it has some issues - since we aren't getting above the score from 100 objects.

--max_source_length 512

python -m torch.distributed.launch --nproc_per_node=2  run_distributed_eval.py --model_name google/pegasus-wikihow \
--save_dir xsum_generations --data_dir /hf/wikihow/wikihow --prefix test  --n_obs 100 --bs 4 \
--max_source_length 512
{'rouge1': 21.5527, 'rouge2': 8.6861, 'rougeL': 18.9145, 'rougeLsum': 18.9772, 'n_obs': 100, 'seconds_per_sample': 0.5674, 'n_gpus': 2}

looks worse on 2 scores, better on 2 other scores.

Do you think there is an issue with dataset ?

I didn't get a chance to study it yet - just had the time to run the eval.

need a little script to convert the json dumps into a nice md table so that it's easier to read the results, like run_eval_search.py does.

newsroom: filled out the form
wikihow: asked the authors https://github.com/google-research/pegasus/issues/111 if @stas00 could paste 1 article, 1 target and 1 generation as a comment on that issue, it would be helpful.

gigaword: Done

@patil-suraj if you have preprocessed links you want me to run evaluate on, feel free to post/slack and I can run eval. My preference would be to gdown/unzip a directory that includes only

data/test.source
data/test.target

I started a sub-section of my porting repo to gather script and instructions for building these datasets:
https://github.com/stas00/porting/tree/master/datasets/pegasus

So for completed things please either submit a PR or send me the files and I will add them there. Whatever is more efficient for you.

p.s. I'm doing it in a separate repo, since @sshleifer doesn't think they should go into the main repo (I think they should, but this can be fixed later as long as we have them).

Here is a little helper util that helps to show the differences in strings - useful when matching pre-processing data.

import difflib
def str_compare(a, b):
    """ 
    If strings are mismatched, print the diff with context
    Returns true if strings match, false otherwise
    adapted from https://stackoverflow.com/a/17904977/9201239
    """

    match = True
    if len(a) != len(b):
        print(f"length mismatch: a={len(a)}, b={len(b)}")

    def context(s, i):
        start = i-10
        end   = i+10
        if start < 0: start = 0
        if end > len(s)-1: end = len(s)-1
        return s[start:end]

    for i, s in enumerate(difflib.ndiff(a, b)):
        if s[0] == ' ': 
            continue          
        elif s[0] == '-':
            match = False
            print(f'Delete "{s[-1]}" from position {i}, ctx=[{context(a, i)}]')
        elif s[0] == '+':
            match = False
            print(f'Add "{s[-1]}" to position {i}, ctx=[{context(a, i)}')

    return match

I'm trying to reproduce the multi-news results. But it seems the ROUGE scores are not even in the ballpark of the original report or the ones in here.

The command I used was
python -m torch.distributed.launch --nproc_per_node=4 run_distributed_eval.py --model_name google/pegasus-multi_news --data_dir multi-news/processed/hf/ --save_dir output_data/ --bs 6

{"rouge1": 44.7752, "rouge2": 16.1437, "rougeL": 22.7593, "rougeLsum": 40.5531, "n_obs": 5622, "seconds_per_sample": 0.6931, "n_gpus": 4}

I downloaded the data from the original authors of Multi-News: link.

I'm not sure if the discrepancy is due to the preprocessing, but to my understanding, pegasus only replaces NEWLINE_CHAR with \n. Could someone give some hints?

Hi @kylie-box ,
pegasus used the datasets provided by tfds library. There is some discrepancy in processing in original data and the data provided by tfds. We also faced similar problem initially.

Use the scripts in this repo (which @stas00 built for reproducing the results) to download and process the datasets.

Following up to @patil-suraj's comment - specifically, this for multi_news:
https://github.com/stas00/porting/tree/master/datasets/pegasus/multi_news
follow the instructions under process.txt.

The key is not to do any preprocessing, other than newlines.

Thanks, @stas00 and @patil-suraj! I was able to reproduce the results using data from tfds and their preprocessing.

Awesome - glad to hear it worked and we uploaded the eval tar balls as well, see https://github.com/stas00/porting/tree/master/datasets/pegasus/

Was this page helpful?
0 / 5 - 0 ratings

Related issues

alphanlp picture alphanlp  路  3Comments

quocnle picture quocnle  路  3Comments

zhezhaoa picture zhezhaoa  路  3Comments

rsanjaykamath picture rsanjaykamath  路  3Comments

chuanmingliu picture chuanmingliu  路  3Comments