Fairseq: Difficulties to reproduce CNN/DM results with BART

Created on 20 Nov 2019 · 14Comments · Source: pytorch/fairseq

Thanks for the great effort toward releasing BART 😃

I'm currently having some difficulties reproducing BART results on the CNN/DM dataset.

I followed the README to test bert.large.cnn model.

I obtained following results :

1 ROUGE-1 Average_R: 0.50475 (95%-conf.int. 0.50197 - 0.50743)
1 ROUGE-1 Average_P: 0.39349 (95%-conf.int. 0.39102 - 0.39598)
1 ROUGE-1 Average_F: 0.43093 (95%-conf.int. 0.42857 - 0.43327)

1 ROUGE-2 Average_R: 0.23609 (95%-conf.int. 0.23327 - 0.23884)
1 ROUGE-2 Average_P: 0.18497 (95%-conf.int. 0.18268 - 0.18712)
1 ROUGE-2 Average_F: 0.20195 (95%-conf.int. 0.19956 - 0.20425)

1 ROUGE-L Average_R: 0.46458 (95%-conf.int. 0.46175 - 0.46720)
1 ROUGE-L Average_P: 0.36244 (95%-conf.int. 0.35999 - 0.36480)
1 ROUGE-L Average_F: 0.39678 (95%-conf.int. 0.39445 - 0.39898)

Which is more than 1 point lower than the expected output.

Any advice on how to reproduce the results is welcome.

@ngoyal2707 @yinhanliu

I personally think it comes from the dataset processing.
I think it's honorable to try to use the script from abisee's repository in order to have comparable results, but we still need to modify the script to make it work for BART, as mentioned here.
_And these modification are still obscure._

Source

astariul-colanim

Most helpful comment

With the changes pointed out by @ngoyal2707, here is my results :

1 ROUGE-1 Average_R: 0.51395 (95%-conf.int. 0.51137 - 0.51660)
1 ROUGE-1 Average_P: 0.40532 (95%-conf.int. 0.40290 - 0.40792)
1 ROUGE-1 Average_F: 0.44235 (95%-conf.int. 0.44009 - 0.44458)

1 ROUGE-2 Average_R: 0.24649 (95%-conf.int. 0.24376 - 0.24923)
1 ROUGE-2 Average_P: 0.19483 (95%-conf.int. 0.19248 - 0.19710)
1 ROUGE-2 Average_F: 0.21227 (95%-conf.int. 0.20990 - 0.21459)

1 ROUGE-L Average_R: 0.47664 (95%-conf.int. 0.47399 - 0.47927)
1 ROUGE-L Average_P: 0.37619 (95%-conf.int. 0.37373 - 0.37863)
1 ROUGE-L Average_F: 0.41043 (95%-conf.int. 0.40808 - 0.41257)

Results reproduced ! 😄

astariul-colanim on 21 Nov 2019

🎉2 👍2

All 14 comments

If it can help, this is the 3 first _tokenized_ predictions :

French prosecutor says he is not aware of any video footage from on board the plane . French gendarmerie spokesman says cell phones have been collected at the crash site , but have n't been exploited . Two magazines claim to have found a cell phone video showing the harrowing final seconds of the crash .

Palestine officially becomes the 123rd member of the international criminal court . The move gives the court jurisdiction over alleged crimes in palestinian territories . israel and the united states , neither of which is an icc member , opposed the palestinians ' efforts to join the body .

Amnesty International releases its annual report on the use of the death penalty around the world . The report finds `` positive developments '' worldwide , with most regions seeming to show reductions in the number of executions . It also highlights a marked increase in theNumber of people sentenced to death in 2014 , an increase of 28 % .

And this is the 3 first _tokenized_ gold summaries :

marseille prosecutor says '' so far no videos were used in the crash investigation '' despite media reports .
journalists at bild and paris match are '' very confident '' the video clip is real , an editor says . andreas lubitz had informed his lufthansa training school of an episode of severe depression , airline says .

membership gives the icc jurisdiction over alleged crimes committed in palestinian territories since last june . israel and the united states opposed the move , which could open the door to war crimes investigations against israelis .

amnesty 's annual death penalty report catalogs encouraging signs , but setbacks in numbers of those sentenced to death . organization claims that governments around the world are using the threat of terrorism to advance executions . the number of executions worldwide has gone down by almost 22 % compared with 2013 , but death sentences up by 28 % .

astariul-colanim on 20 Nov 2019

I finally could reproduce the results :

1 ROUGE-1 Average_R: 0.51378 (95%-conf.int. 0.51119 - 0.51657)
1 ROUGE-1 Average_P: 0.40167 (95%-conf.int. 0.39915 - 0.40421)
1 ROUGE-1 Average_F: 0.44017 (95%-conf.int. 0.43799 - 0.44250)

1 ROUGE-2 Average_R: 0.24613 (95%-conf.int. 0.24322 - 0.24898)
1 ROUGE-2 Average_P: 0.19281 (95%-conf.int. 0.19035 - 0.19519)
1 ROUGE-2 Average_F: 0.21093 (95%-conf.int. 0.20841 - 0.21338)

1 ROUGE-L Average_R: 0.47614 (95%-conf.int. 0.47347 - 0.47884)
1 ROUGE-L Average_P: 0.37256 (95%-conf.int. 0.37004 - 0.37511)
1 ROUGE-L Average_F: 0.40812 (95%-conf.int. 0.40581 - 0.41042)

My mistake was indeed due to the processing of data : I was lowercasing the data but you should not.

Still slightly lower than official results (44.16 / 21.28 / 40.90), but nothing abnormal !

astariul-colanim on 20 Nov 2019

Hi @Colanim ,
Thank you so much for sharing details!
I wonder how long it will take to train BART on CNN/DM dataset.
Would you share some details on the training time and hardware options?
Thanks!!

wonjininfo on 20 Nov 2019

I didn't train BART : the author didn't release the code for training (yet).

I just ran the checkpoint finetuned on CNN/DM and get the results on test dataset ^^

I could run the inference using on a single GPU (8GB)

astariul-colanim on 20 Nov 2019

👍1

Thanks Colanim!

wonjininfo on 20 Nov 2019

@Colanim Thanks for your efforts on reproduction.
Yes we train models with cased data, so having cased input is important during test time.

Some small subtle differences that could make that small gap you are seeing:

1) Change here to not have space.

2) We remove '(CNN)' keyword from the article by adding following lines here

if cnn and article[:5] == '(CNN)':
      article = article[5:]

ngoyal2707 on 20 Nov 2019

👍3

@wonjininfo Training should be around 2 hours.

yinhanliu on 20 Nov 2019

👍1

With the changes pointed out by @ngoyal2707, here is my results :

1 ROUGE-1 Average_R: 0.51395 (95%-conf.int. 0.51137 - 0.51660)
1 ROUGE-1 Average_P: 0.40532 (95%-conf.int. 0.40290 - 0.40792)
1 ROUGE-1 Average_F: 0.44235 (95%-conf.int. 0.44009 - 0.44458)

1 ROUGE-2 Average_R: 0.24649 (95%-conf.int. 0.24376 - 0.24923)
1 ROUGE-2 Average_P: 0.19483 (95%-conf.int. 0.19248 - 0.19710)
1 ROUGE-2 Average_F: 0.21227 (95%-conf.int. 0.20990 - 0.21459)

1 ROUGE-L Average_R: 0.47664 (95%-conf.int. 0.47399 - 0.47927)
1 ROUGE-L Average_P: 0.37619 (95%-conf.int. 0.37373 - 0.37863)
1 ROUGE-L Average_F: 0.41043 (95%-conf.int. 0.40808 - 0.41257)

Results reproduced ! 😄

astariul-colanim on 21 Nov 2019

🎉2 👍2

Hi @Colanim , I have some problem obtaining the output. I wonder whether you can share the BART generated summaries with me (my email: [email protected]).

Thank you,
Rui

memray on 5 Dec 2019

@Colanim Thank you so much for sharing it!
May I inquire some details about how to reproduce the results? I have my own data pipeline implemented with Huggingface tokenizer and I was trying to generate the CNNDM summaries with the released BART checkpoint. But my summary is poor (rouge-1=0.35). So my questions are:

Is the source tensor starting with <s>? My generated summary often starts with two dots ('. .') and I think it's a bias due to the input.
Is title included in the source? If yes, what's the delimiter between text and title?

Thank you!
Rui

memray on 5 Dec 2019

To reproduce the results I used fairseq.

If you also use the code provided in their README, you should have a file test.source containing one article per line, without <s> or </s>.

Inside the method for generating summaries using beam-search, yes they add <s> and </s>.

About file preprocessing, the process is described in #1391.

astariul-colanim on 5 Dec 2019

@Colanim Thanks for sharing your steps to get it to work! I think I got the correct summaries, but I'm having trouble getting the right ROUGE scores.

ROUGE-1:
rouge_1_f_score: 0.4433 with confidence interval (0.4411, 0.4454)
rouge_1_recall: 0.5132 with confidence interval (0.5106, 0.5160)
rouge_1_precision: 0.4063 with confidence interval (0.4040, 0.4086)

ROUGE-2:
rouge_2_f_score: 0.2114 with confidence interval (0.2090, 0.2139)
rouge_2_recall: 0.2447 with confidence interval (0.2418, 0.2475)
rouge_2_precision: 0.1942 with confidence interval (0.1919, 0.1965)

ROUGE-l:
rouge_l_f_score: 0.3077 with confidence interval (0.3054, 0.3099)
rouge_l_recall: 0.3575 with confidence interval (0.3547, 0.3602)
rouge_l_precision: 0.2813 with confidence interval (0.2790, 0.2835)

I'm using the Perl ROUGE-1.5.5 and pyrouge and I feel the problem has to do with the tokenization, since the outputs are all not tokenized. How did you run rouge on the output summaries?

loganlebanoff on 14 Dec 2019

Never mind, I realized I had missed some instructions on https://github.com/pytorch/fairseq/blob/master/examples/bart/README.md. The new ROUGE I got is pretty close to the reported results.

1 ROUGE-1 Average_R: 0.50978 (95%-conf.int. 0.50701 - 0.51246)
1 ROUGE-1 Average_P: 0.40405 (95%-conf.int. 0.40146 - 0.40674)
1 ROUGE-1 Average_F: 0.43987 (95%-conf.int. 0.43759 - 0.44218)

1 ROUGE-2 Average_R: 0.24421 (95%-conf.int. 0.24125 - 0.24706)
1 ROUGE-2 Average_P: 0.19417 (95%-conf.int. 0.19174 - 0.19667)
1 ROUGE-2 Average_F: 0.21090 (95%-conf.int. 0.20840 - 0.21338)

1 ROUGE-L Average_R: 0.47298 (95%-conf.int. 0.47030 - 0.47566)
1 ROUGE-L Average_P: 0.37514 (95%-conf.int. 0.37256 - 0.37772)
1 ROUGE-L Average_F: 0.40828 (95%-conf.int. 0.40600 - 0.41056)

loganlebanoff on 15 Dec 2019

@Colanim , I think I have preprocessed the data just the way in related issues(pytorch/fairseq#1401, pytorch/fairseq#1391, pytorch/fairseq#1364), but somehow the result is just not as high as expected:

1 ROUGE-1 Average_R: 0.49419 (95%-conf.int. 0.49146 - 0.49701)
1 ROUGE-1 Average_P: 0.39219 (95%-conf.int. 0.38966 - 0.39487)
1 ROUGE-1 Average_F: 0.42676 (95%-conf.int. 0.42444 - 0.42909)
---------------------------------------------
1 ROUGE-2 Average_R: 0.23836 (95%-conf.int. 0.23533 - 0.24117)
1 ROUGE-2 Average_P: 0.18979 (95%-conf.int. 0.18733 - 0.19225)
1 ROUGE-2 Average_F: 0.20602 (95%-conf.int. 0.20344 - 0.20846)
---------------------------------------------
1 ROUGE-L Average_R: 0.46045 (95%-conf.int. 0.45774 - 0.46303)
1 ROUGE-L Average_P: 0.36564 (95%-conf.int. 0.36313 - 0.36821)
1 ROUGE-L Average_F: 0.39775 (95%-conf.int. 0.39538 - 0.40009)
Elapsed time: 242.342 seconds

My steps are as follows, I would really appreciate if you could take a glance and maybe find out what detail I might have overlooked

Download and unzip the stories directories from here for both CNN and Daily Mail.
run python make_datafiles.py /path/to/cnn/stories /path/to/dailymail/stories
2.1 do not tokenize
2.2 comment out all the tf code
2.3 do not lower case
2.4 do not add <s> or </s>
2.5 no space before '.'
2.6 remove (CNN)
run

bart = torch.hub.load('pytorch/fairseq', 'bart.large.cnn')
bart.cuda()
bart.eval()
bart.half()
count = 1
bsz = 32
with open('test.source') as source, open('test.hypo', 'w') as fout:
    sline = source.readline().strip()
    slines = [sline]
    for sline in source:
        if count % bsz == 0:
            with torch.no_grad():
                hypotheses_batch = bart.sample(slines, beam=4, lenpen=2.0, max_len_b=140, min_len=55, no_repeat_ngram_size=3)

            for hypothesis in hypotheses_batch:
                fout.write(hypothesis + '\n')
                fout.flush()
            slines = []

        slines.append(sline.strip())
        count += 1
    if slines != []:
        hypotheses_batch = bart.sample(slines, beam=4, lenpen=2.0, max_len_b=140, min_len=55, no_repeat_ngram_size=3)
        for hypothesis in hypotheses_batch:
            fout.write(hypothesis + '\n')
            fout.flush()

export CLASSPATH=/path/to/stanford-corenlp-full-2016-10-31/stanford-corenlp-3.7.0.jar

# Tokenize hypothesis and target files.
cat test.hypo | java edu.stanford.nlp.process.PTBTokenizer -ioFileList -preserveLines > test.hypo.tokenized
cat test.target | java edu.stanford.nlp.process.PTBTokenizer -ioFileList -preserveLines > test.hypo.target
files2rouge test.hypo.tokenized test.hypo.target

One more ask, if it's convenient for you, could you please share your test.source with me? I would really want to know if it's my data preprocessing caused the score gap. My email address is is.[email protected], thanks!