Thanks for the great effort toward releasing BART 馃槂
I'm currently having some difficulties reproducing BART results on the CNN/DM dataset.
I followed the README to test bert.large.cnn model.
I obtained following results :
1 ROUGE-1 Average_R: 0.50475 (95%-conf.int. 0.50197 - 0.50743)
1 ROUGE-1 Average_P: 0.39349 (95%-conf.int. 0.39102 - 0.39598)
1 ROUGE-1 Average_F: 0.43093 (95%-conf.int. 0.42857 - 0.43327)1 ROUGE-2 Average_R: 0.23609 (95%-conf.int. 0.23327 - 0.23884)
1 ROUGE-2 Average_P: 0.18497 (95%-conf.int. 0.18268 - 0.18712)
1 ROUGE-2 Average_F: 0.20195 (95%-conf.int. 0.19956 - 0.20425)1 ROUGE-L Average_R: 0.46458 (95%-conf.int. 0.46175 - 0.46720)
1 ROUGE-L Average_P: 0.36244 (95%-conf.int. 0.35999 - 0.36480)
1 ROUGE-L Average_F: 0.39678 (95%-conf.int. 0.39445 - 0.39898)
Which is more than 1 point lower than the expected output.
Any advice on how to reproduce the results is welcome.
@ngoyal2707 @yinhanliu
I personally think it comes from the dataset processing.
I think it's honorable to try to use the script from abisee's repository in order to have comparable results, but we still need to modify the script to make it work for BART, as mentioned here.
_And these modification are still obscure._
If it can help, this is the 3 first _tokenized_ predictions :
French prosecutor says he is not aware of any video footage from on board the plane . French gendarmerie spokesman says cell phones have been collected at the crash site , but have n't been exploited . Two magazines claim to have found a cell phone video showing the harrowing final seconds of the crash .
Palestine officially becomes the 123rd member of the international criminal court . The move gives the court jurisdiction over alleged crimes in palestinian territories . israel and the united states , neither of which is an icc member , opposed the palestinians ' efforts to join the body .
Amnesty International releases its annual report on the use of the death penalty around the world . The report finds `` positive developments '' worldwide , with most regions seeming to show reductions in the number of executions . It also highlights a marked increase in theNumber of people sentenced to death in 2014 , an increase of 28 % .
And this is the 3 first _tokenized_ gold summaries :
marseille prosecutor says '' so far no videos were used in the crash investigation '' despite media reports .
journalists at bild and paris match are '' very confident '' the video clip is real , an editor says . andreas lubitz had informed his lufthansa training school of an episode of severe depression , airline says .
membership gives the icc jurisdiction over alleged crimes committed in palestinian territories since last june . israel and the united states opposed the move , which could open the door to war crimes investigations against israelis .
amnesty 's annual death penalty report catalogs encouraging signs , but setbacks in numbers of those sentenced to death . organization claims that governments around the world are using the threat of terrorism to advance executions . the number of executions worldwide has gone down by almost 22 % compared with 2013 , but death sentences up by 28 % .
I finally could reproduce the results :
1 ROUGE-1 Average_R: 0.51378 (95%-conf.int. 0.51119 - 0.51657)
1 ROUGE-1 Average_P: 0.40167 (95%-conf.int. 0.39915 - 0.40421)
1 ROUGE-1 Average_F: 0.44017 (95%-conf.int. 0.43799 - 0.44250)1 ROUGE-2 Average_R: 0.24613 (95%-conf.int. 0.24322 - 0.24898)
1 ROUGE-2 Average_P: 0.19281 (95%-conf.int. 0.19035 - 0.19519)
1 ROUGE-2 Average_F: 0.21093 (95%-conf.int. 0.20841 - 0.21338)1 ROUGE-L Average_R: 0.47614 (95%-conf.int. 0.47347 - 0.47884)
1 ROUGE-L Average_P: 0.37256 (95%-conf.int. 0.37004 - 0.37511)
1 ROUGE-L Average_F: 0.40812 (95%-conf.int. 0.40581 - 0.41042)
My mistake was indeed due to the processing of data : I was lowercasing the data but you should not.
Still slightly lower than official results (44.16 / 21.28 / 40.90), but nothing abnormal !
Hi @Colanim ,
Thank you so much for sharing details!
I wonder how long it will take to train BART on CNN/DM dataset.
Would you share some details on the training time and hardware options?
Thanks!!
I didn't train BART : the author didn't release the code for training (yet).
I just ran the checkpoint finetuned on CNN/DM and get the results on test dataset ^^
I could run the inference using on a single GPU (8GB)
Thanks Colanim!
@Colanim Thanks for your efforts on reproduction.
Yes we train models with cased data, so having cased input is important during test time.
Some small subtle differences that could make that small gap you are seeing:
1) Change here to not have space.
2) We remove '(CNN)' keyword from the article by adding following lines here
if cnn and article[:5] == '(CNN)':
article = article[5:]
@wonjininfo Training should be around 2 hours.
With the changes pointed out by @ngoyal2707, here is my results :
1 ROUGE-1 Average_R: 0.51395 (95%-conf.int. 0.51137 - 0.51660)
1 ROUGE-1 Average_P: 0.40532 (95%-conf.int. 0.40290 - 0.40792)
1 ROUGE-1 Average_F: 0.44235 (95%-conf.int. 0.44009 - 0.44458)1 ROUGE-2 Average_R: 0.24649 (95%-conf.int. 0.24376 - 0.24923)
1 ROUGE-2 Average_P: 0.19483 (95%-conf.int. 0.19248 - 0.19710)
1 ROUGE-2 Average_F: 0.21227 (95%-conf.int. 0.20990 - 0.21459)1 ROUGE-L Average_R: 0.47664 (95%-conf.int. 0.47399 - 0.47927)
1 ROUGE-L Average_P: 0.37619 (95%-conf.int. 0.37373 - 0.37863)
1 ROUGE-L Average_F: 0.41043 (95%-conf.int. 0.40808 - 0.41257)
Results reproduced ! 馃槃
Hi @Colanim , I have some problem obtaining the output. I wonder whether you can share the BART generated summaries with me (my email: [email protected]).
Thank you,
Rui
@Colanim Thank you so much for sharing it!
May I inquire some details about how to reproduce the results? I have my own data pipeline implemented with Huggingface tokenizer and I was trying to generate the CNNDM summaries with the released BART checkpoint. But my summary is poor (rouge-1=0.35). So my questions are:
<s>? My generated summary聽often starts with two dots ('. .') and I think it's a bias due to the input.Thank you!
Rui
To reproduce the results I used fairseq.
If you also use the code provided in their README, you should have a file test.source containing one article per line, without <s> or </s>.
Inside the method for generating summaries using beam-search, yes they add <s> and </s>.
About file preprocessing, the process is described in #1391.
@Colanim Thanks for sharing your steps to get it to work! I think I got the correct summaries, but I'm having trouble getting the right ROUGE scores.
ROUGE-1:
rouge_1_f_score: 0.4433 with confidence interval (0.4411, 0.4454)
rouge_1_recall: 0.5132 with confidence interval (0.5106, 0.5160)
rouge_1_precision: 0.4063 with confidence interval (0.4040, 0.4086)
ROUGE-2:
rouge_2_f_score: 0.2114 with confidence interval (0.2090, 0.2139)
rouge_2_recall: 0.2447 with confidence interval (0.2418, 0.2475)
rouge_2_precision: 0.1942 with confidence interval (0.1919, 0.1965)
ROUGE-l:
rouge_l_f_score: 0.3077 with confidence interval (0.3054, 0.3099)
rouge_l_recall: 0.3575 with confidence interval (0.3547, 0.3602)
rouge_l_precision: 0.2813 with confidence interval (0.2790, 0.2835)
I'm using the Perl ROUGE-1.5.5 and pyrouge and I feel the problem has to do with the tokenization, since the outputs are all not tokenized. How did you run rouge on the output summaries?
Never mind, I realized I had missed some instructions on https://github.com/pytorch/fairseq/blob/master/examples/bart/README.md. The new ROUGE I got is pretty close to the reported results.
1 ROUGE-1 Average_R: 0.50978 (95%-conf.int. 0.50701 - 0.51246)
1 ROUGE-1 Average_P: 0.40405 (95%-conf.int. 0.40146 - 0.40674)
1 ROUGE-1 Average_F: 0.43987 (95%-conf.int. 0.43759 - 0.44218)
1 ROUGE-2 Average_R: 0.24421 (95%-conf.int. 0.24125 - 0.24706)
1 ROUGE-2 Average_P: 0.19417 (95%-conf.int. 0.19174 - 0.19667)
1 ROUGE-2 Average_F: 0.21090 (95%-conf.int. 0.20840 - 0.21338)
1 ROUGE-L Average_R: 0.47298 (95%-conf.int. 0.47030 - 0.47566)
1 ROUGE-L Average_P: 0.37514 (95%-conf.int. 0.37256 - 0.37772)
1 ROUGE-L Average_F: 0.40828 (95%-conf.int. 0.40600 - 0.41056)
@Colanim , I think I have preprocessed the data just the way in related issues(pytorch/fairseq#1401, pytorch/fairseq#1391, pytorch/fairseq#1364), but somehow the result is just not as high as expected:
1 ROUGE-1 Average_R: 0.49419 (95%-conf.int. 0.49146 - 0.49701)
1 ROUGE-1 Average_P: 0.39219 (95%-conf.int. 0.38966 - 0.39487)
1 ROUGE-1 Average_F: 0.42676 (95%-conf.int. 0.42444 - 0.42909)
---------------------------------------------
1 ROUGE-2 Average_R: 0.23836 (95%-conf.int. 0.23533 - 0.24117)
1 ROUGE-2 Average_P: 0.18979 (95%-conf.int. 0.18733 - 0.19225)
1 ROUGE-2 Average_F: 0.20602 (95%-conf.int. 0.20344 - 0.20846)
---------------------------------------------
1 ROUGE-L Average_R: 0.46045 (95%-conf.int. 0.45774 - 0.46303)
1 ROUGE-L Average_P: 0.36564 (95%-conf.int. 0.36313 - 0.36821)
1 ROUGE-L Average_F: 0.39775 (95%-conf.int. 0.39538 - 0.40009)
Elapsed time: 242.342 seconds
My steps are as follows, I would really appreciate if you could take a glance and maybe find out what detail I might have overlooked
<s> or </s>bart = torch.hub.load('pytorch/fairseq', 'bart.large.cnn')
bart.cuda()
bart.eval()
bart.half()
count = 1
bsz = 32
with open('test.source') as source, open('test.hypo', 'w') as fout:
sline = source.readline().strip()
slines = [sline]
for sline in source:
if count % bsz == 0:
with torch.no_grad():
hypotheses_batch = bart.sample(slines, beam=4, lenpen=2.0, max_len_b=140, min_len=55, no_repeat_ngram_size=3)
for hypothesis in hypotheses_batch:
fout.write(hypothesis + '\n')
fout.flush()
slines = []
slines.append(sline.strip())
count += 1
if slines != []:
hypotheses_batch = bart.sample(slines, beam=4, lenpen=2.0, max_len_b=140, min_len=55, no_repeat_ngram_size=3)
for hypothesis in hypotheses_batch:
fout.write(hypothesis + '\n')
fout.flush()
export CLASSPATH=/path/to/stanford-corenlp-full-2016-10-31/stanford-corenlp-3.7.0.jar
# Tokenize hypothesis and target files.
cat test.hypo | java edu.stanford.nlp.process.PTBTokenizer -ioFileList -preserveLines > test.hypo.tokenized
cat test.target | java edu.stanford.nlp.process.PTBTokenizer -ioFileList -preserveLines > test.hypo.target
files2rouge test.hypo.tokenized test.hypo.target
One more ask, if it's convenient for you, could you please share your test.source with me? I would really want to know if it's my data preprocessing caused the score gap. My email address is is.[email protected], thanks!
Most helpful comment
With the changes pointed out by @ngoyal2707, here is my results :
Results reproduced ! 馃槃