Fairseq: Evaluating BART on CNN/DM : How to process dataset

Created on 19 Nov 2019 · 8Comments · Source: pytorch/fairseq

From the README of BART for reproducing CNN/DM results :

Follow instructions here to download and process into data-files such that test.source and test.target has one line for each non-tokenized sample.

After following instructions, I don't have files like test.source and test.target...

Instead, I have test.bin, and chunked version of this file
(chunked/test_000.bin ~ chunked/test_011.bin).

How can I process test.bin into test.source and test.target ?

@ngoyal2707 @yinhanliu

Source

astariul-colanim

Most helpful comment

There are many details, here is my code.

I fix the over lenght of train.bpe.source caused by ascii '0D' in articles by split and join

I summarize several notes here :

remove " " before "."
cased, remove the line of lower cased
"\r" in origin articles leads error in bpe preprocess
remove "(CNN)"
bpe encoding

code : https://gist.github.com/zhaoguangxiang/45bf39c528cf7fb7853bffba7fe57c7e

zhaoguangxiang on 6 Dec 2019

👍8

All 8 comments

thanks for the interest. you need to remove https://github.com/abisee/cnn-dailymail/blob/b15ad0a2db0d407a84b8ca9b5731e1f1c4bd24b9/make_datafiles.py#L235
and comment out all the tf code in function write_to_bin.
You need to keep the raw data (no tokenization) to gpt2 bpe.

yinhanliu on 19 Nov 2019

👍1

_Note_
I also had to modify this line :

https://github.com/abisee/cnn-dailymail/blob/b15ad0a2db0d407a84b8ca9b5731e1f1c4bd24b9/make_datafiles.py#L145

In order to remove <s> and </s> from the target file.

astariul-colanim on 20 Nov 2019

👍2

_Note 2_

To get better results, I also had to keep text cased. In order to do this, I removed this line :

https://github.com/abisee/cnn-dailymail/blob/b15ad0a2db0d407a84b8ca9b5731e1f1c4bd24b9/make_datafiles.py#L122

astariul-colanim on 20 Nov 2019

I followed these instructions but I'm getting .bin files instead of .source and .target files. Am I missing something? I'm also trying to reproduce these results.

isabelcachola on 5 Dec 2019

👍2

I modified the write_to_bin function to the following. Is this the correct data format?

def write_to_bin(url_file, out_file, makevocab=False):
  """Reads the tokenized .story files corresponding to the urls listed in the url_file and writes them to a out_file."""
  print "Making bin file for URLs listed in %s..." % url_file
  url_list = read_text_file(url_file)
  url_hashes = get_url_hashes(url_list)
  story_fnames = [s+".story" for s in url_hashes]
  num_stories = len(story_fnames)

  if makevocab:
    vocab_counter = collections.Counter()

  with open('%s.target' %(out_file), 'wb') as target_file:
      with open('%s.source' %(out_file), 'wb') as source_file:
        for idx,s in enumerate(story_fnames):
            if idx % 1000 == 0:
                print "Writing story %i of %i; %.2f percent done" % (idx, num_stories, float(idx)*100.0/float(num_stories))

            # Look in the tokenized story dirs to find the .story file corresponding to this url
            if os.path.isfile(os.path.join(cnn_tokenized_stories_dir, s)):
                story_file = os.path.join(cnn_tokenized_stories_dir, s)
            elif os.path.isfile(os.path.join(dm_tokenized_stories_dir, s)):
                story_file = os.path.join(dm_tokenized_stories_dir, s)
            else:
                print "Error: Couldn't find tokenized story file %s in either tokenized story directories %s and %s. Was there an error during tokenization?" % (s, cnn_tokenized_stories_dir, dm_tokenized_stories_dir)
                # Check again if tokenized stories directories contain correct number of files
                print "Checking that the tokenized stories directories %s and %s contain correct number of files..." % (cnn_tokenized_stories_dir, dm_tokenized_stories_dir)
                check_num_stories(cnn_tokenized_stories_dir, num_expected_cnn_stories)
                check_num_stories(dm_tokenized_stories_dir, num_expected_dm_stories)
                raise Exception("Tokenized stories directories %s and %s contain correct number of files but story file %s found in neither." % (cnn_tokenized_stories_dir, dm_tokenized_stories_dir, s))

            # Get the strings to write to .bin file
            article, abstract = get_art_abs(story_file)

            target_file.write(abstract + '\n')
            source_file.write(article + '\n')

isabelcachola on 6 Dec 2019

There are many details, here is my code.

I fix the over lenght of train.bpe.source caused by ascii '0D' in articles by split and join

I summarize several notes here :