From the README of BART for reproducing CNN/DM results :
Follow instructions here to download and process into data-files such that
test.sourceandtest.targethas one line for each non-tokenized sample.
After following instructions, I don't have files like test.source and test.target...
Instead, I have test.bin, and chunked version of this file
(chunked/test_000.bin ~ chunked/test_011.bin).
How can I process test.bin into test.source and test.target ?
@ngoyal2707 @yinhanliu
thanks for the interest. you need to remove https://github.com/abisee/cnn-dailymail/blob/b15ad0a2db0d407a84b8ca9b5731e1f1c4bd24b9/make_datafiles.py#L235
and comment out all the tf code in function write_to_bin.
You need to keep the raw data (no tokenization) to gpt2 bpe.
_Note_
I also had to modify this line :
In order to remove <s> and </s> from the target file.
_Note 2_
To get better results, I also had to keep text cased. In order to do this, I removed this line :
I followed these instructions but I'm getting .bin files instead of .source and .target files. Am I missing something? I'm also trying to reproduce these results.
I modified the write_to_bin function to the following. Is this the correct data format?
def write_to_bin(url_file, out_file, makevocab=False):
"""Reads the tokenized .story files corresponding to the urls listed in the url_file and writes them to a out_file."""
print "Making bin file for URLs listed in %s..." % url_file
url_list = read_text_file(url_file)
url_hashes = get_url_hashes(url_list)
story_fnames = [s+".story" for s in url_hashes]
num_stories = len(story_fnames)
if makevocab:
vocab_counter = collections.Counter()
with open('%s.target' %(out_file), 'wb') as target_file:
with open('%s.source' %(out_file), 'wb') as source_file:
for idx,s in enumerate(story_fnames):
if idx % 1000 == 0:
print "Writing story %i of %i; %.2f percent done" % (idx, num_stories, float(idx)*100.0/float(num_stories))
# Look in the tokenized story dirs to find the .story file corresponding to this url
if os.path.isfile(os.path.join(cnn_tokenized_stories_dir, s)):
story_file = os.path.join(cnn_tokenized_stories_dir, s)
elif os.path.isfile(os.path.join(dm_tokenized_stories_dir, s)):
story_file = os.path.join(dm_tokenized_stories_dir, s)
else:
print "Error: Couldn't find tokenized story file %s in either tokenized story directories %s and %s. Was there an error during tokenization?" % (s, cnn_tokenized_stories_dir, dm_tokenized_stories_dir)
# Check again if tokenized stories directories contain correct number of files
print "Checking that the tokenized stories directories %s and %s contain correct number of files..." % (cnn_tokenized_stories_dir, dm_tokenized_stories_dir)
check_num_stories(cnn_tokenized_stories_dir, num_expected_cnn_stories)
check_num_stories(dm_tokenized_stories_dir, num_expected_dm_stories)
raise Exception("Tokenized stories directories %s and %s contain correct number of files but story file %s found in neither." % (cnn_tokenized_stories_dir, dm_tokenized_stories_dir, s))
# Get the strings to write to .bin file
article, abstract = get_art_abs(story_file)
target_file.write(abstract + '\n')
source_file.write(article + '\n')
There are many details, here is my code.
I fix the over lenght of train.bpe.source caused by ascii '0D' in articles by split and join
I summarize several notes here :
code : https://gist.github.com/zhaoguangxiang/45bf39c528cf7fb7853bffba7fe57c7e
@zhaoguangxiang Thank you!
Here's a version for Python 3 if anyone is interested:
Most helpful comment
There are many details, here is my code.
I fix the over lenght of train.bpe.source caused by ascii '0D' in articles by split and join
I summarize several notes here :
code : https://gist.github.com/zhaoguangxiang/45bf39c528cf7fb7853bffba7fe57c7e