Fairseq: RoBERTa RACE preprocessing doesn't handle newlines

Created on 12 Aug 2019  路  4Comments  路  Source: pytorch/fairseq

Hi!

I'm trying out some of the roBERTa code, thanks for releasing it! When trying to fine-tune on RACE, I got the following error after trying to run train.py after preprocessing the data as instructed:

| [input] dictionary: 50265 types
| loaded 4887 examples from: RACE_preprocessed_bpe/input0/valid
| loaded 4892 examples from: RACE_preprocessed_bpe/input1/valid
| loaded 4891 examples from: RACE_preprocessed_bpe/input2/valid
| loaded 4891 examples from: RACE_preprocessed_bpe/input3/valid
| loaded 4896 examples from: RACE_preprocessed_bpe/input4/valid
Traceback (most recent call last):
  File "train.py", line 325, in <module>
    cli_main()
  File "train.py", line 321, in cli_main
    main(args)
  File "train.py", line 46, in main
    task.load_dataset(valid_sub_split, combine=False, epoch=0)
  File "/home/nfliu/git/fairseq/fairseq/tasks/sentence_ranking.py", line 119, in load_dataset
    src_token = ConcatSentencesDataset(input_option, input0)
  File "/home/nfliu/git/fairseq/fairseq/data/concat_sentences_dataset.py", line 17, in __init__
    'datasets must have the same length'
AssertionError: datasets must have the same length

Digging deeper, it seems like some of the answers in the RACE dataset have newlines (\n). These aren't removed in preprocess_RACE.py, so the resultant files have different lengths:

     4887 dev.input0
     4892 dev.input1
     4891 dev.input2
     4891 dev.input3
     4896 dev.input4
     4887 dev.label

(they should all be 4887)

An example question with a newline:

[". What is the article mainly about? The main cause of the Titanic's sinking.",
'. What is the article mainly about? The 100\nthanniversary of the Titanic.',
". What is the article mainly about? The moon's great influence on the Earth's tides.",
". What is the article mainly about? The moon's role in the sinking of the Titanic."]

Seems like the fix would be to replace the newlines with another character? Happy to submit a PR, just let me know which character you replaced newlines with in your experiments (perhaps a space would work)? Alternatively, could also escape the newline and write out the literal \n character to the preprocessed data...let me know which is better.

Most helpful comment

Sorry, known problem, fix incoming.

All 4 comments

Sorry, known problem, fix incoming.

There's a few other changes in there, e.g., adding --truncate-sequence and the --dropout options to the training command, and also cleaning up extra redundant spaces in preprocessing. This should get slightly better results than reported in the paper.

thanks for the quick turnaround, looks great!

Was this page helpful?
0 / 5 - 0 ratings