Transformers: Dataset format and Best Practices For Language Model Fine-tuning

Created on 20 Sep 2019 · 3Comments · Source: huggingface/transformers

❓ Questions & Help

Hi, thanks for making this code base available!

I have two questions, one on the input format of for fine-tuning the language model on custom dataset, and one on (unreasonably-)long data preprocessing time. Thanks in advance for any help!

I'm trying to fine-tune the BERT Model on an extra dataset, and am using the run_lm_finetuning.py script in the examples/ directory. However, I'm having trouble locating instructions on the proper format of the input data. There used to be some instructions in the examples/lm_finetuning/ directory, but they seem deprecated now.
As a start, I followed the run_lm_finetuning.py example and changed nothing but --train_data_file argument with a bigger text file (arbitrary format). The training, however, hangs on the data preprocessing part for about 10 hours, and the last standard output is shown below.

pytorch_transformers.tokenization_utils -   Token indices sequence length is longer than the specified maximum sequence length for this model (164229992 > 512). Running this sequence through the model will result in indexing errors

wontfix

Source

HanGuo97

👍8

All 3 comments

I am facing the same issue as there is no proper format available for defining the train and test dataset.
As usual, I use .csv file in a format of columns with (UID, Text, and Labels). But according to the Wiki.txt its more of arbitrary format.

Any help would be appreciated.

gr8Adakron on 22 Sep 2019

I'm having the same issue. I think it's counting the total length of the tokenized corpus not only the tokenized document length. I tried to run the wiki raw files as mentioned in the read me and still get this warning of total tokenized corpus length.

I tried to the following formats with no success:

sentence per line with a blank line in between docs
document per line with a blank line in between docs

Update:
After looking at the code again it looks like even though this warning is showing the sequence length being longer than 512 it is still chunking the corpus into 512 tokens and training it that way. This raises the question of whether it is problematic to just separate the corpus based on token length alone especially that BERT for example is training on predicting next sentence. What happens to the probably recurring case of the data being chunked mid-way the sentence?

anassalamah on 25 Sep 2019

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.