Hi, thanks for making this code base available!
I have two questions, one on the input format of for fine-tuning the language model on custom dataset, and one on (unreasonably-)long data preprocessing time. Thanks in advance for any help!
I'm trying to fine-tune the BERT Model on an extra dataset, and am using the run_lm_finetuning.py
script in the examples/
directory. However, I'm having trouble locating instructions on the proper format of the input data. There used to be some instructions in the examples/lm_finetuning/
directory, but they seem deprecated now.
As a start, I followed the run_lm_finetuning.py
example and changed nothing but --train_data_file
argument with a bigger text file (arbitrary format). The training, however, hangs on the data preprocessing part for about 10 hours, and the last standard output is shown below.
pytorch_transformers.tokenization_utils - Token indices sequence length is longer than the specified maximum sequence length for this model (164229992 > 512). Running this sequence through the model will result in indexing errors
I am facing the same issue as there is no proper format available for defining the train and test dataset.
As usual, I use .csv file in a format of columns with (UID, Text, and Labels). But according to the Wiki.txt its more of arbitrary format.
Any help would be appreciated.
I'm having the same issue. I think it's counting the total length of the tokenized corpus not only the tokenized document length. I tried to run the wiki raw files as mentioned in the read me and still get this warning of total tokenized corpus length.
I tried to the following formats with no success:
Update:
After looking at the code again it looks like even though this warning is showing the sequence length being longer than 512 it is still chunking the corpus into 512 tokens and training it that way. This raises the question of whether it is problematic to just separate the corpus based on token length alone especially that BERT for example is training on predicting next sentence. What happens to the probably recurring case of the data being chunked mid-way the sentence?
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.