I am trying to use a custom dataset (similar to MRPC) to fine-tune the BERT model.
I am running this python run_classifier.py \
--task_name=mrpc \
--do_train=true \
--do_eval=true \
--data_dir=$GLUE_DIR \
--use_gpu=False \
--vocab_file=$BERT_BASE_DIR/vocab.txt \
--bert_config_file=$BERT_BASE_DIR/bert_config.json \
--init_checkpoint=$BERT_BASE_DIR/bert_model.ckpt \
--max_seq_length=128 \
--train_batch_size=32 \
--learning_rate=2e-5 \
--num_train_epochs=3.0 \
--output_dir=/tmp/mrpc_output/
and getting the following error Traceback (most recent call last):
File "run_classifier.py", line 981, in
tf.app.run()
File "/home/kddilabs/miniconda3/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 125, in run
_sys.exit(main(argv))
File "run_classifier.py", line 842, in main
train_examples = processor.get_train_examples(FLAGS.data_dir)
File "run_classifier.py", line 302, in get_train_examples
self._read_tsv(os.path.join(data_dir, "train.tsv")), "train")
File "run_classifier.py", line 326, in _create_examples
text_b = tokenization.convert_to_unicode(line[4])
IndexError: list index out of range
My other custom datasets have run without any issue, I am getting this error only when I have increased the size of the dataset. What could be the possible reason / fix ?
Hi,
i have the same error. I fix the error if i remove line break in file
In python i do :
df_train = pd.read_csv("data/train.tsv", header =None, sep="\t", encoding = "UTF-8", quotechar='"')
df_bert_train = pd.DataFrame({'0':df_train[0],
'1':df_train[1],
'2':df_train[2],
'3':df_train[3],
'4':df_train[4].replace(r'\n',' ',regex=True)})
df_bert_train.to_csv('data/train.tsv', sep='\t', index=False, header=False, encoding="UTF-8")
Hope this helps
L.
Hi,
i have the same error. I fix the error if i remove line break in file
In python i do :df_train = pd.read_csv("data/train.tsv", header =None, sep="\t", encoding = "UTF-8", quotechar='"') df_bert_train = pd.DataFrame({'0':df_train[0], '1':df_train[1], '2':df_train[2], '3':df_train[3], '4':df_train[4].replace(r'\n',' ',regex=True)}) df_bert_train.to_csv('data/train.tsv', sep='\t', index=False, header=False, encoding="UTF-8")Hope this helps
L.
Hi,
i have the same error. I fix the error if i remove line break in file
In python i do :df_train = pd.read_csv("data/train.tsv", header =None, sep="\t", encoding = "UTF-8", quotechar='"') df_bert_train = pd.DataFrame({'0':df_train[0], '1':df_train[1], '2':df_train[2], '3':df_train[3], '4':df_train[4].replace(r'\n',' ',regex=True)}) df_bert_train.to_csv('data/train.tsv', sep='\t', index=False, header=False, encoding="UTF-8")Hope this helps
L.
Hey, your advice really worked. It solved my problem perfectly
Hi,
i have the same error. I fix the error if i remove line break in file
In python i do :df_train = pd.read_csv("data/train.tsv", header =None, sep="\t", encoding = "UTF-8", quotechar='"') df_bert_train = pd.DataFrame({'0':df_train[0], '1':df_train[1], '2':df_train[2], '3':df_train[3], '4':df_train[4].replace(r'\n',' ',regex=True)}) df_bert_train.to_csv('data/train.tsv', sep='\t', index=False, header=False, encoding="UTF-8")Hope this helps
L.
your advice worked! thanks!
Most helpful comment
Hi,
i have the same error. I fix the error if i remove line break in file
In python i do :
Hope this helps
L.