Description
The pipeline for QA crashes for roberta models.
It's loading the model and tokenizer correctly, but the SQuAD preprocessing produces a wrong p_mask leading to no possible prediction and the error message below.
The observed p_mask for a roberta model is
[0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...]
while it should only mask the question tokens like this
[0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, ...]
I think the deeper root cause here is that roberta's token_type_ids returned from encode_plus are now all zeros (introduced in https://github.com/huggingface/transformers/pull/2432) and the creation of p_mask in squad_convert_example_to_features relies on this information:
https://github.com/huggingface/transformers/blob/520e7f211926e07b2059bc8e21b668db4372e4db/src/transformers/data/processors/squad.py#L189-L202
Haven't checked yet, but this might also affect training/eval if p_mask is used there.
How to reproduce?
model_name = "deepset/roberta-base-squad2"
nlp = pipeline('question-answering', model=model_name, tokenizer=model_name)
res = nlp({
'question': 'What is roberta?',
'context': 'Roberta is a language model that was trained for a longer time, on more data, without NSP'
})
results in
File "/home/mp/deepset/dev/transformers/src/transformers/pipelines.py", line 847, in __call__
for s, e, score in zip(starts, ends, scores)
File "/home/mp/deepset/dev/transformers/src/transformers/pipelines.py", line 847, in <listcomp>
for s, e, score in zip(starts, ends, scores)
KeyError: 0
Environment
I think I have a problem that is related regarding training/evaluation using run_squad.py.
I wanted to train a roberta model on my own Q&A dataset mixed with the SQuAD dataset by running:
python ./examples/run_squad.py --output_dir=/home/jupyter/sec_roberta/roberta-base-mixed-quad --model_type=roberta --model_name_or_path=roberta-large --do_train --train_file=../sec_roberta/financial_and_squad2_train.json --do_eval --predict_file=../sec_roberta/financial_and_squad2_dev.json --learning_rate=1.5e-5 --num_train_epochs=2 --max_seq_length 384 --doc_stride 128 --overwrite_output_dir --per_gpu_train_batch_size=6 --per_gpu_eval_batch_size=6 --warmup_steps 500 --weight_decay 0.01 --version_2_with_negative
I ran into this error:
02/12/2020 08:22:38 - INFO - __main__ - Creating features from dataset file at .
--
0%\| \| 0/542 [00:00<?, ?it/s]
Traceback (most recent call last): File "./examples/run_squad.py", line 853, in <module> main() File "./examples/run_squad.py", line 791, in main
train_dataset = load_and_cache_examples(args, tokenizer, evaluate=False, output_examples=False)
File "./examples/run_squad.py", line 474, in load_and_cache_examples
examples = processor.get_train_examples(args.data_dir, filename=args.train_file)
File "/opt/anaconda3/lib/python3.7/site-packages/transformers/data/processors/squad.py", line 501, in get_train_examples
return self._create_examples(input_data, "train")
File "/opt/anaconda3/lib/python3.7/site-packages/transformers/data/processors/squad.py", line 559, in _create_examples
answers=answers,
File "/opt/anaconda3/lib/python3.7/site-packages/transformers/data/processors/squad.py", line 633, in __init__
self.start_position = char_to_word_offset[start_position_character]
IndexError: list index out of range
I tested my dataset on roberta-base and it works, so I don't necessarily think my dataset is the issue.
Also, I ran the same code using the SQuAD 2.0 dataset on roberta large and also on a lm-finetuned version of roberta large and both work, so this is all very mysterious to me.
I thought it could be related.
Update: a fresh install of transformers fixed it for me...
i run into a similar error when trying to use the run_squad.py example to train roberta-large on Squad 2.0
when i run
export DATA_DIR=./data
python ./transformers/examples/run_squad.py \
--model_type roberta \
--model_name_or_path roberta-large \
--do_train \
--do_eval \
--version_2_with_negative \
--train_file $DATA_DIR/squad2/train-v2.0.json \
--predict_file $DATA_DIR/squad2/dev-v2.0.json \
--per_gpu_eval_batch_size=6 \
--per_gpu_train_batch_size=6 \
--learning_rate 3e-5 \
--num_train_epochs 2.0 \
--overwrite_output_dir \
--overwrite_cache \
--max_seq_length 384 \
--doc_stride 128 \
--save_steps 100000 \
--output_dir ./roberta_squad/
i get the following error:
Traceback (most recent call last):
File "/opt/anaconda3/lib/python3.7/multiprocessing/pool.py", line 121, in worker
result = (True, func(args, *kwds))
File "/opt/anaconda3/lib/python3.7/multiprocessing/pool.py", line 44, in mapstar
return list(map(*args))
File "/home/joshua_wagner/.local/lib/python3.7/site-packages/transformers/data/processors/squad.py", line 198, in
squad_convert_example_to_features
p_mask = np.array(span["token_type_ids"])
KeyError: 'token_type_ids'
Environment:
same error as @joshuawagner93
@joshuawagner93 @HenrykBorzymowski, this issue should have been patched with #3439. Could you install the latest release and let me know if it fixes your issue?
@LysandreJik works perfectly fine! Thx
@LysandreJik reinstall fixed the issue, thank you
@LysandreJik Unfortunately, we still face the same issue when we try to use roberta in the pipeline for inference. #3439 didn't seem to help for this.
Hi @tholor, indeed, it seems I thought this issue was resolved when it really wasn't. I just opened #4049 which should fix the issue.
Awesome, thanks for working on this @LysandreJik!
@tholor, the PR should be merged soon, thank you for your patience!
Great, thank you! Looking forward to it :)
Most helpful comment
@joshuawagner93 @HenrykBorzymowski, this issue should have been patched with #3439. Could you install the latest release and let me know if it fixes your issue?