Transformers: SQuAD preprocessing not working for roberta (wrong p_mask)

Created on 9 Feb 2020 · 11Comments · Source: huggingface/transformers

Description
The pipeline for QA crashes for roberta models.
It's loading the model and tokenizer correctly, but the SQuAD preprocessing produces a wrong p_mask leading to no possible prediction and the error message below.

The observed p_mask for a roberta model is
[0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...]

while it should only mask the question tokens like this
[0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, ...]

I think the deeper root cause here is that roberta's token_type_ids returned from encode_plus are now all zeros (introduced in https://github.com/huggingface/transformers/pull/2432) and the creation of p_mask in squad_convert_example_to_features relies on this information:
https://github.com/huggingface/transformers/blob/520e7f211926e07b2059bc8e21b668db4372e4db/src/transformers/data/processors/squad.py#L189-L202
Haven't checked yet, but this might also affect training/eval if p_mask is used there.

How to reproduce?

model_name = "deepset/roberta-base-squad2"
nlp = pipeline('question-answering', model=model_name, tokenizer=model_name)
res = nlp({
    'question': 'What is roberta?',
    'context': 'Roberta is a language model that was trained for a longer time, on more data, without NSP'
})

results in

  File "/home/mp/deepset/dev/transformers/src/transformers/pipelines.py", line 847, in __call__
    for s, e, score in zip(starts, ends, scores)
  File "/home/mp/deepset/dev/transformers/src/transformers/pipelines.py", line 847, in <listcomp>
    for s, e, score in zip(starts, ends, scores)
KeyError: 0

Environment

Ubuntu 18.04
Python 3.7.6
PyTorch 1.3.1

Pipeline Question Answering Should Fix Usage

Source

tholor

👍7 ❤1

Most helpful comment

@joshuawagner93 @HenrykBorzymowski, this issue should have been patched with #3439. Could you install the latest release and let me know if it fixes your issue?

LysandreJik on 1 Apr 2020

👍2

All 11 comments

I think I have a problem that is related regarding training/evaluation using run_squad.py.

I wanted to train a roberta model on my own Q&A dataset mixed with the SQuAD dataset by running:

python ./examples/run_squad.py --output_dir=/home/jupyter/sec_roberta/roberta-base-mixed-quad --model_type=roberta --model_name_or_path=roberta-large --do_train --train_file=../sec_roberta/financial_and_squad2_train.json --do_eval --predict_file=../sec_roberta/financial_and_squad2_dev.json --learning_rate=1.5e-5 --num_train_epochs=2 --max_seq_length 384 --doc_stride 128 --overwrite_output_dir --per_gpu_train_batch_size=6 --per_gpu_eval_batch_size=6 --warmup_steps 500 --weight_decay 0.01 --version_2_with_negative

I ran into this error:

02/12/2020 08:22:38 - INFO - __main__ -   Creating features from dataset file at .
--
0%\|                                                                                                                                                      \| 0/542 [00:00<?, ?it/s]
Traceback (most recent call last):  File "./examples/run_squad.py", line 853, in <module>    main()  File "./examples/run_squad.py", line 791, in main
train_dataset = load_and_cache_examples(args, tokenizer, evaluate=False, output_examples=False)
File "./examples/run_squad.py", line 474, in load_and_cache_examples
examples = processor.get_train_examples(args.data_dir, filename=args.train_file)
File "/opt/anaconda3/lib/python3.7/site-packages/transformers/data/processors/squad.py", line 501, in get_train_examples
return self._create_examples(input_data, "train")
File "/opt/anaconda3/lib/python3.7/site-packages/transformers/data/processors/squad.py", line 559, in _create_examples
answers=answers,
File "/opt/anaconda3/lib/python3.7/site-packages/transformers/data/processors/squad.py", line 633, in __init__
self.start_position = char_to_word_offset[start_position_character]
IndexError: list index out of range

I tested my dataset on roberta-base and it works, so I don't necessarily think my dataset is the issue.

Also, I ran the same code using the SQuAD 2.0 dataset on roberta large and also on a lm-finetuned version of roberta large and both work, so this is all very mysterious to me.

I thought it could be related.

chinisan on 12 Feb 2020

👍1

Update: a fresh install of transformers fixed it for me...
i run into a similar error when trying to use the run_squad.py example to train roberta-large on Squad 2.0
when i run
export DATA_DIR=./data python ./transformers/examples/run_squad.py \ --model_type roberta \ --model_name_or_path roberta-large \ --do_train \ --do_eval \ --version_2_with_negative \ --train_file $DATA_DIR/squad2/train-v2.0.json \ --predict_file $DATA_DIR/squad2/dev-v2.0.json \ --per_gpu_eval_batch_size=6 \ --per_gpu_train_batch_size=6 \ --learning_rate 3e-5 \ --num_train_epochs 2.0 \ --overwrite_output_dir \ --overwrite_cache \ --max_seq_length 384 \ --doc_stride 128 \ --save_steps 100000 \ --output_dir ./roberta_squad/

i get the following error:

Traceback (most recent call last):
File "/opt/anaconda3/lib/python3.7/multiprocessing/pool.py", line 121, in worker
result = (True, func(args, *kwds))
File "/opt/anaconda3/lib/python3.7/multiprocessing/pool.py", line 44, in mapstar
return list(map(*args))
File "/home/joshua_wagner/.local/lib/python3.7/site-packages/transformers/data/processors/squad.py", line 198, in
squad_convert_example_to_features
p_mask = np.array(span["token_type_ids"])
KeyError: 'token_type_ids'

Environment:

Debian GNU/Linux 9.11
Python 3.7
PyTorch 1.4.0

joshuawagner93 on 26 Mar 2020

👍1

same error as @joshuawagner93

borhenryk on 26 Mar 2020

@joshuawagner93 @HenrykBorzymowski, this issue should have been patched with #3439. Could you install the latest release and let me know if it fixes your issue?

LysandreJik on 1 Apr 2020

👍2

@LysandreJik works perfectly fine! Thx

borhenryk on 2 Apr 2020

🎉1

@LysandreJik reinstall fixed the issue, thank you

joshuawagner93 on 3 Apr 2020

🎉1

@LysandreJik Unfortunately, we still face the same issue when we try to use roberta in the pipeline for inference. #3439 didn't seem to help for this.

tholor on 27 Apr 2020

👍1

Hi @tholor, indeed, it seems I thought this issue was resolved when it really wasn't. I just opened #4049 which should fix the issue.

LysandreJik on 28 Apr 2020

🎉1

Awesome, thanks for working on this @LysandreJik!

tholor on 28 Apr 2020

@tholor, the PR should be merged soon, thank you for your patience!

LysandreJik on 7 May 2020

Great, thank you! Looking forward to it :)

tholor on 7 May 2020

Was this page helpful?