Transformers: KeyError in GLUE data tokenization with RoBERTA

Created on 17 Mar 2020 · 6Comments · Source: huggingface/transformers

🐛 Bug

I'm getting a KeyError here when using RoBERTa in examples/run_glue.py and trying to access 'token_type_ids' while preprocessing the data, maybe from this commit removing 'token_type_ids' from RoBERTa (and DistilBERT)?

I get the error when fine-tuning RoBERTa on CoLA and RTE. I haven't tried other tasks, but I think you'd get the same error.
I don't get the error when fine-tuning XLNet (presumably, since XLNet does use 'token_type_ids'), and I don't get the error when I do pip install transformers instead of pip install . (which I think means the issue is coming from a recent commit).

Here's the full error message:

03/17/2020 11:53:58 - INFO - transformers.data.processors.glue -   Writing example 0/13997
Traceback (most recent call last):
  File "examples/run_glue.py", line 731, in <module>
    main()
  File "examples/run_glue.py", line 679, in main
    train_dataset = load_and_cache_examples(args, args.task_name, tokenizer, evaluate=False)
  File "examples/run_glue.py", line 419, in load_and_cache_examples
    pad_token_segment_id=4 if args.model_type in ["xlnet"] else 0,
  File "/home/ejp416/cmv/transformers/src/transformers/data/processors/glue.py", line 94, in glue_convert_examples_to_features
    input_ids, token_type_ids = inputs["input_ids"], inputs["token_type_ids"]
KeyError: 'token_type_ids'

Information

Model I am using (Bert, XLNet ...): RoBERTa. I think DistilBERT may run into the same issue as well.

Language I am using the model on (English, Chinese ...): English

The problem arises when using:

[x] the official example scripts: (give details below)
[ ] my own modified scripts: (give details below)

I've made slight modifications to the training loop in the official examples/run_glue.py, but I did not touch the data pre-processing, which is where the error occurs (before any training).

The tasks I am working on is:

[x] an official GLUE/SQUaD task: (give the name)
[ ] my own task or dataset: (give details below)

I've run into the error on CoLA and RTE, though I think the error should happen on all GLUE tasks.

To reproduce

Steps to reproduce the behavior:

Install transformers using the latest clone (use pip install . not pip install transformers)
Download the RTE data (e.g., into data/RTE using the GLUE download scripts in this repo)
Run a command to train RoBERTa (base or large). I'm using:

python examples/run_glue.py --model_type roberta --model_name_or_path roberta-base --output_dir models/debug --task_name rte --do_train --evaluate_during_training --data_dir data/RTE --max_seq_length 32 --max_grad_norm inf --adam_epsilon 1e-6 --adam_beta_2 0.98 --weight_decay 0.1 --logging_steps 874 --save_steps 874 --num_train_epochs 10 --warmup_steps 874 --per_gpu_train_batch_size 1 --per_gpu_eval_batch_size 2 --learning_rate 1e-5 --seed 12 --gradient_accumulation_steps 16 --overwrite_output_dir

Expected behavior

load_and_cache_examples (and specifically, the call to convert_examples_to_features) in examples/run_glue.py should run without error, to load, preprocess, and tokenize the dataset.

Environment info

transformers version: 2.5.1
Platform: Linux-3.10.0-1062.12.1.el7.x86_64-x86_64-with-centos-7.7.1908-Core
Python version: 3.7.6
PyTorch version (GPU?): 1.4.0 (True)
Tensorflow version (GPU?): not installed (NA)
Using GPU in script?: Error happens with both GPU and CPU
Using distributed or parallel set-up in script?: No

Source

ethanjperez

👍1

Most helpful comment

I also have this issue when i run run_multiple_choice.py in RACE data with RoBERTA.

nielingyun on 17 Mar 2020

👍4

All 6 comments

I also have this issue when i run run_multiple_choice.py in RACE data with RoBERTA.

nielingyun on 17 Mar 2020

👍4

I get the same error when I try to fine-tune Squad

orena1 on 19 Mar 2020

👍2

Tagging @LysandreJik

ethanjperez on 23 Mar 2020

I also have this issue when i run run_multiple_choice.py in RACE data with RoBERTA.

Same here. Any solution?

ghost on 25 Mar 2020

@nielingyun @orena1 @Onur90 maybe try pulling again from the latest version of the repo and see if it works? The error went away after I pulled recently, not sure if that fixed it or something else I did - let me know if that worked

ethanjperez on 25 Mar 2020

👍1

@ethanjperez by latest version you mean latest commit or the latest release (v2.6.0)? It is still not working with the latest commit.