Transformers: XLNet evaluation fails if the size of evaluation set can't be divided by a given evaluation batch size

Created on 5 Oct 2020 · 11Comments · Source: huggingface/transformers

Environment info

transformers version: 3.3.1
Platform: Linux-4.15.0-117-generic-x86_64-with-glibc2.10
Python version: 3.8.5
PyTorch version (GPU?): 1.4.0 (True)
Tensorflow version (GPU?): 2.2.0 (False)
Using GPU in script?: Yes
Using distributed or parallel set-up in script?: No

Who can help

@sgugger

Information

Model I am using (Bert, XLNet ...): XLNet-base-cased

The problem arises when using:

the official example scripts: run_glue.py

The tasks I am working on is:

an official GLUE/SQUaD task: SST-2

To reproduce

Steps to reproduce the behavior:

Install transformers from master and download SST-2 data using download_glue_data.py
Create the following scripts

GLUE_DIR=~/glue
CUDA_VISIBLE_DEVICES=0
TASK_NAME=SST-2

python3 ~/applications/transformers/examples/text-classification/run_glue.py \
  --model_name_or_path ~/xlnet \
  --task_name $TASK_NAME \
  --do_eval \
  --data_dir $GLUE_DIR/$TASK_NAME \
  --max_seq_length 64 \
  --per_device_train_batch_size 32 \
  --per_device_eval_batch_size 64 \
  --learning_rate 2e-5 \
  --num_train_epochs 3.0 \
  --output_dir ~/result/$TASK_NAME/ \
  --overwrite_output_dir \
  --eval_steps 100

run this script

Expected behavior

Trainer should return appropriate evaluation results. Here are logs when evaluating bert-base with above-given hyperparameters.

10/05/2020 22:28:47 - INFO - filelock -   Lock 140392033291808 acquired on /data/home/liusishun/glue/SST-2/cached_dev_BertTokenizer_64_sst-2.lock
10/05/2020 22:28:47 - INFO - filelock -   Lock 140392033291808 released on /data/home/liusishun/glue/SST-2/cached_dev_BertTokenizer_64_sst-2.lock
10/05/2020 22:28:50 - INFO - __main__ -   *** Evaluate ***
Evaluation: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 14/14 [00:01<00:00,  7.22it/s]
{'eval_loss': 0.6916399122378148, 'eval_acc': 0.49770642201834864, 'step': 0}
/data/home/liusishun/.conda/envs/myenv/lib/python3.8/site-packages/transformers/trainer.py:1168: FutureWarning: This method is deprecated, use `Trainer.is_world_process_zero()` instead.
  warnings.warn("This method is deprecated, use `Trainer.is_world_process_zero()` instead.", FutureWarning)
10/05/2020 22:28:52 - INFO - __main__ -   ***** Eval results sst-2 *****
10/05/2020 22:28:52 - INFO - __main__ -     eval_loss = 0.6916399122378148
10/05/2020 22:28:52 - INFO - __main__ -     eval_acc = 0.49770642201834864

Observed behavior

10/05/2020 22:30:05 - INFO - filelock -   Lock 139928226197216 acquired on /data/home/liusishun/glue/SST-2/cached_dev_XLNetTokenizer_64_sst-2.lock
10/05/2020 22:30:05 - INFO - filelock -   Lock 139928226197216 released on /data/home/liusishun/glue/SST-2/cached_dev_XLNetTokenizer_64_sst-2.lock
10/05/2020 22:30:09 - INFO - __main__ -   *** Evaluate ***
Evaluation:  93%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉           | 13/14 [00:02<00:00,  4.44it/s]
Traceback (most recent call last):
  File "/data/home/liusishun/applications/transformers/examples/text-classification/run_glue.py", line 247, in <module>
    main()
  File "/data/home/liusishun/applications/transformers/examples/text-classification/run_glue.py", line 197, in main
    eval_result = trainer.evaluate(eval_dataset=eval_dataset)
  File "/data/home/liusishun/.conda/envs/myenv/lib/python3.8/site-packages/transformers/trainer.py", line 1297, in evaluate
    output = self.prediction_loop(eval_dataloader, description="Evaluation")
  File "/data/home/liusishun/.conda/envs/myenv/lib/python3.8/site-packages/transformers/trainer.py", line 1382, in prediction_loop
    preds = logits if preds is None else nested_concat(preds, logits, dim=0)
  File "/data/home/liusishun/.conda/envs/myenv/lib/python3.8/site-packages/transformers/trainer_utils.py", line 151, in nested_concat
    return type(tensors)(nested_concat(t, n, dim) for t, n in zip(tensors, new_tensors))
  File "/data/home/liusishun/.conda/envs/myenv/lib/python3.8/site-packages/transformers/trainer_utils.py", line 151, in <genexpr>
    return type(tensors)(nested_concat(t, n, dim) for t, n in zip(tensors, new_tensors))
  File "/data/home/liusishun/.conda/envs/myenv/lib/python3.8/site-packages/transformers/trainer_utils.py", line 151, in nested_concat
    return type(tensors)(nested_concat(t, n, dim) for t, n in zip(tensors, new_tensors))
  File "/data/home/liusishun/.conda/envs/myenv/lib/python3.8/site-packages/transformers/trainer_utils.py", line 151, in <genexpr>
    return type(tensors)(nested_concat(t, n, dim) for t, n in zip(tensors, new_tensors))
  File "/data/home/liusishun/.conda/envs/myenv/lib/python3.8/site-packages/transformers/trainer_utils.py", line 152, in nested_concat
    return torch.cat((tensors, new_tensors), dim=dim)
RuntimeError: invalid argument 0: Sizes of tensors must match except in dimension 0. Got 40 and 64 in dimension 1 at /opt/conda/conda-bld/pytorch_1579061855666/work/aten/src/THC/generic/THCTensorMath.cu:71

Source

StepinSilence

Most helpful comment

From reading the paper (especilally the experiment part about SQuad, RACE, ...) I originally thought that the cached memory was also used during fine-tuning and not just during pre-training, but from this description here: https://github.com/zihangdai/xlnet/issues/41#issuecomment-505102587 it seems like the cached memory is actually not used during fine-tuning. So I'd suggest that we disable it for all models except XLNetLMHeadModel where it obviously makes sense to use it. I'll add a PR to fix it

patrickvonplaten on 16 Nov 2020

👍2

All 11 comments

The XLNet model outputs some past states called mems at index 2. Those can't be concatenated together because they have a sequence length that varies. You should pass along --past_index 2 to your script so that:

those mems are used
they are discarded from the predictions, and thus evaluation should work.

We will have something easier to use in the future, but for now it should work around your problem.

sgugger on 5 Oct 2020

Thanks for your fast reply. Unfortunately --past_index 2 doesn't work for me.
New error logs

10/05/2020 22:55:40 - INFO - filelock -   Lock 140417916796544 acquired on /data/home/liusishun/glue/SST-2/cached_dev_XLNetTokenizer_64_sst-2.lock
10/05/2020 22:55:41 - INFO - filelock -   Lock 140417916796544 released on /data/home/liusishun/glue/SST-2/cached_dev_XLNetTokenizer_64_sst-2.lock
10/05/2020 22:55:44 - INFO - __main__ -   *** Evaluate ***
Evaluation:  93%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉           | 13/14 [00:09<00:00,  1.41it/s]
Traceback (most recent call last):
  File "/data/home/liusishun/applications/transformers/examples/text-classification/run_glue.py", line 247, in <module>
    main()
  File "/data/home/liusishun/applications/transformers/examples/text-classification/run_glue.py", line 197, in main
    eval_result = trainer.evaluate(eval_dataset=eval_dataset)
  File "/data/home/liusishun/.conda/envs/myenv/lib/python3.8/site-packages/transformers/trainer.py", line 1297, in evaluate
    output = self.prediction_loop(eval_dataloader, description="Evaluation")
  File "/data/home/liusishun/.conda/envs/myenv/lib/python3.8/site-packages/transformers/trainer.py", line 1377, in prediction_loop
    loss, logits, labels = self.prediction_step(model, inputs, prediction_loss_only)
  File "/data/home/liusishun/.conda/envs/myenv/lib/python3.8/site-packages/transformers/trainer.py", line 1459, in prediction_step
    outputs = model(**inputs)
  File "/data/home/liusishun/.conda/envs/myenv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/data/home/liusishun/.conda/envs/myenv/lib/python3.8/site-packages/transformers/modeling_xlnet.py", line 1499, in forward
    transformer_outputs = self.transformer(
  File "/data/home/liusishun/.conda/envs/myenv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/data/home/liusishun/.conda/envs/myenv/lib/python3.8/site-packages/transformers/modeling_xlnet.py", line 1226, in forward
    new_mems = new_mems + (self.cache_mem(output_h, mems[i]),)
  File "/data/home/liusishun/.conda/envs/myenv/lib/python3.8/site-packages/transformers/modeling_xlnet.py", line 1011, in cache_mem
    new_mem = torch.cat([prev_mem, curr_out], dim=0)[cutoff:]
RuntimeError: invalid argument 0: Sizes of tensors must match except in dimension 0. Got 40 and 64 in dimension 1 at /opt/conda/conda-bld/pytorch_1579061855666/work/aten/src/THC/generic/THCTensorMath.cu:71

current script

GLUE_DIR=~/glue
CUDA_VISIBLE_DEVICES=0
TASK_NAME=SST-2

python3 ~/applications/transformers/examples/text-classification/run_glue.py \
  --model_name_or_path ~/xlnet \
  --task_name $TASK_NAME \
  --do_eval \
  --data_dir $GLUE_DIR/$TASK_NAME \
  --max_seq_length 64 \
  --per_device_train_batch_size 32 \
  --per_device_eval_batch_size 64 \
  --past_index 2 \
  --learning_rate 2e-5 \
  --num_train_epochs 3.0 \
  --output_dir ~/result/$TASK_NAME/ \
  --overwrite_output_dir \
  --eval_steps 100 \

Any idea?

StepinSilence on 5 Oct 2020

Asking for the XLNet specialists on our internal slack. I think the main problem is that the model returns those mems that can't be used for anything (and can't be concatenated). The fact you have an error with past_index show they can't really be used to speed up sequence classification.

sgugger on 5 Oct 2020

Thanks for your response. Could you have any temporary workarounds or further actions about this problem?

StepinSilence on 5 Oct 2020

Use another model...

sgugger on 5 Oct 2020

Hi @StepinSilence and @sgugger ! Any updates on this issue?
@StepinSilence were able to find a work around to use XLNet?

adhithyaarun on 8 Nov 2020

Hi, @adhithyaarun. I remember that this issue occurred when batch size couldn't divide the dataset size, so if you set the batch size a factor of the size of your dataset it may work. However, I can't confirm this right now because our server data disk died several days ago.

StepinSilence on 9 Nov 2020

Hello. I encountered the same problem using a Camembert Model with transformers 3.4.0. This issue seems to rise when using dynamic padding. Any workaround for this other than padding to max length?