Transformers: RuntimeError: cublas runtime error : resource allocation failed

Created on 7 Oct 2019 · 13Comments · Source: huggingface/transformers

🐛 Bug

Model I am using Bert:

Language I am using the model on English:

The tasks I am working on is:

[ ] Finetuned bert model with my own dataset.
[ ] run_lm_finetuning.py

To Reproduce

Steps to reproduce the behavior:

I was followesd this issue https://github.com/huggingface/transfer-learning-conv-ai/issues/10
i tried to reduced batch_size = 1
i tried CUDA_LAUNCH_BLOCKING=1
it is throwing,
RuntimeError: CUDA error: out of memory

CUDA_VISIBLE_DEVICES=2 python run_lm_finetuning.py --output_dir=output --model_type=roberta --model_name_or_path=roberta-base --do_train --train_data_file=$TRAIN_FILE --do_eval --eval_data_file=$TEST_FILE --mlm --per_gpu_train_batch_size 1 --per_gpu_eval_batch_size 1

Expected behavior

Traceback (most recent call last):
  File "run_lm_finetuning.py", line 497, in <module>
    main()
  File "run_lm_finetuning.py", line 451, in main
    global_step, tr_loss = train(args, train_dataset, model, tokenizer)
  File "run_lm_finetuning.py", line 189, in train
    outputs = model(inputs, masked_lm_labels=labels) if args.mlm else model(inputs, labels=labels)
  File "/media/user1/storage-1/Ashok_AI/mask_env/lib/python3.6/site-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/media/user1/storage-1/Ashok_AI/mask_env/lib/python3.6/site-packages/transformers/modeling_roberta.py", line 237, in forward
    head_mask=head_mask)
  File "/media/user1/storage-1/Ashok_AI/mask_env/lib/python3.6/site-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/media/user1/storage-1/Ashok_AI/mask_env/lib/python3.6/site-packages/transformers/modeling_roberta.py", line 177, in forward
    head_mask=head_mask)
  File "/media/user1/storage-1/Ashok_AI/mask_env/lib/python3.6/site-packages/transformers/modeling_bert.py", line 625, in forward
    head_mask=head_mask)
  File "/media/user1/storage-1/Ashok_AI/mask_env/lib/python3.6/site-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/media/user1/storage-1/Ashok_AI/mask_env/lib/python3.6/site-packages/transformers/modeling_bert.py", line 346, in forward
    layer_outputs = layer_module(hidden_states, attention_mask, head_mask[i])
  File "/media/user1/storage-1/Ashok_AI/mask_env/lib/python3.6/site-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/media/user1/storage-1/Ashok_AI/mask_env/lib/python3.6/site-packages/transformers/modeling_bert.py", line 324, in forward
    attention_outputs = self.attention(hidden_states, attention_mask, head_mask)
  File "/media/user1/storage-1/Ashok_AI/mask_env/lib/python3.6/site-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/media/user1/storage-1/Ashok_AI/mask_env/lib/python3.6/site-packages/transformers/modeling_bert.py", line 281, in forward
    self_outputs = self.self(input_tensor, attention_mask, head_mask)
  File "/media/user1/storage-1/Ashok_AI/mask_env/lib/python3.6/site-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/media/user1/storage-1/Ashok_AI/mask_env/lib/python3.6/site-packages/transformers/modeling_bert.py", line 200, in forward
    mixed_query_layer = self.query(hidden_states)
  File "/media/user1/storage-1/Ashok_AI/mask_env/lib/python3.6/site-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/media/user1/storage-1/Ashok_AI/mask_env/lib/python3.6/site-packages/torch/nn/modules/linear.py", line 87, in forward
    return F.linear(input, self.weight, self.bias)
  File "/media/user1/storage-1/Ashok_AI/mask_env/lib/python3.6/site-packages/torch/nn/functional.py", line 1371, in linear
    output = input.matmul(weight.t())
RuntimeError: cublas runtime error : resource allocation failed at /pytorch/aten/src/THC/THCGeneral.cpp:216
Epoch:   0%|                                                                                                                                                                                                            | 0/1 [00:00<?, ?it/s]
Iteration:   0%|

Environment

OS: Linux
Python version: 3.6
PyTorch version: 1.2.0
PyTorch Transformers version: latest
Using GPU : yes, CUDA 10
Distributed of parallel setup : yes

wontfix

Source

MuruganR96

Most helpful comment

@YDYordanov @Hadjer13 I found the the solution. In my case , my input example has two sentences, so I use token_type_ids like I use in Bert, but it turns out that I pass the wrong token_type_ids to the RobertaModel. According to the transformers doc, RoBERTa does not make use of token type ids. So using [0,0,..0,1,1..1,0,0,..] as token_type_ids for Roberta is wrong, after I change it to all zeros, i.e. [0,0,...,0,0], the error is fixed. Hope it can help someone!

iamxpy on 1 Feb 2020

👍6 🚀1

All 13 comments

What GPU do you have?

julien-c on 7 Oct 2019

Thanks for your reply and support sir:)

NVIDIA TITAN RTX: 4 × 24 GB GPUs

MuruganR96 on 8 Oct 2019

Looks like your batch size may be too big?

thomwolf on 8 Oct 2019

Thank you so much for your support sir.

I given batch size = 1. May be the latest branch any issues will be present. I will check out previous master and then i will try sir.

MuruganR96 on 8 Oct 2019

Hi, I have the same error. Did you get this problem resolved?

tingkai-zhang on 2 Dec 2019

I have the same error too

mmsamiei on 2 Jan 2020

It may be because of this nn.embedding issue in pytorch . I had the same error. See if you have padded correctly.. or have included some invalid token

Jeevesh8 on 9 Jan 2020

Very similar issue with roberta-base (but not bert-base-cased/uncased):

RuntimeError: cublas runtime error : library not initialized at /opt/conda/conda-bld/pytorch_1573049306803/work/aten/src/THC/THCGeneral.cpp:216

I have checked and it isn't a problem with nn.embedding, nor a memory issue.

YDYordanov on 13 Jan 2020

👍3

Very similar issue with roberta-base (but not bert-base-cased/uncased):

RuntimeError: cublas runtime error : library not initialized at /opt/conda/conda-bld/pytorch_1573049306803/work/aten/src/THC/THCGeneral.cpp:216

I have checked and it isn't a problem with nn.embedding, nor a memory issue.

Very similar issue, when using camembert model which is based on roberta,
could you solve the issue ? any thoughts about it plz

Hadjer13 on 19 Jan 2020

Very similar issue with roberta-base (but not bert-base-cased/uncased):

RuntimeError: cublas runtime error : library not initialized at /opt/conda/conda-bld/pytorch_1573049306803/work/aten/src/THC/THCGeneral.cpp:216

I have checked and it isn't a problem with nn.embedding, nor a memory issue.

@YDYordanov Same with you when using roberta-base, have you resolved it?

iamxpy on 1 Feb 2020

👍6 🚀1

@YDYordanov @Hadjer13 I found the the solution. In my case , my input example has two sentences, so I use token_type_ids like I use in Bert, but it turns out that I pass the wrong token_type_ids to the RobertaModel. According to the transformers doc, RoBERTa does not make use of token type ids. So using [0,0,..0,1,1..1,0,0,..] as token_type_ids for Roberta is wrong, after I change it to all zeros, i.e. [0,0,...,0,0], the error is fixed. Hope it can help someone!

thank you,

jetou on 3 Feb 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.