Transformers: RuntimeError: cublas runtime error : resource allocation failed

Created on 7 Oct 2019  路  13Comments  路  Source: huggingface/transformers

馃悰 Bug

Model I am using Bert:

Language I am using the model on English:

The tasks I am working on is:

  • [ ] Finetuned bert model with my own dataset.
  • [ ] run_lm_finetuning.py

To Reproduce

Steps to reproduce the behavior:

  1. I was followesd this issue https://github.com/huggingface/transfer-learning-conv-ai/issues/10
  2. i tried to reduced batch_size = 1
  3. i tried CUDA_LAUNCH_BLOCKING=1
    it is throwing,
    RuntimeError: CUDA error: out of memory

CUDA_VISIBLE_DEVICES=2 python run_lm_finetuning.py --output_dir=output --model_type=roberta --model_name_or_path=roberta-base --do_train --train_data_file=$TRAIN_FILE --do_eval --eval_data_file=$TEST_FILE --mlm --per_gpu_train_batch_size 1 --per_gpu_eval_batch_size 1

Expected behavior

Traceback (most recent call last):
  File "run_lm_finetuning.py", line 497, in <module>
    main()
  File "run_lm_finetuning.py", line 451, in main
    global_step, tr_loss = train(args, train_dataset, model, tokenizer)
  File "run_lm_finetuning.py", line 189, in train
    outputs = model(inputs, masked_lm_labels=labels) if args.mlm else model(inputs, labels=labels)
  File "/media/user1/storage-1/Ashok_AI/mask_env/lib/python3.6/site-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/media/user1/storage-1/Ashok_AI/mask_env/lib/python3.6/site-packages/transformers/modeling_roberta.py", line 237, in forward
    head_mask=head_mask)
  File "/media/user1/storage-1/Ashok_AI/mask_env/lib/python3.6/site-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/media/user1/storage-1/Ashok_AI/mask_env/lib/python3.6/site-packages/transformers/modeling_roberta.py", line 177, in forward
    head_mask=head_mask)
  File "/media/user1/storage-1/Ashok_AI/mask_env/lib/python3.6/site-packages/transformers/modeling_bert.py", line 625, in forward
    head_mask=head_mask)
  File "/media/user1/storage-1/Ashok_AI/mask_env/lib/python3.6/site-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/media/user1/storage-1/Ashok_AI/mask_env/lib/python3.6/site-packages/transformers/modeling_bert.py", line 346, in forward
    layer_outputs = layer_module(hidden_states, attention_mask, head_mask[i])
  File "/media/user1/storage-1/Ashok_AI/mask_env/lib/python3.6/site-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/media/user1/storage-1/Ashok_AI/mask_env/lib/python3.6/site-packages/transformers/modeling_bert.py", line 324, in forward
    attention_outputs = self.attention(hidden_states, attention_mask, head_mask)
  File "/media/user1/storage-1/Ashok_AI/mask_env/lib/python3.6/site-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/media/user1/storage-1/Ashok_AI/mask_env/lib/python3.6/site-packages/transformers/modeling_bert.py", line 281, in forward
    self_outputs = self.self(input_tensor, attention_mask, head_mask)
  File "/media/user1/storage-1/Ashok_AI/mask_env/lib/python3.6/site-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/media/user1/storage-1/Ashok_AI/mask_env/lib/python3.6/site-packages/transformers/modeling_bert.py", line 200, in forward
    mixed_query_layer = self.query(hidden_states)
  File "/media/user1/storage-1/Ashok_AI/mask_env/lib/python3.6/site-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/media/user1/storage-1/Ashok_AI/mask_env/lib/python3.6/site-packages/torch/nn/modules/linear.py", line 87, in forward
    return F.linear(input, self.weight, self.bias)
  File "/media/user1/storage-1/Ashok_AI/mask_env/lib/python3.6/site-packages/torch/nn/functional.py", line 1371, in linear
    output = input.matmul(weight.t())
RuntimeError: cublas runtime error : resource allocation failed at /pytorch/aten/src/THC/THCGeneral.cpp:216
Epoch:   0%|                                                                                                                                                                                                            | 0/1 [00:00<?, ?it/s]
Iteration:   0%|

Environment

  • OS: Linux
  • Python version: 3.6
  • PyTorch version: 1.2.0
  • PyTorch Transformers version: latest
  • Using GPU : yes, CUDA 10
  • Distributed of parallel setup : yes
wontfix

Most helpful comment

@YDYordanov @Hadjer13 I found the the solution. In my case , my input example has two sentences, so I use token_type_ids like I use in Bert, but it turns out that I pass the wrong token_type_ids to the RobertaModel. According to the transformers doc, RoBERTa does not make use of token type ids. So using [0,0,..0,1,1..1,0,0,..] as token_type_ids for Roberta is wrong, after I change it to all zeros, i.e. [0,0,...,0,0], the error is fixed. Hope it can help someone!

All 13 comments

What GPU do you have?

Thanks for your reply and support sir:)

NVIDIA TITAN RTX: 4 脳 24 GB GPUs

Looks like your batch size may be too big?

Thank you so much for your support sir.

I given batch size = 1. May be the latest branch any issues will be present. I will check out previous master and then i will try sir.

Hi, I have the same error. Did you get this problem resolved?

I have the same error too

It may be because of this nn.embedding issue in pytorch . I had the same error. See if you have padded correctly.. or have included some invalid token

Very similar issue with roberta-base (but not bert-base-cased/uncased):

RuntimeError: cublas runtime error : library not initialized at /opt/conda/conda-bld/pytorch_1573049306803/work/aten/src/THC/THCGeneral.cpp:216

I have checked and it isn't a problem with nn.embedding, nor a memory issue.

Very similar issue with roberta-base (but not bert-base-cased/uncased):

RuntimeError: cublas runtime error : library not initialized at /opt/conda/conda-bld/pytorch_1573049306803/work/aten/src/THC/THCGeneral.cpp:216

I have checked and it isn't a problem with nn.embedding, nor a memory issue.

Very similar issue, when using camembert model which is based on roberta,
could you solve the issue ? any thoughts about it plz

Very similar issue with roberta-base (but not bert-base-cased/uncased):

RuntimeError: cublas runtime error : library not initialized at /opt/conda/conda-bld/pytorch_1573049306803/work/aten/src/THC/THCGeneral.cpp:216

I have checked and it isn't a problem with nn.embedding, nor a memory issue.

@YDYordanov Same with you when using roberta-base, have you resolved it?

@YDYordanov @Hadjer13 I found the the solution. In my case , my input example has two sentences, so I use token_type_ids like I use in Bert, but it turns out that I pass the wrong token_type_ids to the RobertaModel. According to the transformers doc, RoBERTa does not make use of token type ids. So using [0,0,..0,1,1..1,0,0,..] as token_type_ids for Roberta is wrong, after I change it to all zeros, i.e. [0,0,...,0,0], the error is fixed. Hope it can help someone!

@YDYordanov @Hadjer13 I found the the solution. In my case , my input example has two sentences, so I use token_type_ids like I use in Bert, but it turns out that I pass the wrong token_type_ids to the RobertaModel. According to the transformers doc, RoBERTa does not make use of token type ids. So using [0,0,..0,1,1..1,0,0,..] as token_type_ids for Roberta is wrong, after I change it to all zeros, i.e. [0,0,...,0,0], the error is fixed. Hope it can help someone!

thank you,

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Was this page helpful?
0 / 5 - 0 ratings