Transformers: CUDA out of memory on loss.backward when fine-tuning GPT2 (117M)

Created on 13 Nov 2019 · 8Comments · Source: huggingface/transformers

❓ Questions & Help

File "run_lm_finetuning.py", line 551, in
main()
File "run_lm_finetuning.py", line 503, in main
global_step, tr_loss = train(args, train_dataset, model, tokenizer)
File "run_lm_finetuning.py", line 240, in train
loss.backward()
File "/home/antonio/anaconda3/lib/python3.7/site-packages/torch/tensor.py", line 150, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "/home/antonio/anaconda3/lib/python3.7/site-packages/torch/autograd/__init__.py", line 99, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: CUDA out of memory. Tried to allocate 198.00 MiB (GPU 0; 5.93 GiB total capacity; 4.64 GiB already allocated; 54.94 MiB free; 233.05 MiB cached)

I encounter the above error with my 1060 GTX 6GB Nvidia, on the GPT-2 small model. The training configs are:
batch size = 1
gradient accumulation steps = 1024 (I've started without gradient accumulation, then tried accumulation based on an old issue from this repo, then from small values I went up to this value, but the error always occurs).

If i run with no gradient accumulation, I get this instead:
File "run_lm_finetuning.py", line 228, in train
outputs = model(inputs, masked_lm_labels=labels) if args.mlm else model(inputs, labels=labels)
File "/home/antonio/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in __call__
result = self.forward(input, *kwargs)
File "/home/antonio/anaconda3/lib/python3.7/site-packages/transformers/modeling_gpt2.py", line 549, in forward
inputs_embeds=inputs_embeds)
File "/home/antonio/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in __call__
result = self.forward(input, *kwargs)
File "/home/antonio/anaconda3/lib/python3.7/site-packages/transformers/modeling_gpt2.py", line 460, in forward
head_mask=head_mask[i])
File "/home/antonio/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in __call__
result = self.forward(input, *kwargs)
File "/home/antonio/anaconda3/lib/python3.7/site-packages/transformers/modeling_gpt2.py", line 232, in forward
head_mask=head_mask)
File "/home/antonio/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in __call__
result = self.forward(input, *kwargs)
File "/home/antonio/anaconda3/lib/python3.7/site-packages/transformers/modeling_gpt2.py", line 182, in forward
x = self.c_attn(x)
File "/home/antonio/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in __call__
result = self.forward(input, *kwargs)
File "/home/antonio/anaconda3/lib/python3.7/site-packages/transformers/modeling_utils.py", line 488, in forward
x = torch.addmm(self.bias, x.view(-1, x.size(-1)), self.weight)
RuntimeError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 5.93 GiB total capacity; 4.77 GiB already allocated; 12.81 MiB free; 154.93 MiB cached)

Can you please give me a little hint on how to overcome this error, or a little hope to run gpt-2 small on 6 GB of GPU.

Thanks

wontfix

Source

antocapp

👍1

Most helpful comment

@antocapp Which block_size are you using? The default (512) or something else? Using a smaller block size (e.g. 256) will also use up less memory. On a smaller card like yours, you basically need to use batch size of 1 with small block size and memory efficient optimizer for it to fit into GPU memory. An alternative is to try running your test on Google Colab or another cloud service where you can get 12+ GB of GPU memory.

dvaltchanov on 14 Nov 2019

🚀1 ❤1 🎉1

All 8 comments

If your gpu is out of memory, try decreasing the batch size (this will save memory). In order to retain the same effective batch size, use gradient accumulation.

Basically, do loss.backward() for each step, but only every, say, 10 steps, do optimizer.step() and optimizer.backward().

aced125 on 13 Nov 2019

Hi @aced125, batch size is already set to one and I have tried values from 1 to 1024 for gradient accumulation, it gives me always CUDA out of memory error.

antocapp on 13 Nov 2019

Hijacking this issue to say that I have the exact same problem with xlm-mlm-17-1280. Even with batch size = 1, Apex enabled and two 1080Ti's it always give memory error in the first loss.backward() call.

suicao on 14 Nov 2019

Which optimizer are you using? If you're using the default (AdamW) that may be part of your problem. Different optimizers have different memory requirements. Adam is one of the worst offenders. Give RMSProp a try since it has much less memory overhead. Every additional feature like using momentum will increase memory overhead.

dvaltchanov on 14 Nov 2019

Hi @dvaltchanov thanks but using RMSprop led me to the same errors

antocapp on 14 Nov 2019

dvaltchanov on 14 Nov 2019

🚀1 ❤1 🎉1

Hijacking this issue to say that I have the exact same problem with xlm-mlm-17-1280. Even with batch size = 1, Apex enabled and two 1080Ti's it always give memory error in the first loss.backward() call.

hi,I have the exact same problem with xlm-mlm-17-1280.Have you solved this issue?

Tiankekeke on 3 Dec 2019

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.