File "run_lm_finetuning.py", line 551, in
main()
File "run_lm_finetuning.py", line 503, in main
global_step, tr_loss = train(args, train_dataset, model, tokenizer)
File "run_lm_finetuning.py", line 240, in train
loss.backward()
File "/home/antonio/anaconda3/lib/python3.7/site-packages/torch/tensor.py", line 150, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "/home/antonio/anaconda3/lib/python3.7/site-packages/torch/autograd/__init__.py", line 99, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: CUDA out of memory. Tried to allocate 198.00 MiB (GPU 0; 5.93 GiB total capacity; 4.64 GiB already allocated; 54.94 MiB free; 233.05 MiB cached)
I encounter the above error with my 1060 GTX 6GB Nvidia, on the GPT-2 small model. The training configs are:
batch size = 1
gradient accumulation steps = 1024 (I've started without gradient accumulation, then tried accumulation based on an old issue from this repo, then from small values I went up to this value, but the error always occurs).
If i run with no gradient accumulation, I get this instead:
File "run_lm_finetuning.py", line 228, in train
outputs = model(inputs, masked_lm_labels=labels) if args.mlm else model(inputs, labels=labels)
File "/home/antonio/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in __call__
result = self.forward(input, *kwargs)
File "/home/antonio/anaconda3/lib/python3.7/site-packages/transformers/modeling_gpt2.py", line 549, in forward
inputs_embeds=inputs_embeds)
File "/home/antonio/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in __call__
result = self.forward(input, *kwargs)
File "/home/antonio/anaconda3/lib/python3.7/site-packages/transformers/modeling_gpt2.py", line 460, in forward
head_mask=head_mask[i])
File "/home/antonio/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in __call__
result = self.forward(input, *kwargs)
File "/home/antonio/anaconda3/lib/python3.7/site-packages/transformers/modeling_gpt2.py", line 232, in forward
head_mask=head_mask)
File "/home/antonio/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in __call__
result = self.forward(input, *kwargs)
File "/home/antonio/anaconda3/lib/python3.7/site-packages/transformers/modeling_gpt2.py", line 182, in forward
x = self.c_attn(x)
File "/home/antonio/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in __call__
result = self.forward(input, *kwargs)
File "/home/antonio/anaconda3/lib/python3.7/site-packages/transformers/modeling_utils.py", line 488, in forward
x = torch.addmm(self.bias, x.view(-1, x.size(-1)), self.weight)
RuntimeError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 5.93 GiB total capacity; 4.77 GiB already allocated; 12.81 MiB free; 154.93 MiB cached)
Can you please give me a little hint on how to overcome this error, or a little hope to run gpt-2 small on 6 GB of GPU.
Thanks
If your gpu is out of memory, try decreasing the batch size (this will save memory). In order to retain the same effective batch size, use gradient accumulation.
Basically, do loss.backward() for each step, but only every, say, 10 steps, do optimizer.step() and optimizer.backward().
Hi @aced125, batch size is already set to one and I have tried values from 1 to 1024 for gradient accumulation, it gives me always CUDA out of memory error.
Hijacking this issue to say that I have the exact same problem with xlm-mlm-17-1280. Even with batch size = 1, Apex enabled and two 1080Ti's it always give memory error in the first loss.backward() call.
Which optimizer are you using? If you're using the default (AdamW) that may be part of your problem. Different optimizers have different memory requirements. Adam is one of the worst offenders. Give RMSProp a try since it has much less memory overhead. Every additional feature like using momentum will increase memory overhead.
Hi @dvaltchanov thanks but using RMSprop led me to the same errors
@antocapp Which block_size are you using? The default (512) or something else? Using a smaller block size (e.g. 256) will also use up less memory. On a smaller card like yours, you basically need to use batch size of 1 with small block size and memory efficient optimizer for it to fit into GPU memory. An alternative is to try running your test on Google Colab or another cloud service where you can get 12+ GB of GPU memory.
Hijacking this issue to say that I have the exact same problem with
xlm-mlm-17-1280. Even withbatch size = 1, Apex enabled and two 1080Ti's it always give memory error in the firstloss.backward()call.
hi,I have the exact same problem with xlm-mlm-17-1280.Have you solved this issue?
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Most helpful comment
@antocapp Which block_size are you using? The default (512) or something else? Using a smaller block size (e.g. 256) will also use up less memory. On a smaller card like yours, you basically need to use batch size of 1 with small block size and memory efficient optimizer for it to fit into GPU memory. An alternative is to try running your test on Google Colab or another cloud service where you can get 12+ GB of GPU memory.