transformers version: 3.5.1albert, bert, GPT2, XLM: @LysandreJik
Trainer: @sgugger
Model I am using: GPT2
The problem arises when using:
I'm running the provided example:
python run_clm.py \
--model_name_or_path gpt2 \
--dataset_name wikitext \
--dataset_config_name wikitext-2-raw-v1 \
--do_train \
--do_eval \
--output_dir /tmp/test-clm
and getting this error:
RuntimeError: CUDA out of memory.
on the first pass through Trainer.training_step()
Full traceback:
2020-11-22 22:02:22.921355: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
11/22/2020 22:02:24 - WARNING - __main__ - Process rank: -1, device: cuda:0, n_gpu: 1distributed training: False, 16-bits training: False
11/22/2020 22:02:24 - INFO - __main__ - Training/evaluation parameters TrainingArguments(output_dir='/tmp/test-clm', overwrite_output_dir=False, do_train=True, do_eval=True, do_predict=False, evaluate_during_training=False, evaluation_strategy=<EvaluationStrategy.NO: 'no'>, prediction_loss_only=False, per_device_train_batch_size=8, per_device_eval_batch_size=8, per_gpu_train_batch_size=None, per_gpu_eval_batch_size=None, gradient_accumulation_steps=1, eval_accumulation_steps=None, learning_rate=5e-05, weight_decay=0.0, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, max_grad_norm=1.0, num_train_epochs=3.0, max_steps=-1, warmup_steps=0, logging_dir='runs/Nov22_22-02-24_f7d2e15228b7', logging_first_step=False, logging_steps=500, save_steps=500, save_total_limit=None, no_cuda=False, seed=42, fp16=False, fp16_opt_level='O1', local_rank=-1, tpu_num_cores=None, tpu_metrics_debug=False, debug=False, dataloader_drop_last=False, eval_steps=500, dataloader_num_workers=0, past_index=-1, run_name='/tmp/test-clm', disable_tqdm=False, remove_unused_columns=True, label_names=None, load_best_model_at_end=False, metric_for_best_model=None, greater_is_better=None)
Reusing dataset wikitext (/root/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/47c57a6745aa5ce8e16a5355aaa4039e3aa90d1adad87cef1ad4e0f29e74ac91)
[INFO|configuration_utils.py:413] 2020-11-22 22:02:24,711 >> loading configuration file https://huggingface.co/gpt2/resolve/main/config.json from cache at /root/.cache/torch/transformers/fc674cd6907b4c9e933cb42d67662436b89fa9540a1f40d7c919d0109289ad01.7d2e0efa5ca20cef4fb199382111e9d3ad96fd77b849e1d4bed13a66e1336f51
[INFO|configuration_utils.py:449] 2020-11-22 22:02:24,711 >> Model config GPT2Config {
"activation_function": "gelu_new",
"architectures": [
"GPT2LMHeadModel"
],
"attn_pdrop": 0.1,
"bos_token_id": 50256,
"embd_pdrop": 0.1,
"eos_token_id": 50256,
"gradient_checkpointing": false,
"initializer_range": 0.02,
"layer_norm_epsilon": 1e-05,
"model_type": "gpt2",
"n_ctx": 1024,
"n_embd": 768,
"n_head": 12,
"n_inner": null,
"n_layer": 12,
"n_positions": 1024,
"resid_pdrop": 0.1,
"summary_activation": null,
"summary_first_dropout": 0.1,
"summary_proj_to_labels": true,
"summary_type": "cls_index",
"summary_use_proj": true,
"task_specific_params": {
"text-generation": {
"do_sample": true,
"max_length": 50
}
},
"vocab_size": 50257
}
[INFO|configuration_utils.py:413] 2020-11-22 22:02:24,791 >> loading configuration file https://huggingface.co/gpt2/resolve/main/config.json from cache at /root/.cache/torch/transformers/fc674cd6907b4c9e933cb42d67662436b89fa9540a1f40d7c919d0109289ad01.7d2e0efa5ca20cef4fb199382111e9d3ad96fd77b849e1d4bed13a66e1336f51
[INFO|configuration_utils.py:449] 2020-11-22 22:02:24,791 >> Model config GPT2Config {
"activation_function": "gelu_new",
"architectures": [
"GPT2LMHeadModel"
],
"attn_pdrop": 0.1,
"bos_token_id": 50256,
"embd_pdrop": 0.1,
"eos_token_id": 50256,
"gradient_checkpointing": false,
"initializer_range": 0.02,
"layer_norm_epsilon": 1e-05,
"model_type": "gpt2",
"n_ctx": 1024,
"n_embd": 768,
"n_head": 12,
"n_inner": null,
"n_layer": 12,
"n_positions": 1024,
"resid_pdrop": 0.1,
"summary_activation": null,
"summary_first_dropout": 0.1,
"summary_proj_to_labels": true,
"summary_type": "cls_index",
"summary_use_proj": true,
"task_specific_params": {
"text-generation": {
"do_sample": true,
"max_length": 50
}
},
"vocab_size": 50257
}
[INFO|tokenization_utils_base.py:1650] 2020-11-22 22:02:25,081 >> loading file https://huggingface.co/gpt2/resolve/main/vocab.json from cache at /root/.cache/torch/transformers/684fe667923972fb57f6b4dcb61a3c92763ad89882f3da5da9866baf14f2d60f.c7ed1f96aac49e745788faa77ba0a26a392643a50bb388b9c04ff469e555241f
[INFO|tokenization_utils_base.py:1650] 2020-11-22 22:02:25,081 >> loading file https://huggingface.co/gpt2/resolve/main/merges.txt from cache at /root/.cache/torch/transformers/c0c761a63004025aeadd530c4c27b860ec4ecbe8a00531233de21d865a402598.5d12962c5ee615a4c803841266e9c3be9a691a924f72d395d3a6c6c81157788b
[INFO|tokenization_utils_base.py:1650] 2020-11-22 22:02:25,082 >> loading file https://huggingface.co/gpt2/resolve/main/tokenizer.json from cache at /root/.cache/torch/transformers/16a2f78023c8dc511294f0c97b5e10fde3ef9889ad6d11ffaa2a00714e73926e.cf2d0ecb83b6df91b3dbb53f1d1e4c311578bfd3aa0e04934215a49bf9898df0
[INFO|modeling_utils.py:940] 2020-11-22 22:02:25,230 >> loading weights file https://huggingface.co/gpt2/resolve/main/pytorch_model.bin from cache at /root/.cache/torch/transformers/752929ace039baa8ef70fe21cdf9ab9445773d20e733cf693d667982e210837e.323c769945a351daa25546176f8208b3004b6f563438a7603e7932bae9025925
[INFO|modeling_utils.py:1056] 2020-11-22 22:02:30,168 >> All model checkpoint weights were used when initializing GPT2LMHeadModel.
[INFO|modeling_utils.py:1065] 2020-11-22 22:02:30,168 >> All the weights of GPT2LMHeadModel were initialized from the model checkpoint at gpt2.
If your task is similar to the task the model of the checkpoint was trained on, you can already use GPT2LMHeadModel for predictions without further training.
Loading cached processed dataset at /root/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/47c57a6745aa5ce8e16a5355aaa4039e3aa90d1adad87cef1ad4e0f29e74ac91/cache-e3061a317d13eb90.arrow
Loading cached processed dataset at /root/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/47c57a6745aa5ce8e16a5355aaa4039e3aa90d1adad87cef1ad4e0f29e74ac91/cache-a948c1d62c014b03.arrow
Loading cached processed dataset at /root/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/47c57a6745aa5ce8e16a5355aaa4039e3aa90d1adad87cef1ad4e0f29e74ac91/cache-ea170b0cdcba7aa4.arrow
Loading cached processed dataset at /root/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/47c57a6745aa5ce8e16a5355aaa4039e3aa90d1adad87cef1ad4e0f29e74ac91/cache-38ad73a52a8ec98e.arrow
Loading cached processed dataset at /root/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/47c57a6745aa5ce8e16a5355aaa4039e3aa90d1adad87cef1ad4e0f29e74ac91/cache-dd6364e0f6a6c9eb.arrow
Loading cached processed dataset at /root/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/47c57a6745aa5ce8e16a5355aaa4039e3aa90d1adad87cef1ad4e0f29e74ac91/cache-c40818aaf33935e0.arrow
[INFO|trainer.py:388] 2020-11-22 22:02:35,382 >> The following columns in the training set don't have a corresponding argument in `GPT2LMHeadModel.forward` and have been ignored: .
[INFO|trainer.py:388] 2020-11-22 22:02:35,382 >> The following columns in the evaluation set don't have a corresponding argument in `GPT2LMHeadModel.forward` and have been ignored: .
[INFO|trainer.py:693] 2020-11-22 22:02:35,385 >> ***** Running training *****
[INFO|trainer.py:694] 2020-11-22 22:02:35,385 >> Num examples = 2318
[INFO|trainer.py:695] 2020-11-22 22:02:35,385 >> Num Epochs = 3
[INFO|trainer.py:696] 2020-11-22 22:02:35,385 >> Instantaneous batch size per device = 8
[INFO|trainer.py:697] 2020-11-22 22:02:35,386 >> Total train batch size (w. parallel, distributed & accumulation) = 8
[INFO|trainer.py:698] 2020-11-22 22:02:35,386 >> Gradient Accumulation steps = 1
[INFO|trainer.py:699] 2020-11-22 22:02:35,386 >> Total optimization steps = 870
0% 0/870 [00:00<?, ?it/s]Traceback (most recent call last):
File "run_clm.py", line 351, in <module>
main()
File "run_clm.py", line 321, in main
trainer.train(model_path=model_path)
File "/usr/local/lib/python3.6/dist-packages/transformers/trainer.py", line 775, in train
tr_loss += self.training_step(model, inputs)
File "/usr/local/lib/python3.6/dist-packages/transformers/trainer.py", line 1112, in training_step
loss = self.compute_loss(model, inputs)
File "/usr/local/lib/python3.6/dist-packages/transformers/trainer.py", line 1136, in compute_loss
outputs = model(**inputs)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/transformers/modeling_gpt2.py", line 787, in forward
return_dict=return_dict,
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/transformers/modeling_gpt2.py", line 659, in forward
output_attentions=output_attentions,
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/transformers/modeling_gpt2.py", line 295, in forward
output_attentions=output_attentions,
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/transformers/modeling_gpt2.py", line 239, in forward
attn_outputs = self._attn(query, key, value, attention_mask, head_mask, output_attentions)
File "/usr/local/lib/python3.6/dist-packages/transformers/modeling_gpt2.py", line 181, in _attn
w = self.attn_dropout(w)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/dropout.py", line 58, in forward
return F.dropout(input, self.p, self.training, self.inplace)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/functional.py", line 983, in dropout
else _VF.dropout(input, p, training))
RuntimeError: CUDA out of memory. Tried to allocate 384.00 MiB (GPU 0; 14.73 GiB total capacity; 13.50 GiB already allocated; 137.81 MiB free; 13.55 GiB reserved in total by PyTorch)
0% 0/870 [00:00<?, ?it/s]
The tasks I am working on is:
Steps to reproduce the behavior:
I have a minimal reproduction on this Colab notebook
I traced the problem to the Trainer.training_step() method. It seems PR 6999 was an attempt to fix a similar problem. However, with my issue, the CUDA OOM error happens before the loss.detach() on the first pass of training_step()
This is similar to issue 7169, except I'm not doing distributed training.
I've tested this issue both in Google Colab (1xGPU) and then on an AWS EC2 g4dn.12xlarge instance (4xGPU). I was pursuing the obvious possibility of Colab GPU simply being too small. Both max out with a "CUDA out of memory" error.
I also tried using the TPU launcher script, which hit an error, but that's a separate issue.
I also tried using the legacy run_language_modeling.py script with the same arguments on Colab (a friend had done so a few months ago and had success on Colab). I got this error there:
AttributeError: 'GPT2TokenizerFast' object has no attribute 'max_len'
but that's a separate issue.
The docs say that expected behavior for running will be the output of a trained model at the --output_dir flag
This takes about half an hour to train on a single K80 GPU and about one minute for the evaluation to run. It reaches a score of ~20 perplexity once fine-tuned on the dataset.
How do we fix this to make run_clm.py work?
The comment you are mentioning was about the old run_language_modeling script, and probably with some more options for a K80 that what you are running the script with (we should probably remove it or update with a proper command that gives those results). This doesn't look like a memory leak problem, you just don't have enough GPU memory to run the this large model with its full sequence length (of 1,024). You could try:
--per_device_batch_size 4 or even 2 (or use gradient accumulation)--block_size 512 or even 256--model_name_or_path gpt2-medium or even distilgpt2.The smaller --per_device_train_batch_size 2 batch size seems to be working for me. Just started the training process. Thank you very much for the extremely quick response, and for being an OSS maintainer @sgugger!
I'll likely drop one more update in this thread to confirm that it worked all the way through.
Can confirm - your advice works for me.
In fact, I managed to retrain even the XL on T100 GPUs on the new p4d.24xl instances. Definitely high mem requirements, but doable with --model_name_or_path gpt2-xl --per_device_train_batch_size 1 --block_size 512
Thanks, team! Y'all have a https://buymeacoffee.com account I can send some brews to? I appreciate your work.
Most helpful comment
The smaller
--per_device_train_batch_size 2batch size seems to be working for me. Just started the training process. Thank you very much for the extremely quick response, and for being an OSS maintainer @sgugger!I'll likely drop one more update in this thread to confirm that it worked all the way through.