Transformers: run_clm.py training script failing with CUDA out of memory error, using gpt2 and arguments from docs.

Created on 22 Nov 2020 · 3Comments · Source: huggingface/transformers

Environment info

transformers version: 3.5.1
Platform: Linux-4.19.112+-x86_64-with-Ubuntu-18.04-bionic
Python version: 3.6.9
PyTorch version (GPU?): 1.7.0+cu101 (True)
Tensorflow version (GPU?): 2.3.0 (True)
Using GPU in script?: Yes, via official run_clm.py script
Using distributed or parallel set-up in script?: No

Who can help

albert, bert, GPT2, XLM: @LysandreJik
Trainer: @sgugger

Information

Model I am using: GPT2

The problem arises when using:

[x] the official example scripts: language-modeling/run_clm.py
[ ] my own modified scripts: (give details below)

I'm running the provided example:

python run_clm.py \
    --model_name_or_path gpt2 \
    --dataset_name wikitext \
    --dataset_config_name wikitext-2-raw-v1 \
    --do_train \
    --do_eval \
    --output_dir /tmp/test-clm

and getting this error:

RuntimeError: CUDA out of memory.

on the first pass through Trainer.training_step()

Full traceback:

2020-11-22 22:02:22.921355: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
11/22/2020 22:02:24 - WARNING - __main__ -   Process rank: -1, device: cuda:0, n_gpu: 1distributed training: False, 16-bits training: False
11/22/2020 22:02:24 - INFO - __main__ -   Training/evaluation parameters TrainingArguments(output_dir='/tmp/test-clm', overwrite_output_dir=False, do_train=True, do_eval=True, do_predict=False, evaluate_during_training=False, evaluation_strategy=<EvaluationStrategy.NO: 'no'>, prediction_loss_only=False, per_device_train_batch_size=8, per_device_eval_batch_size=8, per_gpu_train_batch_size=None, per_gpu_eval_batch_size=None, gradient_accumulation_steps=1, eval_accumulation_steps=None, learning_rate=5e-05, weight_decay=0.0, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, max_grad_norm=1.0, num_train_epochs=3.0, max_steps=-1, warmup_steps=0, logging_dir='runs/Nov22_22-02-24_f7d2e15228b7', logging_first_step=False, logging_steps=500, save_steps=500, save_total_limit=None, no_cuda=False, seed=42, fp16=False, fp16_opt_level='O1', local_rank=-1, tpu_num_cores=None, tpu_metrics_debug=False, debug=False, dataloader_drop_last=False, eval_steps=500, dataloader_num_workers=0, past_index=-1, run_name='/tmp/test-clm', disable_tqdm=False, remove_unused_columns=True, label_names=None, load_best_model_at_end=False, metric_for_best_model=None, greater_is_better=None)
Reusing dataset wikitext (/root/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/47c57a6745aa5ce8e16a5355aaa4039e3aa90d1adad87cef1ad4e0f29e74ac91)
[INFO|configuration_utils.py:413] 2020-11-22 22:02:24,711 >> loading configuration file https://huggingface.co/gpt2/resolve/main/config.json from cache at /root/.cache/torch/transformers/fc674cd6907b4c9e933cb42d67662436b89fa9540a1f40d7c919d0109289ad01.7d2e0efa5ca20cef4fb199382111e9d3ad96fd77b849e1d4bed13a66e1336f51
[INFO|configuration_utils.py:449] 2020-11-22 22:02:24,711 >> Model config GPT2Config {
  "activation_function": "gelu_new",
  "architectures": [
    "GPT2LMHeadModel"
  ],
  "attn_pdrop": 0.1,
  "bos_token_id": 50256,
  "embd_pdrop": 0.1,
  "eos_token_id": 50256,
  "gradient_checkpointing": false,
  "initializer_range": 0.02,
  "layer_norm_epsilon": 1e-05,
  "model_type": "gpt2",
  "n_ctx": 1024,
  "n_embd": 768,
  "n_head": 12,
  "n_inner": null,
  "n_layer": 12,
  "n_positions": 1024,
  "resid_pdrop": 0.1,
  "summary_activation": null,
  "summary_first_dropout": 0.1,
  "summary_proj_to_labels": true,
  "summary_type": "cls_index",
  "summary_use_proj": true,
  "task_specific_params": {
    "text-generation": {
      "do_sample": true,
      "max_length": 50
    }
  },
  "vocab_size": 50257
}

[INFO|configuration_utils.py:413] 2020-11-22 22:02:24,791 >> loading configuration file https://huggingface.co/gpt2/resolve/main/config.json from cache at /root/.cache/torch/transformers/fc674cd6907b4c9e933cb42d67662436b89fa9540a1f40d7c919d0109289ad01.7d2e0efa5ca20cef4fb199382111e9d3ad96fd77b849e1d4bed13a66e1336f51
[INFO|configuration_utils.py:449] 2020-11-22 22:02:24,791 >> Model config GPT2Config {
  "activation_function": "gelu_new",
  "architectures": [
    "GPT2LMHeadModel"
  ],
  "attn_pdrop": 0.1,
  "bos_token_id": 50256,
  "embd_pdrop": 0.1,
  "eos_token_id": 50256,
  "gradient_checkpointing": false,
  "initializer_range": 0.02,
  "layer_norm_epsilon": 1e-05,
  "model_type": "gpt2",
  "n_ctx": 1024,
  "n_embd": 768,
  "n_head": 12,
  "n_inner": null,
  "n_layer": 12,
  "n_positions": 1024,
  "resid_pdrop": 0.1,
  "summary_activation": null,
  "summary_first_dropout": 0.1,
  "summary_proj_to_labels": true,
  "summary_type": "cls_index",
  "summary_use_proj": true,
  "task_specific_params": {
    "text-generation": {
      "do_sample": true,
      "max_length": 50
    }
  },
  "vocab_size": 50257
}

[INFO|tokenization_utils_base.py:1650] 2020-11-22 22:02:25,081 >> loading file https://huggingface.co/gpt2/resolve/main/vocab.json from cache at /root/.cache/torch/transformers/684fe667923972fb57f6b4dcb61a3c92763ad89882f3da5da9866baf14f2d60f.c7ed1f96aac49e745788faa77ba0a26a392643a50bb388b9c04ff469e555241f
[INFO|tokenization_utils_base.py:1650] 2020-11-22 22:02:25,081 >> loading file https://huggingface.co/gpt2/resolve/main/merges.txt from cache at /root/.cache/torch/transformers/c0c761a63004025aeadd530c4c27b860ec4ecbe8a00531233de21d865a402598.5d12962c5ee615a4c803841266e9c3be9a691a924f72d395d3a6c6c81157788b
[INFO|tokenization_utils_base.py:1650] 2020-11-22 22:02:25,082 >> loading file https://huggingface.co/gpt2/resolve/main/tokenizer.json from cache at /root/.cache/torch/transformers/16a2f78023c8dc511294f0c97b5e10fde3ef9889ad6d11ffaa2a00714e73926e.cf2d0ecb83b6df91b3dbb53f1d1e4c311578bfd3aa0e04934215a49bf9898df0
[INFO|modeling_utils.py:940] 2020-11-22 22:02:25,230 >> loading weights file https://huggingface.co/gpt2/resolve/main/pytorch_model.bin from cache at /root/.cache/torch/transformers/752929ace039baa8ef70fe21cdf9ab9445773d20e733cf693d667982e210837e.323c769945a351daa25546176f8208b3004b6f563438a7603e7932bae9025925
[INFO|modeling_utils.py:1056] 2020-11-22 22:02:30,168 >> All model checkpoint weights were used when initializing GPT2LMHeadModel.

[INFO|modeling_utils.py:1065] 2020-11-22 22:02:30,168 >> All the weights of GPT2LMHeadModel were initialized from the model checkpoint at gpt2.
If your task is similar to the task the model of the checkpoint was trained on, you can already use GPT2LMHeadModel for predictions without further training.
Loading cached processed dataset at /root/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/47c57a6745aa5ce8e16a5355aaa4039e3aa90d1adad87cef1ad4e0f29e74ac91/cache-e3061a317d13eb90.arrow
Loading cached processed dataset at /root/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/47c57a6745aa5ce8e16a5355aaa4039e3aa90d1adad87cef1ad4e0f29e74ac91/cache-a948c1d62c014b03.arrow
Loading cached processed dataset at /root/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/47c57a6745aa5ce8e16a5355aaa4039e3aa90d1adad87cef1ad4e0f29e74ac91/cache-ea170b0cdcba7aa4.arrow
Loading cached processed dataset at /root/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/47c57a6745aa5ce8e16a5355aaa4039e3aa90d1adad87cef1ad4e0f29e74ac91/cache-38ad73a52a8ec98e.arrow
Loading cached processed dataset at /root/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/47c57a6745aa5ce8e16a5355aaa4039e3aa90d1adad87cef1ad4e0f29e74ac91/cache-dd6364e0f6a6c9eb.arrow
Loading cached processed dataset at /root/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/47c57a6745aa5ce8e16a5355aaa4039e3aa90d1adad87cef1ad4e0f29e74ac91/cache-c40818aaf33935e0.arrow
[INFO|trainer.py:388] 2020-11-22 22:02:35,382 >> The following columns in the training set don't have a corresponding argument in `GPT2LMHeadModel.forward` and have been ignored: .
[INFO|trainer.py:388] 2020-11-22 22:02:35,382 >> The following columns in the evaluation set don't have a corresponding argument in `GPT2LMHeadModel.forward` and have been ignored: .
[INFO|trainer.py:693] 2020-11-22 22:02:35,385 >> ***** Running training *****
[INFO|trainer.py:694] 2020-11-22 22:02:35,385 >>   Num examples = 2318
[INFO|trainer.py:695] 2020-11-22 22:02:35,385 >>   Num Epochs = 3
[INFO|trainer.py:696] 2020-11-22 22:02:35,385 >>   Instantaneous batch size per device = 8
[INFO|trainer.py:697] 2020-11-22 22:02:35,386 >>   Total train batch size (w. parallel, distributed & accumulation) = 8
[INFO|trainer.py:698] 2020-11-22 22:02:35,386 >>   Gradient Accumulation steps = 1
[INFO|trainer.py:699] 2020-11-22 22:02:35,386 >>   Total optimization steps = 870
  0% 0/870 [00:00<?, ?it/s]Traceback (most recent call last):
  File "run_clm.py", line 351, in <module>
    main()
  File "run_clm.py", line 321, in main
    trainer.train(model_path=model_path)
  File "/usr/local/lib/python3.6/dist-packages/transformers/trainer.py", line 775, in train
    tr_loss += self.training_step(model, inputs)
  File "/usr/local/lib/python3.6/dist-packages/transformers/trainer.py", line 1112, in training_step
    loss = self.compute_loss(model, inputs)
  File "/usr/local/lib/python3.6/dist-packages/transformers/trainer.py", line 1136, in compute_loss
    outputs = model(**inputs)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/transformers/modeling_gpt2.py", line 787, in forward
    return_dict=return_dict,
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/transformers/modeling_gpt2.py", line 659, in forward
    output_attentions=output_attentions,
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/transformers/modeling_gpt2.py", line 295, in forward
    output_attentions=output_attentions,
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/transformers/modeling_gpt2.py", line 239, in forward
    attn_outputs = self._attn(query, key, value, attention_mask, head_mask, output_attentions)
  File "/usr/local/lib/python3.6/dist-packages/transformers/modeling_gpt2.py", line 181, in _attn
    w = self.attn_dropout(w)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/dropout.py", line 58, in forward
    return F.dropout(input, self.p, self.training, self.inplace)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/functional.py", line 983, in dropout
    else _VF.dropout(input, p, training))
RuntimeError: CUDA out of memory. Tried to allocate 384.00 MiB (GPU 0; 14.73 GiB total capacity; 13.50 GiB already allocated; 137.81 MiB free; 13.55 GiB reserved in total by PyTorch)
  0% 0/870 [00:00<?, ?it/s]

The tasks I am working on is:

[ ] an official GLUE/SQUaD task: (give the name)
[x] my own task or dataset: (give details below)
On the transformers wikitext dataset. I also attempted on my own corpus.txt file. Same issue with both.

To reproduce

Steps to reproduce the behavior:
I have a minimal reproduction on this Colab notebook

What I've checked out so far:

I traced the problem to the Trainer.training_step() method. It seems PR 6999 was an attempt to fix a similar problem. However, with my issue, the CUDA OOM error happens before the loss.detach() on the first pass of training_step()

This is similar to issue 7169, except I'm not doing distributed training.

I've tested this issue both in Google Colab (1xGPU) and then on an AWS EC2 g4dn.12xlarge instance (4xGPU). I was pursuing the obvious possibility of Colab GPU simply being too small. Both max out with a "CUDA out of memory" error.

I also tried using the TPU launcher script, which hit an error, but that's a separate issue.

I also tried using the legacy run_language_modeling.py script with the same arguments on Colab (a friend had done so a few months ago and had success on Colab). I got this error there:
AttributeError: 'GPT2TokenizerFast' object has no attribute 'max_len'
but that's a separate issue.

Expected behavior

The docs say that expected behavior for running will be the output of a trained model at the --output_dir flag
This takes about half an hour to train on a single K80 GPU and about one minute for the evaluation to run. It reaches a score of ~20 perplexity once fine-tuned on the dataset.

How do we fix this to make run_clm.py work?

Source

erik-dunteman

Most helpful comment

The smaller --per_device_train_batch_size 2 batch size seems to be working for me. Just started the training process. Thank you very much for the extremely quick response, and for being an OSS maintainer @sgugger!

I'll likely drop one more update in this thread to confirm that it worked all the way through.

erik-dunteman on 23 Nov 2020

👍3

All 3 comments

The comment you are mentioning was about the old run_language_modeling script, and probably with some more options for a K80 that what you are running the script with (we should probably remove it or update with a proper command that gives those results). This doesn't look like a memory leak problem, you just don't have enough GPU memory to run the this large model with its full sequence length (of 1,024). You could try:

a smaller batch size with --per_device_batch_size 4 or even 2 (or use gradient accumulation)
a smaller sequence length with --block_size 512 or even 256
a smaller model with --model_name_or_path gpt2-medium or even distilgpt2.

sgugger on 23 Nov 2020

I'll likely drop one more update in this thread to confirm that it worked all the way through.

erik-dunteman on 23 Nov 2020

👍3

Can confirm - your advice works for me.

In fact, I managed to retrain even the XL on T100 GPUs on the new p4d.24xl instances. Definitely high mem requirements, but doable with --model_name_or_path gpt2-xl --per_device_train_batch_size 1 --block_size 512

Thanks, team! Y'all have a https://buymeacoffee.com account I can send some brews to? I appreciate your work.

erik-dunteman on 24 Nov 2020

❤2

Was this page helpful?

0 / 5 - 0 ratings