Transformers: Memory leak

Created on 22 Sep 2020 · 7Comments · Source: huggingface/transformers

Environment info

transformers version: 3.1.0
Platform: Linux-3.10.0-1127.el7.x86_64-x86_64-with-debian-buster-sid
Python version: 3.7.0
PyTorch version (GPU?): 1.5.1 (True)
Tensorflow version (GPU?): 2.2.0 (False)
Using GPU in script?: yes
Using distributed or parallel set-up in script?: distributed

Who can help

@LysandreJik, @sgugger, @patrickvonplaten

Information

Model I am using (Bert, GPT2):

The problem arises when using:

[ X] the official example scripts: (give details below)
[ ] my own modified scripts: (give details below)

The tasks I am working on is:

[ ] an official GLUE/SQUaD task: (give the name)
[X ] my own task or dataset: (give details below)

To reproduce

When I pretrain or fine tune a model (in my case BERT and GPT2) using torch.distributed.launch, the CPU memory usage will grow up to the memory limit (>500GB) until the first process is killed due to this issue. If I train bert-base, it takes around 30 epochs until the first process is killed, but when I train gpt-large, it just need 3 epochs until it is killed. Following is the command line I run to train/fine tune the bert-base (similar with gpt2). The script run_language_modeling.py is a copy of transformers/examples/language-modeling/run_language_modeling.py (vers. 3.1.0)

python -m torch.distributed.launch --nproc_per_node=8 \
../run_language_modeling.py \
--output_dir $model_target \
--model_name_or_path $model_source \
--config_name $model_source \
--tokenizer_name $model_source \
--train_data_file $target_train \
--eval_data_file $target_test \
--save_total_limit 5 \
--block_size 128 \
--overwrite_output_dir \
--fp16 \
--num_train_epochs 50 \
--do_train --do_eval \
--per_device_train_batch_size 32 \
--per_device_eval_batch_size 4 \
--mlm

Expected behavior

I would expect that the distributed training run until it is done without any memory issue.
Thanks for checking it.

Source

cahya-wirawan

Most helpful comment

6999

xesdiny on 24 Sep 2020

👍2

All 7 comments

This looks to be a duplicate of #7169

sgugger on 22 Sep 2020

But I think my problem is run out of the cpu memory, not the GPU memory

cahya-wirawan on 22 Sep 2020

Ah my bad, I misread one letter ;-)
To fully understand your error, what's the dataset (particularly its size) you are training on?

sgugger on 22 Sep 2020

The size of dataset (indonesian Wikipedia) is around 522MB.

cahya-wirawan on 22 Sep 2020

👍1

just additional info, running the script in single process doesn't have this issue. In my case, the memory usage is stable, and stay at 16GB after few epochs.
But I want to run it in multiple GPU, it is just too slow with only one :-)

cahya-wirawan on 22 Sep 2020

6999

xesdiny on 24 Sep 2020

👍2

I tried the fix from #6999 manually (which is just a one liner return loss to return loss.detach(), and it seems to solve my memory leak issue. The fix is actually available since version 3.2.0, but when I used the version 3.2.0 with multi gpu, the process just stuck after the 500 steps, maybe there is deadlock among processes? Maybe I will write another ticket regarding this issue.

cahya-wirawan on 27 Sep 2020

Was this page helpful?

0 / 5 - 0 ratings