Pytorch-lightning: Memory leaks: process still remain in the back if even the code is finished.

Created on 12 Jul 2020  路  9Comments  路  Source: PyTorchLightning/pytorch-lightning

Hi!
I'm new to Lightning, and have experienced one day. However, I found that there are some critical issues, especially in multi-gpu in Memory Leaks:

(1) Even if the code finished and exited, the process is still in the background.
(2) After I kill those process manually one by one, there seems still some processes occupying the GPU memories: for example:
image

BTW, there are some other issues in multi-gpu settings.

Priority P0 bug / fix

All 9 comments

The following reproduces can be sure to happen the memory leaks in my code (100%) for your referece:

  • clone the code from: https://github.com/ShomyLiu/pytorch_bert_elmo_example
  • some third-party would need

    • fire, transformer, and so on

    • maybe the env should be added export TOKENIZERS_PARALLELISM=False for forbidding the warning information.

  • go to the data dir , download and unzip the dataset in google drive in data/README.md
cd data
unzip bert_elmo_glove.weight.zip
  • checkout the pl branch ('pl' means pytorch lightning)
git checkout pl
  • run the code in multi-gpu settings would lead to memory leaks: for example:
python3 main.py train --gpu_id=[0,1] --epochs=5

Each running will remain a process that would not be exited:
image

My enviroment:

  • python3.6.8
  • NVIDIA-SMI: 418.39
  • CUDA: 10.0
  • pytorch: 1.5.1+cu101
  • pytorch-lightning: 0.8.5

I'm not sure that the reason from pytorch or lightning.
@awaelchli

Does it also happen with distributed_backend="ddp_spawn"?

Hi, I have tried the different backends settings:

  • distributed_backend="ddp_spawn" with num_workers=0, and in this settings, there is no memory leaks. However, there would be warning:
 You are using `distributed_backend=ddp_spawn` with num_workers=0. For much faster performance, switch to `distributed_backend=ddp` and set `num_workers>0
  • distributed_backend="dp" this would directly raise an error:
RuntimeError: arguments are located on different GPUs

In addition, if distributed_backend="ddp" and let the code runs over the memory leaks would happen. But if I interrupt the program manually during its running with ctrl-c, the memory leaks would not happen. Hope this can help.

I found this in the code:

def forward(self, x, device):
        self.device = device

this does not look right. LightningModule also hase a self.device attribute, these calls could cause the data left on the wrong device and maybe cause the memory leak?

https://pytorch-lightning.readthedocs.io/en/latest/multi_gpu.html#init-tensors-using-type-as-and-register-buffer

Just I have closed another issue about how to put new tensors into the right device : #2585
Since there are new tensors created in the submodule of Model (ie: the Net), so I have passed the device to the Net .

def forward(self, x, device):
        self.device = device

Here self is a submodule instead of the main module of pl.LightningModule, so I think this just a variable including the device information regardless of what the name of the variable is.

  • run the code in multi-gpu settings would lead to memory leaks: for example:
python3 main.py train --gpu_id=[0,1] --epochs=5

there is no attribute --gpu_id you shall use --gpus

@Borda Hi, --gpu_id is the argparse in my code, and i will pass the config.gpu_id into the gpus in trainer.
Yet, maybe this issue has been resolved in the current version, and I will check again asap

I can confirm this is still an issue. I also run into it very often when I kill ddp training. The problem is that the kill signal (like keyboard interrupt for example) is not sent to the children processes in ddp, and they keep running.
I promise I will get back to #2165 soon to fix it.

@awaelchli Thanks for your great effort, and it's indeed a critical issue.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

williamFalcon picture williamFalcon  路  3Comments

DavidRuhe picture DavidRuhe  路  3Comments

monney picture monney  路  3Comments

anthonytec2 picture anthonytec2  路  3Comments

versatran01 picture versatran01  路  3Comments