Pytorch-lightning: Memory leaks: process still remain in the back if even the code is finished.

Created on 12 Jul 2020 · 9Comments · Source: PyTorchLightning/pytorch-lightning

Hi!
I'm new to Lightning, and have experienced one day. However, I found that there are some critical issues, especially in multi-gpu in Memory Leaks:

(1) Even if the code finished and exited, the process is still in the background.
(2) After I kill those process manually one by one, there seems still some processes occupying the GPU memories: for example:

BTW, there are some other issues in multi-gpu settings.

Priority P0 bug / fix

Source

ShomyLiu

All 9 comments

The following reproduces can be sure to happen the memory leaks in my code (100%) for your referece:

clone the code from: https://github.com/ShomyLiu/pytorch_bert_elmo_example
some third-party would need
- fire, transformer, and so on
- maybe the env should be added export TOKENIZERS_PARALLELISM=False for forbidding the warning information.
go to the data dir , download and unzip the dataset in google drive in data/README.md

cd data
unzip bert_elmo_glove.weight.zip

checkout the pl branch ('pl' means pytorch lightning)

git checkout pl

run the code in multi-gpu settings would lead to memory leaks: for example:

python3 main.py train --gpu_id=[0,1] --epochs=5

Each running will remain a process that would not be exited:

My enviroment:

python3.6.8
NVIDIA-SMI: 418.39
CUDA: 10.0
pytorch: 1.5.1+cu101
pytorch-lightning: 0.8.5

I'm not sure that the reason from pytorch or lightning.
@awaelchli

ShomyLiu on 12 Jul 2020

Does it also happen with distributed_backend="ddp_spawn"?

awaelchli on 12 Jul 2020

Hi, I have tried the different backends settings:

distributed_backend="ddp_spawn" with num_workers=0, and in this settings, there is no memory leaks. However, there would be warning:

 You are using `distributed_backend=ddp_spawn` with num_workers=0. For much faster performance, switch to `distributed_backend=ddp` and set `num_workers>0

distributed_backend="dp" this would directly raise an error:

RuntimeError: arguments are located on different GPUs

In addition, if distributed_backend="ddp" and let the code runs over the memory leaks would happen. But if I interrupt the program manually during its running with ctrl-c, the memory leaks would not happen. Hope this can help.

ShomyLiu on 13 Jul 2020

I found this in the code:

def forward(self, x, device):
        self.device = device

this does not look right. LightningModule also hase a self.device attribute, these calls could cause the data left on the wrong device and maybe cause the memory leak?

https://pytorch-lightning.readthedocs.io/en/latest/multi_gpu.html#init-tensors-using-type-as-and-register-buffer

awaelchli on 13 Jul 2020

Just I have closed another issue about how to put new tensors into the right device : #2585
Since there are new tensors created in the submodule of Model (ie: the Net), so I have passed the device to the Net .

def forward(self, x, device):
        self.device = device

Here self is a submodule instead of the main module of pl.LightningModule, so I think this just a variable including the device information regardless of what the name of the variable is.

ShomyLiu on 13 Jul 2020

run the code in multi-gpu settings would lead to memory leaks: for example:
python3 main.py train --gpu_id=[0,1] --epochs=5

there is no attribute --gpu_id you shall use --gpus

Borda on 15 Sep 2020

@Borda Hi, --gpu_id is the argparse in my code, and i will pass the config.gpu_id into the gpus in trainer.
Yet, maybe this issue has been resolved in the current version, and I will check again asap

ShomyLiu on 16 Sep 2020

I can confirm this is still an issue. I also run into it very often when I kill ddp training. The problem is that the kill signal (like keyboard interrupt for example) is not sent to the children processes in ddp, and they keep running.
I promise I will get back to #2165 soon to fix it.

awaelchli on 16 Sep 2020

👍1

@awaelchli Thanks for your great effort, and it's indeed a critical issue.

ShomyLiu on 16 Sep 2020

👍1

Was this page helpful?

0 / 5 - 0 ratings