Hi!
I'm new to Lightning, and have experienced one day. However, I found that there are some critical issues, especially in multi-gpu in Memory Leaks:
(1) Even if the code finished and exited, the process is still in the background.
(2) After I kill those process manually one by one, there seems still some processes occupying the GPU memories: for example:

BTW, there are some other issues in multi-gpu settings.
The following reproduces can be sure to happen the memory leaks in my code (100%) for your referece:
export TOKENIZERS_PARALLELISM=False for forbidding the warning information.data dir , download and unzip the dataset in google drive in data/README.mdcd data
unzip bert_elmo_glove.weight.zip
pl branch ('pl' means pytorch lightning)git checkout pl
python3 main.py train --gpu_id=[0,1] --epochs=5
Each running will remain a process that would not be exited:

My enviroment:
I'm not sure that the reason from pytorch or lightning.
@awaelchli
Does it also happen with distributed_backend="ddp_spawn"?
Hi, I have tried the different backends settings:
distributed_backend="ddp_spawn" with num_workers=0, and in this settings, there is no memory leaks. However, there would be warning: You are using `distributed_backend=ddp_spawn` with num_workers=0. For much faster performance, switch to `distributed_backend=ddp` and set `num_workers>0
distributed_backend="dp" this would directly raise an error: RuntimeError: arguments are located on different GPUs
In addition, if distributed_backend="ddp" and let the code runs over the memory leaks would happen. But if I interrupt the program manually during its running with ctrl-c, the memory leaks would not happen. Hope this can help.
I found this in the code:
def forward(self, x, device):
self.device = device
this does not look right. LightningModule also hase a self.device attribute, these calls could cause the data left on the wrong device and maybe cause the memory leak?
https://pytorch-lightning.readthedocs.io/en/latest/multi_gpu.html#init-tensors-using-type-as-and-register-buffer
Just I have closed another issue about how to put new tensors into the right device : #2585
Since there are new tensors created in the submodule of Model (ie: the Net), so I have passed the device to the Net .
def forward(self, x, device):
self.device = device
Here self is a submodule instead of the main module of pl.LightningModule, so I think this just a variable including the device information regardless of what the name of the variable is.
- run the code in multi-gpu settings would lead to memory leaks: for example:
python3 main.py train --gpu_id=[0,1] --epochs=5
there is no attribute --gpu_id you shall use --gpus
@Borda Hi, --gpu_id is the argparse in my code, and i will pass the config.gpu_id into the gpus in trainer.
Yet, maybe this issue has been resolved in the current version, and I will check again asap
I can confirm this is still an issue. I also run into it very often when I kill ddp training. The problem is that the kill signal (like keyboard interrupt for example) is not sent to the children processes in ddp, and they keep running.
I promise I will get back to #2165 soon to fix it.
@awaelchli Thanks for your great effort, and it's indeed a critical issue.