Pytorch-lightning: gan.py multi-gpu running problems

Created on 24 Mar 2020 · 9Comments · Source: PyTorchLightning/pytorch-lightning

Running gan.py example with Trainer(ngpus=2) causes two types of error:

if Trainer(ngpus=2, distributed_backend='dp')

Exception has occurred: AttributeError
'NoneType' object has no attribute 'detach'
  File "/home/user/gan.py", line 146, in training_step
    self.discriminator(self.generated_imgs.detach()), fake)

if Trainer(ngpus=2, distributed_backend='ddp')
in ./lightling_logs one run creates two folders: version_0 and version_1
Exception caused:
File "/opt/miniconda3/envs/ctln-gan/lib/python3.7/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 122, in _del_model
os.remove(filepath)
FileNotFoundError: [Errno 2] No such file or directory: '/home/user/pyproj/DCGAN/lightning_logs/version_1/checkpoints/epoch=0.ckpt'

it seems that each subprocess tries to create its own checkpoints and delete not ctrated one.

Environment version:

python 3.7.5
pytorch 1.4.0
pytorch-lightning 0.7.1

bug / fix help wanted

Source

armavox

👍4

Most helpful comment

yep, I'll try

armavox on 30 Mar 2020

👍2

All 9 comments

Hi! thanks for your contribution!, great first issue!

github-actions[bot] on 24 Mar 2020

The problem is that gan.py example suppose to use buffered values self.generated_images and self.last_img, however during replicating and gathering in https://github.com/PyTorchLightning/pytorch-lightning/blob/22a7264e9a77ef70154e3ad7c926133c9f2205cd/pytorch_lightning/overrides/data_parallel.py#L64
buffered values are not replicated and not gathered to main LightningModule model

armavox on 24 Mar 2020

👍1

@armavox good catch, mind draft a PR? :robot:

Borda on 27 Mar 2020

yep, I'll try

armavox on 30 Mar 2020

👍2

@armavox how is it going?

Borda on 14 Apr 2020

@Borda I assume to fix it by May

armavox on 22 Apr 2020

😕3

@armavox Any updates on this? Having the same issue...

axkoenig on 28 May 2020

👀2

Made some updates. Sorry for waiting.

There is an official l warning about the use of local (buffered here) variables during the distributed training: https://pytorch.org/docs/stable/nn.html#torch.nn.DataParallel

So I didn't try to create detours in the Lightning code and fixed only the example to work with dp and ddp.

armavox on 30 May 2020

The problem from point 2 in the heading post seems to be fixed by someone. But the unused folder for parallel experiment still created during ddp training. The problem is in https://github.com/PyTorchLightning/pytorch-lightning/blob/fdbbe968256f6c68a5dbb840a2004b77a618ef61/pytorch_lightning/trainer/callback_config.py#L66, which doesn't use rank_zero_only decorator or something else.

I would propose some good fixes, but don't know how to do this elegant.

Thanks for your work!
Best regards, Artem.

armavox on 30 May 2020

Was this page helpful?

0 / 5 - 0 ratings