Pytorch-lightning: gan.py multi-gpu running problems

Created on 24 Mar 2020  路  9Comments  路  Source: PyTorchLightning/pytorch-lightning

Running gan.py example with Trainer(ngpus=2) causes two types of error:

  1. if Trainer(ngpus=2, distributed_backend='dp')
Exception has occurred: AttributeError
'NoneType' object has no attribute 'detach'
  File "/home/user/gan.py", line 146, in training_step
    self.discriminator(self.generated_imgs.detach()), fake)
  1. if Trainer(ngpus=2, distributed_backend='ddp')
  2. in ./lightling_logs one run creates two folders: version_0 and version_1
  3. Exception caused:
    File "/opt/miniconda3/envs/ctln-gan/lib/python3.7/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 122, in _del_model
    os.remove(filepath)
    FileNotFoundError: [Errno 2] No such file or directory: '/home/user/pyproj/DCGAN/lightning_logs/version_1/checkpoints/epoch=0.ckpt'

it seems that each subprocess tries to create its own checkpoints and delete not ctrated one.

Environment version:

python 3.7.5
pytorch 1.4.0
pytorch-lightning 0.7.1

bug / fix help wanted

Most helpful comment

yep, I'll try

All 9 comments

Hi! thanks for your contribution!, great first issue!

The problem is that gan.py example suppose to use buffered values self.generated_images and self.last_img, however during replicating and gathering in https://github.com/PyTorchLightning/pytorch-lightning/blob/22a7264e9a77ef70154e3ad7c926133c9f2205cd/pytorch_lightning/overrides/data_parallel.py#L64
buffered values are not replicated and not gathered to main LightningModule model

@armavox good catch, mind draft a PR? :robot:

yep, I'll try

@armavox how is it going?

@Borda I assume to fix it by May

@armavox Any updates on this? Having the same issue...

Made some updates. Sorry for waiting.

There is an official l warning about the use of local (buffered here) variables during the distributed training: https://pytorch.org/docs/stable/nn.html#torch.nn.DataParallel

So I didn't try to create detours in the Lightning code and fixed only the example to work with dp and ddp.

The problem from point 2 in the heading post seems to be fixed by someone. But the unused folder for parallel experiment still created during ddp training. The problem is in https://github.com/PyTorchLightning/pytorch-lightning/blob/fdbbe968256f6c68a5dbb840a2004b77a618ef61/pytorch_lightning/trainer/callback_config.py#L66, which doesn't use rank_zero_only decorator or something else.

I would propose some good fixes, but don't know how to do this elegant.

Thanks for your work!
Best regards, Artem.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

chuong98 picture chuong98  路  3Comments

maxime-louis picture maxime-louis  路  3Comments

edenlightning picture edenlightning  路  3Comments

iakremnev picture iakremnev  路  3Comments

polars05 picture polars05  路  3Comments