Running gan.py example with Trainer(ngpus=2) causes two types of error:
Trainer(ngpus=2, distributed_backend='dp')Exception has occurred: AttributeError
'NoneType' object has no attribute 'detach'
File "/home/user/gan.py", line 146, in training_step
self.discriminator(self.generated_imgs.detach()), fake)
Trainer(ngpus=2, distributed_backend='ddp')./lightling_logs one run creates two folders: version_0 and version_1it seems that each subprocess tries to create its own checkpoints and delete not ctrated one.
python 3.7.5
pytorch 1.4.0
pytorch-lightning 0.7.1
Hi! thanks for your contribution!, great first issue!
The problem is that gan.py example suppose to use buffered values self.generated_images and self.last_img, however during replicating and gathering in https://github.com/PyTorchLightning/pytorch-lightning/blob/22a7264e9a77ef70154e3ad7c926133c9f2205cd/pytorch_lightning/overrides/data_parallel.py#L64
buffered values are not replicated and not gathered to main LightningModule model
@armavox good catch, mind draft a PR? :robot:
yep, I'll try
@armavox how is it going?
@Borda I assume to fix it by May
@armavox Any updates on this? Having the same issue...
Made some updates. Sorry for waiting.
There is an official l warning about the use of local (buffered here) variables during the distributed training: https://pytorch.org/docs/stable/nn.html#torch.nn.DataParallel
So I didn't try to create detours in the Lightning code and fixed only the example to work with dp and ddp.
The problem from point 2 in the heading post seems to be fixed by someone. But the unused folder for parallel experiment still created during ddp training. The problem is in https://github.com/PyTorchLightning/pytorch-lightning/blob/fdbbe968256f6c68a5dbb840a2004b77a618ef61/pytorch_lightning/trainer/callback_config.py#L66, which doesn't use rank_zero_only decorator or something else.
I would propose some good fixes, but don't know how to do this elegant.
Thanks for your work!
Best regards, Artem.
Most helpful comment
yep, I'll try