import pytorch_lightning as pl
class LightningModel(pl.LightningModule):
def __init__(self):
super(LightningModel, self).__init__()
# not the best model...
self.model = U2NET(3, 1)
def forward(self, x):
# called with self(x)
return self.model(x)
def training_step(self, batch, batch_nb):
# REQUIRED
x = batch['image'].float()
labels_v=batch['label'].double()
d0, d1, d2, d3, d4, d5, d6 = self(x)
d0, d1, d2, d3, d4, d5, d6,loss2, loss = muti_bce_loss_fusion(d0, d1, d2, d3, d4, d5, d6, labels_v)
tensorboard_logs = {'train_loss': loss}
return {'loss': loss, 'log': tensorboard_logs}
def configure_optimizers(self):
opt=torch.optim.Adam(model.parameters(), lr=lr, betas=(0.9, 0.999), eps=1e-08, weight_decay=0)
# scheduler=CombineCos(opt,0.2,lr,lr,5e-4,len(self.train_dataloader()),100)
return opt
def train_dataloader(self):
loaders=get_loader([size,size])
return loaders['train']
train_loader = get_loader([size,size])['train']
model = LightningModel()
trainer = pl.Trainer(gpus=1)
trainer.fit(model, train_loader)
I want to do Distributed Training with 2 gpus with data parallel.
This works just fine with one gpu. But when I do trainer = pl.Trainer(gpus=2) or trainer = pl.Trainer(gpus=[0,1]) I get an error:

System
Hi! thanks for your contribution!, great first issue!
@bluesky314 I am not able to reproduce the same bug, can you provide the lightning version and some code to reproduce this exactly?
I just installed lightning via pip install command given in the docs. My pytorch version is 1.6. The code worked with gpu was set to 1. As you can see in the message up and below, the error traces internally to multiprocessing files when trainer.fit is called and has not much to do with my code. The run time error also cannot be caused by any of my definitions.
Before this I was using wandb to log experiments with single gpu, after I changed to multiple gpu I got a wandb error saying wandb.init was not called. (wandb was not actually used to log anything, just to view the terminal log from wandb page so it did not interact with Lightning) I think copying the program to multiple cpus is leading to a problem.
Here is the extended error:

I just installed lightning via pip install command given in the docs
You will probably get 0.8.5. Could you try to install the 0.9.0rc16 release, or confirm that you have it already?
pip install pytorch-lightning==0.9.0rc16 --upgrade
I updated it as above but am getting the same error. The difference is that the print commands at the top of the files are printed twice, meaning that it is making 2 copies of the code but the Runtime error remains
I think you need to put
if __name__ == "__main__"
...
before the start of your script. This is a python multiprocessing thing, it imports the module before it runs it, so you need to guard the entry point of the program. Closing this as my suspicion is very high this is the problem here. Let me know if this fixes it for you.
Thanks, adding:
def main():
train_loader = get_loader([size,size])['train']
model = LightningModel()
trainer = pl.Trainer(gpus=2,distributed_backend='ddp')
trainer.fit(model, train_loader)
if __name__ == '__main__':
main()
has removed the previous error but now the process gets stuck at:

The training progress does not start showing, but 1.6gb on the 2 gpus gets used up.
And what about distributed_backend="ddp_spawn"?
What is dso_loader.cc? Make sure this actually supports running on multi-GPU. It seems to come from the tensorboard library.
Have you trained this code before in plain PyTorch in multi-gpu?
I am using pytorch so not sure why tensorboard is getting involved. But it seems that those operations were successful anyways. dso_loader.cc is not my file. This is an AWS instance with 2gpus.
Yes, I have used this code with nn.DataParallel and it worked there. To remove the single cpu bottleneck I wanted to do Distributed DataParallel so someone recommended I try Lightning as it is easier to set up.
ok, is distributed_backend="ddp_spawn" also getting stuck?
Let me try and get back
yeah, this original issue was not using __main__ which is a pytorch requirement.
I am having an issue with multiple GPU training:
mnist_model = MNISTModel()
trainer = pl.Trainer(max_epochs = 1, gpus=-1, distributed_backend='ddp')
trainer.fit(mnist_model, DataLoader(train, num_workers=10), DataLoader(val, num_workers=10))
This does not start the training but setting gpus=1 does it on a single GPU.
System:
Frozen output image attached.

as the docs state... multi gpu is not supported on jupyter or colab. this is a limitation of those platforms, not lightning
@awaelchli @williamFalcon I have created a new issue for the new error: https://github.com/PyTorchLightning/pytorch-lightning/issues/3117