Pytorch-lightning: Distributed Data Training

Created on 18 Aug 2020 · 15Comments · Source: PyTorchLightning/pytorch-lightning

import pytorch_lightning as pl
class LightningModel(pl.LightningModule):

    def __init__(self):
        super(LightningModel, self).__init__()
        # not the best model...
        self.model = U2NET(3, 1)

    def forward(self, x):
        # called with self(x)
        return self.model(x)

    def training_step(self, batch, batch_nb):
        # REQUIRED
        x = batch['image'].float() 
        labels_v=batch['label'].double() 

        d0, d1, d2, d3, d4, d5, d6 = self(x)
        d0, d1, d2, d3, d4, d5, d6,loss2, loss = muti_bce_loss_fusion(d0, d1, d2, d3, d4, d5, d6, labels_v)

        tensorboard_logs = {'train_loss': loss}
        return {'loss': loss, 'log': tensorboard_logs}


    def configure_optimizers(self):

        opt=torch.optim.Adam(model.parameters(), lr=lr, betas=(0.9, 0.999), eps=1e-08, weight_decay=0) 
#         scheduler=CombineCos(opt,0.2,lr,lr,5e-4,len(self.train_dataloader()),100)
        return opt

    def train_dataloader(self):
        loaders=get_loader([size,size])
        return loaders['train']



train_loader = get_loader([size,size])['train']
model = LightningModel()

trainer = pl.Trainer(gpus=1)
trainer.fit(model, train_loader)

I want to do Distributed Training with 2 gpus with data parallel.
This works just fine with one gpu. But when I do trainer = pl.Trainer(gpus=2) or trainer = pl.Trainer(gpus=[0,1]) I get an error:
Screen Shot 2020-08-18 at 8 12 18 PM

System

3.6
Linux
pip
Build command you used (if compiling from source):
Python version:
10.1
Nvidia-100

Priority P0 help wanted information needed

Source

bluesky314

All 15 comments

Hi! thanks for your contribution!, great first issue!

github-actions[bot] on 18 Aug 2020

@bluesky314 I am not able to reproduce the same bug, can you provide the lightning version and some code to reproduce this exactly?

ananyahjha93 on 18 Aug 2020

I just installed lightning via pip install command given in the docs. My pytorch version is 1.6. The code worked with gpu was set to 1. As you can see in the message up and below, the error traces internally to multiprocessing files when trainer.fit is called and has not much to do with my code. The run time error also cannot be caused by any of my definitions.

Before this I was using wandb to log experiments with single gpu, after I changed to multiple gpu I got a wandb error saying wandb.init was not called. (wandb was not actually used to log anything, just to view the terminal log from wandb page so it did not interact with Lightning) I think copying the program to multiple cpus is leading to a problem.

Here is the extended error:
Screen Shot 2020-08-18 at 10 11 36 PM

bluesky314 on 18 Aug 2020

I just installed lightning via pip install command given in the docs

You will probably get 0.8.5. Could you try to install the 0.9.0rc16 release, or confirm that you have it already?
pip install pytorch-lightning==0.9.0rc16 --upgrade

awaelchli on 18 Aug 2020

I updated it as above but am getting the same error. The difference is that the print commands at the top of the files are printed twice, meaning that it is making 2 copies of the code but the Runtime error remains

bluesky314 on 19 Aug 2020

I think you need to put

if __name__ == "__main__"
    ...

before the start of your script. This is a python multiprocessing thing, it imports the module before it runs it, so you need to guard the entry point of the program. Closing this as my suspicion is very high this is the problem here. Let me know if this fixes it for you.

awaelchli on 19 Aug 2020

Thanks, adding:

def main():
    train_loader = get_loader([size,size])['train']
    model = LightningModel()

    trainer = pl.Trainer(gpus=2,distributed_backend='ddp')
    trainer.fit(model, train_loader)

if __name__ ==  '__main__':
    main()

has removed the previous error but now the process gets stuck at:

The training progress does not start showing, but 1.6gb on the 2 gpus gets used up.

bluesky314 on 19 Aug 2020

And what about distributed_backend="ddp_spawn"?
What is dso_loader.cc? Make sure this actually supports running on multi-GPU. It seems to come from the tensorboard library.
Have you trained this code before in plain PyTorch in multi-gpu?

awaelchli on 19 Aug 2020

I am using pytorch so not sure why tensorboard is getting involved. But it seems that those operations were successful anyways. dso_loader.cc is not my file. This is an AWS instance with 2gpus.

Yes, I have used this code with nn.DataParallel and it worked there. To remove the single cpu bottleneck I wanted to do Distributed DataParallel so someone recommended I try Lightning as it is easier to set up.

bluesky314 on 19 Aug 2020

ok, is distributed_backend="ddp_spawn" also getting stuck?

awaelchli on 19 Aug 2020

Let me try and get back

bluesky314 on 19 Aug 2020

👍1

yeah, this original issue was not using __main__ which is a pytorch requirement.

williamFalcon on 20 Aug 2020

I am having an issue with multiple GPU training:

mnist_model = MNISTModel()
trainer = pl.Trainer(max_epochs = 1, gpus=-1, distributed_backend='ddp')   
trainer.fit(mnist_model, DataLoader(train, num_workers=10), DataLoader(val, num_workers=10))

This does not start the training but setting gpus=1 does it on a single GPU.
System:

PyTorch: 1.4.0
Pytorch Lightning: 0.9.0rc2
CUDA: 10.0
GPU: Tesla V100-SXM2-16GB

Frozen output image attached.

pgg1610 on 22 Aug 2020

as the docs state... multi gpu is not supported on jupyter or colab. this is a limitation of those platforms, not lightning

williamFalcon on 22 Aug 2020

👍1

@awaelchli @williamFalcon I have created a new issue for the new error: https://github.com/PyTorchLightning/pytorch-lightning/issues/3117

bluesky314 on 24 Aug 2020

👍1

Was this page helpful?

0 / 5 - 0 ratings

Related issues

[Bug] Progress bar displays wrong total iterations for train

versatran01 · 3Comments

Add "epoch" options to basic templates

baeseongsu · 3Comments

Early stopping + checkpoint key

williamFalcon · 3Comments

Simplification: Merge load_from_metrics and load_from_checkpoint

awaelchli · 3Comments

Fix .test() on ddp

williamFalcon · 3Comments