Pytorch-lightning: Output not reduced in DP

Created on 20 Apr 2020  路  15Comments  路  Source: PyTorchLightning/pytorch-lightning

馃悰 Bug

I believe there is a discrepancy in the training and validation epoch ends when training in parallel gpus. I cannot release the code I'm working with as it is private, but I'll try and reproduce if I can.
The bug is that the output in validation epochs is not reduced, we can verify it in the source of the validation, and we can also see that all references to this function are inside an if train statement.
This can lead to bugs which are hard to detect without looking at the source code. For example, you can calculate the mean of the loss at training_epoch_end, but the same code depending on the number of gpus, validation samples and batchsize might lead to bugs in validation_epoch_end.

To Reproduce

Given 4 gpus, 10 validation samples and a batchsize of 4. The last validation step will only have two outputs. Then:
Running torch.stack([x['loss'] for x in outputs]).mean() works in training_epoch_end as outputs are reduced before reaching training_epoch_end.
Running torch.stack([x['loss'] for x in outputs]).mean() will fail in validaition_epoch_end as outputs are not reduced, and the first dimention of the output coming from the last batch differs. so torch.stack will fail

Expected behavior

I expect that both validation and training epochs have the same behaviour, or if not there has to be a clear warning.

DP Priority P0 bug / fix help wanted

Most helpful comment

@Borda and @Deanamic : I can have a look at this if you do not mind :-)

All 15 comments

they should have the same behavior. mind putting together a colab to illustrate? mnist should be enough no?

This code ilustrates the issue
If you run with

  • 2 gpus,
  • batch size:2
  • dataloader size: 3

The outputs from the 2 training/validation steps will have dimension 2 and 1 in each epoch.
For training, it's OK as it is reduced. We verify that the exception occurs in the validation epoch end, having the sanity run disabled.
On the other hand, if we run with a batch size of 3 the problem will not happen.I belive this bug only will happen if there is a gpu with no data in a run, i.e. len(val_dataloader)%batch_size < n_gpusWhich make this bug very hard to diagnose and many times unnoticed.
The error is the following:

Traceback (most recent call last):
  File "dpbug.py", line 75, in <module>  
    main()  
  File "dpbug.py", line 72, in main  
    val_dataloaders = val_dl)  
  File "/cluster/home/dzhu/anaconda3/lib/python3.7/site-packages/pytorch_lightning/trainer  /trainer.py", line 702, in fit  
    self.dp_train(model)  
  File "/cluster/home/dzhu/anaconda3/lib/python3.7/site-packages/pytorch_lightning/trainer  /distrib_parts.py", line 538, in dp_train  
    self.run_pretrain_routine(model)  
  File "/cluster/home/dzhu/anaconda3/lib/python3.7/site-packages/pytorch_lightning/trainer  /trainer.py", line 865, in run_pretrain_routine  
    self.train()
  File "/cluster/home/dzhu/anaconda3/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 363, in train
    self.run_training_epoch()
  File "/cluster/home/dzhu/anaconda3/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 457, in run_training_epoch
    self.run_evaluation(test_mode=self.testing)
  File "/cluster/home/dzhu/anaconda3/lib/python3.7/site-packages/pytorch_lightning/trainer/evaluation_loop.py", line 372, in run_evaluation
    eval_results = self._evaluate(self.model, dataloaders, max_batches, test_mode)
  File "/cluster/home/dzhu/anaconda3/lib/python3.7/site-packages/pytorch_lightning/trainer/evaluation_loop.py", line 316, in _evaluate
    eval_results = model.validation_epoch_end(outputs)
  File "dpbug.py", line 44, in validation_epoch_end
    loss = torch.stack([x['loss'] for x in outputs]).mean()
RuntimeError: invalid argument 0: Sizes of tensors must match except in dimension 0. Got 1 and 2 in dimension 1 at /tmp/pip-req-build-4baxydiv/aten/src/THC/generic/THCTensorMath.cu:71

QQ: Why is the reason the calls to self.reduce_distributed_output() are only for training.
I think the fix is probably adding a call to self.reduce_distributed_output() here. but I haven't looked too deep into the issue. I'm having quite a bit of work right now, so hopefully someone else will look into it

@Deanamic mind send a PR? :]

I won't be able to work on this until approximately mid May, so hopefully someone can work on this. Else I'll do it when I'm available

@Borda and @Deanamic : I can have a look at this if you do not mind :-)

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@jbschiratti @Borda

I am curious the status of this issue. Could this be fixed in the near future?

I am in the process of migration from Pytorch to lightning and I ran into the same issue outlined above when I changed batch_size.

Thanks so much!

@junwen-austin would you be interested in sending a fix for this issue? :]

@Borda I'd love to but my background is data science and do not have lots of coding experience in DDP. I am using Lightning for my data science project at work and am really impressed by the design and philosophy of Lightning 馃憤

I didn't know this bug was still open, the fix shouldn't involve coding with DDP as pytorch should handle everything. I could try fix that now, but I wondered if there is any way to test multiple GPUs as I do not have access to that.

I think the fix is probably adding a call to self.reduce_distributed_output() here. but I haven't looked too deep into the issue. I'm having quite a bit of work right now, so hopefully someone else will look into it

This should probably fix the issue, but would require further testing

@jbschiratti @junwen-austin @Deanamic anyone wants to take this over?

This appears to be an issue on both training_epoch_end() as well as validation_epoch_end(), as there is now no reduction for either. This means that DP basically doesn't work, looking into.

Using results objects seems to fix the issue, but it is still a problem when returning dicts. Not sure if you are done with refactoring DP @williamFalcon

ok... this is not a bug.

The batch sizes are too small and you're not dropping the last batch.
What ends up happening is that one GPU processes 2 items and the second only 1 item (because your batch size is 3).

And thus your outputs aren't the same length... this is not something we can automate... so you have to either check for that in your code or make sure your batch size is always a multiple of GPUs.

from torch.utils.data import Dataset, DataLoader
import torch
import torch.nn as nn
import pytorch_lightning as pl
from pytorch_lightning import Trainer

class Data(Dataset):
    def __init__(self, sz):
        self.sz = sz

    def __len__(self):
        return self.sz

    def __getitem__(self, idx):
        return {'x' : torch.tensor(idx, dtype = torch.float32),
                'y': torch.tensor(0, dtype = torch.float32)}

class Model(pl.LightningModule):
    def __init__(self):
        super(Model, self).__init__()
        self.criterion = nn.MSELoss()
        self.layer = nn.Linear(1,1)

    def forward(self, x):
        return self.layer(x)

    def training_step(self, batch, batch_idx):
        output = self.forward(batch['x'].view(-1, 1))
        loss = self.criterion(batch['y'].view(-1), output.view(-1))
        return {'loss': loss}

    def training_epoch_end(self, outputs):
        ## This is ok, outputs have been reduced
        loss = torch.stack([x['loss'] for x in outputs]).mean()
        return {'log': {'train_loss' : loss}}

    def validation_step(self, batch, batch_idx):
        output = self.forward(batch['x'].view(-1, 1))
        loss = self.criterion(batch['y'].view(-1), output.view(-1))

        r = {'loss': loss}
        print(r)
        return r

    def validation_epoch_end(self, outputs):
        import pdb; pdb.set_trace()
        ## This is not ok, outputs have not been reduced
        loss = torch.stack([x['loss'] for x in outputs]).mean()
        return {'log': {'val_loss' : loss}}

    def configure_optimizers(self):
        optimizer = torch.optim.SGD(self.parameters(),
                                    lr = 0.001)
        return optimizer



def main(backend, ngpu):
    ngpu = 2
    b_size = 2
    gpus = list(range(ngpu))
    train_dl = DataLoader(Data(3), batch_size = b_size)

    # Batch size 2 on 2 gpus -> second epoch a gpu will be empty
    # output is reduced in training epoch end, so it's fine but validation
    # epoch end will run into an exception
    # Batch size on 3 will run properly
    val_dl = DataLoader(Data(3), batch_size = b_size)

    net = Model()
    trainer = Trainer(gpus = gpus,
                      max_epochs = 2,
                      distributed_backend = backend,
                      num_sanity_val_steps=0)
    trainer.fit(net,
                train_dataloader = train_dl,
                val_dataloaders = val_dl)


if __name__ == '__main__':
    main('dp', 2)
Was this page helpful?
0 / 5 - 0 ratings

Related issues

chuong98 picture chuong98  路  3Comments

maxime-louis picture maxime-louis  路  3Comments

iakremnev picture iakremnev  路  3Comments

justusschock picture justusschock  路  3Comments

williamFalcon picture williamFalcon  路  3Comments