Pytorch-lightning: Output not reduced in DP

Created on 20 Apr 2020 · 15Comments · Source: PyTorchLightning/pytorch-lightning

🐛 Bug

I believe there is a discrepancy in the training and validation epoch ends when training in parallel gpus. I cannot release the code I'm working with as it is private, but I'll try and reproduce if I can.
The bug is that the output in validation epochs is not reduced, we can verify it in the source of the validation, and we can also see that all references to this function are inside an if train statement.
This can lead to bugs which are hard to detect without looking at the source code. For example, you can calculate the mean of the loss at training_epoch_end, but the same code depending on the number of gpus, validation samples and batchsize might lead to bugs in validation_epoch_end.

To Reproduce

Given 4 gpus, 10 validation samples and a batchsize of 4. The last validation step will only have two outputs. Then:
Running torch.stack([x['loss'] for x in outputs]).mean() works in training_epoch_end as outputs are reduced before reaching training_epoch_end.
Running torch.stack([x['loss'] for x in outputs]).mean() will fail in validaition_epoch_end as outputs are not reduced, and the first dimention of the output coming from the last batch differs. so torch.stack will fail

Expected behavior

I expect that both validation and training epochs have the same behaviour, or if not there has to be a clear warning.

DP Priority P0 bug / fix help wanted

Source

Deanamic

Most helpful comment

@Borda and @Deanamic : I can have a look at this if you do not mind :-)

jbschiratti on 3 May 2020

👍2

All 15 comments

they should have the same behavior. mind putting together a colab to illustrate? mnist should be enough no?

williamFalcon on 21 Apr 2020

This code ilustrates the issue
If you run with

2 gpus,
batch size:2
dataloader size: 3

The outputs from the 2 training/validation steps will have dimension 2 and 1 in each epoch.
For training, it's OK as it is reduced. We verify that the exception occurs in the validation epoch end, having the sanity run disabled.
On the other hand, if we run with a batch size of 3 the problem will not happen.I belive this bug only will happen if there is a gpu with no data in a run, i.e. len(val_dataloader)%batch_size < n_gpusWhich make this bug very hard to diagnose and many times unnoticed.
The error is the following:

Traceback (most recent call last):
  File "dpbug.py", line 75, in <module>  
    main()  
  File "dpbug.py", line 72, in main  
    val_dataloaders = val_dl)  
  File "/cluster/home/dzhu/anaconda3/lib/python3.7/site-packages/pytorch_lightning/trainer  /trainer.py", line 702, in fit  
    self.dp_train(model)  
  File "/cluster/home/dzhu/anaconda3/lib/python3.7/site-packages/pytorch_lightning/trainer  /distrib_parts.py", line 538, in dp_train  
    self.run_pretrain_routine(model)  
  File "/cluster/home/dzhu/anaconda3/lib/python3.7/site-packages/pytorch_lightning/trainer  /trainer.py", line 865, in run_pretrain_routine  
    self.train()
  File "/cluster/home/dzhu/anaconda3/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 363, in train
    self.run_training_epoch()
  File "/cluster/home/dzhu/anaconda3/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 457, in run_training_epoch
    self.run_evaluation(test_mode=self.testing)
  File "/cluster/home/dzhu/anaconda3/lib/python3.7/site-packages/pytorch_lightning/trainer/evaluation_loop.py", line 372, in run_evaluation
    eval_results = self._evaluate(self.model, dataloaders, max_batches, test_mode)
  File "/cluster/home/dzhu/anaconda3/lib/python3.7/site-packages/pytorch_lightning/trainer/evaluation_loop.py", line 316, in _evaluate
    eval_results = model.validation_epoch_end(outputs)
  File "dpbug.py", line 44, in validation_epoch_end
    loss = torch.stack([x['loss'] for x in outputs]).mean()
RuntimeError: invalid argument 0: Sizes of tensors must match except in dimension 0. Got 1 and 2 in dimension 1 at /tmp/pip-req-build-4baxydiv/aten/src/THC/generic/THCTensorMath.cu:71

Deanamic on 21 Apr 2020

QQ: Why is the reason the calls to self.reduce_distributed_output() are only for training.
I think the fix is probably adding a call to self.reduce_distributed_output() here. but I haven't looked too deep into the issue. I'm having quite a bit of work right now, so hopefully someone else will look into it

Deanamic on 21 Apr 2020

@Deanamic mind send a PR? :]

Borda on 21 Apr 2020

I won't be able to work on this until approximately mid May, so hopefully someone can work on this. Else I'll do it when I'm available

Deanamic on 21 Apr 2020

@Borda and @Deanamic : I can have a look at this if you do not mind :-)

jbschiratti on 3 May 2020

👍2

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] on 2 Jul 2020

@jbschiratti @Borda

I am curious the status of this issue. Could this be fixed in the near future?

I am in the process of migration from Pytorch to lightning and I ran into the same issue outlined above when I changed batch_size.

Thanks so much!

junwen-austin on 17 Aug 2020

@junwen-austin would you be interested in sending a fix for this issue? :]

Borda on 17 Aug 2020

@Borda I'd love to but my background is data science and do not have lots of coding experience in DDP. I am using Lightning for my data science project at work and am really impressed by the design and philosophy of Lightning 👍

junwen-austin on 18 Aug 2020

I didn't know this bug was still open, the fix shouldn't involve coding with DDP as pytorch should handle everything. I could try fix that now, but I wondered if there is any way to test multiple GPUs as I do not have access to that.

I think the fix is probably adding a call to self.reduce_distributed_output() here. but I haven't looked too deep into the issue. I'm having quite a bit of work right now, so hopefully someone else will look into it

This should probably fix the issue, but would require further testing

Deanamic on 19 Aug 2020

@jbschiratti @junwen-austin @Deanamic anyone wants to take this over?

Borda on 15 Sep 2020

This appears to be an issue on both training_epoch_end() as well as validation_epoch_end(), as there is now no reduction for either. This means that DP basically doesn't work, looking into.

teddykoker on 16 Sep 2020

Using results objects seems to fix the issue, but it is still a problem when returning dicts. Not sure if you are done with refactoring DP @williamFalcon

teddykoker on 16 Sep 2020

ok... this is not a bug.

The batch sizes are too small and you're not dropping the last batch.
What ends up happening is that one GPU processes 2 items and the second only 1 item (because your batch size is 3).

And thus your outputs aren't the same length... this is not something we can automate... so you have to either check for that in your code or make sure your batch size is always a multiple of GPUs.

from torch.utils.data import Dataset, DataLoader
import torch
import torch.nn as nn
import pytorch_lightning as pl
from pytorch_lightning import Trainer

class Data(Dataset):
    def __init__(self, sz):
        self.sz = sz

    def __len__(self):
        return self.sz

    def __getitem__(self, idx):
        return {'x' : torch.tensor(idx, dtype = torch.float32),
                'y': torch.tensor(0, dtype = torch.float32)}

class Model(pl.LightningModule):
    def __init__(self):
        super(Model, self).__init__()
        self.criterion = nn.MSELoss()
        self.layer = nn.Linear(1,1)

    def forward(self, x):
        return self.layer(x)

    def training_step(self, batch, batch_idx):
        output = self.forward(batch['x'].view(-1, 1))
        loss = self.criterion(batch['y'].view(-1), output.view(-1))
        return {'loss': loss}

    def training_epoch_end(self, outputs):
        ## This is ok, outputs have been reduced
        loss = torch.stack([x['loss'] for x in outputs]).mean()
        return {'log': {'train_loss' : loss}}

    def validation_step(self, batch, batch_idx):
        output = self.forward(batch['x'].view(-1, 1))
        loss = self.criterion(batch['y'].view(-1), output.view(-1))

        r = {'loss': loss}
        print(r)
        return r

    def validation_epoch_end(self, outputs):
        import pdb; pdb.set_trace()
        ## This is not ok, outputs have not been reduced
        loss = torch.stack([x['loss'] for x in outputs]).mean()
        return {'log': {'val_loss' : loss}}

    def configure_optimizers(self):
        optimizer = torch.optim.SGD(self.parameters(),
                                    lr = 0.001)
        return optimizer



def main(backend, ngpu):
    ngpu = 2
    b_size = 2
    gpus = list(range(ngpu))
    train_dl = DataLoader(Data(3), batch_size = b_size)

    # Batch size 2 on 2 gpus -> second epoch a gpu will be empty
    # output is reduced in training epoch end, so it's fine but validation
    # epoch end will run into an exception
    # Batch size on 3 will run properly
    val_dl = DataLoader(Data(3), batch_size = b_size)

    net = Model()
    trainer = Trainer(gpus = gpus,
                      max_epochs = 2,
                      distributed_backend = backend,
                      num_sanity_val_steps=0)
    trainer.fit(net,
                train_dataloader = train_dl,
                val_dataloaders = val_dl)


if __name__ == '__main__':
    main('dp', 2)

williamFalcon on 23 Sep 2020

Was this page helpful?

0 / 5 - 0 ratings