Pytorch-lightning: Error while training on multi gpus

Created on 30 Aug 2020 · 7Comments · Source: PyTorchLightning/pytorch-lightning

I get the following error on training with multiple gpus. It works for single gpu training

avg_loss = torch.stack([x['val_loss'] for x in outputs_of_validation_steps]).mean()
RuntimeError: stack expects each tensor to be equal size, but got [2] at entry 0 and [1] at entry 343

 def validation_step(self, batch, batch_idx):                                
        logits, softmax_logits = self.forward(**batch)                          
        loss, prediction_label_count = self.loss_function(logits, batch["labels"])

        accuracy = self.compute_accuracy(logits, batch["labels"])               

        return {                                                                
            "val_loss": loss,                                                   
            "accuracy": accuracy,                                               
            "prediction_label_count": prediction_label_count,                   
        }                                                                       

    def validation_epoch_end(self, outputs_of_validation_steps):                
        avg_loss = torch.stack([x['val_loss'] for x in outputs_of_validation_steps]).mean()
        val_accuracy = torch.stack([x['accuracy'] for x in outputs_of_validation_steps]).mean()

        log = {'val_loss': avg_loss, "val_accuracy": val_accuracy}              

        return {'val_loss': avg_loss, "val_accuracy": val_accuracy, 'log': log}

question won't fix

Source

nrjvarshney

Most helpful comment

yes, then it makes sense that it fails, because for stacking, all tensors need to have the same shape. If the last tensor has different batch size, it fails.
Solution: use torch.cat or Results object to reduce. Example using code above:

 def validation_step(self, batch, batch_idx):                                
        logits, softmax_logits = self(**batch)                          
        loss, prediction_label_count = self.loss_function(logits, batch["labels"])

        accuracy = self.compute_accuracy(logits, batch["labels"])               

        result = EvalResult()
        result.log('val_accuracy', accuracy, reduce_fx=torch.mean)  # mean is also the default, we don't need to write it.
        result.log('val_loss', loss)

def validation_epoch_end(self, outputs_of_validation_steps):
       # not needed! everything will be done by results: collects all acc./losses, reduces, then logs.

Hope this helps.

awaelchli on 5 Sep 2020

👍2

All 7 comments

Using drop_last = True is not acceptable

nrjvarshney on 30 Aug 2020

👍1

Hi, I think I have solved that recently #3020. Which version are you on? Please try to upgrade and let me know.

awaelchli on 30 Aug 2020

Hi, I think I have solved that recently #3020. Which version are you on? Please try to upgrade and let me know.

This issue persists in Pytorch Lightning v0.9.0.

RahulSajnani on 4 Sep 2020

@RahulSajnani are you using results object or the same kind of manual reduction as shown in @nrjvarshney's code?
Because in the latter case, it is normal that this is a problem and becaues you do it manually, you need to choose torch.cat.
However, I recommend you use the Results api. https://pytorch-lightning.readthedocs.io/en/latest/results.html

awaelchli on 4 Sep 2020

@awaelchli I am using the same kind of manual reduction as @nrjvarshney . The reduction is as shown here:

epoch_train_loss = torch.stack([x['val_epoch_logger']['train_val_loss'] for x in outputs]).mean()

RahulSajnani on 4 Sep 2020

 def validation_step(self, batch, batch_idx):                                
        logits, softmax_logits = self(**batch)                          
        loss, prediction_label_count = self.loss_function(logits, batch["labels"])

        accuracy = self.compute_accuracy(logits, batch["labels"])               

        result = EvalResult()
        result.log('val_accuracy', accuracy, reduce_fx=torch.mean)  # mean is also the default, we don't need to write it.
        result.log('val_loss', loss)

def validation_epoch_end(self, outputs_of_validation_steps):
       # not needed! everything will be done by results: collects all acc./losses, reduces, then logs.

Hope this helps.

awaelchli on 5 Sep 2020

👍2

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team!