Pytorch-lightning: Error while training on multi gpus

Created on 30 Aug 2020  路  7Comments  路  Source: PyTorchLightning/pytorch-lightning

I get the following error on training with multiple gpus. It works for single gpu training

avg_loss = torch.stack([x['val_loss'] for x in outputs_of_validation_steps]).mean()
RuntimeError: stack expects each tensor to be equal size, but got [2] at entry 0 and [1] at entry 343
 def validation_step(self, batch, batch_idx):                                
        logits, softmax_logits = self.forward(**batch)                          
        loss, prediction_label_count = self.loss_function(logits, batch["labels"])

        accuracy = self.compute_accuracy(logits, batch["labels"])               

        return {                                                                
            "val_loss": loss,                                                   
            "accuracy": accuracy,                                               
            "prediction_label_count": prediction_label_count,                   
        }                                                                       

    def validation_epoch_end(self, outputs_of_validation_steps):                
        avg_loss = torch.stack([x['val_loss'] for x in outputs_of_validation_steps]).mean()
        val_accuracy = torch.stack([x['accuracy'] for x in outputs_of_validation_steps]).mean()

        log = {'val_loss': avg_loss, "val_accuracy": val_accuracy}              

        return {'val_loss': avg_loss, "val_accuracy": val_accuracy, 'log': log} 


question won't fix

Most helpful comment

yes, then it makes sense that it fails, because for stacking, all tensors need to have the same shape. If the last tensor has different batch size, it fails.
Solution: use torch.cat or Results object to reduce. Example using code above:

 def validation_step(self, batch, batch_idx):                                
        logits, softmax_logits = self(**batch)                          
        loss, prediction_label_count = self.loss_function(logits, batch["labels"])

        accuracy = self.compute_accuracy(logits, batch["labels"])               

        result = EvalResult()
        result.log('val_accuracy', accuracy, reduce_fx=torch.mean)  # mean is also the default, we don't need to write it.
        result.log('val_loss', loss)

def validation_epoch_end(self, outputs_of_validation_steps):
       # not needed! everything will be done by results: collects all acc./losses, reduces, then logs.

Hope this helps.

All 7 comments

Using drop_last = True is not acceptable

Hi, I think I have solved that recently #3020. Which version are you on? Please try to upgrade and let me know.

Hi, I think I have solved that recently #3020. Which version are you on? Please try to upgrade and let me know.

This issue persists in Pytorch Lightning v0.9.0.

@RahulSajnani are you using results object or the same kind of manual reduction as shown in @nrjvarshney's code?
Because in the latter case, it is normal that this is a problem and becaues you do it manually, you need to choose torch.cat.
However, I recommend you use the Results api. https://pytorch-lightning.readthedocs.io/en/latest/results.html

@awaelchli I am using the same kind of manual reduction as @nrjvarshney . The reduction is as shown here:

epoch_train_loss = torch.stack([x['val_epoch_logger']['train_val_loss'] for x in outputs]).mean()

yes, then it makes sense that it fails, because for stacking, all tensors need to have the same shape. If the last tensor has different batch size, it fails.
Solution: use torch.cat or Results object to reduce. Example using code above:

 def validation_step(self, batch, batch_idx):                                
        logits, softmax_logits = self(**batch)                          
        loss, prediction_label_count = self.loss_function(logits, batch["labels"])

        accuracy = self.compute_accuracy(logits, batch["labels"])               

        result = EvalResult()
        result.log('val_accuracy', accuracy, reduce_fx=torch.mean)  # mean is also the default, we don't need to write it.
        result.log('val_loss', loss)

def validation_epoch_end(self, outputs_of_validation_steps):
       # not needed! everything will be done by results: collects all acc./losses, reduces, then logs.

Hope this helps.

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team!

Was this page helpful?
0 / 5 - 0 ratings

Related issues

awaelchli picture awaelchli  路  3Comments

Vichoko picture Vichoko  路  3Comments

baeseongsu picture baeseongsu  路  3Comments

monney picture monney  路  3Comments

mmsamiei picture mmsamiei  路  3Comments