Pytorch-lightning: Train loss vs loss on progress bar

Created on 25 Mar 2020  ยท  12Comments  ยท  Source: PyTorchLightning/pytorch-lightning

โ“ Questions and Help

What is your question?

I don't understand why train_loss is different than loss even though I assign the same value. Perhaps one loss is calculated over the whole dataset and the other one is only for a recent batch? But if that's the case, which one is which?

Epoch 1:  79%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‰  | 691/870 [00:07<00:01, 100.85batch/s, accuracy=0.7, batch_idx=690, gpu=0, loss=0.442, train_loss=0.535, v_num=2]  

Code

    def training_step(self, batch, batch_idx):
        ...     
        loss = self.loss(...)
        tqdm_dict = {'train_loss': loss}
        outputs = {
            'loss': loss,
            'progress_bar': tqdm_dict,
            'log': tqdm_dict
        }

What's your environment?

  • OS: Ubuntu
  • Packaging: pip
  • Version: 0.6.0
question

Most helpful comment

@Borda can we note this in the docs?

All 12 comments

Hi! thanks for your contribution!, great first issue!

just to rephrase your question, you are not sure why the loss is there twice (once in outputs and second in tqdm), right?
the difference is that outputs are needed eg for EarlyStopping and tqdm is used just for visualization in progress bar :]

@Borda No, I understand why there are 2 losses (loss=0.442, train_loss=0.535). I don't understand why they have different values even though they come from the same variable loss = self.loss(...)

that sounds strange, do you have a minimal example we can run?
mind using out Test modules? so we could add it among test so it won't happen

@Borda Well, I don't really need to, because even if though I run the minimal example from https://pytorch-lightning.readthedocs.io/en/latest/lightning-module.html#minimal-example I got the discrepancy. Having said that there is either a bug or I don't understand something :cry:

Initially, I thought that one of them could be mean loss over the whole training set, whereas the other one could be the loss from the last batch. That's why I asked.

import os

import pytorch_lightning as pl
import torch
import torchvision.transforms as transforms
from torch import nn
from torch.utils.data import DataLoader
from torchvision.datasets import MNIST


class LitModel(pl.LightningModule):

    def __init__(self):
        super(LitModel, self).__init__()
        self.l1 = torch.nn.Linear(28 * 28, 10)
        self.loss = nn.CrossEntropyLoss()

    def forward(self, x):
        return torch.relu(self.l1(x.view(x.size(0), -1)))

    def training_step(self, batch, batch_idx):
        x, y = batch
        y_hat = self.forward(x)

        loss = self.loss(y_hat, y)
        tqdm_dict = {'train_loss': loss}
        outputs = {
            'loss': loss,
            'progress_bar': tqdm_dict,
            'log': tqdm_dict
        }
        return outputs

    def train_dataloader(self):
        return DataLoader(
            dataset=MNIST(
                root=os.getcwd(),
                train=True,
                download=True,
                transform=transforms.ToTensor()
            ),
            batch_size=32
        )

    def configure_optimizers(self):
        return torch.optim.Adam(self.parameters(), lr=0.01)


if __name__ == "__main__":
    trainer = pl.Trainer(
        checkpoint_callback=False,
        early_stop_callback=False,
        show_progress_bar=True,
        max_epochs=2,
    )

    model = LitModel()
    trainer.fit(model)

Result:

Epoch 2: : 1900it [00:06, 288.43it/s, loss=1.019, train_loss=0.836, v_num=27]                                                                                                                                                                                  

loss on progress bar is a running average. what you return (train_loss) is not

@Borda can we note this in the docs?

@williamFalcon Thanks for the clarification.

@williamFalcon What about the magic number i.e. -100? Do you think it is worth parametrizing as Trainer.__init__ argument? Frankly speaking, I already think that the number of arguments for Trainer.__init__ is overwhelming.

So which loss( averaged OR non-averaged) is being used to perform updates to backprop?

I also face a same problem.

During the training, I'm using the custom loss function to train my model. However the loss are displayed as 0.000, but when I display the same value to display using different variable it gives 4.73e-5 (some value in exponential format).
Epoch 80: 10%|โ–ˆโ–ˆโ–Œ | 100/1013 [01:33<14:11, 1.07it/s, loss=0.000, v_num=None, train_loss=4.73e-5]

both loss and train_loss use the same value to display. why one displays in exponential format and other doesn't.

Is it possible does it prohibits the model from converging. because when i use the same parameters to train the model in normal way it converges, however with the pytorch lightning the model doesn't converge beyond certain limit.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

remisphere picture remisphere  ยท  3Comments

polars05 picture polars05  ยท  3Comments

williamFalcon picture williamFalcon  ยท  3Comments

edenlightning picture edenlightning  ยท  3Comments

jcreinhold picture jcreinhold  ยท  3Comments