Pytorch-lightning: Train loss vs loss on progress bar

Created on 25 Mar 2020 · 12Comments · Source: PyTorchLightning/pytorch-lightning

❓ Questions and Help

What is your question?

I don't understand why train_loss is different than loss even though I assign the same value. Perhaps one loss is calculated over the whole dataset and the other one is only for a recent batch? But if that's the case, which one is which?

Epoch 1:  79%|███████▉  | 691/870 [00:07<00:01, 100.85batch/s, accuracy=0.7, batch_idx=690, gpu=0, loss=0.442, train_loss=0.535, v_num=2]

Code

    def training_step(self, batch, batch_idx):
        ...     
        loss = self.loss(...)
        tqdm_dict = {'train_loss': loss}
        outputs = {
            'loss': loss,
            'progress_bar': tqdm_dict,
            'log': tqdm_dict
        }

What's your environment?

OS: Ubuntu
Packaging: pip
Version: 0.6.0

question

Source

mateuszpieniak

👍3

Most helpful comment

@Borda can we note this in the docs?

williamFalcon on 26 Mar 2020

👍4

All 12 comments

Hi! thanks for your contribution!, great first issue!

github-actions[bot] on 25 Mar 2020

just to rephrase your question, you are not sure why the loss is there twice (once in outputs and second in tqdm), right?
the difference is that outputs are needed eg for EarlyStopping and tqdm is used just for visualization in progress bar :]

Borda on 26 Mar 2020

@Borda No, I understand why there are 2 losses (loss=0.442, train_loss=0.535). I don't understand why they have different values even though they come from the same variable loss = self.loss(...)

mateuszpieniak on 26 Mar 2020

that sounds strange, do you have a minimal example we can run?
mind using out Test modules? so we could add it among test so it won't happen

Borda on 26 Mar 2020

@Borda Well, I don't really need to, because even if though I run the minimal example from https://pytorch-lightning.readthedocs.io/en/latest/lightning-module.html#minimal-example I got the discrepancy. Having said that there is either a bug or I don't understand something :cry:

Initially, I thought that one of them could be mean loss over the whole training set, whereas the other one could be the loss from the last batch. That's why I asked.

import os

import pytorch_lightning as pl
import torch
import torchvision.transforms as transforms
from torch import nn
from torch.utils.data import DataLoader
from torchvision.datasets import MNIST


class LitModel(pl.LightningModule):

    def __init__(self):
        super(LitModel, self).__init__()
        self.l1 = torch.nn.Linear(28 * 28, 10)
        self.loss = nn.CrossEntropyLoss()

    def forward(self, x):
        return torch.relu(self.l1(x.view(x.size(0), -1)))

    def training_step(self, batch, batch_idx):
        x, y = batch
        y_hat = self.forward(x)

        loss = self.loss(y_hat, y)
        tqdm_dict = {'train_loss': loss}
        outputs = {
            'loss': loss,
            'progress_bar': tqdm_dict,
            'log': tqdm_dict
        }
        return outputs

    def train_dataloader(self):
        return DataLoader(
            dataset=MNIST(
                root=os.getcwd(),
                train=True,
                download=True,
                transform=transforms.ToTensor()
            ),
            batch_size=32
        )

    def configure_optimizers(self):
        return torch.optim.Adam(self.parameters(), lr=0.01)


if __name__ == "__main__":
    trainer = pl.Trainer(
        checkpoint_callback=False,
        early_stop_callback=False,
        show_progress_bar=True,
        max_epochs=2,
    )

    model = LitModel()
    trainer.fit(model)

Result:

Epoch 2: : 1900it [00:06, 288.43it/s, loss=1.019, train_loss=0.836, v_num=27]

mateuszpieniak on 26 Mar 2020

loss on progress bar is a running average. what you return (train_loss) is not

williamFalcon on 26 Mar 2020

👎3

@Borda can we note this in the docs?

williamFalcon on 26 Mar 2020

👍4

https://github.com/PyTorchLightning/pytorch-lightning/blob/3be81cb54ebf2b5425cae09327e852bea0e7c492/pytorch_lightning/trainer/training_loop.py#L598

williamFalcon on 26 Mar 2020

@williamFalcon Thanks for the clarification.

mateuszpieniak on 26 Mar 2020

@williamFalcon What about the magic number i.e. -100? Do you think it is worth parametrizing as Trainer.__init__ argument? Frankly speaking, I already think that the number of arguments for Trainer.__init__ is overwhelming.

mateuszpieniak on 5 May 2020

So which loss( averaged OR non-averaged) is being used to perform updates to backprop?

JonnyD1117 on 27 Jun 2020

I also face a same problem.

During the training, I'm using the custom loss function to train my model. However the loss are displayed as 0.000, but when I display the same value to display using different variable it gives 4.73e-5 (some value in exponential format).
Epoch 80: 10%|██▌ | 100/1013 [01:33<14:11, 1.07it/s, loss=0.000, v_num=None, train_loss=4.73e-5]

both loss and train_loss use the same value to display. why one displays in exponential format and other doesn't.

Is it possible does it prohibits the model from converging. because when i use the same parameters to train the model in normal way it converges, however with the pytorch lightning the model doesn't converge beyond certain limit.