I don't understand why train_loss is different than loss even though I assign the same value. Perhaps one loss is calculated over the whole dataset and the other one is only for a recent batch? But if that's the case, which one is which?
Epoch 1: 79%|โโโโโโโโ | 691/870 [00:07<00:01, 100.85batch/s, accuracy=0.7, batch_idx=690, gpu=0, loss=0.442, train_loss=0.535, v_num=2]
def training_step(self, batch, batch_idx):
...
loss = self.loss(...)
tqdm_dict = {'train_loss': loss}
outputs = {
'loss': loss,
'progress_bar': tqdm_dict,
'log': tqdm_dict
}
Hi! thanks for your contribution!, great first issue!
just to rephrase your question, you are not sure why the loss is there twice (once in outputs and second in tqdm), right?
the difference is that outputs are needed eg for EarlyStopping and tqdm is used just for visualization in progress bar :]
@Borda No, I understand why there are 2 losses (loss=0.442, train_loss=0.535). I don't understand why they have different values even though they come from the same variable loss = self.loss(...)
that sounds strange, do you have a minimal example we can run?
mind using out Test modules? so we could add it among test so it won't happen
@Borda Well, I don't really need to, because even if though I run the minimal example from https://pytorch-lightning.readthedocs.io/en/latest/lightning-module.html#minimal-example I got the discrepancy. Having said that there is either a bug or I don't understand something :cry:
Initially, I thought that one of them could be mean loss over the whole training set, whereas the other one could be the loss from the last batch. That's why I asked.
import os
import pytorch_lightning as pl
import torch
import torchvision.transforms as transforms
from torch import nn
from torch.utils.data import DataLoader
from torchvision.datasets import MNIST
class LitModel(pl.LightningModule):
def __init__(self):
super(LitModel, self).__init__()
self.l1 = torch.nn.Linear(28 * 28, 10)
self.loss = nn.CrossEntropyLoss()
def forward(self, x):
return torch.relu(self.l1(x.view(x.size(0), -1)))
def training_step(self, batch, batch_idx):
x, y = batch
y_hat = self.forward(x)
loss = self.loss(y_hat, y)
tqdm_dict = {'train_loss': loss}
outputs = {
'loss': loss,
'progress_bar': tqdm_dict,
'log': tqdm_dict
}
return outputs
def train_dataloader(self):
return DataLoader(
dataset=MNIST(
root=os.getcwd(),
train=True,
download=True,
transform=transforms.ToTensor()
),
batch_size=32
)
def configure_optimizers(self):
return torch.optim.Adam(self.parameters(), lr=0.01)
if __name__ == "__main__":
trainer = pl.Trainer(
checkpoint_callback=False,
early_stop_callback=False,
show_progress_bar=True,
max_epochs=2,
)
model = LitModel()
trainer.fit(model)
Result:
Epoch 2: : 1900it [00:06, 288.43it/s, loss=1.019, train_loss=0.836, v_num=27]
loss on progress bar is a running average. what you return (train_loss) is not
@Borda can we note this in the docs?
@williamFalcon Thanks for the clarification.
@williamFalcon What about the magic number i.e. -100? Do you think it is worth parametrizing as Trainer.__init__ argument? Frankly speaking, I already think that the number of arguments for Trainer.__init__ is overwhelming.
So which loss( averaged OR non-averaged) is being used to perform updates to backprop?
I also face a same problem.
During the training, I'm using the custom loss function to train my model. However the loss are displayed as 0.000, but when I display the same value to display using different variable it gives 4.73e-5 (some value in exponential format).
Epoch 80: 10%|โโโ | 100/1013 [01:33<14:11, 1.07it/s, loss=0.000, v_num=None, train_loss=4.73e-5]
both loss and train_loss use the same value to display. why one displays in exponential format and other doesn't.
Is it possible does it prohibits the model from converging. because when i use the same parameters to train the model in normal way it converges, however with the pytorch lightning the model doesn't converge beyond certain limit.
Most helpful comment
@Borda can we note this in the docs?