Pytorch-lightning: training_epoch_end's outputs doesn't have 'loss' key

Created on 26 Jun 2020 · 13Comments · Source: PyTorchLightning/pytorch-lightning

pytorch-lightning: build from master

Traceback (most recent call last):
  File "main.py", line 140, in <module>
    main(hparams)
  File "main.py", line 72, in main
    trainer.fit(model)
  File "/mnt/lustre/maxiao1/anaconda3/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 881, in fit
    self.ddp_train(task, model)
  File "/mnt/lustre/maxiao1/anaconda3/lib/python3.7/site-packages/pytorch_lightning/trainer/distrib_data_parallel.py", line 539, in ddp_train
    self.run_pretrain_routine(model)
  File "/mnt/lustre/maxiao1/anaconda3/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1091, in run_pretrain_routine
    self.train()
  File "/mnt/lustre/maxiao1/anaconda3/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 376, in train
    self.run_training_epoch()
  File "/mnt/lustre/maxiao1/anaconda3/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 510, in run_training_epoch
    self.run_training_epoch_end(epoch_output)
  File "/mnt/lustre/maxiao1/anaconda3/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 535, in run_training_epoch_end
    epoch_output = model.training_epoch_end(epoch_output)
  File "/mnt/lustre/maxiao1/PVM/models/baseline.py", line 335, in training_epoch_end
    avg_loss = torch.stack([x['loss'] for x in outputs]).mean()
  File "/mnt/lustre/maxiao1/PVM/models/baseline.py", line 335, in <listcomp>
    avg_loss = torch.stack([x['loss'] for x in outputs]).mean()
KeyError: 'loss'

This is my code:

    def training_step(self, batch, batch_idx):
        ...
        return {'loss': loss, "train_acc": acc}

    def training_epoch_end(self, outputs):
        avg_loss = torch.stack([x['loss'] for x in outputs]).mean()
        avg_acc = torch.stack([x['train_acc'] for x in outputs]).mean()
        logs = {'loss': avg_loss, 'train_acc': avg_acc}
        progress_bar = {'train_loss': avg_loss, 'train_acc': avg_acc}
        results = {
            'log': logs,
            'progress_bar': progress_bar
        }
        return results

Priority P0 bug / fix help wanted

Source

xiadingZ

All 13 comments

Try: avg_loss = torch.stack([x['batch_loss'] for x in outputs]).mean()

rohitgr7 on 26 Jun 2020

Thanks， it works
but 'train_acc' key doesn't exist, neither do batch_train_acc. How to access other keys returned in training_step?

xiadingZ on 27 Jun 2020

As of now in lightning you can access them using x['callback_metrics']['loss'] and x['callback_metrics']['train_acc'], but I think it should be handled in a similar way we do this with validation_epoch_end and test_epoch_end.

rohitgr7 on 27 Jun 2020

Hi! One hint: for me it works with "loss" under windows but not under ubuntu.

Pet222 on 29 Jun 2020

Weird!! Why is this think platform dependent?? :thinking:

rohitgr7 on 29 Jun 2020

@Pet222 , are u sure that versions on ubuntu and windows are same?

Red-Eyed on 30 Jun 2020

Hey @williamFalcon is this intended behaviour? I was surprised to see this breaking change being introduced with no warning.
If it is intended, why not have consistent behaviour over validation_epoch_end and test_epoch_end.

If it is not intended, as it seems due to the "bug fix" tag, are you working on it or should I make a PR for this?

captainvera on 30 Jun 2020

👍1

what is the behavior? that the "loss" key is not in training_epoch_end? If so, that's a bug because it should be there

williamFalcon on 30 Jun 2020

@williamFalcon , on the latest version, the loss key was changed to the batch_loss. I think it was changed here

Red-Eyed on 30 Jun 2020

Yes, the fact that you need to access it through 'callback metrics'.
Got it!

On Tue, 30 Jun 2020 at 12:44, William Falcon notifications@github.com
wrote:

what is the behavior? that the "loss" key is not in training_epoch_end? If
so, that's a bug because it should be there

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/PyTorchLightning/pytorch-lightning/issues/2372#issuecomment-651740702,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/ABKWP6XTUJDTEDJ2NZQ3RKTRZHFY5ANCNFSM4OJKX4KQ
.

>

Best Regards,
Miguel Vera

+351 915 198 452
miguel.coimbra.[email protected]
Github/Captainvera http://www.github.com/captainvera

captainvera on 30 Jun 2020

👍1

@captainvera would love a PR :)

williamFalcon on 30 Jun 2020

@captainvera @xiadingZ sorry about that! it was a bad bug.

Made a PR #2428 and added tests to make sure this doesn't happen again!

try master now!
we’ll push a new minor again since this is a key bug (and we have a few other key bugs)

williamFalcon on 30 Jun 2020

Well, that was fast, thanks!

captainvera on 30 Jun 2020

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Access the logging directory through LightningModule or Trainer

DavidRuhe · 3Comments

Early stopping + checkpoint key

williamFalcon · 3Comments

speed-up too long tests

edenlightning · 3Comments

Ditch Trainer percent arguments, make overfit great again

iakremnev · 3Comments

Add "epoch" options to basic templates

baeseongsu · 3Comments