Pytorch-lightning: How to log train and validation loss in the same figure ?

Created on 6 Jan 2020 · 14Comments · Source: PyTorchLightning/pytorch-lightning

❓ Questions and Help

What is your question?

How can we log train and validation loss in the same plot and preview them in tensorboard?
Having both in the same plot is useful to identify overfitting visually.

Code

    def training_step(self, batch, batch_idx):
        images, labels = batch
        output = self.forward(images)
        loss = F.nll_loss(output, labels)
        return {"loss": loss, 'log': {'train_loss': loss}}

    def validation_step(self, batch, batch_idx):
        images, labels = batch
        output = self.forward(images)
        loss = F.nll_loss(output, labels)
        return {"loss": loss}

    def validation_end(self, outputs):
        avg_loss = torch.stack([x['loss'] for x in outputs]).mean()
        return {'val_loss': avg_loss, 'log': {'val_loss': avg_loss}}

What have you tried?

Using Loss/train and Loss/valid contains them in the same section, but still in separate plot.

    def training_step(self, batch, batch_idx):
        images, labels = batch
        output = self.forward(images)
        loss = F.nll_loss(output, labels)
        return {"loss": loss, 'log': {'Loss/train': loss}}

    def validation_step(self, batch, batch_idx):
        images, labels = batch
        output = self.forward(images)
        loss = F.nll_loss(output, labels)
        return {"loss": loss}

    def validation_end(self, outputs):
        avg_loss = torch.stack([x['loss'] for x in outputs]).mean()
        return {'val_loss': avg_loss, 'log': {'Loss/valid': avg_loss}}

I tried to use self.logger.experiment.add_scalars(), but confused on how to access train loss in validation loop.

What's your environment?

OS: MAC OSX
Packaging: conda
Version: 0.5.3.2

question

Source

WiraDKP

Most helpful comment

Got NotImplementedError: Got <class 'dict'>, but numpy array, torch tensor, or caffe2 blob name are expected. when trying to use nested dict...


def training_step(self, batch, batch_index):
   loss = self.model.loss(batch)
   # tensorboard_logs = {'train_loss': loss}
   tensorboard_logs = {'loss': {'train': loss}}

   return {'loss': loss, 'log': tensorboard_logs}

raceback (most recent call last):
  File "bert_ner.py", line 252, in <module>
    trainer.fit(system)
  File "/Users/user/.pyenv/versions/env-mkwPXnF--py3.7/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 630, in fit
    self.run_pretrain_routine(model)
  File "/Users/user/.pyenv/versions/env-mkwPXnF--py3.7/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 830, in run_pretrain_routine
    self.train()
  File "/Users/user/.pyenv/versions/env-mkwPXnF--py3.7/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 343, in train
    self.run_training_epoch()
  File "/Users/user/.pyenv/versions/env-mkwPXnF--py3.7/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 444, in run_training_epoch
    self.log_metrics(batch_step_metrics, grad_norm_dic)
  File "/Users/user/.pyenv/versions/env-mkwPXnF--py3.7/lib/python3.7/site-packages/pytorch_lightning/trainer/logging.py", line 74, in log_metrics
    self.logger.log_metrics(scalar_metrics, step=step)
  File "/Users/user/.pyenv/versions/env-mkwPXnF--py3.7/lib/python3.7/site-packages/pytorch_lightning/loggers/base.py", line 122, in log_metrics
    [logger.log_metrics(metrics, step) for logger in self._logger_iterable]
  File "/Users/user/.pyenv/versions/env-mkwPXnF--py3.7/lib/python3.7/site-packages/pytorch_lightning/loggers/base.py", line 122, in <listcomp>
    [logger.log_metrics(metrics, step) for logger in self._logger_iterable]
  File "/Users/user/.pyenv/versions/env-mkwPXnF--py3.7/lib/python3.7/site-packages/pytorch_lightning/loggers/base.py", line 18, in wrapped_fn
    fn(self, *args, **kwargs)
  File "/Users/user/.pyenv/versions/env-mkwPXnF--py3.7/lib/python3.7/site-packages/pytorch_lightning/loggers/tensorboard.py", line 126, in log_metrics
    self.experiment.add_scalar(k, v, step)
  File "/Users/user/.pyenv/versions/env-mkwPXnF--py3.7/lib/python3.7/site-packages/torch/utils/tensorboard/writer.py", line 342, in add_scalar
    scalar(tag, scalar_value), global_step, walltime)
  File "/Users/user/.pyenv/versions/env-mkwPXnF--py3.7/lib/python3.7/site-packages/torch/utils/tensorboard/summary.py", line 196, in scalar
    scalar = make_np(scalar)
  File "/Users/user/.pyenv/versions/env-mkwPXnF--py3.7/lib/python3.7/site-packages/torch/utils/tensorboard/_convert_np.py", line 30, in make_np
    'Got {}, but numpy array, torch tensor, or caffe2 blob name are expected.'.format(type(x)))
NotImplementedError: Got <class 'dict'>, but numpy array, torch tensor, or caffe2 blob name are expected.

liebkne on 15 Mar 2020

👍12

All 14 comments

You can use

def training_step(self, batch, batch_idx):
     tensorboard_logs = {'acc': {'train': some_value }, 'loss':{'train': some_value } }
     return {"loss": loss, 'log': tensorboard_logs }

def validation_end(self, outputs):
     tensorboard_logs = {'acc': {'val': some_value }, 'loss':{'val': some_value } }
     return {"loss": loss, 'log': tensorboard_logs }

44REAM on 6 Jan 2020

nested dictionary works!
Thank you @44REAM

WiraDKP on 6 Jan 2020

Got NotImplementedError: Got <class 'dict'>, but numpy array, torch tensor, or caffe2 blob name are expected. when trying to use nested dict...


def training_step(self, batch, batch_index):
   loss = self.model.loss(batch)
   # tensorboard_logs = {'train_loss': loss}
   tensorboard_logs = {'loss': {'train': loss}}

   return {'loss': loss, 'log': tensorboard_logs}

raceback (most recent call last):
  File "bert_ner.py", line 252, in <module>
    trainer.fit(system)
  File "/Users/user/.pyenv/versions/env-mkwPXnF--py3.7/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 630, in fit
    self.run_pretrain_routine(model)
  File "/Users/user/.pyenv/versions/env-mkwPXnF--py3.7/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 830, in run_pretrain_routine
    self.train()
  File "/Users/user/.pyenv/versions/env-mkwPXnF--py3.7/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 343, in train
    self.run_training_epoch()
  File "/Users/user/.pyenv/versions/env-mkwPXnF--py3.7/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 444, in run_training_epoch
    self.log_metrics(batch_step_metrics, grad_norm_dic)
  File "/Users/user/.pyenv/versions/env-mkwPXnF--py3.7/lib/python3.7/site-packages/pytorch_lightning/trainer/logging.py", line 74, in log_metrics
    self.logger.log_metrics(scalar_metrics, step=step)
  File "/Users/user/.pyenv/versions/env-mkwPXnF--py3.7/lib/python3.7/site-packages/pytorch_lightning/loggers/base.py", line 122, in log_metrics
    [logger.log_metrics(metrics, step) for logger in self._logger_iterable]
  File "/Users/user/.pyenv/versions/env-mkwPXnF--py3.7/lib/python3.7/site-packages/pytorch_lightning/loggers/base.py", line 122, in <listcomp>
    [logger.log_metrics(metrics, step) for logger in self._logger_iterable]
  File "/Users/user/.pyenv/versions/env-mkwPXnF--py3.7/lib/python3.7/site-packages/pytorch_lightning/loggers/base.py", line 18, in wrapped_fn
    fn(self, *args, **kwargs)
  File "/Users/user/.pyenv/versions/env-mkwPXnF--py3.7/lib/python3.7/site-packages/pytorch_lightning/loggers/tensorboard.py", line 126, in log_metrics
    self.experiment.add_scalar(k, v, step)
  File "/Users/user/.pyenv/versions/env-mkwPXnF--py3.7/lib/python3.7/site-packages/torch/utils/tensorboard/writer.py", line 342, in add_scalar
    scalar(tag, scalar_value), global_step, walltime)
  File "/Users/user/.pyenv/versions/env-mkwPXnF--py3.7/lib/python3.7/site-packages/torch/utils/tensorboard/summary.py", line 196, in scalar
    scalar = make_np(scalar)
  File "/Users/user/.pyenv/versions/env-mkwPXnF--py3.7/lib/python3.7/site-packages/torch/utils/tensorboard/_convert_np.py", line 30, in make_np
    'Got {}, but numpy array, torch tensor, or caffe2 blob name are expected.'.format(type(x)))
NotImplementedError: Got <class 'dict'>, but numpy array, torch tensor, or caffe2 blob name are expected.

liebkne on 15 Mar 2020

👍12

@isolet mind open a new issue?

Borda on 18 Mar 2020

@isolet I have the same issue, must be due to bumping the pytorch-lightning version up to 0.7.1 (original issue is 0.5.3.2)

christabella on 19 Mar 2020

👍3

I have the same issue. How to fix this?

huyvnphan on 3 Apr 2020

@huyvnphan Until this gets resolved properly, here's a _really terrible_ workaround...

def log_metrics(self, metrics, step=None):
    for k, v in metrics.items():
        if isinstance(v, dict):
            self.experiment.add_scalars(k, v, step)
        else:
            if isinstance(v, torch.Tensor):
                v = v.item()
            self.experiment.add_scalar(k, v, step)

def monkeypatch_tensorboardlogger(logger):
    import types
    logger.log_metrics = types.MethodType(log_metrics, logger)

# ...

monkeypatch_tensorboardlogger(trainer.logger)

Again, this is a terrible idea, but it works. Note that the example above assumes you only have the default TensorboardLogger wired up. Adjust accordingly if you have multiple loggers.

I began working on a PR to fix this properly but given the current situation with the pandemic, I simply have not found the time to put in the required effort to finish it. My hope is that the snippet above might inspire someone to continue where I stopped...

Ref: https://github.com/PyTorchLightning/pytorch-lightning/blob/af621f8590b2f2ba046b508da2619cfd4995d876/pytorch_lightning/loggers/tensorboard.py#L121-L126

thomasjo on 3 Apr 2020

@chiragraman @huyvnphan @thomasjo mind open a new issue?

Borda on 3 Apr 2020

👀1

I have the same issue with
pytorch 1.5.0
pytorch-lightning 0.7.6

Anyone solve this?

DanielOrtega94 on 25 May 2020

👍6

@Borda Can we open this issue back? There's no solution to it as of now and the same error.

vibhavagarwal5 on 28 Jun 2020

👍2

I'm getting this error too

import-antigravity on 22 Jul 2020

See my comment here.
You can do this right now in your validation_epoch_end and get the plots in one figure.
I think in the future we could support that also as part of the output of the training/validation_epoch_end, but I would wait for the structured results to be finished first. Let me know if that helps.

awaelchli on 22 Jul 2020

👍1

@awaelchli very cool, thanks for sharing!!!

import-antigravity on 22 Jul 2020

@awaelchli This way I have to keep track of the global_step associated with the training steps, validation steps, validation_epoch_end steps etc. Is there a way to access those counters in a lightning module?

To make this point somewhat more clear:

Suppose a training_step method like this:

    def training_step(self, batch, batch_idx):
        features, _ = batch
        reconstructed_batch, mu, log_var = self(features)
        reconstruction_loss, kld_loss = self.loss_function(reconstructed_batch, features, mu, log_var)
        train_loss = reconstruction_loss + kld_loss
        logger_losses = {'train_loss': train_loss,
                         'train_reconstruction_loss': reconstruction_loss,
                         'train_kld_loss': kld_loss}
        self.logger.experiment.add_scalars('losses', logger_losses, global_step=self._train_step_counter)
        self._train_step_counter += 1
        return {'loss': train_loss}

so here I have to keep track of _strain_step_counter variable. Same would be with validation_step and validation_epoch_end_step counters if we cannot use the nested

return {'log': logger_losses}

method which apparently takes care of all of that.
I wonder whether there is a method s.t. I don't have to keep track of all those global_step counters.