Pytorch-lightning: How to log train and validation loss in the same figure ?

Created on 6 Jan 2020  ยท  14Comments  ยท  Source: PyTorchLightning/pytorch-lightning

โ“ Questions and Help

What is your question?

How can we log train and validation loss in the same plot and preview them in tensorboard?
Having both in the same plot is useful to identify overfitting visually.

Code

    def training_step(self, batch, batch_idx):
        images, labels = batch
        output = self.forward(images)
        loss = F.nll_loss(output, labels)
        return {"loss": loss, 'log': {'train_loss': loss}}

    def validation_step(self, batch, batch_idx):
        images, labels = batch
        output = self.forward(images)
        loss = F.nll_loss(output, labels)
        return {"loss": loss}

    def validation_end(self, outputs):
        avg_loss = torch.stack([x['loss'] for x in outputs]).mean()
        return {'val_loss': avg_loss, 'log': {'val_loss': avg_loss}}

What have you tried?

Using Loss/train and Loss/valid contains them in the same section, but still in separate plot.

    def training_step(self, batch, batch_idx):
        images, labels = batch
        output = self.forward(images)
        loss = F.nll_loss(output, labels)
        return {"loss": loss, 'log': {'Loss/train': loss}}

    def validation_step(self, batch, batch_idx):
        images, labels = batch
        output = self.forward(images)
        loss = F.nll_loss(output, labels)
        return {"loss": loss}

    def validation_end(self, outputs):
        avg_loss = torch.stack([x['loss'] for x in outputs]).mean()
        return {'val_loss': avg_loss, 'log': {'Loss/valid': avg_loss}}

I tried to use self.logger.experiment.add_scalars(), but confused on how to access train loss in validation loop.

What's your environment?

  • OS: MAC OSX
  • Packaging: conda
  • Version: 0.5.3.2
question

Most helpful comment

Got NotImplementedError: Got <class 'dict'>, but numpy array, torch tensor, or caffe2 blob name are expected. when trying to use nested dict...


def training_step(self, batch, batch_index):
   loss = self.model.loss(batch)
   # tensorboard_logs = {'train_loss': loss}
   tensorboard_logs = {'loss': {'train': loss}}

   return {'loss': loss, 'log': tensorboard_logs}
raceback (most recent call last):
  File "bert_ner.py", line 252, in <module>
    trainer.fit(system)
  File "/Users/user/.pyenv/versions/env-mkwPXnF--py3.7/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 630, in fit
    self.run_pretrain_routine(model)
  File "/Users/user/.pyenv/versions/env-mkwPXnF--py3.7/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 830, in run_pretrain_routine
    self.train()
  File "/Users/user/.pyenv/versions/env-mkwPXnF--py3.7/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 343, in train
    self.run_training_epoch()
  File "/Users/user/.pyenv/versions/env-mkwPXnF--py3.7/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 444, in run_training_epoch
    self.log_metrics(batch_step_metrics, grad_norm_dic)
  File "/Users/user/.pyenv/versions/env-mkwPXnF--py3.7/lib/python3.7/site-packages/pytorch_lightning/trainer/logging.py", line 74, in log_metrics
    self.logger.log_metrics(scalar_metrics, step=step)
  File "/Users/user/.pyenv/versions/env-mkwPXnF--py3.7/lib/python3.7/site-packages/pytorch_lightning/loggers/base.py", line 122, in log_metrics
    [logger.log_metrics(metrics, step) for logger in self._logger_iterable]
  File "/Users/user/.pyenv/versions/env-mkwPXnF--py3.7/lib/python3.7/site-packages/pytorch_lightning/loggers/base.py", line 122, in <listcomp>
    [logger.log_metrics(metrics, step) for logger in self._logger_iterable]
  File "/Users/user/.pyenv/versions/env-mkwPXnF--py3.7/lib/python3.7/site-packages/pytorch_lightning/loggers/base.py", line 18, in wrapped_fn
    fn(self, *args, **kwargs)
  File "/Users/user/.pyenv/versions/env-mkwPXnF--py3.7/lib/python3.7/site-packages/pytorch_lightning/loggers/tensorboard.py", line 126, in log_metrics
    self.experiment.add_scalar(k, v, step)
  File "/Users/user/.pyenv/versions/env-mkwPXnF--py3.7/lib/python3.7/site-packages/torch/utils/tensorboard/writer.py", line 342, in add_scalar
    scalar(tag, scalar_value), global_step, walltime)
  File "/Users/user/.pyenv/versions/env-mkwPXnF--py3.7/lib/python3.7/site-packages/torch/utils/tensorboard/summary.py", line 196, in scalar
    scalar = make_np(scalar)
  File "/Users/user/.pyenv/versions/env-mkwPXnF--py3.7/lib/python3.7/site-packages/torch/utils/tensorboard/_convert_np.py", line 30, in make_np
    'Got {}, but numpy array, torch tensor, or caffe2 blob name are expected.'.format(type(x)))
NotImplementedError: Got <class 'dict'>, but numpy array, torch tensor, or caffe2 blob name are expected.

All 14 comments

You can use

def training_step(self, batch, batch_idx):
     tensorboard_logs = {'acc': {'train': some_value }, 'loss':{'train': some_value } }
     return {"loss": loss, 'log': tensorboard_logs }

def validation_end(self, outputs):
     tensorboard_logs = {'acc': {'val': some_value }, 'loss':{'val': some_value } }
     return {"loss": loss, 'log': tensorboard_logs }

nested dictionary works!
Thank you @44REAM

Got NotImplementedError: Got <class 'dict'>, but numpy array, torch tensor, or caffe2 blob name are expected. when trying to use nested dict...


def training_step(self, batch, batch_index):
   loss = self.model.loss(batch)
   # tensorboard_logs = {'train_loss': loss}
   tensorboard_logs = {'loss': {'train': loss}}

   return {'loss': loss, 'log': tensorboard_logs}
raceback (most recent call last):
  File "bert_ner.py", line 252, in <module>
    trainer.fit(system)
  File "/Users/user/.pyenv/versions/env-mkwPXnF--py3.7/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 630, in fit
    self.run_pretrain_routine(model)
  File "/Users/user/.pyenv/versions/env-mkwPXnF--py3.7/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 830, in run_pretrain_routine
    self.train()
  File "/Users/user/.pyenv/versions/env-mkwPXnF--py3.7/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 343, in train
    self.run_training_epoch()
  File "/Users/user/.pyenv/versions/env-mkwPXnF--py3.7/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 444, in run_training_epoch
    self.log_metrics(batch_step_metrics, grad_norm_dic)
  File "/Users/user/.pyenv/versions/env-mkwPXnF--py3.7/lib/python3.7/site-packages/pytorch_lightning/trainer/logging.py", line 74, in log_metrics
    self.logger.log_metrics(scalar_metrics, step=step)
  File "/Users/user/.pyenv/versions/env-mkwPXnF--py3.7/lib/python3.7/site-packages/pytorch_lightning/loggers/base.py", line 122, in log_metrics
    [logger.log_metrics(metrics, step) for logger in self._logger_iterable]
  File "/Users/user/.pyenv/versions/env-mkwPXnF--py3.7/lib/python3.7/site-packages/pytorch_lightning/loggers/base.py", line 122, in <listcomp>
    [logger.log_metrics(metrics, step) for logger in self._logger_iterable]
  File "/Users/user/.pyenv/versions/env-mkwPXnF--py3.7/lib/python3.7/site-packages/pytorch_lightning/loggers/base.py", line 18, in wrapped_fn
    fn(self, *args, **kwargs)
  File "/Users/user/.pyenv/versions/env-mkwPXnF--py3.7/lib/python3.7/site-packages/pytorch_lightning/loggers/tensorboard.py", line 126, in log_metrics
    self.experiment.add_scalar(k, v, step)
  File "/Users/user/.pyenv/versions/env-mkwPXnF--py3.7/lib/python3.7/site-packages/torch/utils/tensorboard/writer.py", line 342, in add_scalar
    scalar(tag, scalar_value), global_step, walltime)
  File "/Users/user/.pyenv/versions/env-mkwPXnF--py3.7/lib/python3.7/site-packages/torch/utils/tensorboard/summary.py", line 196, in scalar
    scalar = make_np(scalar)
  File "/Users/user/.pyenv/versions/env-mkwPXnF--py3.7/lib/python3.7/site-packages/torch/utils/tensorboard/_convert_np.py", line 30, in make_np
    'Got {}, but numpy array, torch tensor, or caffe2 blob name are expected.'.format(type(x)))
NotImplementedError: Got <class 'dict'>, but numpy array, torch tensor, or caffe2 blob name are expected.

@isolet mind open a new issue?

@isolet I have the same issue, must be due to bumping the pytorch-lightning version up to 0.7.1 (original issue is 0.5.3.2)

I have the same issue. How to fix this?

@huyvnphan Until this gets resolved properly, here's a _really terrible_ workaround...

def log_metrics(self, metrics, step=None):
    for k, v in metrics.items():
        if isinstance(v, dict):
            self.experiment.add_scalars(k, v, step)
        else:
            if isinstance(v, torch.Tensor):
                v = v.item()
            self.experiment.add_scalar(k, v, step)

def monkeypatch_tensorboardlogger(logger):
    import types
    logger.log_metrics = types.MethodType(log_metrics, logger)

# ...

monkeypatch_tensorboardlogger(trainer.logger)

Again, this is a terrible idea, but it works. Note that the example above assumes you only have the default TensorboardLogger wired up. Adjust accordingly if you have multiple loggers.

I began working on a PR to fix this properly but given the current situation with the pandemic, I simply have not found the time to put in the required effort to finish it. My hope is that the snippet above might inspire someone to continue where I stopped...


Ref: https://github.com/PyTorchLightning/pytorch-lightning/blob/af621f8590b2f2ba046b508da2619cfd4995d876/pytorch_lightning/loggers/tensorboard.py#L121-L126

@chiragraman @huyvnphan @thomasjo mind open a new issue?

I have the same issue with
pytorch 1.5.0
pytorch-lightning 0.7.6

Anyone solve this?

@Borda Can we open this issue back? There's no solution to it as of now and the same error.

I'm getting this error too

See my comment here.
You can do this right now in your validation_epoch_end and get the plots in one figure.
I think in the future we could support that also as part of the output of the training/validation_epoch_end, but I would wait for the structured results to be finished first. Let me know if that helps.

@awaelchli very cool, thanks for sharing!!!

@awaelchli This way I have to keep track of the global_step associated with the training steps, validation steps, validation_epoch_end steps etc. Is there a way to access those counters in a lightning module?

To make this point somewhat more clear:

Suppose a training_step method like this:

    def training_step(self, batch, batch_idx):
        features, _ = batch
        reconstructed_batch, mu, log_var = self(features)
        reconstruction_loss, kld_loss = self.loss_function(reconstructed_batch, features, mu, log_var)
        train_loss = reconstruction_loss + kld_loss
        logger_losses = {'train_loss': train_loss,
                         'train_reconstruction_loss': reconstruction_loss,
                         'train_kld_loss': kld_loss}
        self.logger.experiment.add_scalars('losses', logger_losses, global_step=self._train_step_counter)
        self._train_step_counter += 1
        return {'loss': train_loss}

so here I have to keep track of _strain_step_counter variable. Same would be with validation_step and validation_epoch_end_step counters if we cannot use the nested

return {'log': logger_losses}

method which apparently takes care of all of that.
I wonder whether there is a method s.t. I don't have to keep track of all those global_step counters.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

baeseongsu picture baeseongsu  ยท  3Comments

iakremnev picture iakremnev  ยท  3Comments

anthonytec2 picture anthonytec2  ยท  3Comments

justusschock picture justusschock  ยท  3Comments

edenlightning picture edenlightning  ยท  3Comments