How can we log train and validation loss in the same plot and preview them in tensorboard?
Having both in the same plot is useful to identify overfitting visually.
def training_step(self, batch, batch_idx):
images, labels = batch
output = self.forward(images)
loss = F.nll_loss(output, labels)
return {"loss": loss, 'log': {'train_loss': loss}}
def validation_step(self, batch, batch_idx):
images, labels = batch
output = self.forward(images)
loss = F.nll_loss(output, labels)
return {"loss": loss}
def validation_end(self, outputs):
avg_loss = torch.stack([x['loss'] for x in outputs]).mean()
return {'val_loss': avg_loss, 'log': {'val_loss': avg_loss}}
Using Loss/train
and Loss/valid
contains them in the same section, but still in separate plot.
def training_step(self, batch, batch_idx):
images, labels = batch
output = self.forward(images)
loss = F.nll_loss(output, labels)
return {"loss": loss, 'log': {'Loss/train': loss}}
def validation_step(self, batch, batch_idx):
images, labels = batch
output = self.forward(images)
loss = F.nll_loss(output, labels)
return {"loss": loss}
def validation_end(self, outputs):
avg_loss = torch.stack([x['loss'] for x in outputs]).mean()
return {'val_loss': avg_loss, 'log': {'Loss/valid': avg_loss}}
I tried to use self.logger.experiment.add_scalars()
, but confused on how to access train loss in validation loop.
You can use
def training_step(self, batch, batch_idx):
tensorboard_logs = {'acc': {'train': some_value }, 'loss':{'train': some_value } }
return {"loss": loss, 'log': tensorboard_logs }
def validation_end(self, outputs):
tensorboard_logs = {'acc': {'val': some_value }, 'loss':{'val': some_value } }
return {"loss": loss, 'log': tensorboard_logs }
nested dictionary works!
Thank you @44REAM
Got NotImplementedError: Got <class 'dict'>, but numpy array, torch tensor, or caffe2 blob name are expected.
when trying to use nested dict...
def training_step(self, batch, batch_index):
loss = self.model.loss(batch)
# tensorboard_logs = {'train_loss': loss}
tensorboard_logs = {'loss': {'train': loss}}
return {'loss': loss, 'log': tensorboard_logs}
raceback (most recent call last):
File "bert_ner.py", line 252, in <module>
trainer.fit(system)
File "/Users/user/.pyenv/versions/env-mkwPXnF--py3.7/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 630, in fit
self.run_pretrain_routine(model)
File "/Users/user/.pyenv/versions/env-mkwPXnF--py3.7/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 830, in run_pretrain_routine
self.train()
File "/Users/user/.pyenv/versions/env-mkwPXnF--py3.7/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 343, in train
self.run_training_epoch()
File "/Users/user/.pyenv/versions/env-mkwPXnF--py3.7/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 444, in run_training_epoch
self.log_metrics(batch_step_metrics, grad_norm_dic)
File "/Users/user/.pyenv/versions/env-mkwPXnF--py3.7/lib/python3.7/site-packages/pytorch_lightning/trainer/logging.py", line 74, in log_metrics
self.logger.log_metrics(scalar_metrics, step=step)
File "/Users/user/.pyenv/versions/env-mkwPXnF--py3.7/lib/python3.7/site-packages/pytorch_lightning/loggers/base.py", line 122, in log_metrics
[logger.log_metrics(metrics, step) for logger in self._logger_iterable]
File "/Users/user/.pyenv/versions/env-mkwPXnF--py3.7/lib/python3.7/site-packages/pytorch_lightning/loggers/base.py", line 122, in <listcomp>
[logger.log_metrics(metrics, step) for logger in self._logger_iterable]
File "/Users/user/.pyenv/versions/env-mkwPXnF--py3.7/lib/python3.7/site-packages/pytorch_lightning/loggers/base.py", line 18, in wrapped_fn
fn(self, *args, **kwargs)
File "/Users/user/.pyenv/versions/env-mkwPXnF--py3.7/lib/python3.7/site-packages/pytorch_lightning/loggers/tensorboard.py", line 126, in log_metrics
self.experiment.add_scalar(k, v, step)
File "/Users/user/.pyenv/versions/env-mkwPXnF--py3.7/lib/python3.7/site-packages/torch/utils/tensorboard/writer.py", line 342, in add_scalar
scalar(tag, scalar_value), global_step, walltime)
File "/Users/user/.pyenv/versions/env-mkwPXnF--py3.7/lib/python3.7/site-packages/torch/utils/tensorboard/summary.py", line 196, in scalar
scalar = make_np(scalar)
File "/Users/user/.pyenv/versions/env-mkwPXnF--py3.7/lib/python3.7/site-packages/torch/utils/tensorboard/_convert_np.py", line 30, in make_np
'Got {}, but numpy array, torch tensor, or caffe2 blob name are expected.'.format(type(x)))
NotImplementedError: Got <class 'dict'>, but numpy array, torch tensor, or caffe2 blob name are expected.
@isolet mind open a new issue?
@isolet I have the same issue, must be due to bumping the pytorch-lightning version up to 0.7.1 (original issue is 0.5.3.2)
I have the same issue. How to fix this?
@huyvnphan Until this gets resolved properly, here's a _really terrible_ workaround...
def log_metrics(self, metrics, step=None):
for k, v in metrics.items():
if isinstance(v, dict):
self.experiment.add_scalars(k, v, step)
else:
if isinstance(v, torch.Tensor):
v = v.item()
self.experiment.add_scalar(k, v, step)
def monkeypatch_tensorboardlogger(logger):
import types
logger.log_metrics = types.MethodType(log_metrics, logger)
# ...
monkeypatch_tensorboardlogger(trainer.logger)
Again, this is a terrible idea, but it works. Note that the example above assumes you only have the default TensorboardLogger
wired up. Adjust accordingly if you have multiple loggers.
I began working on a PR to fix this properly but given the current situation with the pandemic, I simply have not found the time to put in the required effort to finish it. My hope is that the snippet above might inspire someone to continue where I stopped...
@chiragraman @huyvnphan @thomasjo mind open a new issue?
I have the same issue with
pytorch 1.5.0
pytorch-lightning 0.7.6
Anyone solve this?
@Borda Can we open this issue back? There's no solution to it as of now and the same error.
I'm getting this error too
See my comment here.
You can do this right now in your validation_epoch_end and get the plots in one figure.
I think in the future we could support that also as part of the output of the training/validation_epoch_end, but I would wait for the structured results to be finished first. Let me know if that helps.
@awaelchli very cool, thanks for sharing!!!
@awaelchli This way I have to keep track of the global_step associated with the training steps, validation steps, validation_epoch_end steps etc. Is there a way to access those counters in a lightning module?
To make this point somewhat more clear:
Suppose a training_step
method like this:
def training_step(self, batch, batch_idx):
features, _ = batch
reconstructed_batch, mu, log_var = self(features)
reconstruction_loss, kld_loss = self.loss_function(reconstructed_batch, features, mu, log_var)
train_loss = reconstruction_loss + kld_loss
logger_losses = {'train_loss': train_loss,
'train_reconstruction_loss': reconstruction_loss,
'train_kld_loss': kld_loss}
self.logger.experiment.add_scalars('losses', logger_losses, global_step=self._train_step_counter)
self._train_step_counter += 1
return {'loss': train_loss}
so here I have to keep track of _strain_step_counter
variable. Same would be with validation_step and validation_epoch_end_step counters if we cannot use the nested
return {'log': logger_losses}
method which apparently takes care of all of that.
I wonder whether there is a method s.t. I don't have to keep track of all those global_step counters.
Most helpful comment
Got
NotImplementedError: Got <class 'dict'>, but numpy array, torch tensor, or caffe2 blob name are expected.
when trying to use nested dict...