I have been trying out pytorch-lightning 1.0.0rc5 and wanted to log only on epoch end for both training and validation while having in the x-axis the epoch number. I noticed that training_epoch_end
now does not allow to return anything. Though I noticed that for training I can achieve what I want by doing:
def training_epoch_end(self, outputs):
loss = compute_epoch_loss_from_outputs(outputs)
self.log('step', self.trainer.current_epoch)
self.log('loss', {'train': loss})
It sets the step
to be the epoch number and used for the x-axis just as I wanted. I have not found in the documentation if this is how it is intended to be logged. I am also a bit confused about the result objects. Nevertheless, this code seems quite simple and logical, so I thought this could be one of the possible intended ways of logging per epoch.
I tried to do the same for validation as follows:
def validation_epoch_end(self, outputs):
loss = compute_epoch_loss_from_outputs(outputs)
self.log('step', self.trainer.current_epoch)
self.log('loss', {'valid': loss})
However, in the case of validation the x-axis is the number of batches in validation and an additional step
graph appears in tensorboard.
Based on this I have some questions. Is this an intended way of logging per epoch? If yes, is the idea that the same behavior is obtained for both training and validation? If this is not the intended way of logging per epoch, where can I read about how this is planned for version 1.0.0?
I think that to log one value per epoch you can simply call
self.log('metric_name', metric_value, on_step=False, on_epoch=True)
at each training step. This should automatically accumulate over the epoch and output the averaged value at epoch end. But true, then on the x-axis you will have the current step (not the epoch number).
I'm not sure you can override that from the LightningModule.log
API. If that's very important maybe you can directly access the logger in self.logger.experiment
and use that?
using
self.log('metric_name', metric_value, on_step=False, on_epoch=True)
both in training_step
and training_epoch_end
, it will log the metric against the global_step
. It will also log the epoch values separately so you can create a new panel in your logger UI (for eg Wandb) and put epoch on the x-axis and metric_name on the y-axis.
@ndrplz @rohitgr7 thank you for your responses. In the documentation I have read that on_step
and on_epoch
are automatically set depending on the context. So from my understanding logging from the *_epoch_end
is equivalent to that.
The self.logger.experiment
workaround certainly could work. However, one of the motivations I had for creating this issue was in part trying to figure out if this is an unintended behaviour and maybe help out to make pytorch-lightning more consistent. If there is code that treats step
in a special way when logging from training_epoch_end
, I would figure that it would make sense that validation_epoch_end
also has this behaviour.
treats
step
in a special way when logging fromtraining_epoch_end
you will get an error in such case.
@rohitgr7 I am not sure what you mean that I will get an error. I have tried this and I don't get any errors. By default pytorch-lightning logs to tensorboard so I am using that. When I do this I get both losses in the same graph. For training the values used for the x-axis is the epoch (as I want) but for validation it is the number of batches, so the two curves don't align. I have also tried with v1.0.0 that was released today and I get the same behaviour.
This is related to https://github.com/PyTorchLightning/pytorch-lightning/blob/1.0.0/pytorch_lightning/trainer/connectors/logger_connector.py#L84-L90 . When log_metrics
is called for training the value of the step
argument is None
so the if statement evaluates to True
and the step gets assigned the value that I gave. But for validation the step
argument has as value the number of batches so the value of step
is not overridden.
by step, I thought you mean self.log(..., on_step=True)
in *_epoch_end
.
By default, all the logs are logged with step=global_step
for consistency although epoch is also logged alongside. So you can alter or create a new log frame in your logger UI (for eg Wandb) and put epoch on the x-axis and metric_value on the y-axis.
But for validation the step argument has as value the number of batches so the value of step is not overridden.
you want to log by step=epoch
in train_epoch_end
and step=number_of_batches
in validation_epoch_end
?
you want to log by
step=epoch
in bothtrain_epoch_end
andstep=number_of_batches
invalidation_epoch_end
?
No, I want step=epoch
for both training_epoch_end
and validation_epoch_end
. What you say is the unexpected behaviour I am getting that I don't want.
Also I expect this to work with the defaults (tensorboard) and without needing to select what the x-axis should be.
ok got it, let me check.
Maybe there is a bug in line logger_connector.py#L202. The step
argument should not be given. Anyway it will get the global_step
value because of line logger_connector.py#L90.
Also in logger_connector.py#L90 it does not make sense to have step = step if step is not None ...
since it is inside an if that checks if step is None.
yeah good catch, verified this is a bug. Mind send a PR?
step = step if step is not None ..
yeah this can be improved.
Yes, I can create a pull request for this.
def training_step:
return loss
def training_epoch_end(outs):
self.log('avg_loss', outs.mean())
is the same as:
def training_step:
self.log('avg_loss', on_step=False, on_epoch=True)
return loss
def training_epoch_end(outs):
some_val = ...
self.log('some_val', some_val)
@williamFalcon thank you for the response. Please note that I am not interested in logging validation in each step. I completely agree, this does not make sense. I only want to log validation values on validation_epoch_end
. In my example it is for loss
but that is not important, the same question holds for some_val
.
Furthermore, if for both training and validation values are only logged at epoch end as in the example, then both can be plotted on the same graph precisely showing the change in distribution over time. Both can be plotted in the same graph because the values correspond to the same points in time (epoch end). This is already done automatically by PL with my example snippets at the top but removing the self.log('step', ...
, but as you say this does not make sense. For the plot to make sense I want to override the step
to be batch
instead of global_step
. This overriding of step
works for training but not for validation. If users are allowed to override step
for training for consistency it makes sense that it can also be overridden for validation.
To clarify a bit further, I want to do
def training_epoch_end(self, outputs):
some_val = ...
self.log('step', self.trainer.current_epoch)
self.log('some_val', {'train': some_val})
def validation_epoch_end(self, outputs):
some_val = ...
self.log('step', self.trainer.current_epoch)
self.log('some_val', {'valid': some_val})
Expecting to get a graph where I see some_val
for both training and validation which would look like
It is useful for me to observe in a single graph the same value for both training and validation at comparable time intervals. I also want the x-axis to be epoch
because of several reasons. One of them is that I want to use GradientAccumulationScheduler
which means that in each epoch the number of steps can be different. If I use number of steps then the points in the x-axis would be unevenly distributed.
@mauvilsa I guess you wanted to write that you want the x-axis to mark the epoch
not the batch
:)
I've been running into the same problem since I updated to 1.0.0. For some reason, the metrics I log in my validation step no longer have an epoch associated with them. I am using the CSVLogger
and the epoch value for any of validation metrics is empty while the step value is present. I essentially want to do the same as @mauvilsa and plot my metrics against the epoch number.
This has now come up multiple times and I have a strong opinion here. To me, it is 100% clear that logging with epoch on the "x-axis" makes no sense. Note the emphasis on logging, which I see as separate of "visualization".
These are just a couple of reasons, I could probably give you 10 more.
I see two options: 1. make a feature request in TensorBoard 2. Let TensorBoard go
EDIT: I originally wrote "logging on epoch makes no sense" but what I mean is "logging with epoch on x-axis makes no sense"
@awaelchli thank you very much for your comment, it certainly adds value to the discussion. I am not much concerned about plotting or tensorboard in particular. I added the plot just to clarify the issue. For me logging is just about storing values at certain points during training. Surely with a huge dataset people might want to validate more than once per epoch. Also I have no issue in associating values to steps, certainly that makes sense in many cases. However, as I mentioned in a previous comment, I am using GradientAccumulationScheduler
which means that each epoch does not have the same number of steps, thus a simple thing to do is associate values to epochs.
Could I do something different while using GradientAccumulationScheduler
? Probably I can. But for me this is not the main point of this discussion. The main point is that pytorch-lightning should give freedom to the user to do as they need depending on the case. Being able to override step when logging is a nice feature to have to provide flexibility to the users. The issue is that right now the behavior of pytorch-lightning is inconsistent. The step can be overridden for training, but it does not work for validation.
Set val_check_interval < 1 and your plot now shows multiple values per one epoch in the plot.
good point.
But still, if someone adds step
in the .log
, IMO we should still replace the step with this value on the x-axis? WDYT?
But still, if someone adds step in the .log, IMO we should still replace the step with this value on the x-axis? WDYT?
as long as the default behaviour we have right now is not changed, I have not objections.
@awaelchli by default, it does not allow changing step in validation_epoch_end, that's what the PR is trying to solve but 1 test is failing there.
The issue is that right now the behavior of pytorch-lightning is inconsistent. The step can be overridden for training, but it does not work for validation.
Yes, this is because in validation, we typically don't want to increase step and accumulate instead. If this is an optional feature to be added, I have no objections.
@awaelchli yes, this is just an optional feature. Maybe could you please comment in pull request #4130 saying that you don't have objections or better yet review it? Still in that pull request we need feedback to decide about what to do with a line already in the code that automatically logs epoch
which with the change makes the unit tests fail.
Most helpful comment
To clarify a bit further, I want to do
Expecting to get a graph where I see
some_val
for both training and validation which would look likeIt is useful for me to observe in a single graph the same value for both training and validation at comparable time intervals. I also want the x-axis to be
epoch
because of several reasons. One of them is that I want to useGradientAccumulationScheduler
which means that in each epoch the number of steps can be different. If I use number of steps then the points in the x-axis would be unevenly distributed.