Pytorch-lightning: How to log by epoch for both training and validation on 1.0.0rc4 / 1.0.0rc5 / 1.0.0

Created on 12 Oct 2020  路  23Comments  路  Source: PyTorchLightning/pytorch-lightning

What is your question?

I have been trying out pytorch-lightning 1.0.0rc5 and wanted to log only on epoch end for both training and validation while having in the x-axis the epoch number. I noticed that training_epoch_end now does not allow to return anything. Though I noticed that for training I can achieve what I want by doing:

def training_epoch_end(self, outputs):
    loss = compute_epoch_loss_from_outputs(outputs)
    self.log('step', self.trainer.current_epoch)
    self.log('loss', {'train': loss})

It sets the step to be the epoch number and used for the x-axis just as I wanted. I have not found in the documentation if this is how it is intended to be logged. I am also a bit confused about the result objects. Nevertheless, this code seems quite simple and logical, so I thought this could be one of the possible intended ways of logging per epoch.

I tried to do the same for validation as follows:

def validation_epoch_end(self, outputs):
    loss = compute_epoch_loss_from_outputs(outputs)
    self.log('step', self.trainer.current_epoch)
    self.log('loss', {'valid': loss})

However, in the case of validation the x-axis is the number of batches in validation and an additional step graph appears in tensorboard.

Based on this I have some questions. Is this an intended way of logging per epoch? If yes, is the idea that the same behavior is obtained for both training and validation? If this is not the intended way of logging per epoch, where can I read about how this is planned for version 1.0.0?

What's your environment?

  • OS: Linux
  • Packaging: pip
  • Version: 1.0.0rc5
question

Most helpful comment

To clarify a bit further, I want to do

def training_epoch_end(self, outputs):
    some_val = ...
    self.log('step', self.trainer.current_epoch)
    self.log('some_val', {'train': some_val})

def validation_epoch_end(self, outputs):
    some_val = ...
    self.log('step', self.trainer.current_epoch)
    self.log('some_val', {'valid': some_val})

Expecting to get a graph where I see some_val for both training and validation which would look like

Screenshot from 2020-10-14 08-02-53

It is useful for me to observe in a single graph the same value for both training and validation at comparable time intervals. I also want the x-axis to be epoch because of several reasons. One of them is that I want to use GradientAccumulationScheduler which means that in each epoch the number of steps can be different. If I use number of steps then the points in the x-axis would be unevenly distributed.

All 23 comments

I think that to log one value per epoch you can simply call

self.log('metric_name', metric_value, on_step=False, on_epoch=True)

at each training step. This should automatically accumulate over the epoch and output the averaged value at epoch end. But true, then on the x-axis you will have the current step (not the epoch number).

I'm not sure you can override that from the LightningModule.log API. If that's very important maybe you can directly access the logger in self.logger.experiment and use that?

using

self.log('metric_name', metric_value, on_step=False, on_epoch=True)

both in training_step and training_epoch_end, it will log the metric against the global_step. It will also log the epoch values separately so you can create a new panel in your logger UI (for eg Wandb) and put epoch on the x-axis and metric_name on the y-axis.

@ndrplz @rohitgr7 thank you for your responses. In the documentation I have read that on_step and on_epoch are automatically set depending on the context. So from my understanding logging from the *_epoch_end is equivalent to that.

The self.logger.experiment workaround certainly could work. However, one of the motivations I had for creating this issue was in part trying to figure out if this is an unintended behaviour and maybe help out to make pytorch-lightning more consistent. If there is code that treats step in a special way when logging from training_epoch_end, I would figure that it would make sense that validation_epoch_end also has this behaviour.

treats step in a special way when logging from training_epoch_end

you will get an error in such case.

@rohitgr7 I am not sure what you mean that I will get an error. I have tried this and I don't get any errors. By default pytorch-lightning logs to tensorboard so I am using that. When I do this I get both losses in the same graph. For training the values used for the x-axis is the epoch (as I want) but for validation it is the number of batches, so the two curves don't align. I have also tried with v1.0.0 that was released today and I get the same behaviour.

This is related to https://github.com/PyTorchLightning/pytorch-lightning/blob/1.0.0/pytorch_lightning/trainer/connectors/logger_connector.py#L84-L90 . When log_metrics is called for training the value of the step argument is None so the if statement evaluates to True and the step gets assigned the value that I gave. But for validation the step argument has as value the number of batches so the value of step is not overridden.

by step, I thought you mean self.log(..., on_step=True) in *_epoch_end.

By default, all the logs are logged with step=global_step for consistency although epoch is also logged alongside. So you can alter or create a new log frame in your logger UI (for eg Wandb) and put epoch on the x-axis and metric_value on the y-axis.

But for validation the step argument has as value the number of batches so the value of step is not overridden.

you want to log by step=epoch in train_epoch_end and step=number_of_batches in validation_epoch_end?

you want to log by step=epoch in both train_epoch_end and step=number_of_batches in validation_epoch_end?

No, I want step=epoch for both training_epoch_end and validation_epoch_end. What you say is the unexpected behaviour I am getting that I don't want.

Also I expect this to work with the defaults (tensorboard) and without needing to select what the x-axis should be.

ok got it, let me check.

Maybe there is a bug in line logger_connector.py#L202. The step argument should not be given. Anyway it will get the global_step value because of line logger_connector.py#L90.

Also in logger_connector.py#L90 it does not make sense to have step = step if step is not None ... since it is inside an if that checks if step is None.

yeah good catch, verified this is a bug. Mind send a PR?

step = step if step is not None ..
yeah this can be improved.

Yes, I can create a pull request for this.

  1. This:
def training_step:
   return loss

def training_epoch_end(outs):
    self.log('avg_loss', outs.mean())

is the same as:

def training_step:
   self.log('avg_loss', on_step=False, on_epoch=True)
   return loss
  1. If you still need to log something on epoch end, then just call self.log
def training_epoch_end(outs):
    some_val = ...
    self.log('some_val', some_val)
  1. Logging steps in validation makes no sense lol... the x-axis would be the batch idx not time. So the curve means nothing. This is why PL makes a separate graph for each... because when done this way, it can be viewed as a change in distribution over time.

@williamFalcon thank you for the response. Please note that I am not interested in logging validation in each step. I completely agree, this does not make sense. I only want to log validation values on validation_epoch_end. In my example it is for loss but that is not important, the same question holds for some_val.

Furthermore, if for both training and validation values are only logged at epoch end as in the example, then both can be plotted on the same graph precisely showing the change in distribution over time. Both can be plotted in the same graph because the values correspond to the same points in time (epoch end). This is already done automatically by PL with my example snippets at the top but removing the self.log('step', ..., but as you say this does not make sense. For the plot to make sense I want to override the step to be batch instead of global_step. This overriding of step works for training but not for validation. If users are allowed to override step for training for consistency it makes sense that it can also be overridden for validation.

To clarify a bit further, I want to do

def training_epoch_end(self, outputs):
    some_val = ...
    self.log('step', self.trainer.current_epoch)
    self.log('some_val', {'train': some_val})

def validation_epoch_end(self, outputs):
    some_val = ...
    self.log('step', self.trainer.current_epoch)
    self.log('some_val', {'valid': some_val})

Expecting to get a graph where I see some_val for both training and validation which would look like

Screenshot from 2020-10-14 08-02-53

It is useful for me to observe in a single graph the same value for both training and validation at comparable time intervals. I also want the x-axis to be epoch because of several reasons. One of them is that I want to use GradientAccumulationScheduler which means that in each epoch the number of steps can be different. If I use number of steps then the points in the x-axis would be unevenly distributed.

@mauvilsa I guess you wanted to write that you want the x-axis to mark the epoch not the batch :)

I've been running into the same problem since I updated to 1.0.0. For some reason, the metrics I log in my validation step no longer have an epoch associated with them. I am using the CSVLogger and the epoch value for any of validation metrics is empty while the step value is present. I essentially want to do the same as @mauvilsa and plot my metrics against the epoch number.

This has now come up multiple times and I have a strong opinion here. To me, it is 100% clear that logging with epoch on the "x-axis" makes no sense. Note the emphasis on logging, which I see as separate of "visualization".

  • Do not confuse a logger with a plotting/graphing tool.
  • Set val_check_interval < 1 and your plot now shows multiple values per one epoch in the plot. Is that what you want? "But why would I want to validate more than once per epoch?". You almost ALWAYS want to do that, because your datasets are huge and an epoch takes a long time.
  • Changing the abscissa and ordinate can make sense, but it SHOULD NOT be something a user has to decide before the training. Modern loggers (not TensorBoard) have solved this problem a long time ago: You log your metric as a function of step, and you log the epoch as a function of step. Then in the UI, you can map the abscissa of your metric plot to the epoch. In fact, you would be able to do that with ANY metric that is monotoncially increasing. You can find this in e.g. wandb.
  • If I run 2 experiments, where the difference is the dataset, and the datasets are not equal size, there are two ways to compare: 1. compare the validation losses at epoch intervals. 2. compare validation losses after n steps. Both ways of comparing are valid, only the interpretation changes. With your proposed change, you eliminate the 2nd.
  • ...

These are just a couple of reasons, I could probably give you 10 more.
I see two options: 1. make a feature request in TensorBoard 2. Let TensorBoard go

EDIT: I originally wrote "logging on epoch makes no sense" but what I mean is "logging with epoch on x-axis makes no sense"

@awaelchli thank you very much for your comment, it certainly adds value to the discussion. I am not much concerned about plotting or tensorboard in particular. I added the plot just to clarify the issue. For me logging is just about storing values at certain points during training. Surely with a huge dataset people might want to validate more than once per epoch. Also I have no issue in associating values to steps, certainly that makes sense in many cases. However, as I mentioned in a previous comment, I am using GradientAccumulationScheduler which means that each epoch does not have the same number of steps, thus a simple thing to do is associate values to epochs.

Could I do something different while using GradientAccumulationScheduler? Probably I can. But for me this is not the main point of this discussion. The main point is that pytorch-lightning should give freedom to the user to do as they need depending on the case. Being able to override step when logging is a nice feature to have to provide flexibility to the users. The issue is that right now the behavior of pytorch-lightning is inconsistent. The step can be overridden for training, but it does not work for validation.

Set val_check_interval < 1 and your plot now shows multiple values per one epoch in the plot.

good point.

But still, if someone adds step in the .log, IMO we should still replace the step with this value on the x-axis? WDYT?

But still, if someone adds step in the .log, IMO we should still replace the step with this value on the x-axis? WDYT?

as long as the default behaviour we have right now is not changed, I have not objections.

@awaelchli by default, it does not allow changing step in validation_epoch_end, that's what the PR is trying to solve but 1 test is failing there.

The issue is that right now the behavior of pytorch-lightning is inconsistent. The step can be overridden for training, but it does not work for validation.

Yes, this is because in validation, we typically don't want to increase step and accumulate instead. If this is an optional feature to be added, I have no objections.

@awaelchli yes, this is just an optional feature. Maybe could you please comment in pull request #4130 saying that you don't have objections or better yet review it? Still in that pull request we need feedback to decide about what to do with a line already in the code that automatically logs epoch which with the change makes the unit tests fail.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

sai-prasanna picture sai-prasanna  路  24Comments

Anjum48 picture Anjum48  路  28Comments

BraveDistribution picture BraveDistribution  路  31Comments

jeremycochoy picture jeremycochoy  路  25Comments

mRcSchwering picture mRcSchwering  路  50Comments