Pytorch-lightning: on_train_end seems to get called before logging of last epoch has finished

Created on 2 Apr 2020 · 5Comments · Source: PyTorchLightning/pytorch-lightning

🐛 Bug

Maybe not a bug, but unexpected behavior. When using the on_train_end method to either upload a models latest .csv file created by TestTube to neptune or to print the last numeric channel value of a metric send to neptune, the values from the final epoch have not yet been logged. When training has finished, the last line of metrics.csv is 2020-04-02 17:23:16.029189,0.04208208369463682,30.0, but for the outputs/uploads of on_train_end see code below:

Code sample

def on_epoch_end(self):
    # Logging loss per epoch
    train_loss_mean = np.mean(self.training_losses)
    # Saves loss of final epoch for later visualization
    self.final_loss = train_loss_mean
    self.logger[0].experiment.log_metric('epoch/mean_absolute_loss', y=train_loss_mean, x=self.current_epoch)
    self.logger[1].experiment.log({'epoch/mean_absolute_loss': train_loss_mean, 'epoch': self.current_epoch}, global_step=self.current_epoch)
    self.training_losses = []  # reset for next epoch

def on_train_end(self):
    save_dir = Path(self.logger[1].experiment.get_logdir()).parent/'metrics.csv'
    self.logger[0].experiment.log_artifact(save_dir)



md5-b03c0cc6fcee9dee98c2e9337ee80ecf



def on_train_end(self):
    log_last = self.logger[0].experiment.get_logs()
    print('Last logged values: ', log_last)

Output: Last logged values: {'epoch/mean_absolute_loss': Channel(channelType='numeric', id='b00cd0e5-a427-4a3c-a10c-5033808a930e', lastX=29.0, name='epoch/mean_absolute_loss', x=29.0, y='0.04208208404108882')}

When printing self.final_loss in on_train_end I get the correct last value though.

Expected behavior

The on_train_end method to only get called after the last values have been logged.

Logger Priority P0 bug / fix help wanted

Source

Dunrar

All 5 comments

@Dunrar could you link to a colab notebook to reproduce this? i checked in the training loop code and we are only calling on_train_end after training is complete

https://github.com/PyTorchLightning/pytorch-lightning/blob/master/pytorch_lightning/trainer/training_loop.py#L671

if you can reproduce in a notebook i can dig in deeper as to why you're experience this problem

jeremyjordan on 4 Apr 2020

Hey @jeremyjordan, sorry, just got around to it. Here is the notebook:
https://colab.research.google.com/drive/1WH5GmyvrSevWPp2_C2wfSMakIDgpy477

Dunrar on 14 Apr 2020

@Dunrar Had a little look at this and your code. on_train_end is not being called before the epoch has finished. It just looks that way. What's actually happening is that the logs aren't being finalised/saved until after on_train_end has been called so it looks that way when you look at the logs inside on_train_end.

https://github.com/PyTorchLightning/pytorch-lightning/blob/master/pytorch_lightning/trainer/training_loop.py#L693

Adding a self.logger[1].save() to the beginning of on_train_end() (or the end of on_epoch_end()) yields the result you'd expect for me for test_tube logger. I'm not familiar with Neptune but based on the structure of pytorch-lightning the result should be the same if you add self.logger[0].save() as well