Pytorch-lightning: Training metrics

Created on 11 Aug 2019 · 7Comments · Source: PyTorchLightning/pytorch-lightning

Should we have training accuracy calculation automated?

Currently I am handling like this

class Model(ptl.LightningModule):

    def __init__(self,):
        super(AdvTrainModel, self).__init__()
        self.training_correct_counter = 0

    def training_step(self, batch, batch_nb):
        #...
        if batch_nb == 0:
            self.training_correct_counter = (torch.max(y_hat, 1)[1].view(y.size()) == y).sum()
        else:
            self.training_correct_counter += (torch.max(y_hat, 1)[1].view(y.size()) == y).sum()
        return {'loss': self.my_loss(y_adv_hat, y)}

    def validation_end(self, outputs):
        # ...
        train_avg_acc = 100 * self.training_correct_counter / len(self.tng_dataloader.dataset)
        return {'Training/_accuracy':train_avg_acc}

enhancement help wanted

Source

rcmalli

Most helpful comment

I think the problem here is that if metrics are caculated in training_step, it is only calculated for one batch. I need to tweak the code as @rcmalli did to aggregate for the whole epoch.

Can we have a function called training_end where we can calculate metrics for the whole epoch ? (Something similar to validation_end but for training)

minhptx on 25 Oct 2019

👍16

All 7 comments

? just calculate accuracy in training_step. you can do whatever in there, it’s not just for the loss

williamFalcon on 11 Aug 2019

👎5

I think the problem here is that if metrics are caculated in training_step, it is only calculated for one batch. I need to tweak the code as @rcmalli did to aggregate for the whole epoch.

Can we have a function called training_end where we can calculate metrics for the whole epoch ? (Something similar to validation_end but for training)

minhptx on 25 Oct 2019

👍16

@minhptx Did you implement this? I also want to collect my training metrics after each epoch but as far as I understood the new method training_end just collects the output for the whole batch and not all batches in an epoch.

expectopatronum on 7 Jan 2020

I'm also interested in such a feature. It took me a little while to understand that training_end and validation_end did not have the same behavior, which is a bit misleading. It may be clearer to have training_end be whatever happens at the end of an epoch, and maybe rename the current training_end to training_step_end.

Jonathan-LeRoux on 29 Jan 2020

@Jonathan-LeRoux I'm in the same boat.. It is super misleading that validation_end and training_end have different behaviour. It took me a while to understand what was going on.

Continuing this discussion @williamFalcon, I think this thread's name is misleading. There's absolutely no reason for lightning to automatically calculate accuracy. On the other hand, it would be super useful if lightning could keep the list of outputs of training_step just like it does for validaton_step with validation_end.

Correct me if I'm wrong, but the only way to calculate these metrics is for me to save a state of (y_hat, target) throughout the entire epoch and calculate metrics at certain points. My point is, if I am not supposed to keep state to track validation metrics why would we break that philosophy with the training metrics?

edit:
There are metrics we can calculate per-batch such as accuracy and just save a running average, for that we could use external loggers. On the other hand, metrics like F1, need to be calculated using the entirety of the dataset so pumping out values to the loggers at each training step seems useless for this purpose (off, we could keep avgs of precision, etc etc but you get the point).

captainvera on 3 Mar 2020

👍4

@captainvera have you check recent changes in #776 #889 #950
anyway a PR with suggestions is welcome :robot:

Borda on 4 Mar 2020

@captainvera May I ask how you compute metrics like F1 in current version? I tried to do it in validation_epoch_end but it seemed that to access the data loader by val_dataloader I would need to handle things like moving tensors to correct devices manually...