Should we have training accuracy calculation automated?
Currently I am handling like this
class Model(ptl.LightningModule):
def __init__(self,):
super(AdvTrainModel, self).__init__()
self.training_correct_counter = 0
def training_step(self, batch, batch_nb):
#...
if batch_nb == 0:
self.training_correct_counter = (torch.max(y_hat, 1)[1].view(y.size()) == y).sum()
else:
self.training_correct_counter += (torch.max(y_hat, 1)[1].view(y.size()) == y).sum()
return {'loss': self.my_loss(y_adv_hat, y)}
def validation_end(self, outputs):
# ...
train_avg_acc = 100 * self.training_correct_counter / len(self.tng_dataloader.dataset)
return {'Training/_accuracy':train_avg_acc}
? just calculate accuracy in training_step. you can do whatever in there, it鈥檚 not just for the loss
I think the problem here is that if metrics are caculated in training_step, it is only calculated for one batch. I need to tweak the code as @rcmalli did to aggregate for the whole epoch.
Can we have a function called training_end where we can calculate metrics for the whole epoch ? (Something similar to validation_end but for training)
@minhptx Did you implement this? I also want to collect my training metrics after each epoch but as far as I understood the new method training_end
just collects the output for the whole batch and not all batches in an epoch.
I'm also interested in such a feature. It took me a little while to understand that training_end
and validation_end
did not have the same behavior, which is a bit misleading. It may be clearer to have training_end
be whatever happens at the end of an epoch, and maybe rename the current training_end
to training_step_end
.
@Jonathan-LeRoux I'm in the same boat.. It is super misleading that validation_end
and training_end
have different behaviour. It took me a while to understand what was going on.
Continuing this discussion @williamFalcon, I think this thread's name is misleading. There's absolutely no reason for lightning to automatically calculate accuracy. On the other hand, it would be super useful if lightning could keep the list of outputs
of training_step
just like it does for validaton_step
with validation_end
.
Correct me if I'm wrong, but the only way to calculate these metrics is for me to save a state of (y_hat, target) throughout the entire epoch and calculate metrics at certain points. My point is, if I am not supposed to keep state to track validation metrics why would we break that philosophy with the training metrics?
edit:
There are metrics we can calculate per-batch such as accuracy and just save a running average, for that we could use external loggers. On the other hand, metrics like F1, need to be calculated using the entirety of the dataset so pumping out values to the loggers at each training step seems useless for this purpose (off, we could keep avgs of precision, etc etc but you get the point).
@captainvera have you check recent changes in #776 #889 #950
anyway a PR with suggestions is welcome :robot:
@captainvera May I ask how you compute metrics like F1
in current version? I tried to do it in validation_epoch_end
but it seemed that to access the data loader by val_dataloader
I would need to handle things like moving tensors to correct devices manually...
Most helpful comment
I think the problem here is that if metrics are caculated in training_step, it is only calculated for one batch. I need to tweak the code as @rcmalli did to aggregate for the whole epoch.
Can we have a function called training_end where we can calculate metrics for the whole epoch ? (Something similar to validation_end but for training)