Given the new metrics in 1.0.0 and later (which I really like!), I have three accuracy metrics for training, validation and test initialized in __init__ function. Do I need to reset them at the end of training and validation epoch given they will be used multiple times?
Depends on how you are using the metrics. In general if the .compute()method is called the internal state is reset. This means that if you call .compute() in the end off the epoch you should be fine. If you are using metrics in combination with self.log, then setting on_epoch=True will also internally call . compute() at the end of the epoch.
@SkafteNicki Thanks for the information. This has been very useful.
I'd like to get the validation accuracy over the entire validation data and I have seen some strange results on it with DDP. I want to make sure I am doing the right thing with metrics with DDP. Below is my psedu-code:
def __init__(self):
self.val_acc = Accuracy(compute_on_step=False)
def validation_step(self, batch, batch_idx):
input, y = batch[0], batch[1]
logits = self(input)
_, pred = torch.max(logits, dim=1)
self.val_acc.update(pred, y)
self.log('val_acc', self.val_acc, on_step=False, on_epoch=True)
Could someone tell if this implementation is to give me validation accuracy for the entire validation data? Thanks.
Yes that is the correct way.
Could you explain the strange results you are seeing in ddp mode?
I am facing similar issues with ddp. When computing F1 using Fbeta, if I use ddp, the results are not the same as running them in dp (in a single machine). To compare I use the same saved checkpoint. In my case the metric is computed in test_step_end:
def test_step_end(self, test_step_outputs):
def test_step_end(self, test_step_outputs):
self.fbeta_test(test_step_outputs['y_hat'], test_step_outputs['y'])
without logging. And then in test_epoch_end:
def test_epoch_end(self, outputs: list):
fbeta_test = self.fbeta_test.compute()
So I am not sure if in ddp is just averaging the metric across ranks instead of computing the metric over the whole dataset. It is worth saying that when computing the metric on test_epoch_end, the value fbeta_test is the same across ranks. With dp the result is .5670 and with ddp it returns .6154. Running it on a single gpu with no dp returns the same value as dp (.5670), which I assume is the correct one.
@SkafteNicki I am trying a new model with PyTorch Lightning and also with the new metrics in 1.0.3. The strange metric results in DDP may or may not be related to the new metrics. That's why I'd like to make sure I am doing the right thing with the new metrics. This helps me debug any issue that might be related to the new model. Thanks.
Thanks @LittlePea13 and @junwen-austin for both getting back to me.
It seems to me that there may be a problem with how metrics are aggregated in ddp mode. I will try to identify the issue and get back to you.
No problem @SkafteNicki. Let me know if I can help in any way, I tried to debug a bit but didn't get very far, I am quite new to ddp. And thanks for your work :)
@LittlePea13 could you try setting the dist_sync_on_step to Trueand see if it solved the problem?
If so then it have something to do with how buffer metric states.
@SkafteNicki I just tried and the result was still the same (wrong) by declaring the metric as:
self.fbeta_test = Fbeta(num_classes=1, dist_sync_on_step = True)
Depends on how you are using the metrics. In general if the
.compute()method is called the internal state is reset. This means that if you call.compute()in the end off the epoch you should be fine. If you are using metrics in combination withself.log, then settingon_epoch=Truewill also internally call. compute()at the end of the epoch.
Overwriting on_epoch_end allows computing metrics using all data, but there is no way to log those results (or even use it for model selection. It's because log_train_epoch_end_metrics is called right after the loop through batches and before on_epoch_end.
https://github.com/PyTorchLightning/pytorch-lightning/blob/b50dd12332bf83209d9535c8516486edc1a6b252/pytorch_lightning/trainer/training_loop.py#L608-L613
@hoanghng instead of using on_epoch_end could you use training_epoch_end? That should work with logging.
@SkafteNicki Thanks a lot. It works like a charm.
@SkafteNicki @hoanghng I feel like I missed something. Are you referring to metrics in ddp? I am using the *_epoch_end methods but still see different results with ddp.
@LittlePea13 this was just a question about which model methods that actually support logging
Metrics are not reset automatically on epoch end in DDP for me.
In LitModule __init__() I initialize MSE metric:
self.train_mse = pl.metrics.MeanSquaredError()
Then in training step:
mse = self.train_mse.update(pred, target)
self.log('train_mse', self.train_mse, on_step=False, on_epoch=True)
In addition I print the mse of each batch manually.
By comparing both methods it seems that the logged MSE is an average of all epochs so far.
PL v1.0.7
pytorch v1.6.0
@itsikad I can confirm that it is not being reset correctly.
Could you open up a new issue, where you reproduce using boringmodel?
@SkafteNicki is the reset not correct in DDP mode for all class metrics or just for this MSE? If this is for all, could one manually reset the metrics after compute() method at the end of the epoch to fix it for the time being? Thanks.
@SkafteNicki Done #4806
@junwen-austin I have really not investigated this enough yet, so I don't know how deep the rabbit hole goes. I suspect it is the same for all other metrics. Until solved, just call self.metric.reset() in training_epoch_end(). Let's keep further discussion in the new issue.
@itsikad @SkafteNicki I took itskikad's colab notebook and made the following changes:
def training_step(self, batch, batch_idx):
output = self.layer(batch)
loss = self.loss(batch, output)
self.metric.update(output, torch.ones_like(output))
return {"loss": loss, "batch_mse": loss}
def training_epoch_end(self, outputs) -> None:
avg_mse = self.metric.compute()
self.log('mse', avg_mse, on_step=False, on_epoch=True)
print(f'Sum squared error: {avg_mse}, Total samples: {self.metric.total}')
and now it works. Here is the link: https://colab.research.google.com/drive/1-NJKZ1hiXVCCirN7xsLmswf-zEyxVQIU?usp=sharing
essentially what I did is to update the metric in the train_step and then at the end of an epoch call compute explicitly and get the value of the metric and then pass on the value to the log. This might be a temp solution.
@junwen-austin
It works since you added an explicit call to .compute() (as I mention #4806). However, according to the documentation, it is unnecessary with self.log(...,on_epoch=True)
Edit: missed the last part of your reply, indeed a possible temp solution.
Most helpful comment
@itsikad @SkafteNicki I took itskikad's colab notebook and made the following changes:
and now it works. Here is the link: https://colab.research.google.com/drive/1-NJKZ1hiXVCCirN7xsLmswf-zEyxVQIU?usp=sharing
essentially what I did is to update the metric in the train_step and then at the end of an epoch call
computeexplicitly and get the value of the metric and then pass on the value to thelog. This might be a temp solution.