Introduce a metrics package that contains easy to use reference implementations of metrics people care about. The package should take care of the following:
ConfusionMatrix
metric can use Pandas for terminal (where it will print a nicely tabulated representation) and maybe a color-coded map for notebooks and TensorBoard.MetricsReporter
of some sort that will generate a full report of all the metrics you care about.This would be a rather large change, and I'm not sure what is the best way to do it. This issue is really meant to spur discussion on the topic. The actual solution might require some stuff to be landed on PyTorch, and that's fine.
Metrics are a big component of reproducibility. They satisfy all the requirements you can think about to justify standardizing them:
n
or n-1
? (aka: do you use Bessel's correction?). Numpy defaults to not using it, MATLAB and PyTorch default to using it. It's unsurprising to see threads like this.I think Lightning should take a page from Ignite's book and add a metrics package.
I also propose that Lightning take care of the following:
validation_step()
and validation_end()
steps for each metric, so that they can be computed efficiently per-batch.In this proposal, the API for the LightningModule will be simplified significantly. For example, something like this:
class CoolSystem(pl.LightningModule):
def __init__(self):
super(CoolSystem, self).__init__()
# not the best model...
self.l1 = torch.nn.Linear(28 * 28, 10)
def forward(self, x):
return torch.relu(self.l1(x.view(x.size(0), -1)))
def training_step(self, batch, batch_idx, training_metrics):
# REQUIRED
x, y = batch
y_hat = self.forward(x)
loss = F.cross_entropy(y_hat, y)
return {'loss': loss,}
def configure_metrics(self):
# OPTIONAL
training_metrics = MetricsLog([Accuracy(), Loss()])
eval_metrics = MetricsLog([F1Score('micro'), F1Score('macro'), F1Score(None), Accuracy(), PrecisionAtRecall(80)])
return {'train_metrics': training_metrics, 'eval_metrics': eval_metrics}
def configure_optimizers(self):
# REQUIRED
# can return multiple optimizers and learning_rate schedulers
# (LBFGS it is automatically supported, no need for closure function)
return torch.optim.Adam(self.parameters(), lr=0.02)
@pl.data_loader
def train_dataloader(self):
# REQUIRED
return DataLoader(MNIST(os.getcwd(), train=True, download=True, transform=transforms.ToTensor()), batch_size=32)
@pl.data_loader
def val_dataloader(self):
# OPTIONAL
return DataLoader(MNIST(os.getcwd(), train=True, download=True, transform=transforms.ToTensor()), batch_size=32)
If we find a way to have all metrics computation code done in PyTorch, even for DDP, it would be highly preferable I think. I just don't know if it's possible - maybe if we formulate metrics as a layer of sorts we might be able to do that? All standard layers have a state that persists & gets updated across batches (its weights :D ) so maybe we can implement metrics as a sort-of nn.Module
?
There is a separate discussion about providing the underlying muscle for this directly in torch (see https://github.com/pytorch/pytorch/issues/22439).
Love it!
@srush thoughts?
So I like the API and agree this would be nice to factor out. However I don't think that lightning should supply the metrics. Outside of classification, this is a really tricky thing to get right and often requires lots of domain engineering. Just as with the dataloaders, you don't want to own them, but facilitate and super clean API.
Random, why even specify CrossEntropy in training? Isn't that just another metric?
Hi,
I may contribute WER because I am interested in using it.
There is a reference implementation for Pytorch (https://github.com/1ytic/pytorch-edit-distance), I can also easily implement it in numpy for or plain Python.
However, I have a few general questions based on the discussion above:
My needs vary:
edit_distance
per batch and numwords
per batch and just sum(edit_distances) / sum(numwords)
- I wanted to demonstrate that WER computation differs (slightly) for batch and Dataset.My question:
Based on my understanding pytorch-lightning does not support events i.e end_of_Epoch, every_N_iteration, end_of_training, etc.
Does this proposal (or any other feature in PyTorch lightning) support triggering metrics
1. at different occasions(start_epoch, end_epoch, 4000iteration, etc)
2. on different objects (training example, batch, dataset(loader))?
So I like the API and agree this would be nice to factor out. However I don't think that lightning should supply the metrics. Outside of classification, this is a really tricky thing to get right and often requires lots of domain engineering. Just as with the dataloaders, you don't want to own them, but facilitate and super clean API.
I agree with the extensibility principle, but I'd also like each community to have their own standardized metrics to help with reproducibility in research (maybe Lightning supplies the classification ones as an example, and then maybe the image community will build the ones for the tasks they care about eg segmentation, pose estimation, etc. Same for the speech community which may build WER and so on). I think it's fine if they live in submodules, or even externally as long as the discoverability problem is tackled. I like submodules more because they are more "official" and this pushes people to collaborate on one reference implementation rather than have too many.
Random, why even specify CrossEntropy in training? Isn't that just another metric?
Fair enough :) Yep, it's definitely another metric.
Sure, I think we agree. Like a dataloader like torch-text https://github.com/pytorch/text is not a submodule right? Its an external package.
I agree with the extensibility principle, but I'd also like each community to have their own standardized metrics to help with reproducibility in research (maybe Lightning supplies the classification ones as an example, and then maybe the image community will build the ones for the tasks they care about eg segmentation, pose estimation, etc. Same for the speech community which may build WER and so on).
The reason I am against this is for the same reason! There are a million implementations of BLEU in NLP exactly because everyone wants to standardize it to "their" version.
Ignite has convinced me that the computation part (DDP included) could be done at the Torch layer (more info in this comment there). Even if people want to reimplement their own BLEU, we'd probably want to land a reference implementation in TorchText. You are free to avoid it if you don't like it, but it's also a pain to write your own so I think you'd do it only if you either don't trust its code, or it's not flexible enough for you, or it's not fast enough/uses too much memory. I'm willing to bet reimplementations will stop if something better is there. Isn't it the whole point of Lightning to stop people for writing their own training loops? :D I see it as very much the same thing.
Assuming the PyTorch API in the proposal happens, then maybe there is no API change that Lightning would have to undergo: just write your own update()
and compute()
in the validation_step()
and validation_end()
methods and pass them to the dicts to attach to TensorBoard? That would take care of the pretty printing and the boilerplate would seem rather minimal:
class CoolSystem(pl.LightningModule):
def __init__(self):
super(CoolSystem, self).__init__()
# not the best model...
self.l1 = torch.nn.Linear(28 * 28, 10)
self.train_macro_f1 = nn.F1Score("macro")
self.validation_macro_f1 = nn.F1Score("macro")
def forward(self, x):
return torch.relu(self.l1(x.view(x.size(0), -1)))
def training_step(self, batch, batch_idx):
# REQUIRED
x, y = batch
y_hat = self.forward(x)
loss = F.cross_entropy(y_hat, y)
self.train_macro_f1.update(y_hat, y) # void method, just update state
m = self.train_macro_f1() # forward calls compute(), takes no args. Compute from state
tensorboard_logs = {'train_loss': loss, 'train_macro_f1': m}
return {'loss': loss, 'log': tensorboard_logs}
def validation_step(self, batch, batch_idx):
# OPTIONAL
x, y = batch
y_hat = self.forward(x)
self.validation_macro_f1.update(y_hat, y)
return {'val_loss': F.cross_entropy(y_hat, y)}
def validation_end(self, outputs):
# OPTIONAL
avg_loss = torch.stack([x['val_loss'] for x in outputs]).mean()
tensorboard_logs = {'val_loss': avg_loss, 'val_macro_f1': self.validation_macro_f1()}
return {'avg_val_loss': avg_loss, 'log': tensorboard_logs}
def configure_optimizers(self):
# REQUIRED
# can return multiple optimizers and learning_rate schedulers
# (LBFGS it is automatically supported, no need for closure function)
return torch.optim.Adam(self.parameters(), lr=0.02)
@pl.data_loader
def train_dataloader(self):
# REQUIRED
return DataLoader(MNIST(os.getcwd(), train=True, download=True, transform=transforms.ToTensor()), batch_size=32)
@pl.data_loader
def val_dataloader(self):
# OPTIONAL
return DataLoader(MNIST(os.getcwd(), train=True, download=True, transform=transforms.ToTensor()), batch_size=32)
There is some boilerplate, but it's not the end of the world
I would not add another object which would hold metric values, it seems as duplicating logger functionality... What about defining just a list of metrics to be computed (function or callable class) and these metric will be automatically added to output dictionary?
cc: @PyTorchLightning/core-contributors ^^ opinion about all above?
I like the way Ignites metrics expose their API as previously discussed.
Opinion:
Organizing the evaluation of the metric at the right moments is often a bigger issue than implementing the metric itself.
Motivation for Solution:
Metrics "Evaluation" is in a similar situation as "a training loop" before PL.
I suggest mimic the approach of how a LightningModule simplified the training loop.
I really like about PL how PL breaks the messy training loop and defines "callback/event" mechanism in the LightningModule.
Solution: EvaluationPlan (pseudo) implementation
class EvaluationPlan():
def __init__(self, **kwargs):
pass # todo think it through
dev setup_train_epoch(self, trainer, module):
pass # E.g. init metric for evaluating epoch
dev setup_train_batch(self, trainer, module):
pass # E.g. init metric for collecting data during training epoch
dev eval_train_epoch(self, trainer, module):
pass # Compute final metric value for epoch and log it
....
# more useful events
....
def setup_training(self, trainer, module):
pass # init metrics for the whole training - collect data in any method above
def eval_training(self, trainer, module):
pass # compute & log metric for the whole training
....
# every supported event need to be implemented in a base class
....
@oplatek do you want to do it a form like a callback?
@oplatek do you want to do it a form like a callback?
@Borda Yes. The intended use is like this: (I renamed LightningEvaluation -> EvaluationPlan)
trainer = Trainer(eval_plan=EvaluationPlan() ))
trainer.fit(model)
We should provide the default EvaluationPlan which accepts any number of metrics and does average of each metric:
# I can imagine implementation where these definitions will be identical
trainer = Trainer()
trainer = Trainer(eval_plan=BatchEpochPlan())
trainer = Trainer(eval_plan=BatchEpochPlan(metrics=['loss']))
trainer = Trainer(eval_plan=BatchEpochPlan(metrics=[loss]))
trainer = Trainer(eval_plan=BatchEpochPlan(train_metrics=[loss], valid_metrics=[loss], test_metrics=[loss]))
trainer = Trainer(metrics=['loss'])
trainer = Trainer(metrics=[loss])
# ^ and ^^ will require the EpochPlan base class to have a constructor:
# def __init___(self, metrics=None, ..., **kwargs)
What should BatchEpochPlan do?
-> Setup each metric at the beginning of every epoch: loss = Loss(..)
-> After each batch update the metric: loss.update(y, y_hat)
-> At epoch end calculate the metric: loss()
What about defining just a list of metrics to be computed (function or callable class) and these metric will be automatically added to output dictionary?
I absolutely agree, every eval_*
method in EvaluationPlan should update the output dictionary (the output dictionaries for batch, validation_epoch_end, etc)
I agree, that there should be a metrics package, but I would differentiate between the metric itself and the integration to the trainer/module.
I like the approach of an EvaluationPlan
that can be used for arbitrary metrics, but IMO the metrics should be standalone (as functions or torch.nn.Module
).
This would enable the user to also use the metric without having to do this with the EvaluationPlan
but directly within the LightningModule
.
I see two main advantages in that case:
1.) We can postpone the EvaluationPlan implementation if necessary
2.) We will pretty soon be able to provide a standardised implementation for many metrics (which increase reproducibility), that can already be used by the user in the module, while we work on a more elegant version to use it (like the EvaluationPlan).
First of all, congratulations for this amazing project. I love Lightning.
Now when it comes to new features, I agree with @justusschock, while I'm pretty new to Lightning, the main thing I'm missing is Metrics. It would be especially helpful in more complex cases, multiple metrics, etc.
@WSzP @Darktex thanks for the awesome feedback! @justusschock will be leading the implementation here. I suspect we can learn a lot from what others have done before and take a stab at this with those lessons learned.
I do agree with @srush that some of the metrics are very domain specific, but we can take that into consideration. For some metrics like Bleu, etc... we should make sure to point to the reference paper for implementation correctness. But the few times i had to calculate BLEU made me depressed for weeks haha because of all the hoops I had to jump through.
there's a related issue #1256 where i proposed that we add a key for metrics
in what is returned from the training_step
and validation_step
(and then discourage use of arbitrary keys). we could use the metrics (defined to be standalone as functions or torch.nn.Module as @justusschock mentions) and then either report for each batch or aggregate at the end of an epoch. essentially, making it easy to capture and access but leaving the user to control how they want the details to be implemented.
I like the approach of an
EvaluationPlan
that can be used for arbitrary metrics, but IMO the metrics should be standalone (as functions ortorch.nn.Module
).
This would enable the user to also use the metric without having to do this with theEvaluationPlan
but directly within theLightningModule
.
@justusschock Could we do similar warping as we did for Loggers where we have LightningLoggerBase
and then LoggerCollection
which would encapsulate the particular metrics so for the usage you do not border much if you use single metric or whole set...?
I like the idea, but I'm not sure, we need this here...
Let's say, we have the metrics all defined as standalones, we could probably go away with only a collection-like thing, that has something like the following init:
class EvaluationPlan:
def __init__(*metrics: Union[Callable, torch.nn.Module], **kwargs)
This would then work with all numbers of metrics
I would love to have a metrics module. But I think it would be best to have a Metrics abstract class first, which is basically an aggregator of training outputs and validation outputs. Then I can use sklearn or ignite as the computation tool. Here is what I am thinking.
class PLMetric():
def __init__(self, calc_fn, mode='batch | epoch'):
self.mode = mode
# This controls what are the input x and y after every batch.
# If batch, x, y got the output from each step only.
# If epoch, x, y got all the previous outputs in current epoch.
self.prev_outputs = []
def calc(self, x, y):
if self.mode = 'epoch':
self.prev_outputs.append([x, y])
return calc_fn(self.prev_outputs)
return calc_fn(x, y)
The PLMetric
will hook up with training/validation/test steps.
Then I could build my metrics like:
def my_calc_fn(x, y):
x, y = SOME_PREPROCESSING(x, y)
return SOME_CALC(x, y)
met = PLMetric(my_calc_fn, 'epoch')
In a way, I think it makes more sense to have a metrics param in fit.
mymodel.fit(model, data_loaders, metrics=[met1, met2])
@shijianjian we are currently working on this. But we will add the aggregator together with all the metrics. We have already implemented the base class for these on the metrics branch
closed by #1326 #1877
Hi,
I am a data scientist and came across Pytorch Lightning a couple of days ago and really LIKE it so much! I am curious as for the progress of the metrics in Lightning. As I can see from 0.8.5 version, there are common metrics there but it seems they were not designed to update in an online fashion and did not solve the DDP issue where the metrics have to be calculated based on y and y_pred from all batches and all GPUs. Did I miss something? Is there a plan to have them soon?
I am using it in my project but because of DDP issue I mentioned above, the metrics are calculated in all batches from each GPU instead of all GPUs, I was hoping this could be fixed soon so I can switch completely to Lightning!
Thank you so much!
cc @justusschock
Hi @junwen-austin ,
Actually we can only compute metrics on each GPU and sync afterwards, but after #2528 we want to introduce a per-metric sync, so that we sync already computed parts, but make them behave, like we computed it over all samples.
@justusschock @williamFalcon Thanks for the reply! I can't say enough how an incredible piece of work you guys are doing!!!!
Do you guys have an estimate when #2528 will be completed?
Also at the moment, what is the best way of calculating a metric correctly on Muti-GPU DDP scenario as in the example in #2528:
First machine returns (assume sum of squared error is 200)
sqrt(1/10 * 200) = 4.47
Second machine return (assume sum of squared error is 100)
sqrt(1/10 * 100) = 3.16
After ddp sync the value that is returned would be
(4.47 + 3.16) / 2 = 3.815
But the correct value is:
sqrt(1/20 * 300) = 3.872
Thanks again.
First machine returns (assume sum of squared error is 200)
sqrt(1/10 * 200) = 4.47
Second machine return (assume sum of squared error is 100)
sqrt(1/10 * 100) = 3.16
After ddp sync the value that is returned would be
(4.47 + 3.16) / 2 = 3.815
But the correct value is:
sqrt(1/20 * 300) = 3.872
mind shot a PR with adding a test to replicate this case, it would help us to fix...
@Borda the example above was actually taken from @2528, suggest by @justusschock :)
Most helpful comment
I like the idea, but I'm not sure, we need this here...
Let's say, we have the metrics all defined as standalones, we could probably go away with only a collection-like thing, that has something like the following init:
This would then work with all numbers of metrics