Pytorch-lightning: [Discussion] There should be a Metrics package

Created on 28 Feb 2020 · 26Comments · Source: PyTorchLightning/pytorch-lightning

🚀 Feature

Introduce a metrics package that contains easy to use reference implementations of metrics people care about. The package should take care of the following:

Logging and printing in the right format. The metric object should know how it wants to be printed, depending on the available surface (terminal vs notebook vs TensorBoard). For example, a ConfusionMatrix metric can use Pandas for terminal (where it will print a nicely tabulated representation) and maybe a color-coded map for notebooks and TensorBoard.
Handle the actual computation in all cases: cpu, single-gpu, single-box and multi-box DDP.
Support plugging into a MetricsReporter of some sort that will generate a full report of all the metrics you care about.

This would be a rather large change, and I'm not sure what is the best way to do it. This issue is really meant to spur discussion on the topic. The actual solution might require some stuff to be landed on PyTorch, and that's fine.

Motivation

Metrics are a big component of reproducibility. They satisfy all the requirements you can think about to justify standardizing them:

They are very hard to test. Kinda reminds me of the barber paradox: who measures the metrics? :) The only way you test them is really by checking corner cases and depending on someone else's implementation (eg sklearn's), hoping you did your piping right.
They are unsexy to implement so nobody wants to really spend time dealing with them. They are boilerplate, but necessary boilerplate.
They are ambiguous. Even something as trivial as a variance is actually not trivial to compare! Do you divide by n or n-1? (aka: do you use Bessel's correction?). Numpy defaults to not using it, MATLAB and PyTorch default to using it. It's unsurprising to see threads like this.
They are quite hard to implement for all cases: how do we compute a multi-node, multi-GPU ROC AUC score?

Pitch

I think Lightning should take a page from Ignite's book and add a metrics package.

I also propose that Lightning take care of the following:

Pretty printing. Lightning is the interface to Tensorboard and the loggers, so it should bridge the gap. For example, torch should provide something similar to the sklearn.metrics package: given a tensor x and a tensor y, compute f1 score or whatever. Lightning should wrap that to handle attaching the result on Tensorboard, printing a metrics report table, etc.
Attaching to events. PyTorch has no notion of events, so Lightning should handle that. This is essentially akin to writing the validation_step() and validation_end() steps for each metric, so that they can be computed efficiently per-batch.
Handle DDP (not sure about this one). On the one hand, it's breaking the abstraction that PyTorch computes and Lightning wraps, but on the other hand I don't see a way out: how you compute the metrics will necessarily change once you introduce DDP into the mix. For example, an accuracy can just keep a running average and how many samples it has seen so far, but something like ROC AUC will necessarily require that all processes store their predictions.

In this proposal, the API for the LightningModule will be simplified significantly. For example, something like this:

class CoolSystem(pl.LightningModule):

    def __init__(self):
        super(CoolSystem, self).__init__()
        # not the best model...
        self.l1 = torch.nn.Linear(28 * 28, 10)

    def forward(self, x):
        return torch.relu(self.l1(x.view(x.size(0), -1)))

    def training_step(self, batch, batch_idx, training_metrics):
        # REQUIRED
        x, y = batch
        y_hat = self.forward(x)
        loss = F.cross_entropy(y_hat, y)
        return {'loss': loss,}

    def configure_metrics(self):
        # OPTIONAL
        training_metrics = MetricsLog([Accuracy(), Loss()])
        eval_metrics = MetricsLog([F1Score('micro'), F1Score('macro'), F1Score(None), Accuracy(), PrecisionAtRecall(80)])
        return {'train_metrics': training_metrics, 'eval_metrics': eval_metrics}

    def configure_optimizers(self):
        # REQUIRED
        # can return multiple optimizers and learning_rate schedulers
        # (LBFGS it is automatically supported, no need for closure function)
        return torch.optim.Adam(self.parameters(), lr=0.02)

    @pl.data_loader
    def train_dataloader(self):
        # REQUIRED
        return DataLoader(MNIST(os.getcwd(), train=True, download=True, transform=transforms.ToTensor()), batch_size=32)

    @pl.data_loader
    def val_dataloader(self):
        # OPTIONAL
        return DataLoader(MNIST(os.getcwd(), train=True, download=True, transform=transforms.ToTensor()), batch_size=32)

Alternative implementation

If we find a way to have all metrics computation code done in PyTorch, even for DDP, it would be highly preferable I think. I just don't know if it's possible - maybe if we formulate metrics as a layer of sorts we might be able to do that? All standard layers have a state that persists & gets updated across batches (its weights :D ) so maybe we can implement metrics as a sort-of nn.Module?

Additional context

There is a separate discussion about providing the underlying muscle for this directly in torch (see https://github.com/pytorch/pytorch/issues/22439).

discussion enhancement help wanted let's do it!

Source

Darktex

👍7 ❤4

Most helpful comment

I like the idea, but I'm not sure, we need this here...
Let's say, we have the metrics all defined as standalones, we could probably go away with only a collection-like thing, that has something like the following init:

class EvaluationPlan:
   def __init__(*metrics: Union[Callable, torch.nn.Module], **kwargs)

This would then work with all numbers of metrics

justusschock on 29 Mar 2020

❤2 🚀1

All 26 comments

Love it!
@srush thoughts?

williamFalcon on 28 Feb 2020

So I like the API and agree this would be nice to factor out. However I don't think that lightning should supply the metrics. Outside of classification, this is a really tricky thing to get right and often requires lots of domain engineering. Just as with the dataloaders, you don't want to own them, but facilitate and super clean API.

Random, why even specify CrossEntropy in training? Isn't that just another metric?

srush on 28 Feb 2020

👍1

Hi,

I may contribute WER because I am interested in using it.
There is a reference implementation for Pytorch (https://github.com/1ytic/pytorch-edit-distance), I can also easily implement it in numpy for or plain Python.

However, I have a few general questions based on the discussion above:

My needs vary:

Sometimes I need metric per batch - e.g. loss
Sometimes I need metric per a training example (can be hacked by batch_size==1) - e.g for filtering problematic examples while debugging the model
Most of the time I need metric per epoch
- For example in case of WER you may accumulate edit_distance per batch and numwords per batch and just sum(edit_distances) / sum(numwords) - I wanted to demonstrate that WER computation differs (slightly) for batch and Dataset.
Quite often I need to compute more time-consuming metrics at the end of training/on-demand - e.g. evaluate the final system with the trained NN

My question:
Based on my understanding pytorch-lightning does not support events i.e end_of_Epoch, every_N_iteration, end_of_training, etc.
Does this proposal (or any other feature in PyTorch lightning) support triggering metrics
1. at different occasions(start_epoch, end_epoch, 4000iteration, etc)
2. on different objects (training example, batch, dataset(loader))?

oplatek on 28 Feb 2020

So I like the API and agree this would be nice to factor out. However I don't think that lightning should supply the metrics. Outside of classification, this is a really tricky thing to get right and often requires lots of domain engineering. Just as with the dataloaders, you don't want to own them, but facilitate and super clean API.

I agree with the extensibility principle, but I'd also like each community to have their own standardized metrics to help with reproducibility in research (maybe Lightning supplies the classification ones as an example, and then maybe the image community will build the ones for the tasks they care about eg segmentation, pose estimation, etc. Same for the speech community which may build WER and so on). I think it's fine if they live in submodules, or even externally as long as the discoverability problem is tackled. I like submodules more because they are more "official" and this pushes people to collaborate on one reference implementation rather than have too many.

Random, why even specify CrossEntropy in training? Isn't that just another metric?

Fair enough :) Yep, it's definitely another metric.

Darktex on 28 Feb 2020

Sure, I think we agree. Like a dataloader like torch-text https://github.com/pytorch/text is not a submodule right? Its an external package.

I agree with the extensibility principle, but I'd also like each community to have their own standardized metrics to help with reproducibility in research (maybe Lightning supplies the classification ones as an example, and then maybe the image community will build the ones for the tasks they care about eg segmentation, pose estimation, etc. Same for the speech community which may build WER and so on).

The reason I am against this is for the same reason! There are a million implementations of BLEU in NLP exactly because everyone wants to standardize it to "their" version.

srush on 29 Feb 2020

👍1

Ignite has convinced me that the computation part (DDP included) could be done at the Torch layer (more info in this comment there). Even if people want to reimplement their own BLEU, we'd probably want to land a reference implementation in TorchText. You are free to avoid it if you don't like it, but it's also a pain to write your own so I think you'd do it only if you either don't trust its code, or it's not flexible enough for you, or it's not fast enough/uses too much memory. I'm willing to bet reimplementations will stop if something better is there. Isn't it the whole point of Lightning to stop people for writing their own training loops? :D I see it as very much the same thing.

Assuming the PyTorch API in the proposal happens, then maybe there is no API change that Lightning would have to undergo: just write your own update() and compute() in the validation_step() and validation_end() methods and pass them to the dicts to attach to TensorBoard? That would take care of the pretty printing and the boilerplate would seem rather minimal:

class CoolSystem(pl.LightningModule):

    def __init__(self):
        super(CoolSystem, self).__init__()
        # not the best model...
        self.l1 = torch.nn.Linear(28 * 28, 10)
        self.train_macro_f1 = nn.F1Score("macro")
        self.validation_macro_f1 = nn.F1Score("macro")

    def forward(self, x):
        return torch.relu(self.l1(x.view(x.size(0), -1)))

    def training_step(self, batch, batch_idx):
        # REQUIRED
        x, y = batch
        y_hat = self.forward(x)
        loss = F.cross_entropy(y_hat, y)
        self.train_macro_f1.update(y_hat, y)  # void method, just update state
        m = self.train_macro_f1()  # forward calls compute(), takes no args. Compute from state
        tensorboard_logs = {'train_loss': loss, 'train_macro_f1': m}
        return {'loss': loss, 'log': tensorboard_logs}

    def validation_step(self, batch, batch_idx):
        # OPTIONAL
        x, y = batch
        y_hat = self.forward(x)
        self.validation_macro_f1.update(y_hat, y)
        return {'val_loss': F.cross_entropy(y_hat, y)}

    def validation_end(self, outputs):
        # OPTIONAL
        avg_loss = torch.stack([x['val_loss'] for x in outputs]).mean()
        tensorboard_logs = {'val_loss': avg_loss, 'val_macro_f1': self.validation_macro_f1()}
        return {'avg_val_loss': avg_loss, 'log': tensorboard_logs}

    def configure_optimizers(self):
        # REQUIRED
        # can return multiple optimizers and learning_rate schedulers
        # (LBFGS it is automatically supported, no need for closure function)
        return torch.optim.Adam(self.parameters(), lr=0.02)

    @pl.data_loader
    def train_dataloader(self):
        # REQUIRED
        return DataLoader(MNIST(os.getcwd(), train=True, download=True, transform=transforms.ToTensor()), batch_size=32)

    @pl.data_loader
    def val_dataloader(self):
        # OPTIONAL
        return DataLoader(MNIST(os.getcwd(), train=True, download=True, transform=transforms.ToTensor()), batch_size=32)

There is some boilerplate, but it's not the end of the world

Darktex on 29 Feb 2020

👍2

I would not add another object which would hold metric values, it seems as duplicating logger functionality... What about defining just a list of metrics to be computed (function or callable class) and these metric will be automatically added to output dictionary?
cc: @PyTorchLightning/core-contributors ^^ opinion about all above?

Borda on 26 Mar 2020

👍1

I like the way Ignites metrics expose their API as previously discussed.

Opinion:
Organizing the evaluation of the metric at the right moments is often a bigger issue than implementing the metric itself.

Motivation for Solution:
Metrics "Evaluation" is in a similar situation as "a training loop" before PL.
I suggest mimic the approach of how a LightningModule simplified the training loop.
I really like about PL how PL breaks the messy training loop and defines "callback/event" mechanism in the LightningModule.

Solution: EvaluationPlan (pseudo) implementation

class EvaluationPlan():
      def __init__(self, **kwargs):
              pass    # todo think it through

      dev setup_train_epoch(self, trainer, module):
              pass    # E.g. init metric for evaluating epoch

      dev setup_train_batch(self, trainer, module):
              pass    # E.g. init metric for collecting data during training epoch

      dev eval_train_epoch(self, trainer, module):
              pass    # Compute final metric value for epoch and log it

       .... 
       # more useful events 
       ....

      def setup_training(self, trainer, module):
             pass # init metrics for the whole training - collect data in any method above

      def eval_training(self, trainer, module):
             pass  # compute & log metric for the whole training 

       .... 
       # every supported event need to be implemented in a base class
       ....

oplatek on 26 Mar 2020

👍1

@oplatek do you want to do it a form like a callback?

Borda on 26 Mar 2020

@oplatek do you want to do it a form like a callback?

@Borda Yes. The intended use is like this: (I renamed LightningEvaluation -> EvaluationPlan)

trainer = Trainer(eval_plan=EvaluationPlan() ))
trainer.fit(model)

We should provide the default EvaluationPlan which accepts any number of metrics and does average of each metric:

# I can imagine implementation where these definitions will be identical
trainer = Trainer()
trainer = Trainer(eval_plan=BatchEpochPlan())
trainer = Trainer(eval_plan=BatchEpochPlan(metrics=['loss']))
trainer = Trainer(eval_plan=BatchEpochPlan(metrics=[loss]))
trainer = Trainer(eval_plan=BatchEpochPlan(train_metrics=[loss], valid_metrics=[loss], test_metrics=[loss]))
trainer = Trainer(metrics=['loss'])
trainer = Trainer(metrics=[loss])
# ^ and ^^ will require the EpochPlan base class to have a constructor:
#     def __init___(self, metrics=None, ..., **kwargs)

What should BatchEpochPlan do?
-> Setup each metric at the beginning of every epoch: loss = Loss(..)
-> After each batch update the metric: loss.update(y, y_hat)
-> At epoch end calculate the metric: loss()

oplatek on 26 Mar 2020

What about defining just a list of metrics to be computed (function or callable class) and these metric will be automatically added to output dictionary?

I absolutely agree, every eval_* method in EvaluationPlan should update the output dictionary (the output dictionaries for batch, validation_epoch_end, etc)

oplatek on 27 Mar 2020

I agree, that there should be a metrics package, but I would differentiate between the metric itself and the integration to the trainer/module.

I like the approach of an EvaluationPlan that can be used for arbitrary metrics, but IMO the metrics should be standalone (as functions or torch.nn.Module).
This would enable the user to also use the metric without having to do this with the EvaluationPlan but directly within the LightningModule.
I see two main advantages in that case:

1.) We can postpone the EvaluationPlan implementation if necessary
2.) We will pretty soon be able to provide a standardised implementation for many metrics (which increase reproducibility), that can already be used by the user in the module, while we work on a more elegant version to use it (like the EvaluationPlan).

justusschock on 28 Mar 2020

👍3

First of all, congratulations for this amazing project. I love Lightning.
Now when it comes to new features, I agree with @justusschock, while I'm pretty new to Lightning, the main thing I'm missing is Metrics. It would be especially helpful in more complex cases, multiple metrics, etc.

WSzP on 28 Mar 2020

👍1

@WSzP @Darktex thanks for the awesome feedback! @justusschock will be leading the implementation here. I suspect we can learn a lot from what others have done before and take a stab at this with those lessons learned.

I do agree with @srush that some of the metrics are very domain specific, but we can take that into consideration. For some metrics like Bleu, etc... we should make sure to point to the reference paper for implementation correctness. But the few times i had to calculate BLEU made me depressed for weeks haha because of all the hoops I had to jump through.

williamFalcon on 28 Mar 2020

there's a related issue #1256 where i proposed that we add a key for metrics in what is returned from the training_step and validation_step (and then discourage use of arbitrary keys). we could use the metrics (defined to be standalone as functions or torch.nn.Module as @justusschock mentions) and then either report for each batch or aggregate at the end of an epoch. essentially, making it easy to capture and access but leaving the user to control how they want the details to be implemented.

jeremyjordan on 28 Mar 2020

👍2

I like the approach of an EvaluationPlan that can be used for arbitrary metrics, but IMO the metrics should be standalone (as functions or torch.nn.Module).
This would enable the user to also use the metric without having to do this with the EvaluationPlan but directly within the LightningModule.

@justusschock Could we do similar warping as we did for Loggers where we have LightningLoggerBase and then LoggerCollection which would encapsulate the particular metrics so for the usage you do not border much if you use single metric or whole set...?

Borda on 28 Mar 2020

class EvaluationPlan:
   def __init__(*metrics: Union[Callable, torch.nn.Module], **kwargs)

This would then work with all numbers of metrics

justusschock on 29 Mar 2020

❤2 🚀1

I would love to have a metrics module. But I think it would be best to have a Metrics abstract class first, which is basically an aggregator of training outputs and validation outputs. Then I can use sklearn or ignite as the computation tool. Here is what I am thinking.

class PLMetric():
    def __init__(self, calc_fn, mode='batch | epoch'):
        self.mode = mode
        # This controls what are the input x and y after every batch.
        # If batch, x, y got the output from each step only.
        # If epoch, x, y got all the previous outputs in current epoch. 
        self.prev_outputs = []

    def calc(self, x, y):
        if self.mode = 'epoch':
              self.prev_outputs.append([x, y])
              return calc_fn(self.prev_outputs)
        return calc_fn(x, y)

The PLMetric will hook up with training/validation/test steps.

Then I could build my metrics like:

def my_calc_fn(x, y):
    x, y = SOME_PREPROCESSING(x, y)
    return SOME_CALC(x, y)
met = PLMetric(my_calc_fn, 'epoch')

In a way, I think it makes more sense to have a metrics param in fit.

mymodel.fit(model, data_loaders, metrics=[met1, met2])

shijianjian on 7 Apr 2020

@shijianjian we are currently working on this. But we will add the aggregator together with all the metrics. We have already implemented the base class for these on the metrics branch

justusschock on 7 Apr 2020

closed by #1326 #1877

Borda on 26 May 2020

Hi,

I am a data scientist and came across Pytorch Lightning a couple of days ago and really LIKE it so much! I am curious as for the progress of the metrics in Lightning. As I can see from 0.8.5 version, there are common metrics there but it seems they were not designed to update in an online fashion and did not solve the DDP issue where the metrics have to be calculated based on y and y_pred from all batches and all GPUs. Did I miss something? Is there a plan to have them soon?

I am using it in my project but because of DDP issue I mentioned above, the metrics are calculated in all batches from each GPU instead of all GPUs, I was hoping this could be fixed soon so I can switch completely to Lightning!

Thank you so much!

junwen-austin on 17 Aug 2020

cc @justusschock

williamFalcon on 17 Aug 2020

Hi @junwen-austin ,

Actually we can only compute metrics on each GPU and sync afterwards, but after #2528 we want to introduce a per-metric sync, so that we sync already computed parts, but make them behave, like we computed it over all samples.

justusschock on 17 Aug 2020

@justusschock @williamFalcon Thanks for the reply! I can't say enough how an incredible piece of work you guys are doing!!!!

Do you guys have an estimate when #2528 will be completed?

Also at the moment, what is the best way of calculating a metric correctly on Muti-GPU DDP scenario as in the example in #2528:

First machine returns (assume sum of squared error is 200)
sqrt(1/10 * 200) = 4.47
Second machine return (assume sum of squared error is 100)
sqrt(1/10 * 100) = 3.16
After ddp sync the value that is returned would be
(4.47 + 3.16) / 2 = 3.815
But the correct value is:
sqrt(1/20 * 300) = 3.872

Thanks again.

junwen-austin on 17 Aug 2020

First machine returns (assume sum of squared error is 200)
sqrt(1/10 * 200) = 4.47
Second machine return (assume sum of squared error is 100)
sqrt(1/10 * 100) = 3.16
After ddp sync the value that is returned would be
(4.47 + 3.16) / 2 = 3.815
But the correct value is:
sqrt(1/20 * 300) = 3.872

mind shot a PR with adding a test to replicate this case, it would help us to fix...

Borda on 17 Aug 2020

@Borda the example above was actually taken from @2528, suggest by @justusschock :)

junwen-austin on 17 Aug 2020

Was this page helpful?

0 / 5 - 0 ratings

Related issues

AttributeError: 'tqdm_notebook' object has no attribute 'reset'

mmsamiei · 3Comments

How to use pytorch-lightning to run GAN？

as754770178 · 3Comments

Early stopping + checkpoint key

williamFalcon · 3Comments

NumpyMetric not mapping back to GPU in multi-GPU training

jcreinhold · 3Comments

Fix .test() on ddp

williamFalcon · 3Comments