Pytorch-lightning: [Discussion] What Metrics do we want?

Created on 29 Mar 2020  路  15Comments  路  Source: PyTorchLightning/pytorch-lightning

As discussed in #973 , we will probably start by implementing metrics as standalones.

This issue aims to discuss on what metrics we need and how we can implement this in a package structure.

Suggestions welcome.

My initial thought was to have a metrics package with subpackages for each research area like vision, text, audio etc.

CC @srush @Borda @williamFalcon @Darktex

As a start:
For vision I'd like to have the following:

  • [ ] accuracy
  • [ ] precision
  • [ ] recall
  • [ ] f1 score
  • [ ] roc
  • [ ] auc
  • [ ] dice coefficient
  • [ ] panoptic quality
  • [ ] IOU
  • [ ] SSIM
discussion enhancement help wanted

Most helpful comment

Dividing only into research areas would mean duplication of some metrics, for example accuracy is used more or less within all fields. I think it would be better to mainly divide into a regression and classification subpackage depending on targets begin continues or discrete. Specific metrics (like BLEU in NLP) could be in research-specific subpackeges.

I would rather avoid deep metric structures, one level is enough... So we can have general purpose like accuracy and the domain specific =)
And then make all imported from root metric init...

All 15 comments

Hi! thanks for your contribution!, great first issue!

I like the structure...

Dividing only into research areas would mean duplication of some metrics, for example accuracy is used more or less within all fields. I think it would be better to mainly divide into a regression and classification subpackage depending on targets begin continues or discrete. Specific metrics (like BLEU in NLP) could be in research-specific subpackeges.

I agree, and these metrics (like accuracy) would not fall in any of these but remain in the base package.

I don't want to divide them into regression and classification and also have subpackages for all the research areas as it may become non-trivial where to find the desired metric.
Another thing we could think of is not having subpackages at all, but just one metric package containing them all (just like torch.nn)

Dividing only into research areas would mean duplication of some metrics, for example accuracy is used more or less within all fields. I think it would be better to mainly divide into a regression and classification subpackage depending on targets begin continues or discrete. Specific metrics (like BLEU in NLP) could be in research-specific subpackeges.

I would rather avoid deep metric structures, one level is enough... So we can have general purpose like accuracy and the domain specific =)
And then make all imported from root metric init...

cv: panoptic quality and IOU
Augmentation: affinity and diversity

Some metrics may even be dataset-specific, e.g., F1 score for SQuAD (there are some preprocessing and special rules involved). For these kind of less general metrics, I think there should be a base Metric class for people to inherited on and create their own.

For some reference, this is how I implement mine, and this is from PyTorch Ignite.

Also, should losses be considered as some type of metrics?

@haotongye
I would say, that we shouldn't include dataset specific metrics here.
But I agree, we should have a base class metric (probably just a torch.nn.Module with some extras). This will, however, be hard for the functional interface.

For now, I wouldn't include losses, as this would really broaden the scope. Maybe we can do this afterwards in a separate effort.

@seandatasci Can you link a paper or reference implementation for the affinity and diversity part? AFAIK there are several ways to calculate these...

some requested:

  • confusion matrix
  • f1
  • AUC/ROC
  • rouge
  • bleu

metrics for continues output:

  • mean squared error (MSE) / root mean squared error (RMSE)
  • mean absolute error (MAE)
  • root mean squared Logarithmic Error (RMSLE)
  • maxerror
  • cosinesimilarity

would be nice to have:

  • R2 score (coefficient of determination)
  • Correlation (Pearson/Spearman)
  • Explained variance score

however, as far as I know the last 3 require access to the full list of targets and predictions at ones, so they can only be used for smaller datasets.

  • As I mentioned in the tweet for NLG this repo can directly be integrated (?)

  • If planning to also include support for Vision&Language tasks such as VQA/Visdial etc. which are proposed mostly as discriminative tasks, R@{1,5,10} / MRR/ NDCG can also be used. One nice implementation by Pythia here.

Let me know if I can help! Thanks.

@shubhamagarwal92 We probably will have to adjust the metrics for NLG according to our upcoming metrics interface, but other then that it should be fine. If you want to, you can take this, once we have our interface running (probably tomorrow).

Let me know if I can help! Thanks.

Help is always welcome =)

Was this page helpful?
0 / 5 - 0 ratings

Related issues

monney picture monney  路  3Comments

mmsamiei picture mmsamiei  路  3Comments

versatran01 picture versatran01  路  3Comments

srush picture srush  路  3Comments

justusschock picture justusschock  路  3Comments