Pytorch-lightning: [Discussion] What Metrics do we want?

Created on 29 Mar 2020 · 15Comments · Source: PyTorchLightning/pytorch-lightning

As discussed in #973 , we will probably start by implementing metrics as standalones.

This issue aims to discuss on what metrics we need and how we can implement this in a package structure.

Suggestions welcome.

My initial thought was to have a metrics package with subpackages for each research area like vision, text, audio etc.

CC @srush @Borda @williamFalcon @Darktex

As a start:
For vision I'd like to have the following:

[ ] accuracy
[ ] precision
[ ] recall
[ ] f1 score
[ ] roc
[ ] auc
[ ] dice coefficient
[ ] panoptic quality
[ ] IOU
[ ] SSIM

discussion enhancement help wanted

Source

justusschock

👍4

Most helpful comment

Dividing only into research areas would mean duplication of some metrics, for example accuracy is used more or less within all fields. I think it would be better to mainly divide into a regression and classification subpackage depending on targets begin continues or discrete. Specific metrics (like BLEU in NLP) could be in research-specific subpackeges.

I would rather avoid deep metric structures, one level is enough... So we can have general purpose like accuracy and the domain specific =)
And then make all imported from root metric init...

Borda on 29 Mar 2020

👍2

All 15 comments

Hi! thanks for your contribution!, great first issue!

github-actions[bot] on 29 Mar 2020

I like the structure...

Borda on 29 Mar 2020

Dividing only into research areas would mean duplication of some metrics, for example accuracy is used more or less within all fields. I think it would be better to mainly divide into a regression and classification subpackage depending on targets begin continues or discrete. Specific metrics (like BLEU in NLP) could be in research-specific subpackeges.

SkafteNicki on 29 Mar 2020

I agree, and these metrics (like accuracy) would not fall in any of these but remain in the base package.

I don't want to divide them into regression and classification and also have subpackages for all the research areas as it may become non-trivial where to find the desired metric.
Another thing we could think of is not having subpackages at all, but just one metric package containing them all (just like torch.nn)

justusschock on 29 Mar 2020

Dividing only into research areas would mean duplication of some metrics, for example accuracy is used more or less within all fields. I think it would be better to mainly divide into a regression and classification subpackage depending on targets begin continues or discrete. Specific metrics (like BLEU in NLP) could be in research-specific subpackeges.

I would rather avoid deep metric structures, one level is enough... So we can have general purpose like accuracy and the domain specific =)
And then make all imported from root metric init...

Borda on 29 Mar 2020

👍2

cv: panoptic quality and IOU
Augmentation: affinity and diversity

seandatasci on 30 Mar 2020

Some metrics may even be dataset-specific, e.g., F1 score for SQuAD (there are some preprocessing and special rules involved). For these kind of less general metrics, I think there should be a base Metric class for people to inherited on and create their own.

For some reference, this is how I implement mine, and this is from PyTorch Ignite.

Also, should losses be considered as some type of metrics?

haotongye on 30 Mar 2020

@haotongye
I would say, that we shouldn't include dataset specific metrics here.
But I agree, we should have a base class metric (probably just a torch.nn.Module with some extras). This will, however, be hard for the functional interface.

For now, I wouldn't include losses, as this would really broaden the scope. Maybe we can do this afterwards in a separate effort.

justusschock on 30 Mar 2020

👍1

@seandatasci Can you link a paper or reference implementation for the affinity and diversity part? AFAIK there are several ways to calculate these...

justusschock on 30 Mar 2020

some requested:

confusion matrix
f1
AUC/ROC
rouge
bleu

williamFalcon on 30 Mar 2020

👍1

metrics for continues output:

mean squared error (MSE) / root mean squared error (RMSE)
mean absolute error (MAE)
root mean squared Logarithmic Error (RMSLE)
maxerror
cosinesimilarity

would be nice to have:

R2 score (coefficient of determination)
Correlation (Pearson/Spearman)
Explained variance score

however, as far as I know the last 3 require access to the full list of targets and predictions at ones, so they can only be used for smaller datasets.

SkafteNicki on 30 Mar 2020

As I mentioned in the tweet for NLG this repo can directly be integrated (?)
If planning to also include support for Vision&Language tasks such as VQA/Visdial etc. which are proposed mostly as discriminative tasks, R@{1,5,10} / MRR/ NDCG can also be used. One nice implementation by Pythia here.

Let me know if I can help! Thanks.

shubhamagarwal92 on 30 Mar 2020

@shubhamagarwal92 We probably will have to adjust the metrics for NLG according to our upcoming metrics interface, but other then that it should be fine. If you want to, you can take this, once we have our interface running (probably tomorrow).

justusschock on 30 Mar 2020

👍1

@justusschock https://arxiv.org/abs/2002.08973

seandatasci on 30 Mar 2020

👍1