Pytorch-lightning: Test all metrics against sklearn (with many input trials)

Created on 27 Aug 2020 · 18Comments · Source: PyTorchLightning/pytorch-lightning

Hand-chosen values are not enough, we need to test with a large batch of inputs where possible.
Something in this style, maybe with a fixed seed:

def test_auroc_versus_sklearn():
    for i in range(100):
        target = torch.randint(0, 2, size=(10, ))
        pred = torch.randint(0, 2, size=(10,))
        score_sk = sklearn.metrics.roc_auc_score(target.numpy(), pred.numpy())  # sklearn
        score_pl = auroc(pred, target)  # Pytorch Lightning 
        assert torch.allclose(torch.tensor(score_pl).float(), torch.tensor(score_sk).float())

Metrics enhancement good first issue help wanted tests / CI

Source

awaelchli

Most helpful comment

Hi! I am not a contributor, but I am a user of a library called hypothesis, which may be more suitable for this specific case. This library allows the user to write parameterized tests and then chooses the cases that are most likely to make the program fail, that is, the library is really robust to edge cases and can really help to find those that can be problematic to the implementation.

CamiVasz on 25 Sep 2020

👍3

All 18 comments

maybe with a fixed seed

I don't think a seed is required here.

rohitgr7 on 27 Aug 2020

👍1

maybe with a fixed seed

I don't think a seed is required here.

I would agree here that all metric shall be deterministic so even using different seeds increase the coverage or cases and so far you pas the same values to the two functions it shall always give the same result, right?

Borda on 28 Aug 2020

👍1

I would agree here that all metric shall be deterministic so even using different seeds increase the coverage or cases and so far you pas the same values to the two functions it shall always give the same result, right?

Agreed, this is one good point.

rohitgr7 on 28 Aug 2020

@awaelchli agree that we need such test. We already do that for many metrics:
https://github.com/PyTorchLightning/pytorch-lightning/blob/3910ad033074367f6abfe0001562db725a75cb73/tests/metrics/functional/test_classification.py#L38-L59
and
https://github.com/PyTorchLightning/pytorch-lightning/blob/3910ad033074367f6abfe0001562db725a75cb73/tests/metrics/functional/test_regression.py#L26-L44
tests are not deterministic, because no seed is used.

SkafteNicki on 1 Sep 2020

👍1

CamiVasz on 25 Sep 2020

👍3

Hi! I am not a contributor, but I am a user of a library called hypothesis, which may be more suitable for this specific case. This library allows the user to write parameterized tests and then chooses the cases that are most likely to make the program fail, that is, the library is really robust to edge cases and can really help to find those that can be problematic to the implementation.

hi that you for your recommendation, just not sure if I follow your recommendation, mind write a bit more how you would use https://hypothesis.works in PL...

Borda on 27 Sep 2020

Hi! I am not a contributor, but I am a user of a library called hypothesis, which may be more suitable for this specific case. This library allows the user to write parameterized tests and then chooses the cases that are most likely to make the program fail, that is, the library is really robust to edge cases and can really help to find those that can be problematic to the implementation.

hi that you for your recommendation, just not sure if I follow your recommendation, mind write a bit more how you would use https://hypothesis.works in PL...

The idea would be to generate a hypothesis test, to test Sklearn metrics against the PL metrics, and let the library test the corner cases (as well as the "common" ones) and that way assert that both implementations are concordant, without the need to design the cases by hand, nor search for complicated patterns.

CamiVasz on 28 Sep 2020

@CamiVasz that sounds cool, mind draft a small example of how to use it, and eventually we can extend in on more PL cases... 🐰

Borda on 29 Sep 2020

@CamiVasz that sounds cool, mind draft a small example of how to use it, and eventually we can extend in on more PL cases...

https://colab.research.google.com/drive/1Dprqr1nbtgCFwsyUyb6UbXe9FE7X73Q5?usp=sharing
Here is a small example featuring the mse.

CamiVasz on 29 Sep 2020

@CamiVasz looks cool, but could you tell me the difference between hypothesis and just creating two random tensors myself (using torch.randn for example)

SkafteNicki on 29 Sep 2020

@CamiVasz looks cool, but could you tell me the difference between hypothesis and just creating two random tensors myself (using torch.randn for example)

Hypothesis generation is biased towards edge cases, maximizing the probability of failure. When you generate random numbers, these edge cases that you want to find have the same probability of appearing that easy cases.

CamiVasz on 30 Sep 2020

👍1

Just found that pytorch is also using hypothesis
https://github.com/pytorch/pytorch/blob/master/CONTRIBUTING.md#unit-testing

awaelchli on 2 Oct 2020

❤1