Pytorch-lightning: Test all metrics against sklearn (with many input trials)

Created on 27 Aug 2020  ·  18Comments  ·  Source: PyTorchLightning/pytorch-lightning

Hand-chosen values are not enough, we need to test with a large batch of inputs where possible.
Something in this style, maybe with a fixed seed:

def test_auroc_versus_sklearn():
    for i in range(100):
        target = torch.randint(0, 2, size=(10, ))
        pred = torch.randint(0, 2, size=(10,))
        score_sk = sklearn.metrics.roc_auc_score(target.numpy(), pred.numpy())  # sklearn
        score_pl = auroc(pred, target)  # Pytorch Lightning 
        assert torch.allclose(torch.tensor(score_pl).float(), torch.tensor(score_sk).float())
Metrics enhancement good first issue help wanted tests / CI

Most helpful comment

Hi! I am not a contributor, but I am a user of a library called hypothesis, which may be more suitable for this specific case. This library allows the user to write parameterized tests and then chooses the cases that are most likely to make the program fail, that is, the library is really robust to edge cases and can really help to find those that can be problematic to the implementation.

All 18 comments

maybe with a fixed seed

I don't think a seed is required here.

maybe with a fixed seed

I don't think a seed is required here.

I would agree here that all metric shall be deterministic so even using different seeds increase the coverage or cases and so far you pas the same values to the two functions it shall always give the same result, right?

I would agree here that all metric shall be deterministic so even using different seeds increase the coverage or cases and so far you pas the same values to the two functions it shall always give the same result, right?

Agreed, this is one good point.

Hi! I am not a contributor, but I am a user of a library called hypothesis, which may be more suitable for this specific case. This library allows the user to write parameterized tests and then chooses the cases that are most likely to make the program fail, that is, the library is really robust to edge cases and can really help to find those that can be problematic to the implementation.

Hi! I am not a contributor, but I am a user of a library called hypothesis, which may be more suitable for this specific case. This library allows the user to write parameterized tests and then chooses the cases that are most likely to make the program fail, that is, the library is really robust to edge cases and can really help to find those that can be problematic to the implementation.

hi that you for your recommendation, just not sure if I follow your recommendation, mind write a bit more how you would use https://hypothesis.works in PL...

Hi! I am not a contributor, but I am a user of a library called hypothesis, which may be more suitable for this specific case. This library allows the user to write parameterized tests and then chooses the cases that are most likely to make the program fail, that is, the library is really robust to edge cases and can really help to find those that can be problematic to the implementation.

hi that you for your recommendation, just not sure if I follow your recommendation, mind write a bit more how you would use https://hypothesis.works in PL...

The idea would be to generate a hypothesis test, to test Sklearn metrics against the PL metrics, and let the library test the corner cases (as well as the "common" ones) and that way assert that both implementations are concordant, without the need to design the cases by hand, nor search for complicated patterns.

@CamiVasz that sounds cool, mind draft a small example of how to use it, and eventually we can extend in on more PL cases... 🐰

@CamiVasz that sounds cool, mind draft a small example of how to use it, and eventually we can extend in on more PL cases...

https://colab.research.google.com/drive/1Dprqr1nbtgCFwsyUyb6UbXe9FE7X73Q5?usp=sharing
Here is a small example featuring the mse.

@CamiVasz looks cool, but could you tell me the difference between hypothesis and just creating two random tensors myself (using torch.randn for example)

@CamiVasz looks cool, but could you tell me the difference between hypothesis and just creating two random tensors myself (using torch.randn for example)

Hypothesis generation is biased towards edge cases, maximizing the probability of failure. When you generate random numbers, these edge cases that you want to find have the same probability of appearing that easy cases.

Just found that pytorch is also using hypothesis
https://github.com/pytorch/pytorch/blob/master/CONTRIBUTING.md#unit-testing

Just found that pytorch is also using hypothesis
https://github.com/pytorch/pytorch/blob/master/CONTRIBUTING.md#unit-testing

shall we open a new issue just for hypothesis testing? :]

Yeah, maybe @CamiVasz can do that and I think it would also be great to see an example of it using one of our tests, to show the motivation.

Yeah, maybe @CamiVasz can do that and I think it would also be great to see an example of it using one of our tests, to show the motivation.

it would be great to have it as HackOctober issue :]

It would be great to work on that! Is this still on board?

@justusschock @SkafteNicki @ananyahjha93 @teddykoker Do you guys need help with testing the new metrics? @CamiVasz wants to help.

Yeah sure. I think the whole functional API would be a good place to start.

And we could then later extend it to the revamped class interface

Was this page helpful?
0 / 5 - 0 ratings

Related issues

awaelchli picture awaelchli  ·  3Comments

anthonytec2 picture anthonytec2  ·  3Comments

as754770178 picture as754770178  ·  3Comments

williamFalcon picture williamFalcon  ·  3Comments

remisphere picture remisphere  ·  3Comments