Astropy: Adding property-based tests for Astropy using Hypothesis

Created on 18 Jul 2019  ·  21Comments  ·  Source: astropy/astropy

At SciPy 2019, I did a talk and tutorial on Hypothesis, and at the sprints I was encouraged to open an issue or PR showing how it could be applied to Astropy. So here it is!

(and for SciPy 2020 I wrote a paper on property testing for science with principles, examples, and advice)

Traditional unit tests involve checking that for some specific (usually simple) input, you get the expected output. With Hypothesis, your tests instead check that some property is true for any valid input - you describe the domain, and it generates examples. (See the docs for details, but we spend a lot of time making tests non-flaky too)

For example, the following test will demonstrate a variety of ways that floating-point imprecision can ruin your day when working with units. This is unlikely to happen in practice, but careful use of Fraction could prevent it entirely...

from hypothesis import given, strategies as st
import astropy.units as u
UNITS = sorted(u.get_current_unit_registry().all_units, key=repr)

@given(st.lists(st.sampled_from(UNITS), min_size=3))
def test_intermediate_decomposition_is_no_op(lst):
    decomp = compound = u.dimensionless_unscaled
    for unit in lst:
        decomp = (decomp * unit).decompose()
        compound *= unit
        assert decomp == compound.decompose(), (decomp, compound)

Other properties that were suggested:

  • [ ] Check that round-tripping from image to sky coordinates and back is lossless for distortion-free mappings, and otherwise always below 10^-5 px.
  • [x] Take a moment in time, round-trip it through various frames, check it hasn't changed or lost precision. See #9532 and #10373.
  • [ ] Testing that IO routines losslessly round-trip data that they are expected to handle
  • [ ] Check that any optimised routines calculate the same result as unoptimised, within tolerances
  • [ ] Check broadcasting behaviour of custom ufuncs with mutually_broadcastable_shapes()

pytest-astropy pulls in Hypothesis automatically now, and Astropy has everything configured, so it's just a matter of writing some more tests! While I'm not an astrophysicist, I remain willing to review and otherwise help out with any property-based testing PRs for open science projects - just ping me :grin:

Refactoring needs-discussion testing

Most helpful comment

For those following this thread, we now have very nice precision tests for Time using hypothesis (via #10373, thanks to @aarchiba and @Zac-HD) - see https://github.com/astropy/astropy/blob/master/astropy/time/tests/test_precision.py

Those can serve as a good (if perhaps rather complicated!) example of how to set things up; there is also some more general information in the documentation, at https://docs.astropy.org/en/latest/development/testguide.html#property-based-tests.

All 21 comments

Hello and thank you for all these very useful and practical details.

I heard a lot of good things on Hypothesis at PyCon 2019 too. However, I do worry about how much longer CI would take with this switch. I think Hypothesis is great for regression tests in a nightly/weekly cron, but I am not 100% convinced for a CI that runs for every push event. Please enlighten me. Thanks again!

Thanks @pllim! There are a couple of responses to performance concerns:

  • First, I don't suggest using Hypothesis for everything - we don't do that even for Hypothesis itself. While it's used for a small number of high-value tests, I think the extra time will be negligible.
  • If or when performance becomes a concern, I'd use the settings profiles functionality. This would allow the number of examples per test to be decreased from the default 100 to e.g. 10 (or even lower), but left at the default or even increased for scheduled runs - you could even have a separate profile for local development. This would be set from a conftest.py or similar, based on the detected environment.
  • In the extreme case, you can even skip example generation entirely using the phases setting, and rely on passing explicit examples instead. Or skip hypothesis tests with pytest -m "not hypothesis", as we automatically apply that marker.

So there are lots of options to manage the performance impact, but IMO to apply them before seeing any slowdown would be premature optimisation :smile:

@Zac-HD - thanks very much indeed. While you were answering @pllim's query (I share her desire to minimize testing time/power consumption), I had been looking at your documentation and had not only found the nice numpy array plugin but also the nice run-examples only option (so kudos to your documentation!). The latter might be appropriate for existing tests in newly opened PRs, leaving the random trials for daily-cron (and lots of trials for tests in the module the PR affects, especially new ones).

Anyway, agreed that too much worry about that is premature optimization!

Another practical question: for CIs, a database wit previous failures seems a bit less practical. Do you somehow still use it, or just add explicit examples for bugs uncovered? (Which can of course happen somewhat randomly...)

p.s. Did the example above uncover problems? (I'm the units maintainer...)

cc @bsipocz, @astrofrog

Aw, thanks :blush:

My experience with Hypothesis tests in PRs is that it's actually very rare to find a bug for the first time when making an unrelated change, and less rare to discover an unexpected effect of the new change. We had one or two cases where a very rare edge cases eluded the tests for a week or so; but our current protocol is that if we suspect that's possible the PR developer will turn up the max_examples setting to 100K for a local run - if it's going to happen at all that will find it!

Re example-database-in-CI: I'd suggest adding .hypothesis to the list of directories cached by CI (or equivalently configuring the database setting to use such a directory). This database is basically designed for developer experience though, so for each bug found I'd capture it in an explicit @example(...) or traditional regression test to make sure it doesn't come back.

And yep! The "unit test" (he he he) can find float overflow, underflow, and imprecision - potentially all in a single run. Whether this is a problem I think basically depends on your tradeoff between performance (floats are faster than Fractions) and precision (Fractions are exact, and unlike floats preserve that through scaling) - most of the examples it finds are combinations of six or more units which are probably (?) unlikely to arise in practice.

Thanks again, I now installed it locally at least...

@mhvk - I hope https://github.com/astropy/astropy/compare/master...Zac-HD:add-hypothesis is illustrative, at least :smile:

I already use hypothesis in other packages and it has helped me find so many bugs! So I'm in favor of using it for some cases where it would make sense in the core package 👍

Regarding caching .hypothesis, I think we should go with @Zac-HD's suggestion of instead capturing failing examples explicitly using @example, as any kind of caching on CI often leads to confusion when trying to reproduce failures locally.

Just to clarify, I'd set up caching and capture failures as regression tests.

As to how you reproduce failures locally, Hypothesis will print instructions - as an explicit input where possible, or as a seed argument or decorator if not. That happens regardless of caching, as "user new to Hypothesis needs to reproduce a bug from CI" is one of our core use-cases 🙂

Ping @Juanlu001 from EuroScipy :smile:

Let me recommend hypothesis by pointing out bug #9328 which I found using hypothesis on PINT. It is excellent at uncovering numerical accuracy issues; see https://github.com/nanograv/PINT/pull/549 for a beginner's attempt at hypothesizing (?) calculations. The runtime of the hypothesis tests is not noticeable in realtime, there was no need to cache the database, and troublesome examples are recorded in the source code. There is considerable repetition in the examples, but that is because my tests aren't very orthogonal.

I think this is something that would make sense to discuss at the in-person coordination meeting in December, so I'll add it to the agenda.

My experience with Hypothesis tests in PRs is that it's actually very rare to find a bug for the first time when making an unrelated change, and less rare to discover an unexpected effect of the new change. We had one or two cases where a very rare edge cases eluded the tests for a week or so; but our current protocol is that if we suspect that's possible the PR developer will turn up the max_examples setting to 100K for a local run - if it's going to happen at all that will find it!

Let me add that if you understand where your corner cases are likely to occur you can encourage the example generator to focus on them. For example, I built a leap_sec_day_mjd() strategy to check days with leap seconds, and a normal_mjd() strategy to check days without - but normal_mjd() chooses between days-adjacent-to-leap-second-days and any-old-day-that-isn't-a-leap-second-day. And reasonable_mjd() chooses between those two, so leap-second-days and their neighbours get tried as part of the exploration process. I have found a few cases that still pop up after only a few hundred examples, but it's easy to propagate those to related tests once found.

I also found that I got more useful debugging when I used reasonable_mjds() to generate my examples and specified my constraints in real terms (these two conversion methods should not differ by more than a nanosecond).

The 2019 coordination meeting held a Hypothesis unconference - notes here: https://docs.google.com/document/d/14x7B0ykjWPyh2kh2GcbZTQn8PGdizg5zL7t03zDw7uU

CONCLUSION

No objection to using hypothesis in astropy. Next steps:

  • Maintainer can choose to apply it as they see fit. No need to “go crazy.”
  • Sub-packages that can benefit from this: units, time, modeling, …
  • Add your comments to [issue #9017]
  • Zac H-D willing to help out with our effort (thanks!)

For those following this thread, we now have very nice precision tests for Time using hypothesis (via #10373, thanks to @aarchiba and @Zac-HD) - see https://github.com/astropy/astropy/blob/master/astropy/time/tests/test_precision.py

Those can serve as a good (if perhaps rather complicated!) example of how to set things up; there is also some more general information in the documentation, at https://docs.astropy.org/en/latest/development/testguide.html#property-based-tests.

Just a thought - we might also want to provide strategies for other packages based on astropy, for example to generate random quantities, or random tables and so on.

Indeed, one could think of using our tests module for that (either that of astropy itself or, perhaps more logically, of each module), somewhat in analogy with np.testing - this could also hold things like assert_quantity_allclose and standard fixtures. Indeed, I've seen in our own tests that we've started to import handy features from other tests; e.g.,
https://github.com/astropy/astropy/blob/0a46d333a976d3d69c6cd7f9f2e3574298fe5209/astropy/table/tests/test_operations.py#L21-L22

Sounds great! I'd suggest 'dogfooding' them by using the strategies internally as a valuable way to check that the design actually makes sense, but something like an astropy.testing.strategies module would be awesome :smile:

Recommended reading: Hypothesis' guides to API design and efficient shrinking for strategies. They go into considerably more detail than most users should ever find useful, but are hopefully helpful if you're going to write custom strategies for downstream users.

Hi all - I have a draft paper in the SciPy proceedings (https://github.com/scipy-conference/scipy_proceedings/pull/549) about property-based testing for scientific code, which uses some examples from this work.

Feel free to take a look, and let me know if you spot any inaccuracies or have suggestions on how to cite or describe the astro parts. General feedback would also be welcome, of course - I'm aiming to make this useful for both maintainers and users of scientific libraries!

@Zac-HD - thanks for sharing! I enjoyed reading it (but have no useful comments)

Thanks @mhvk! "no comments" still tells me I didn't make any egregious mistakes about astropy 😉

Was this page helpful?
0 / 5 - 0 ratings