This was originally discussed at #1240. It's unclear how this would work exactly -- we may first need to design a static type system extension for numpy.
we may first need to design a static type system extension for numpy.
I think we would only need some dependent type to describe a fixed size array Array[np.foat32, n, k], where n and k are integer numbers known statically. Related https://github.com/python/mypy/issues/3062 and https://github.com/python/mypy/issues/3345
This project looks somewhat related: http://datashape.readthedocs.io, https://github.com/blaze/datashape
There is some conversation about type annotations in this NumPy issue: https://github.com/numpy/numpy/issues/7370
Here's my brain dump on static typing for arrays:
TensorFlow (which copies the design of NumPy in many ways) has a form of static typing for array shapes that is useful precedence here. Notably, it distinguishes between three cases (copied from that page):
TensorShape([16, 256])TensorShape([None, 256])TensorShape(None)TensorFlow sets shapes using C++ or Python, mostly with SetShapeFn. Looking through how it's used internally by TensorFlow operations could give hints about the necessary use cases.
In particular, one important use case supported by most NumPy functions is broadcasting, where output shapes are propagated via broadcasting rules. This could allow for identifying broadcasting errors (one of the most frequent user-errors in NumPy) at compile-time instead of run-time.
Finally, a concept like TypeVar for arrays dimensions (call it DimensionVar?) could be extremely powerful. In NumPy, an operation like x + y for inputs x: Vector[N] and y: Vector[M] only can succeed if N=1, M=1 or N=M. Obviously, this is error prone -- N and M can coincide purely by chance, even if they represent semantically distinct axes. For static checking, it could be useful to assume that N and M are always distinct, unless cast to use the same dimension variables. This would also catch many errors.
I think we would only need some dependent type
I think dependent types are needed in very rare cases. What we probably need more, is something like static constraints. So dependent types would mean that some part of the type is based on value of function argument. For example, if function would be:
def foo(array, array_dtype):
...
Where array has dtype of array_dtype, then you would need a dependent type to express that array argument has numpy type with array_dtype as its dtype.
But what we first need is even support to express "numpy with dtype" type. I see this as an extra constraint on numpy type. The issue is that it is not enough just to have support for static constraints, but one needs a language so that you can describe that input arguments to a function can have some relations between them (like one argument array has to be 2x in width than the other argument array). It gets pretty tricky soon. See contracts project for an attempt at this.
Working on a large codebase of scientific python code, a generic array-type annotation would be incredibly useful. An example that would be nice to type-check:
def compute_costs(
predictions, # type: Array(n_samples, n_outputs)[float] # Array of probabilities
labels, # type: Array(n_samples)[int,>=0,<n_outputs] # Correct labels in range 0 to n_outputs-1
): # type: (...) -> Array(n_samples)[float] # The cost per-sample
This documents that:
It would be good to add more flexibility to this example. For example it may be the case that the data points are not flattened along one dimension, e.g.
def compute_costs(
predictions, # type: Array(n_samples, *data_dims, n_outputs)[float] # Array of probabilities
labels, # type: Array(n_samples, *data_dims)[int,>=0,<n_outputs] # Correct labels in range 0 to n_outputs-1
): # type: (...) -> Array(n_samples)[float] # The cost per-sample
This would indicate that the predictions and labels can have any shapes as long as they match.
It would also be great to have some kind of support for Generics Types. For example:
def minibatch_iterator(
data_iterator, # type: Generator[Array(*dims)[<dtype>]],
minibatch_size, # type: int
): # type: (...) -> Generator[Array(minibatch_size, *dims)[<dtype>]],
This annotation would show that
FYI, the https://github.com/numpy/numpy_stubs project is now working on support for typing numpy-using code.
Yes, and we are very much interested in collaboration. See here for ongoing discussion about typing shapes: https://github.com/numpy/numpy_stubs/issues/5
Who is "we" here and who do you want to collaborate with? Is there a
complete proposal for the syntax yet? It would have to aim at becoming a
PEP (a sibling to PEP 484 and PEP 526) and someone should attempt an
implementation for mypy, perhaps in the form of a plugin. The mypy team
doesn't have the resources or priorities to help much, alas.
@gvanrossum
I guess Stephan means collaboration between numpy team and mypy team. There is a (very preliminary) draft proposal here, discussion about it is tracked also in https://github.com/python/typing/issues/513. Just as a coincidence yesterday we discussed with Jukka possible plans for supporting numpy. I am personally very interested in this (taking into account my extensive experience with numpy), and it looks like Jukka is interested as well (mostly due to the impact this will have). _However_, we both indeed don't have bandwidth for this now. In our discussion we agreed that it would be great if either or both of us can work on this in late Q3 or Q4.
@gvanrossum I suppose by "We" I was referring to NumPy fellow developers, and others in the broader numerical Python community. (But to be honest, I wasn't being very carefully with my post.)
There are a few others in the mypy community who have also expressed interest here. I met up with @JelleZijlstra and @ethanhs along with @njsmith (from the NumPy side) a few months ago to discuss the issues in https://github.com/python/typing/issues/513. This prompted the creation of numpy-stubs, but there hasn't been significant progress on the major issues since then (which of course is totally fine, this being open source I have no expectations on anyone else's time).
Is there a complete proposal for the syntax yet? It would have to aim at becoming a
PEP (a sibling to PEP 484 and PEP 526) and someone should attempt an
implementation for mypy, perhaps in the form of a plugin.
My preliminary partial proposal (which @ilevkivskyi linked to above) takes a stab at syntax, and also outlines some of the broader issues for typing. I agree that a PEP will be necessary eventually, but I'm not sure we're quite ready for that yet. A proof of concept implementation would certainly help clarify many of these issues.
@rmcgibbo wrote an experimental mypy plugin for NumPy in https://github.com/rmcgibbo/numpy-mypy, but it only supports checks on the number of array dimensions (ndim), not shapes or dimension identity. I think dimension identity checks could be very powerful (along the same lines as what @petered writes above), but certainly support for ndim would be better than nothing.
Ah, very interesting. Are you coming to PyCon in Cleveland this year? If so
you should make sure to have an in-person meeting with the mypy team (Jukka
and Ivan in particular) and any other folks interested (you mentioned a
few).
On Sun, Mar 25, 2018 at 11:55 PM, Stephan Hoyer notifications@github.com
wrote:
@gvanrossum https://github.com/gvanrossum I suppose by "We" I was
referring to NumPy fellow developers, and others in the broader numerical
Python community. (But to be honest, I wasn't being very carefully with my
post.)There are a few others in the mypy community who have also expressed
interest here. I met up with @JelleZijlstra
https://github.com/JelleZijlstra and @ethanhs
https://github.com/ethanhs along with @njsmith
https://github.com/njsmith (from the NumPy side) a few months ago to
discuss the issues in python/typing#513
https://github.com/python/typing/issues/513. This prompted the creation
of numpy-stubs https://github.com/numpy/numpy_stubs, but there hasn't
been significant progress on the major issues since then (which of course
is totally fine, this being open source I have no expectations on anyone
else's time).Is there a complete proposal for the syntax yet? It would have to aim at
becoming a
PEP (a sibling to PEP 484 and PEP 526) and someone should attempt an
implementation for mypy, perhaps in the form of a plugin.My preliminary partial proposal
https://docs.google.com/document/d/1vpMse4c6DrWH5rq2tQSx3qwP_m_0lyn-Ij4WHqQqRHY/edit#
(which @ilevkivskyi https://github.com/ilevkivskyi linked to above)
takes a stab at syntax, and also outlines some of the broader issues for
typing. I agree that a PEP will be necessary eventually, but I'm not sure
we're quite ready for that yet. A proof of concept implementation would
certainly help clarify many of these issues.@rmcgibbo https://github.com/rmcgibbo wrote an experimental mypy plugin
for NumPy in https://github.com/rmcgibbo/numpy-mypy, but it only supports
checks on the number of array dimensions (ndim), not shapes or
dimension identity. I think dimension identity checks could be very
powerful (along the same lines as what @petered
https://github.com/petered writes above), but certainly support for
ndim would be better than nothing.—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/python/mypy/issues/3540#issuecomment-376064538, or mute
the thread
https://github.com/notifications/unsubscribe-auth/ACwrMsb8o_LlqbCinfGUfwdz6mH8guW-ks5tiJDpgaJpZM4N5lFE
.
--
--Guido van Rossum (python.org/~guido)
@gvanrossum unfortunately, I don't think I'll be able to make it to PyCon this year (though it is definitely tempting!). But if you could still have a good meeting with @njsmith and @mrocklin from the numpy/scientific python side. I would also be happy to do a video-conference/call with anyone interested in higher bandwidth discussion.
This topic has been subsequently discussed in various other forums (including the Bay Area typing meetups -- and PyCon 2019, I think). Keeping this issue open doesn't seem worthwhile, since it's missing a lot of context.
@JukkaL Is there a link to the discussions you mentioned? How can we get more context?
There has been some recent discussion in the typing-sig@ mailing list. That seems like the best place to find up-to-date information.
Most helpful comment
Here's my brain dump on static typing for arrays:
TensorFlow (which copies the design of NumPy in many ways) has a form of static typing for array shapes that is useful precedence here. Notably, it distinguishes between three cases (copied from that page):
TensorShape([16, 256])TensorShape([None, 256])TensorShape(None)TensorFlow sets shapes using C++ or Python, mostly with
SetShapeFn. Looking through how it's used internally by TensorFlow operations could give hints about the necessary use cases.In particular, one important use case supported by most NumPy functions is broadcasting, where output shapes are propagated via broadcasting rules. This could allow for identifying broadcasting errors (one of the most frequent user-errors in NumPy) at compile-time instead of run-time.
Finally, a concept like
TypeVarfor arrays dimensions (call itDimensionVar?) could be extremely powerful. In NumPy, an operation likex + yfor inputsx: Vector[N]andy: Vector[M]only can succeed ifN=1,M=1orN=M. Obviously, this is error prone --NandMcan coincide purely by chance, even if they represent semantically distinct axes. For static checking, it could be useful to assume thatNandMare always distinct, unless cast to use the same dimension variables. This would also catch many errors.