For context, see also #501 on vectorizing all Shapely functions. More specific context: PyGEOS (https://github.com/pygeos/pygeos/, https://caspervdw.github.io/Introducing-Pygeos/) is a new python package that exposes geospatial operations from GEOS, similarly to Shapely, but through fast, vectorized (numpy-based) functions.
It gives a significant speed-up (5x to 100x speed-up (depending on the kind of operation) for operations on arrays), opens up the possibility of parallelizing calls to pygeos functions, and will provide a needed performance boost to GeoPandas and others who work with arrays of geometries.
It has a different initial focus (performance) but clearly also has a lot of overlap with Shapely. We have been discussing offline with a few people on how a possible integration between both packages could look like, but thought to make an issue here as well to have a public recollection of it.
To quote Sean: The unanswered question is: how do we organize GeoPandas, Shapely, and PyGEOS to maximize the benefits for the projects, their developers, and their users while also distributing the costs in an equitable way?
Can we integrate the PyGEOS functionality fully into Shapely, avoiding the need of a separate library? Or would Shapely want to use PyGEOS as a dependency for its interaction with the GEOS library instead of its ctypes approach? Or if keeping separate libraries, how do ensure a best possible interoperability?
Below, I am trying to summarize some of the discussion points, partly from my (biased) viewpoint as GeoPandas developer.
Stumbling blocks / disadvantages for moving pygeos into Shapely:
Advantages of integrating pygeos into Shapely:
And a few aspects that are less clear advantage/disadvantage:
Or, if not directly integrating the full functionality of PyGEOS into Shapely, are there ways to integrate the Geometry type better and ensure good interoperability / exchangeability ?
__geo_interface__
. However, not all packages will support this to recognize geometries (I suppose many packages expect Shapely objects, and actually Shapely itself also does not support it as input in the Geometry methods), and this also has a lot of overhead. I would argue that this is mainly useful for packages that explicitly do no want to depend on GEOS (through Shapely or PyGEOS).For me, my main concern is to have a clear story for the community of both end user and dependent packages.
Should they use / support shapely or pygeos geometries? What if I am using pygeos geometries but then another package I want to use only supports shapely geometries? For that reason, I am personally hoping we can find some way to integrate both projects.
And a similar viewpoint from the GeoPandas side: we are planning to use PyGEOS under the hood to store the geometries in the geometry column and have faster operations on it. However, we are not planning to "get rid of" Shapely. For now, Shapely will still be the primary user-facing access to elements of a GeoDataFrame: when accessing a single element, users get a Shapely object (the stored pygeos object gets converted) to benefit from the rich functionality of Shapely scalar objects.
This means that whathever inconsistency there is between PyGEOS and Shapely will be confusing for GeoPandas users. It means that there will be two ways of getting the geometries out of a dataframe: as an array of shapely objects or an array of pygeos objects. All that is not an ideal situation from a GeoPandas viewpoint.
cc @sgillies @snorfalorpagus @snowman2 @caspervdw
Same thing as above, just organized by potential directions that could be taken in the integration. Probably missed some benefits/drawbacks for the different options, but I tried to get the general idea.
pygeos
use shapely
geometry object as its geometry objectshapely
would remain a lightweight wrapper to GEOS with no other dependencies (except maybe cython
).shapely
could remove itself from having numpy
as a dependency by no longer needing the vectorized extra dependency.shapely
would be able to separate itself from the numpy deprecation cycle.shapely
installation process .shapely
maintenance burden.shapely
developers to add other new features.shapely
and pygeos
could potentially have different versions of GEOS at runtime.pygeos
and shapely
geometry objects interact well with each other__geo_interface__
pygeos
to create geometries from shapely
shapely
to create geometries from pygeos
shapely
would remain a lightweight wrapper to GEOS with no other dependencies (except maybe cython
).shapely
could remove itself from having numpy
as a dependency by no longer needing the vectorized extra dependency.shapely
would be able to separate itself from the numpy deprecation cycle.shapely
installation process.shapely
maintenance burden.shapely
developers to add other new features.shapely
use pygeos
geometry object as base geometryshapely
and pygeos
could potentially have different versions of GEOS at runtime.shapely
will require numpy as a dependency & have the deprecation cycle attached.shapely
users.shapely
and pygeos
into a single librarynumpy
dependencyc
code (increase maintenance complexity)shapely
developers in the integrationshapely
@snowman2 thanks for that write-up!
On Option 1: Have pygeos use shapely geometry object as its geometry object, some questions on how that would work in practice.
For pygeos, the base geometry type needs to be a Python extension type, so written in C (or cython). So that's something that Shapely will still need to do (and thus have a hard dependency on a compiler for development / packaging).
I suppose then the other geometry classes in Shapely would be subclasses of this? And for the rest Shapely would still keep using the current approach of ctypes to interact with GEOS? (that would still duplicate the effort of interfacing to GEOS; and also keep using ctypes (I don't know to what extent it is desired to move away from ctypes though)).
On Option 1: Have pygeos use shapely geometry object as its geometry object, some questions on how that would work in practice.
If this road were to be taken, I would probably lean towards replacing ctypes with cython
if possible. That would require quite a big refactor for Shapely. Also, not sure how well that would interface with pygeos
. And it does come with the downside as you mentioned it would "duplicate the effort of interfacing to GEOS"
.
Though off of that tangent: If shapely
were to be re-written in cython
, I wonder if you could achieve similar speedups found in pygeos
without needing numpy. :thinking:
If shapely were to be re-written in cython, I wonder if you could achieve similar speedups found in pygeos without needing numpy.
So either you write it in cython for scalars, and then: no, you can't achieve the same speed-up (it's the fact that it works on the full array at once that is important). Or either you write it in cython to work on arrays, but then most practically you need numpy, and then: yes, you can achieve the exact same speed-up (this is basically what we were doing in the geopandas cython branch).
I suppose it would be possible to write array-wise cython functionality without depending on numpy (using python's memoryviews etc), but I think that would be a lot less practical. Numpy is giving you a lot of convenience to work with arrays.
Now, if a numpy dependency is not necessarily the problem, but rather the direct use of C is seen as a bottleneck: I think it is possible to do almost exactly the same in cython what pygeos currently does in C (you can write ufuncs in cython).
A short response from my side. Thanks a lot for the writeup @jorisvandenbossche @snowman2
For me, one of the main issues is not wanting to split the Python GEOS community. For that it is necessary to either merge or make them dependent (options 3 and 4).
Also, we shouldn’t put too much emphasis on the amount of work any transition would take. We should just decide on the “best way” to go forward. It will pay off. In my experience, that approach works best.
More technically, I think that not only speed is an argument for using pygeos’ approach of wrapping GEOS. There is actually not much difference when looking only at scalar geometries. If I try to look objectively to the differences in shapely and pygeos (disclaimer: I wrote pygeos) I come to the following advantages of either using pygeos as dependency of shapely or merging the projects:
I did open the door to extending the pygeos.Geometry type with a pure Python class. That could be a way to actually make that object act as a shapely object. See https://github.com/pygeos/pygeos/pull/72 .
We've had a pair of video calls to discuss these issues. Thank you @jorisvandenbossche and @caspervdw for organizing and thank you @snowman2 @snorfalorpagus and @mwtoews for joining us. Our consensus is to go ahead with the 4th option: replacing the existing vectorized module in shapely with code from pygeos and expanding the scope for vectorized operations. It is yet to be decided if shapely's existing scalar operations will be a special case of more general vectorized operations. Also yet to be determined is a roadmap for developing and releasing.
Next steps: let's spread the word about this and see if anyone else sees a flaw in the plan. Then let's work on the roadmap and try to make some progress on this after the holidays.
Hi all, I started working (mainly thinking, some experimentation) on this again over the last weeks, and tried to put some thoughts together in a discussion document, in an effort to start getting this in motion.
I think one of the important discussion points that we didn't flesh out on the call is how to integrate pygeos' code/concepts in Shapely: what API do we want to provide, and how much can this deviate from the current API? We obviously want to add the vectorized functions (somewhere in a module), but what to do with the Geometry classes?
Since those geometry classes are at the core of Shapely, I think it's good to start discussing the options in that area.
The main difference is that Shapely now provides a set of classes (Point, Linestring, Polygon, MultiPoint, etc), one for each of the basic geometry types. PyGEOS, on the other hand, has only a single Geometry extension type (coded in C). I don't think we "just" want to add the Geometry extension type of PyGEOS to Shapely and further keep the other classes as well (that would give two "types" of objects within Shapely), but rather integrate both in some way.
I put some ideas on the different options here on hackmd: https://hackmd.io/@jorisvandenbossche/ryQs_0AzL (with already some other topics as well)
How do we want to discuss this? Is the text on hackmd fine (there is basic commenting functionality), or would it be better to do a PR with it, so we can do inline commenting on github? Or do we discuss it here, keeping the separate document as reference that can get updated?
On another note: I think we will want to change some behaviour (eg I mentioned mutability in the linked document, but also just the ctypes
interface will go away. The exact items are to be discussed of course), so from a practical point of view, would the following branching / releasing workflow work?
A gentle ping for my post above.
Also, if there are any suggestions on how or where to best discuss this (here, a PR with a textual proposal, ...), that's very welcome.
Hi @jorisvandenbossche, I'm currently drowning in shapely issues in the 1.7 (master) branch, we've got a burst of them, and haven't been able to get to this yet. I'll respond to that comment above later today or tomorrow.
@jorisvandenbossche Thanks for keeping at this! Listing all the stuff that needs to be done makes the project a lot less daunting.
One way to discuss this is is taking your document and place it it a shapely-rfc repo. We could separate issues and iteratively work on a plan.
One thing we should consider is if we are going to work on the existing shapely code or start with a clean sheet, copying stuff from pygeos and shapely into a completely new repo (or branch). It seems to me that starting with a clean sheet might be less work altogether.
One specific thing I’d like to add: I think we will hit a hard wall when trying to subclass pygeos.Geometry into Python. This because we should be able to type check the geometry inside the ufunc inner loop, and as Python is dynamic, you can’t without taking a detour through the interpreter.
Cython in the core is something I’d like to avoid (as oppossed to custom algorithms).
I’ll try to reply faster next time!
Okay, I've left some comments in the HackMD page. Thanks for starting that @jorisvandenbossche !
Tomorrow I'll make a maint-1.7 branch and then the master will become 1.8dev. I think we could make a branch for version 2.0dev almost immediately and periodically merge in the changes from maint-1.7 and the master branch.
@caspervdw let's find out whether that's going to be a blocker as soon as we can.
@sgillies thanks for the comments!
Since the comments in HackMD are not that visible, putting the most important one here:
[@sgillies] I require the geometry type classes. JTS and GEOS both have a rich set of geometry type classes. It's only the GEOS C API that lacks them, and that's because of C language limitations, not because it's the ideal approach. Yes, a 2.0 version allows us to make breaking changes, but breaking every existing application is a step too far. This would best be done in an entirely new package not called shapely. Constructors and isinstance hooks would help but eliminating a Point's x and y attributes goes too far.
On this last sentence: I didn't propose to eliminate Point's x and y attributes, so sorry if that wasn't clear (I think we should preserve all existing attributes and methods, with a few exceptions that we can deprecate first, like .ctypes
).
With the constructors, isinstance hooks, and all familiar methods/attributes in place, I don't think the option of a single class would break "every existing application".
I would assume it will only be a minority that encounters problems (for example, if one is doing type(geom) is shapely.geometry.Point
instead of an isinstance check, that will break).
The main problem of multiple classes is that it will quite complicate the implementation on the C side (as far as I can imagine now), which is a clear downside of that option.
We could have the subclasses in C and have them basically almost the same as the base Geometry class (just ensuring they have the correct geometry type). That might not be too much work (but IMO also doesn't add that much value compared to the single class). A difficulty would in this case also be to match the behaviour of the current constructors.
To have the subclasses in Python, as @caspervdw points out, that complicates the ufuncs (and would probably impact performance as well) as those that create new geometries would need to call from C into python to construct the correct class.
@jorisvandenbossche I thought removal of attributes like x, y from point objects was implied in the proposal for a single class, because why would we want x, y attributes on a polygon object?
Would Capsules help with the ufuncs? https://docs.python.org/3/c-api/capsule.html#capsules. Specifically, could our C base class define a small C API that the ufuncs could use to get the GEOS geometry pointers?
I touched on the topic of methods that are unique to one of the geometry types, but there are only a few, so mentioned that "those can also raise an error if they are not applicable for the geometry type in question", when providing all attributes on the single Geometry class.
It's certainly a slight user interface degradation that you can see those irrelevant attributes (eg in tab completion), but it are only a few (eg x/y/z for Points and exterior/interiors for Polygons on 50-60 attributes/methods when comparing those two geometry types).
Would Capsules help with the ufuncs? Specifically, could our C base class define a small C API that the ufuncs could use to get the GEOS geometry pointers?
What problem do you have in mind for which this could help?
The C base class already provides a way for the ufuncs to get the GEOS geometry pointer, as it stores this pointer as an attribute in the C struct of the python object.
I did look at capsules, but our implementation doesn't need it. It can't solve this issue for us.
However, I do like the idea of pure python subclasses of pygeos.Geometry
. It seems to me that there are two issues that are blocking this:
I had another look at the Python docs (notably: https://docs.python.org/3/c-api/typeobj.html#c.PyTypeObject.tp_base) and it appears that there is a static attribute that keeps track of a type's baseclass. We can use this in the ufuncs. Then the first issue is resolved (note: I did not benchmark yet, but I think we're absolutely fine with this). See this PR:
https://github.com/pygeos/pygeos/pull/104
The second issue might be not that bad either: we are using the python interpreter anyway as we are creating the objects. We can't release the GIL. Selecting the right subtype might not be that much extra overhead. If necessary, we could populate a struct on the module initialization with pointers to the (python-defined) classes.
Consuming the subclasses in the C ufuncs is no problem I think (I commented on the PR as well), it's mainly producing them that is more complex. It's true that we need python interaction anyway to create the python objects, so it might indeed not be that much slower. It will still make the C code more complex though ..
But anyway, probably something we just have to try out so we can benchmark it and evaluate how complex it actually turns out to be!
Some small updates / other questions on this:
@caspervdw further experimented with the "subclassing in python" approach (https://github.com/pygeos/pygeos/pull/104), which seems to work nicely, also to return those subclasses from the ufuncs. So this might be a good solution for the subclassing question.
In PyGEOS, we also implemented an STRtree. But the API is not compatible with the one that is now in Shapely (but more similar to the API of rtree
): the tree in PyGEOS returns integer indices, while the tree of Shapely returns actual geometry objects (cfr https://github.com/Toblerity/Shapely/issues/615).
The behaviour of shapely can easily be implemented on top of the behaviour of PyGEOS (to avoid a duplicate tree implementation). But the question here will still be how we want to provide this in the public API. Can we somehow unify this? Or just provide two tree objects users can choose from? Or is it acceptable to deprecate the current shapely behaviour? (or provide a keyword to still return geometries instead of integers?)
In Shapely, there is a PreparedGeometry
(and prep
function), which needs to be used explicitly by the user. In PyGEOS, we are however thinking to "just" use this when appropriate under the hood, without the user needing to think about this. For example the STRtree already uses this (when a predicate is specified), and in https://github.com/pygeos/pygeos/pull/92 I am looking at enabling this in the vectorized spatial predicates like contains
etc.
The question here is if we will always know when this is "appropriate", or if there are cases that it can unexpectedly give a large overhead for situations prepared geoms were used unncessarily (feedback on this in the linked PR is very welcome!).
But if we do it under the hood, this could also mean that PreparedGeometry
could be deprecated.
Feedback on the STRTree / prepared geometry questions from the comment above is still very welcome.
I also wanted to raise another topic for discussion: how to expose those vectorized/ufunc functions?
Copying part of the text in the hackmd and Sean's response to it:
How to add the ufuncs to Shapely’s API? This can be just in a submodule of shapely, eg
shapely.geos
? (as the module that exposes the “raw” GEOS functionality on which the other modules are based)
[@sgillies] Yes, I think a submodule is the way to go. It would be okay to use shapely.vectorized for this. I would prefer to not have a module named geos.
Some reasons that I am not really "fond" of shapely.vectorized
:
self
)That last point touches on how we want to integrate this with the other shapely modules (ops
, algorithms
, affinity
, ...).
As an example, the shapely.affinity
module has a affine_transform(geom, matrix)
function. I think we will want a vectorized version of this (we already have the machinery for that in pygeos right now), but instead of having a vectorized shapely.vectorized.affine_transform
handling both single geometries and arrays of geometries and a shapely.affinity.affine_transform
only handling single geometries, I think it would make sense to just have the version in shapely.affinity
handle both cases (this would be a backwards compatible change).
Similar examples can be given for the other functions in affinity
, or for shapely.validation
functions, for shapely.ops.orient
, the functions in the wkb
/wkt
modules, etc.
The question of whether/how to automatically prepare geometry is an interesting one. It depends on the cost of preparing a geometry versus how much is saved by using the prepared geometry multiple times. There's so many factors involved it would be very difficult to cost this exactly.
If you know the number of evaluations a priori, then a simple heuristic is to prepare if # evaluations > N, where N is some small number (perhaps even 1).
@dr-jts Thanks for the input! I will try to respond in a bit more detail at https://github.com/pygeos/pygeos/pull/92 one of the coming days
@sgillies would it help if I do a PR with an RFC-like document that summarizes the changes we would like to do for Shapely 2.0 and the workflow to get there? (similarly how you did one for Fiona 2.0). That might be easier to discuss compared to the external hackmd I made now (which I also need to update with the latest changes/discussions above)
@jorisvandenbossche yes please with an RFC :pray:
Hi all, a more detailed proposal is now available at https://github.com/shapely/shapely-rfc/pull/1. Feedback very welcome!
I read through the RFC. A big paint point for me is that geometries of different sizes cannot be created in a vectorized way in PyGEOS. This is fine if you only have points - but for linestrings, I think it is an unreasonable assumption that all linestrings will contain the same number of coordinates.
In my particular case I want something that creates linestrings if the number of coordinates is >= 2, and points if the number of coordinates is == 1. (And I want the looping-part of this not to happen in Python for performance reasons)
I'd love to see some kind of solution for this considered in Shapely 2.0!
Most helpful comment
We've had a pair of video calls to discuss these issues. Thank you @jorisvandenbossche and @caspervdw for organizing and thank you @snowman2 @snorfalorpagus and @mwtoews for joining us. Our consensus is to go ahead with the 4th option: replacing the existing vectorized module in shapely with code from pygeos and expanding the scope for vectorized operations. It is yet to be decided if shapely's existing scalar operations will be a special case of more general vectorized operations. Also yet to be determined is a roadmap for developing and releasing.
Next steps: let's spread the word about this and see if anyone else sees a flaw in the plan. Then let's work on the roadmap and try to make some progress on this after the holidays.