Geopandas: Performance in Geopandas appears to be slow

Created on 8 Feb 2017  路  4Comments  路  Source: geopandas/geopandas

I'd like to confirm the performance I am observing when utilizing distance.

Problem

Time to complete operations appears to be far greater than comparable operation in PostGIS. I would like to understand if this is known and, if so, if there are suggestions as to methods for making Geopandas more performant, particular regarding geo operations like buffering.

Goal

Given a set of geometries, I would like to calculate some aggregate (e.g. sum()) for each geometry plus those that fall within a given distance of the reference geometry.

Example

Given a set of geometries, I would like to calculate a quarter mile buffer (402 meters) around them and gather the sum the attribute du (dwelling units).

Current Strategy

Current method, utilizing solely the centroids in an attempt to be performant:

# precompute the centroids
centroids = df['wkb_geometry'].centroid

def test_measure(row):
    center = row['wkb_geometry'].centroid
    return df.loc[centroids.distance(center) < 402, 'du'].sum()

df.apply(lambda row: test_measure(row), axis=1)

Precomputing centroid and then using distance does appear to introduce some efficiencies compared to buffering and using within operation. Cost, regardless of methodology, grows at rate n^2 due to the fact that test_measure runs n times where n is the row count and must be run once for each of the n rows.

Times:
100 rows: 0.17s
1000 rows: 11.11s

Prior strategy

Prior, with buffering and within we took about:
100 rows: 1s
1000 rows: 100s

Prior method, not using centroids:

def test_measure(row):
    buffer_shape = row['wkb_geometry'].buffer(500)
    return df.loc[df['wkb_geometry'].within(buffer_shape), 'du'].sum()

Thoughts

Were I to perform a similar operation (using st_dwithin) in PostGIS, I would be able to run the operation in the following times.

Times:
100 rows: not run
1000 rows: not run
24000 rows: 75s

For reference, here is an example of that sort of SQL query:

CREATE OR REPLACE FUNCTION agg_within_dist(
    in_id int,
    in_geometry geometry,
    OUT id int,
    OUT du float)
AS
$$
    SELECT 
        $1 AS geography_id, 
        SUM(CAST(ref.du AS float)) AS du
    FROM s1.s1_scenario_final AS REF WHERE st_dwithin($2, ref.wkb_geometry, 402);
$$
COST 10000 
LANGUAGE SQL STABLE strict;


SELECT (f).* 
FROM (
        SELECT agg_within_dist(geography_id, wkb_geometry) AS f
        FROM scenario
     ) s

Most helpful comment

Thanks all, I am going to close this issue as the "answer" I believe to my issue is that I was not implementing spatial indices correctly / at all.

The solution to improve the operation speed is to utilize a spatial index, either as @perrygeo mentioned y doing something like in the sjoin - or perhaps via the method I happened to post at the same time, above. The above is a first pass at something like that, it could not be completely correct.

All 4 comments

Profiling could help you find out where the bottlenecks are at.

Just my uninformed guess would be that shapely might not be as fast as it could be.

It would be nice if your simpler code resulted in similar performance to your centroid example.

@kuanb The difference in performance is likely due to use of a spatial index (or lack thereof). In postgis, the query planner can use the spatial index to run within only on the subset of geometries which could possibly intersect. In your geopandas code, within is run on every geometry regardless. Take a look at the sjoin code for an example of using a spatial index in geopandas.

@micahcochran thanks. I did break it down further and was able to identify that distance was the costly operation, which is itself a Shapely operation.

I was able to introduce a spatial index with the following:

spatial_index = df['wkb_geometry'].sindex

def test_measure(row):
    geom = row['wkb_geometry'].buffer(402)

    """
    foo.bounds: a shapely operations that
        returns (minx, miny, maxx, maxy)
    """
    bounds = geom.bounds
    possible_indices = list(spatial_index.intersection(bounds))

    """
    foo.within: a shapely operation that returns 
        boolean values for intersects with only the interior
    """
    possible_matches = df.iloc[possible_indices]
    within_geom      = possible_matches['wkb_geometry'].within(geom)

    return possible_matches[within_geom]

Thanks all, I am going to close this issue as the "answer" I believe to my issue is that I was not implementing spatial indices correctly / at all.

The solution to improve the operation speed is to utilize a spatial index, either as @perrygeo mentioned y doing something like in the sjoin - or perhaps via the method I happened to post at the same time, above. The above is a first pass at something like that, it could not be completely correct.

Was this page helpful?
0 / 5 - 0 ratings