import geopandas
world = geopandas.read_file(geopandas.datasets.get_path('naturalearth_lowres'))
world = world - world.mean()
One of the main use cases of pandas is for machine learning. An important step in ML is to standardize the dataset by subtracting the mean and dividing by the standard deviation. However, even simple arithmetic operations like this don't work for GeoDataFrames and GeoSeries:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/Adam/.spack/.spack-env/view/lib/python3.8/site-packages/geopandas/geodataframe.py", line 1140, in __sub__
return self.geometry.difference(other)
File "/Users/Adam/.spack/.spack-env/view/lib/python3.8/site-packages/geopandas/base.py", line 523, in difference
return _binary_geo("difference", self, other)
File "/Users/Adam/.spack/.spack-env/view/lib/python3.8/site-packages/geopandas/base.py", line 60, in _binary_geo
geoms, index = _delegate_binary_method(op, this, other)
File "/Users/Adam/.spack/.spack-env/view/lib/python3.8/site-packages/geopandas/base.py", line 49, in _delegate_binary_method
raise TypeError(type(this), type(other))
TypeError: (<class 'geopandas.geoseries.GeoSeries'>, <class 'pandas.core.series.Series'>)
I would expect all of the same features of pandas to work with geopandas. In my mind, geopandas is a superset of pandas, not a subset.
geopandas.show_versions()python : 3.8.6 (default, Dec 25 2020, 19:26:42) [Clang 12.0.0 (clang-1200.0.32.28)]
executable : /Users/Adam/.spack/.spack-env/view/bin/python
machine : macOS-10.15.7-x86_64-i386-64bit
GEOS : 3.8.1
GEOS lib : /Users/Adam/spack/opt/spack/darwin-catalina-x86_64/apple-clang-12.0.0/geos-3.8.1-vlrmv4vvnmfcvabpu6t4boks5fxtllko/lib/libgeos_c.dylib
GDAL : 3.2.0
GDAL data dir: /Users/Adam/.spack/.spack-env/view/share/gdal
PROJ : 7.1.0
PROJ data dir: /Users/Adam/.spack/.spack-env/view/share/proj
geopandas : 0.8.1
pandas : 1.2.0
fiona : 1.8.18
numpy : 1.19.4
shapely : 1.7.1
rtree : None
pyproj : 2.6.0
matplotlib : 3.3.3
mapclassify: None
geopy : None
psycopg2 : None
geoalchemy2: None
pyarrow : None
Thanks for raising this. It is true that for a situation like this, GeoSeries should behave like any other non-numerical column and return NaN. The one thing which blocks this now is a custom implementation of __sub__ which assumes geometric difference but that has already been deprecated and will be removed.
The other thing we'll have to take care of is the __sub__ behaviour of GeometryArray, which should return an array of NaNs in this case.
The vast majority of pandas operations work, so you just found one of the few which cause an issue. Note that in your case, you would probably want to drop geometry column anyway.
Thanks! The other thing I've noticed is that for many use cases, I would like the "geometry" to be the "index". This would allow the geometry column to be ignored during these kinds of numerical operations, but still be available for indexing later. However, shapely.geometry.Point and friends are no longer hashable: https://github.com/Toblerity/Shapely/issues/209. This means that whenever I use an external library like sklearn, I have to copy the index, pop the geometry, and add both back later. Not sure if this is a known limitation or worth opening a new issue for.
I would like the "geometry" to be the "index"
That should be possible with the upcoming shapely 2.0 which makes geometry hashable again so it is just a matter of time.