Geopandas: drop_duplicates functionality

Created on 29 Aug 2017  路  10Comments  路  Source: geopandas/geopandas

Currently, applying gdf.drop_duplicates() on a point shapefiles gives:

File "pandas/_libs/hashtable_class_helper.pxi", line 1312, in pandas._libs.hashtable.PyObjectHashTable.get_labels (pandas/_libs/hashtable.c:22034)
TypeError: unhashable type: 'Point'

bug good first issue

Most helpful comment

My solution to the issue was to the following:

# convert to wkb
df["geometry"] = df["geometry"].apply(lambda geom: geom.wkb)

df = df.drop_duplicates(["geometry"])

# convert back to shapely geometry
df["geometry"] = df["geometry"].apply(lambda geom: shapely.wkb.loads(geom))

I do not know if it is good solution/ workaround. At least the geometry as wkb is hashable. I could maybe try to fix it but i have not been contributing to any project before, so i might need some guidance on that

All 10 comments

@byezy Thanks, this is a known issue, but certainly would be useful to have this working!

Yes it would be great!

idea... something like (just typed off head, not ide checked):

Dataframe():
    def drop_duplicates(self):
        self["geom_str"] = self["geometry"].apply(lambda x: x.__repr())
        super(blah).drop_duplicates(subset=(self.columns - ["geometry"])  # run the pandas f exclude geom

Anyway just an idea, I expect you're a better pythonist than I ;-)

Cheers

HACK=this looks like it worked for me... uses the __repr__() return value in a temp column instead of the geometry object, anyway an idea for you.

```
if drop_duplicates:
    c0 = len(base_joined)
    base_joined["geom"] = base_joined["geometry"].apply(lambda x: x.__repr__())
    print("Cols are: {}".format(base_joined.columns.values))
    cmp_cols = list(base_joined.columns.values)
    cmp_cols.remove("geometry")
    print("Comp. Cols are: {}".format(cmp_cols))
    base_joined.drop_duplicates(inplace=True, subset=cmp_cols)
    base_joined.drop("geom", axis=1, inplace=True)
    print("Cols are: {}".format(list(base_joined.columns.values)))
    c1 = len(base_joined)
    print("\n{} duplicate records were dropped".format(c0-c1))

```

It will not be possible to get drop_duplicated from pandas working (at least not with current geopandas, we might be able to do this with the cython refactor).

But one idea would be to overwrite the duplicated method on a GeoSeries to have a custom implementation. That might already help for certain cases.

A naive solution would be:

[any([geom1.equals(geom2) for geom2 in gsrs]) for geom1 in gsrs]

That is O(n^2) and would swamp almost immediately.

A solution could be hashing the geometries and doing a set op on the hashes. shapely doesn't provide that feature directly (because strong consistency would require immutability --- Toblerity/Shapely#209), but it wouldn't be hard to implement (since we can throw the hash away afterwards).

@jorisvandenbossche Is this the direction you want to go in? I'm a little confused as to what you mean by "That might already help for certain cases."

My solution to the issue was to the following:

# convert to wkb
df["geometry"] = df["geometry"].apply(lambda geom: geom.wkb)

df = df.drop_duplicates(["geometry"])

# convert back to shapely geometry
df["geometry"] = df["geometry"].apply(lambda geom: shapely.wkb.loads(geom))

I do not know if it is good solution/ workaround. At least the geometry as wkb is hashable. I could maybe try to fix it but i have not been contributing to any project before, so i might need some guidance on that

or you can create wkt and just mask by it, it would be faster than going back and forth ;-)
```
G = df["geometry"].apply(lambda geom: geom.wkb)
df = df.loc[G.drop_duplicates().index]

To disclaim, this only works if geometries are point-wise equal, and not topologically equal, since those are distinct concepts for Shapely. Two geometries are pointwise equal if and only if their coordinates/strings/rings are all stored in exactly the same way. Often, segments might be topologically equal but stored in reverse order, since for things like polygon lattices, winding direction is clockwise, so shared boundary segments are not pointwise equal.

This can bite you in weird ways, especially since the ordering of components in a multi-* now matters.

Just, fyi.

for i in range (len(df)):
df.loc[i,"geom_str"]=str(df.loc[i,"geometry"])

Please take a look at the following script:

from geopandas import GeoSeries
from shapely.geometry import Point

count_dropped = 0
count_not_dropped = 0
count_other = 0
for i in range(0,10):
    dups = GeoSeries([Point(0, 0), Point(0, 0)])
    dropped = dups.drop_duplicates()
    print(dups)
    print(dropped)
    print('\n')
    if len(dropped)==1:
        count_dropped = count_dropped+1
    elif len(dropped==2):
        count_not_dropped = count_not_dropped+1
    else:
        count_other = count_other+1

print('count dropped ', count_dropped)
print('count not dropped ', count_not_dropped)
print('count other ', count_other)

The output should be :

0    POINT (0 0)
1    POINT (0 0)
dtype: object
0    POINT (0 0)
dtype: object


0    POINT (0 0)
1    POINT (0 0)
dtype: object
0    POINT (0 0)
1    POINT (0 0)
dtype: object


0    POINT (0 0)
1    POINT (0 0)
dtype: object
0    POINT (0 0)
1    POINT (0 0)
dtype: object


0    POINT (0 0)
1    POINT (0 0)
dtype: object
0    POINT (0 0)
1    POINT (0 0)
dtype: object


count dropped  1
count not dropped  3
count other  0

Process finished with exit code 0

As you can see the "duplicate" is getting dropped in the first iteration, but subsequently not in the next n iterations. This has consequences on the test test_drop_duplicates_series. I uncovered this in https://github.com/geopandas/geopandas/issues/1010 in which I realized that the AppVeyor build was showing xpassed for that test. Running the test suite on my local machine replicates the behavior exhibited in the script above. The runs are unpredictably xpassed and xfailed for that test.

Does this have anything to do with floating point precision? I'm not sure,,,,

Was this page helpful?
0 / 5 - 0 ratings

Related issues

kuanb picture kuanb  路  4Comments

ResidentMario picture ResidentMario  路  6Comments

wassname picture wassname  路  5Comments

cheng-chi picture cheng-chi  路  4Comments

martinfleis picture martinfleis  路  5Comments