Pandas: Have the possibility for Series.unique() to return a Series rather than an array

Created on 17 Sep 2012 · 13Comments · Source: pandas-dev/pandas

I admit I haven't looked at the code so there may be reasons for this, but I've found myself in the need of squeezing out duplicates from a Series but keeping the results as a Series.

Series.unique() however returns an array, so in my code I have to construct a Series twice:

series = pandas.Series([1,2,3,3])
series = pandas.Series(series.unique())

Is this by design? If so, feel free to close this bug.

Enhancement

Source

lbeltrame

👍2

Most helpful comment

At the risk of appearing to re-open something that is dead, I thought I'd just summarize for any newcomers to this page:

series.unique was left unchanged (returns a numpy ndarray)
drop_duplicates and duplicated were added to Series.

So, if you wanted to do something like the OP and wanted to keep the indices, you'd perform:

series = pandas.Series([1,2,3,3]).drop_duplicates(keep='first')

(keep='first' retains the first occurrence of any duplicates, keep='last' retains the last occurrence, and keep=False retains NONE of the duplicates. keep='first' is the default.)

lazarillo on 19 Jun 2018

👍16

All 13 comments

Well it's a good question. I guess the main issue is what index you should assign (default 0 to N-1 would be the only reasonable one probably, otherwise the index values where the unique values occurred).

wesm on 17 Sep 2012

👍1

In data lunedì 17 settembre 2012 05:35:34, Wes McKinney ha scritto:

assign (default 0 to N-1 would be the only reasonable one probably,
otherwise the index values where the unique values occurred).

I don't have strong opinions on either, any would be a very good improvement
over the current behavior IMO.

Luca Beltrame - KDE Forums team
KDE Science supporter
GPG key ID: 6E1A4E79

lbeltrame on 17 Sep 2012

I think the second option (indices of the unique entries) would be helpful
since it is easy to simply reset the index to 0...n-1 but much more
expensive to get the indices where they occur in case I am interested.
but probably if you added this users would then ask for an option to
specify whether I get the index of the first, last or whatever occurence of
the unique values ;-)

On Mon, Sep 17, 2012 at 3:21 PM, Luca Beltrame [email protected]:

In data lunedì 17 settembre 2012 05:35:34, Wes McKinney ha scritto:

assign (default 0 to N-1 would be the only reasonable one probably,
otherwise the index values where the unique values occurred).

I don't have strong opinions on either, any would be a very good
improvement
over the current behavior IMO.

Luca Beltrame - KDE Forums team
KDE Science supporter
GPG key ID: 6E1A4E79

—
Reply to this email directly or view it on GitHubhttps://github.com/pydata/pandas/issues/1923#issuecomment-8613623.

gerigk on 17 Sep 2012

I agree it would be helpful. But more expensive to compute. Have to think about it

wesm on 18 Sep 2012

I don't think unique should return a Series with a meaningless integer index, seems harmful/confusing if the original Series also had an integer index.
As for computing the indices of occurrences, how about an optional return_loc parameter where it's None by default (returns ndarray) and can be "all", "first", "last".

changhiskhan on 24 Sep 2012

like how about this:

s.unique() -> no index
s.unique(index='first') -> Series
s.unique(index='last') -> Series

wesm on 25 Sep 2012

yeah, exactly what I was thinking

changhiskhan on 25 Sep 2012

s.unique() --> keep method as it is, a faster alternative to np.unique() --- no index

Add drop_duplicates() to Series?:
s.drop_duplicates(take_last=...) --> Series, index behavior like for DataFrame.drop_duplicates()

lodagro on 25 Sep 2012

That's not a bad idea either

wesm on 25 Sep 2012

Maybe drop_duplicates to get first or last and then a separate method to get a reverse mapping of all indices for each unique value?

changhiskhan on 5 Oct 2012

See DataFrame.duplicated, which returns a boolean array

wesm on 5 Oct 2012