I admit I haven't looked at the code so there may be reasons for this, but I've found myself in the need of squeezing out duplicates from a Series but keeping the results as a Series.
Series.unique() however returns an array, so in my code I have to construct a Series twice:
series = pandas.Series([1,2,3,3])
series = pandas.Series(series.unique())
Is this by design? If so, feel free to close this bug.
Well it's a good question. I guess the main issue is what index you should assign (default 0 to N-1 would be the only reasonable one probably, otherwise the index values where the unique values occurred).
In data lunedì 17 settembre 2012 05:35:34, Wes McKinney ha scritto:
assign (default 0 to N-1 would be the only reasonable one probably,
otherwise the index values where the unique values occurred).
I don't have strong opinions on either, any would be a very good improvement
over the current behavior IMO.
Luca Beltrame - KDE Forums team
KDE Science supporter
GPG key ID: 6E1A4E79
I think the second option (indices of the unique entries) would be helpful
since it is easy to simply reset the index to 0...n-1 but much more
expensive to get the indices where they occur in case I am interested.
but probably if you added this users would then ask for an option to
specify whether I get the index of the first, last or whatever occurence of
the unique values ;-)
On Mon, Sep 17, 2012 at 3:21 PM, Luca Beltrame [email protected]:
In data lunedì 17 settembre 2012 05:35:34, Wes McKinney ha scritto:
assign (default 0 to N-1 would be the only reasonable one probably,
otherwise the index values where the unique values occurred).I don't have strong opinions on either, any would be a very good
improvement
over the current behavior IMO.
Luca Beltrame - KDE Forums team
KDE Science supporter
GPG key ID: 6E1A4E79—
Reply to this email directly or view it on GitHubhttps://github.com/pydata/pandas/issues/1923#issuecomment-8613623.
I agree it would be helpful. But more expensive to compute. Have to think about it
I don't think unique should return a Series with a meaningless integer index, seems harmful/confusing if the original Series also had an integer index.
As for computing the indices of occurrences, how about an optional return_loc
parameter where it's None by default (returns ndarray) and can be "all", "first", "last".
like how about this:
s.unique() -> no index
s.unique(index='first') -> Series
s.unique(index='last') -> Series
yeah, exactly what I was thinking
s.unique() --> keep method as it is, a faster alternative to np.unique() --- no index
Add drop_duplicates() to Series?:
s.drop_duplicates(take_last=...) --> Series, index behavior like for DataFrame.drop_duplicates()
That's not a bad idea either
Maybe drop_duplicates to get first or last and then a separate method to get a reverse mapping of all indices for each unique value?
See DataFrame.duplicated
, which returns a boolean array
At the risk of appearing to re-open something that is dead, I thought I'd just summarize for any newcomers to this page:
series.unique
was left unchanged (returns a numpy ndarray
)
drop_duplicates
and duplicated
were added to Series.
So, if you wanted to do something like the OP and wanted to keep the indices, you'd perform:
series = pandas.Series([1,2,3,3]).drop_duplicates(keep='first')
(keep='first'
retains the first occurrence of any duplicates, keep='last'
retains the last occurrence, and keep=False
retains NONE of the duplicates. keep='first'
is the default.)
In case anyone feels like writing a PR, the pandas.Series.unique
docs doesn't mention what @lazarillo says and could use an extra line that references pandas.Series.drop_duplicates
(and possibly vice versa).
Most helpful comment
At the risk of appearing to re-open something that is dead, I thought I'd just summarize for any newcomers to this page:
series.unique
was left unchanged (returns a numpyndarray
)drop_duplicates
andduplicated
were added to Series.So, if you wanted to do something like the OP and wanted to keep the indices, you'd perform:
series = pandas.Series([1,2,3,3]).drop_duplicates(keep='first')
(
keep='first'
retains the first occurrence of any duplicates,keep='last'
retains the last occurrence, andkeep=False
retains NONE of the duplicates.keep='first'
is the default.)