import pandas as pd
import numpy as np
df = pd.DataFrame({'a': [1, 10, 8, 11, 8],
'b': list('abdce'),
'c': [1.0, 2.0, np.nan, 3.0, 4.0]})
print('_________')
print(df.nlargest(10,['a','b']))
DataFrame的nlargest在遇到rank相同的情况时,结果错误。如下,第二行和第四行反复出现了。
a b c
3 11 c 3.0
1 10 b 2.0
2 8 d NaN
4 8 e 4.0
2 8 d NaN
4 8 e 4.0
0 1 a 1.0
[Finished in 0.6s]
pd.show_versions()INSTALLED VERSIONS
------------------
commit: None
python: 3.6.0.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 94 Stepping 3, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None
what exactly is the problem? show the full pd.show_versions() as well.
This is exactly what should be reported.
In [9]: df.nlargest(10, columns=['a', 'b'])
TypeError: Column 'b' has dtype object, cannot use method 'nlargest' with this dtype
Yeah, so it appears that it doesn't handle dtypes that are "objects" since it wasn't built to handle sorting strings.
>>> df.dtypes
a int64
b object
c float64
@jreback @lesley2958
no, gourp by key list exist duplicates:
tmp = df.nlargest(10,['a','b']).index.unique()
print(df.loc[tmp])
output:
a b c
3 11 c 3.0
1 10 b 2.0
2 8 d NaN
4 8 e 4.0
0 1 a 1.0
What might be a reason for confusion here (depending on the version @flystarhe is using) is that there is a difference between 0.19.2 and 0.20
In [6]: df.nlargest(10,['a','b'])
Out[6]:
a b c
3 11 c 3.0
1 10 b 2.0
2 8 d NaN
4 8 e 4.0
2 8 d NaN
4 8 e 4.0
0 1 a 1.0
In [7]: pd.__version__
Out[7]: '0.19.2'
In [57]: df.nlargest(10,['a','b'])
...
TypeError: Column 'b' has dtype object, cannot use method 'nlargest' with this dtype
In [58]: pd.__version__
Out[58]: '0.21.0.dev+19.g69a5d6f.dirty'
That said, the output of 0.19.2 also seems wrong (even if object columns would be allowed). -> but that seems to be fixed: https://github.com/pandas-dev/pandas/issues/15297
Given that the following methods that rely on order work for object dtype:
In [62]: pd.Series(['a', 'b', 'd', 'c']).sort_values()
Out[62]:
0 a
1 b
3 c
2 d
dtype: object
In [63]: pd.Series(['a', 'b', 'd', 'c']).max()
Out[63]: 'd'
you could also say nlargest should work for object dtype.
@jorisvandenbossche But it doesn't solve the problem. Because nlargest has different efficiencies and computational idea
This already raised for Series in 0.19.2, but not for DataFrame. This was unified to disallow object columns generally and take on the Series behavior (and of course fix the actual duplicated issues)
xref #15299
In [3]: Series(list('abc')).nlargest(1)
TypeError: Cannot use method 'nlargest' with dtype object
@flystarhe your question is not clear
These might be what you want
In [5]: df.groupby(['a', 'b']).nth([0, 1, 2])
Out[5]:
c
a b
1 a 1.0
8 d NaN
e 4.0
10 b 2.0
11 c 3.0
In [6]: df.sort_values(['a', 'b']).groupby(['a', 'b']).head(10)
Out[6]:
a b c
0 1 a 1.0
2 8 d NaN
4 8 e 4.0
1 10 b 2.0
3 11 c 3.0
But it doesn't solve the problem. Because nlargest has different efficiencies and computational idea
It was not meant to solve your problem (which you should try to explain better). I was just giving a possible reason to allow nlargest on object columns. But since we also raise for series, I don't think we are going to change this.
Closing as the current behavior is intended and correct.
better still since it does deals string format.
so the df[''].value_counts().nlargest
Most helpful comment
@flystarhe your question is not clear
These might be what you want