Is your feature request related to a problem? Please describe.
Need value_counts for StringColumn
df = pd.DataFrame({"str_col":['a','a','b','c']})
df["str_col"].value_counts()
Output
a 2
c 1
b 1
Name: str_col, dtype: int64
Describe alternatives you've considered
As a work around we can do count agg on groupbys .
Additional context
Add any other context, code examples, or references to existing implementations about the feature request here.
The current implementation of .value_counts() is via a numba kernel, it may be better to just use a groupby under the hood for that in general.
@VibhuJawa a naive (and likely non-optimal) way to do what Keith suggested might be something like:
def value_counts(data, col):
data['temp'] = cudf.utils.cudautils.zeros(len(data), np.int8)
res = data.groupby(col).count().sort_values(by='temp', ascending=False)
res = res.rename({'temp':col})
del data['temp']
return res[col]
@VibhuJawa a naive (and likely non-optimal) way to do what Keith suggested might be something like:
def value_counts(data, col): data['temp'] = cudf.utils.cudautils.zeros(len(data), np.int8) res = data.groupby(col).count().sort_values(by='temp', ascending=False) res = res.rename({'temp':col}) del data['temp'] return res[col]
You should be able to do data['temp'] = np.dtype('int8').type(0) instead of manually constructing your device array.
Assigning to @galipremsagar . Thanks for picking this up.
Most helpful comment
You should be able to do
data['temp'] = np.dtype('int8').type(0)instead of manually constructing your device array.