Cudf: [FEA] value_counts for StringColumn

Created on 7 Jun 2019  路  4Comments  路  Source: rapidsai/cudf

Is your feature request related to a problem? Please describe.

Need value_counts for StringColumn

df = pd.DataFrame({"str_col":['a','a','b','c']})
df["str_col"].value_counts()

Output

a    2
c    1
b    1
Name: str_col, dtype: int64

Describe alternatives you've considered
As a work around we can do count agg on groupbys .

Additional context
Add any other context, code examples, or references to existing implementations about the feature request here.

cuDF (Python) feature request

Most helpful comment

@VibhuJawa a naive (and likely non-optimal) way to do what Keith suggested might be something like:

def value_counts(data, col):
    data['temp'] = cudf.utils.cudautils.zeros(len(data), np.int8)
    res = data.groupby(col).count().sort_values(by='temp', ascending=False)
    res = res.rename({'temp':col})
    del data['temp']
    return res[col]

You should be able to do data['temp'] = np.dtype('int8').type(0) instead of manually constructing your device array.

All 4 comments

The current implementation of .value_counts() is via a numba kernel, it may be better to just use a groupby under the hood for that in general.

@VibhuJawa a naive (and likely non-optimal) way to do what Keith suggested might be something like:

def value_counts(data, col):
    data['temp'] = cudf.utils.cudautils.zeros(len(data), np.int8)
    res = data.groupby(col).count().sort_values(by='temp', ascending=False)
    res = res.rename({'temp':col})
    del data['temp']
    return res[col]

@VibhuJawa a naive (and likely non-optimal) way to do what Keith suggested might be something like:

def value_counts(data, col):
    data['temp'] = cudf.utils.cudautils.zeros(len(data), np.int8)
    res = data.groupby(col).count().sort_values(by='temp', ascending=False)
    res = res.rename({'temp':col})
    del data['temp']
    return res[col]

You should be able to do data['temp'] = np.dtype('int8').type(0) instead of manually constructing your device array.

Assigning to @galipremsagar . Thanks for picking this up.

Was this page helpful?
0 / 5 - 0 ratings