Cudf: [FEA] Improve `unique_count` to count the number of unique rows in a table

Created on 29 May 2020  路  7Comments  路  Source: rapidsai/cudf

Is your feature request related to a problem? Please describe.

libcudf has a unique_count function, but it only works on individual columns.

Describe the solution you'd like

I want a function that tells me the number of unique rows in a table.

feature request libcudf

Most helpful comment

So it looks like the internal unique_count function _already_ works on tables, but the top-level API only accepts a column, which is wrapped in a table.

All 7 comments

So it looks like the internal unique_count function _already_ works on tables, but the top-level API only accepts a column, which is wrapped in a table.

What is driving this need (so I can prioritize)?

What is driving this need (so I can prioritize)?

This allows computing the output size (in # of rows) of groupby without requiring a special groupby API.

So it looks like the internal unique_count function _already_ works on tables, but the top-level API only accepts a column, which is wrapped in a table.

table API was not required earlier. pandas Dataframe.nunique calculates _per column unique_count_.

Would simply exposing the detail API and a public API on table be enough?

table API was not required earlier. pandas Dataframe.nunique calculates _per column unique_count_.

Remember that libcudf is not Pandas.

Would simply exposing the detail API and a public API on table be enough?

Changing the existing column_view public API to work on table_view is what I suggest.

@jrhemstad do you need null_policy::INCLUDE/EXCLUDE and nan_policy::NAN_IS_NULL/NAN_IS_VALID functionality as well for this public API?
(this needs some extra code & special cases for implementing for table_view)

If they are not required, exposing detail API should be a better option.

cudf::size_type unique_count(table_view const& keys,
                             null_equality nulls_equal = null_equality::EQUAL)

@jrhemstad do you need null_policy::INCLUDE/EXCLUDE and nan_policy::NAN_IS_NULL/NAN_IS_VALID functionality as well for this public API?
(this needs some extra code & special cases for implementing for table_view)

If they are not required, exposing detail API should be a better option.

cudf::size_type unique_count(table_view const& keys,
                             null_equality nulls_equal = null_equality::EQUAL)

We need whatever options are required such that unique_count can be used to compute the output size (in number of rows) of any groupby operation.

Was this page helpful?
0 / 5 - 0 ratings