Cudf: [BUG] Sort_values should use libcudf for gathering columns in dataframes

Created on 28 Jun 2019  路  5Comments  路  Source: rapidsai/cudf

Sorting a dataframe currently requires looping through the columns in Python. We should replace this loop:

https://github.com/rapidsai/cudf/blob/28888ade95e4d2bc55c7f0002ad674e9c03b1e76/python/cudf/dataframe/dataframe.py#L1520-L1525

with libcuDF functionality to gather all the columns in libcuDF instead of in Python.

bug cuDF (Python)

Most helpful comment

I've also been thinking that the current gdf_order_by should be renamed to something like cudf::sorted_order where it returns the indices of the sorted order.

Then we should add a new API, cudf::sort, that will actually materialize the sorted table (basically cudf::sorted_order + cudf::gather)

All 5 comments

I've also been thinking that the current gdf_order_by should be renamed to something like cudf::sorted_order where it returns the indices of the sorted order.

Then we should add a new API, cudf::sort, that will actually materialize the sorted table (basically cudf::sorted_order + cudf::gather)

I'm happy writing a new test for this, but I was thinking that since this is primarily a refactoring that the existing tests would cover it.

I've been using cudf/tests/test_groupby.py::test_groupby_sort to validate my changes (I introduced a specific breaking change, watched the test fail expectedly, made my changes, watched the test pass again). Does that seem adequate or are there better tests, or even additional tests needed?

I don't know what test_groupby_sort is doing, because we don't currently have a sort-based groupby. Perhaps exercising the legacy groupby?

I'd think that there would be other "sort_values" tests that exercise just the sort functionality.

libcudf doesn't provide any higher level abstraction than columns. It sounds like you're asking for a Matrix or DataFrame level abstraction in libcudf, is that right @beckernick?

We've got a table with is a DF abstraction in libcudf. @beckernick's ask is that cudf::gather is used after calling gdf_order_by when materializing the result for sort_values.

Was this page helpful?
0 / 5 - 0 ratings