Sorting a dataframe currently requires looping through the columns in Python. We should replace this loop:
with libcuDF functionality to gather all the columns in libcuDF instead of in Python.
I've also been thinking that the current gdf_order_by should be renamed to something like cudf::sorted_order where it returns the indices of the sorted order.
Then we should add a new API, cudf::sort, that will actually materialize the sorted table (basically cudf::sorted_order + cudf::gather)
I'm happy writing a new test for this, but I was thinking that since this is primarily a refactoring that the existing tests would cover it.
I've been using cudf/tests/test_groupby.py::test_groupby_sort to validate my changes (I introduced a specific breaking change, watched the test fail expectedly, made my changes, watched the test pass again). Does that seem adequate or are there better tests, or even additional tests needed?
I don't know what test_groupby_sort is doing, because we don't currently have a sort-based groupby. Perhaps exercising the legacy groupby?
I'd think that there would be other "sort_values" tests that exercise just the sort functionality.
libcudf doesn't provide any higher level abstraction than columns. It sounds like you're asking for a Matrix or DataFrame level abstraction in libcudf, is that right @beckernick?
We've got a table with is a DF abstraction in libcudf. @beckernick's ask is that cudf::gather is used after calling gdf_order_by when materializing the result for sort_values.
Most helpful comment
I've also been thinking that the current
gdf_order_byshould be renamed to something likecudf::sorted_orderwhere it returns the indices of the sorted order.Then we should add a new API,
cudf::sort, that will actually materialize the sorted table (basicallycudf::sorted_order+cudf::gather)