Cudf: [BUG] Sort_values should use libcudf for gathering columns in dataframes

Created on 28 Jun 2019 · 5Comments · Source: rapidsai/cudf

Sorting a dataframe currently requires looping through the columns in Python. We should replace this loop:

https://github.com/rapidsai/cudf/blob/28888ade95e4d2bc55c7f0002ad674e9c03b1e76/python/cudf/dataframe/dataframe.py#L1520-L1525

with libcuDF functionality to gather all the columns in libcuDF instead of in Python.

bug cuDF (Python)

Source

beckernick

Most helpful comment

I've also been thinking that the current gdf_order_by should be renamed to something like cudf::sorted_order where it returns the indices of the sorted order.

Then we should add a new API, cudf::sort, that will actually materialize the sorted table (basically cudf::sorted_order + cudf::gather)

jrhemstad on 29 Jun 2019

👍2

All 5 comments

I've also been thinking that the current gdf_order_by should be renamed to something like cudf::sorted_order where it returns the indices of the sorted order.

Then we should add a new API, cudf::sort, that will actually materialize the sorted table (basically cudf::sorted_order + cudf::gather)

jrhemstad on 29 Jun 2019

👍2

I'm happy writing a new test for this, but I was thinking that since this is primarily a refactoring that the existing tests would cover it.

I've been using cudf/tests/test_groupby.py::test_groupby_sort to validate my changes (I introduced a specific breaking change, watched the test fail expectedly, made my changes, watched the test pass again). Does that seem adequate or are there better tests, or even additional tests needed?

rlratzel on 11 Jul 2019

I don't know what test_groupby_sort is doing, because we don't currently have a sort-based groupby. Perhaps exercising the legacy groupby?

I'd think that there would be other "sort_values" tests that exercise just the sort functionality.

jrhemstad on 11 Jul 2019

libcudf doesn't provide any higher level abstraction than columns. It sounds like you're asking for a Matrix or DataFrame level abstraction in libcudf, is that right @beckernick?

thomcom on 11 Jul 2019

We've got a table with is a DF abstraction in libcudf. @beckernick's ask is that cudf::gather is used after calling gdf_order_by when materializing the result for sort_values.

jrhemstad on 11 Jul 2019

Was this page helpful?

0 / 5 - 0 ratings