Cudf: [FEA] Multi-column quantiles

Created on 30 Jan 2020  路  15Comments  路  Source: rapidsai/cudf

Is your feature request related to a problem? Please describe.

I would like to be able to compute "quantiles" that consider entire rows of a table.

For example, if I request the 0.5 quantile for a row-level quantile, I want the row that sits at the midpoint between the first and last row when the table is sorted lexicographically.

The existing cudf::quantiles API works on a table_view, but this is potentially misleading because it computes the requested quantile _independently_ for each column.

Describe the solution you'd like

The existing cudf::quantiles API should be refactored to only work on a single column at a time to prevent confusion:

std::unique_ptr<scalar> quantiles(column_view const& input,  double q, interpolation interp = interpolation::LINEAR, order_info column_order = {});

Then, a new table level API should be created that does what I described above:

unique_ptr<table> quantiles(table_view const& input, std::vector<double> q, row_interpolation interp, std::vector<order_info> column_order);

This is different from the previous API in that it allows the user to request multiple quantiles at once, and returns a table containing the rows at the specified quantiles. (The column API could also be refactored to allow requesting multiple quantiles at once, but that is a minor issue).

The table-level API probably needs a different set of options for interpolation, because it doesn't really make sense to try and do linear interpolation on rows, especially when one of the columns can be strings. I think only LOWER, HIGHER, MIDPOINT, NEAREST makes sense for rows, but I could be mistaken.

Part of the effort of this issue should also be to plumb the new libcudf++ quantiles APIs into Cython/Python.

Performance dask feature request libcudf

All 15 comments

Accidentally closed. Sorry.

Any objections to renaming quantiles(column_view ...) to quantile(column_view ...) (plural vs singular)?

I also thought renaming the function would alleviate the need to change the parameters, but it may be confusing to have quantile and quantiles both accepting a table.

Under the hood, this should still use sorted_order, correct?

Under the hood, this should still use sorted_order, correct?

Yeah, the implementation shouldn't be too different from the existing one, just doing a multi-column comparison instead of single.

The API you specified uses std::vector<double> q, but do we also need/want support for column_view q (where dtype is floating point)?

The API you specified uses std::vector<double> q, but do we also need/want support for column_view q (where dtype is floating point)?

I can't imagine a scenario where the quantiles requested resides in device memory, but I'd defer to @kkraus14 or @rjzamora.

To be clear, the question is, for multi-column quantiles, will the information about what quantiles are desired ever reside in device memory?

To be clear, the question is, for multi-column quantiles, will the information about what quantiles are desired ever reside in device memory?

Your assumption is probably correct here. dask/dask_cudf will be passing in a desired list of quantiles with a size that is proportional to the number of devices (probably). I am guessing that this relatively small/simple calculation is not worth performing on the device.

As mentioned in #4053, we will want to consolidate the quantile and quantiles APIs and enforce non-arithmetic interpolation on tables containing non-arithmetic column types - arithmetic interpolation _should_ work for tables containing only arithmetic columns types.

See https://github.com/rapidsai/cudf/pull/4053#discussion_r375587867, https://github.com/rapidsai/cudf/pull/4053#discussion_r375590447, and https://github.com/rapidsai/cudf/pull/4053#discussion_r375591293

@cwharris are you planning to take on that work in 0.13? Is this issue the one tracking Cython / Python interface to libcudf++ quantiles, or is there a separate issue?

@harrism planning on 0.13 for cython + multi-column arithmetic support - both are in progress. I was using this issue for both, but we can split it up if that makes it easier.

I have questions regarding implementation.

The multi-column quantiles described in this issue assume discrete rows are selected, without interpolation. We're discussion adding interpolation support when supplied with only arithmetic columns. The legacy quantiles api, as well as the experimental quantile api return double results. That doesn't fit the pattern established here.

We may need multiple quantile methods, one which returns columns with the same types as the input (be it discrete interpolation, or arithmetic), and another which returns ~doubles~ floating point.

Thoughts?

I think we should probably check with our users on expected use cases (dask/Python, Spark, BlazingSQL). It would be great if you can gather that information. I think the stakeholders are, respectively @kkraus14 @rjzamora @jlowe @revans2 @felipeblazing

Since the answer isn't immediately obvious, I'm going to get both the quantiles and quantile apis hooked up in cython so we can get it integrated. We can consolidate later, but the existing apis are sufficient for much of this work.

@harrism planning on 0.13 for cython + multi-column arithmetic support - both are in progress. I was using this issue for both, but we can split it up if that makes it easier.

I have questions regarding implementation.

The multi-column quantiles described in this issue assume discrete rows are selected, without interpolation. We're discussion adding interpolation support when supplied with only arithmetic columns. The legacy quantiles api, as well as the experimental quantile api return double results. That doesn't fit the pattern established here.

We may need multiple quantile methods, one which returns columns with the same types as the input (be it discrete interpolation, or arithmetic), and another which returns ~doubles~ floating point.

Thoughts?

Separate APIs sound reasonable. Something like discrete_quantiles and interpolated_quantiles.

This has the nice advantage of making it clearer you can't interpolate non-arithmetic types by forbidding them from the interpolated_quantiles API.

I like Jake's suggestion.

4292

Was this page helpful?
0 / 5 - 0 ratings