Cudf: [BUG] find_and_replace_all doesn't replace nans

Created on 20 Mar 2020  路  9Comments  路  Source: rapidsai/cudf

Describe the bug
find_and_replace_all should be able to replace nans with other values, but does not.

Steps/Code to reproduce bug

auto input = fixed_width_column_wrapper<float>({ numeric_limits<double>::quiet_NaN() }};
auto values_to_replace = fixed_width_column_wrapper<float>({ numeric_limits<double>::quiet_NaN() }};
auto replacements = fixed_width_column_wrapper<float>({ 0 }};
auto actual = find_and_replace_all(input, values_to_replace, replacement_values);

Expected behavior
The following test should pass.

auto expected = fixed_width_column_wrapper<float>({ 0 });
expect_columns_equal(expected, actual);

Additional context
This treatment of qNaNs may be happening elsewhere in libcudf, including contains.
https://github.com/rapidsai/cudf/issues/4640

There are no cpp unit tests for find_and_replace_all which specifically test nans. There _are_ Java unit tests which will need to be updated.

Spark bug libcudf

All 9 comments

Blocked by larger NaN discussion #4760

Meanwhile, the desired behaviour should be achievable by using nans_to_nulls and then replace_nulls. Or an internal implementation using the two, if column contains both NaNs and nulls.

Based on the discussion #4760, we probably don't want to modify the current behavior, as it is consistent with C++ (NaN != NaN).
As @devavret said, there is a way to achieve the same result with a few extra operations.
@cwharris can we close this issue?

It's not exactly the same result. If the input already contains nulls, and those are _not_ to be replaced, then that won't work out. :/

Personally, I don't feel this has much to do with #4760. The consumer is specifically saying "I want these values replaced", and one of those values is NaN. We're not comparing for the sake of sorting, or computing equality. We're doing a business operation, not a mathematical one.

I think this can be implemented as requested without effecting the conversation in #4760 - if not by passing NaN as an argument, then by passing an argument such as "replace_nan", defaulting to false.

Alternatively, we could write a replace_nans, similar to replace_nulls, which could be run prior to find_and_replace_all.

We're not comparing for the sake of sorting, or computing equality.

But you are comparing for equality. You're saying, "For haystack values that are _equal to_ needle values, replace them with a corresponding value."

replace_nans is definitely something we need to provide.

In fact, when I started looking at this, I thought this should be implemented as replace_nans. That's what led me to think this can be quickly implemented with nans_to_nulls and then replace_nulls. Of course, I'd unpack it first because the column may already have nulls.

So is this a bug or feature request? if it's a bug, I can get this implemented pretty quick, I imagine. If it's a feature, we can push it back.

It's a feature request masquerading as a bug. Marked as P1 for 0.14, and I'll leave it there for anyone to pick up when they don't have a P0.

Was this page helpful?
0 / 5 - 0 ratings