Cudf: [DISCUSSION] Behavior for NaN comparisons in libcudf

Created on 31 Mar 2020  Â·  45Comments  Â·  Source: rapidsai/cudf

Recent issues (https://github.com/rapidsai/cudf/issues/4753 https://github.com/rapidsai/cudf/issues/4752) have called into question how libcudf handles NaN floating point values. We've only ever addressed this issue on an ad hoc basis as opposed to having a larger conversation about the issue.

C++

C++ follows the IEEE 754 standard for floating point values, which for comparisons with NaN has the following behavior:

| Comparison | NaN ≥ x | NaN ≤ x | NaN > x | NaN < x | NaN = x | NaN ≠ x |
|------------|--------------|--------------|--------------|--------------|--------------|-------------|
| Result | Always False | Always False | Always False | Always False | Always False | Always True |

https://en.wikipedia.org/wiki/NaN

Spark

Spark is non-conforming with the IEEE 754 standard:

| Comparison | NaN ≥ x | NaN ≤ x | NaN > x | NaN < x | NaN = x | NaN ≠ x |
|------------|-------------|-----------------------|-------------|--------------|------------------|----------------------|
| Result | Always True | False unless x is NaN | Always True | Always False | True only if x is NaN | True unless x is NaN |

See https://spark.apache.org/docs/latest/sql-reference.html#nan-semantics

Python/Pandas

Python is a bit of a grey area because prior to 1.0, Pandas did not have the concept of "null" values and used NaN's in their stead.

In most regards, Python _does_ respect IEEE 754. For example, see how numpy conforms with the expected IEEE754 behavior in binary ops https://github.com/rapidsai/cudf/issues/4752#issuecomment-606649251 (where Spark does not).

However, there are some cases where Pandas is non-conforming due to the pseudo-null behavior. For example, in sort_values there is a na_position argument to control where NaN values are placed. This requires specializing the libcudf comparator used for sorting to special case floating point values and deviate from the IEEE 754 behavior of NaN < x == false and NaN > x == false. See https://github.com/rapidsai/cudf/issues/2191 and https://github.com/rapidsai/cudf/issues/3226 where this was done previously.

That said, I believe Python's requirements could be satisfied by always converting NaN values to nulls, but @shwina @kkraus14 will need to confirm. Prior to Pandas 1.0, it wasn't possible to have both NaN and NULL values in a floating point column. We should see what the expected behavior is of NaNs vs Nulls will be in 1.0.

Discussion

We need to have a conversation and make decisions on what libcudf will and will not do with respect to NaN behavior.

My stance is that libcudf should adhere to IEEE 754. Spark's semantics redefine a core concept of the C++ language/IEEE standard and satisfying those semantics would require extremely invasive changes that negatively impact both performance and code maintainability.

Even worse, because Spark differs from C++/Pandas, we need to provide separate code paths for all comparison based operations: a "Spark" path, and a "C++/Pandas" path. This further increases code bloat and maintenance costs.

Furthermore, for consistency, I think we should roll back the non-conformant changes introduced for comparators in https://github.com/rapidsai/cudf/issues/3226.

In conclusion, we already have special logic for handling NULLs everywhere in libcudf. Users should leverage that logic by converting NaNs to NULLs. I understand that vanilla Spark treats NaNs and NULLs independently, but I believe trying to imitate that behavior in libcudf comes at too high a cost.

Spark cuDF (Python) libcudf

Most helpful comment

Besides Spark and Pandas, what do other apps on cudf want with respect to NaN handling? For example, curious what the Blazing group's thoughts are and if they have a proposal we haven't considered. cc: @felipeblazing

All 45 comments

That said, I believe Python's requirements could be satisfied by always converting NaN values to nulls, but @shwina @kkraus14 will need to confirm. Prior to Pandas 1.0, it wasn't possible to have both NaN and NULL values in a floating point column. We should see what the expected behavior is of NaNs vs Nulls will be in 1.0.

They only have nullable integer and boolean columns as of 1.0. For floating point columns they only have NaNs are there's ongoing discussions about if they want to support nulls vs NaNs for floating point columns. https://github.com/pandas-dev/pandas/issues/32265

I'm strongly in favour of libcudf following IEEE 754 for NaN behaviour, while offering a good deal of flexibility for null behaviour. There should be no question about what the behaviour of NaN should be - it's defined by the standard. However, null can be interpreted differently in different contexts and in different platforms.

That said, I believe Python's requirements could be satisfied by always converting NaN values to nulls

This is definitely accurate for Pandas < 1.0, i.e., before the introduction of pd.NA. To match Pandas behaviour, we can convert NaNs in the input to null, and use libcudf flexibility to accommodate for Pandas' quirks with regards to NaNs.

For Pandas >= 1.0, it's still not clear (to me, at least), how Pandas treats NaN differently from pd.NA in algorithms like join, groupby, etc., but I strongly believe that should not influence libcudf handling of NaN.

ORC and Parquet both have very clearly defined behavior regardings NaNs, so IO readers/writers will presumably stick with following the corresponding specification in that regard (forcing NaNs to be NULLs would be a problem).

ORC and Parquet both have very clearly defined behavior regardings NaNs, so IO readers/writers will presumably stick with following the corresponding specification in that regard (forcing NaNs to be NULLs would be a problem).

Do ORC/Parquet have any behavior that doesn't conform with IEEE 754? If not, then there's no issue.

To be clear, I'm not suggesting NaN's be _always_ converted to Nulls. Instead, if a user desires behavior for NaNs to be anything other than what IEEE 754 specifies, then they can convert those NaNs to nulls and use the various null handling logic present in libcudf.

if a user desires behavior for NaNs to be anything other than what IEEE 754 specifies, then they can convert those NaNs to nulls and use the various null handling logic present in libcudf.

This has come up a number of times, and as discussed previously it is not a workable solution. NaNs are distinct from nulls in Spark, and therefore NaNs cannot be converted to nulls without losing information. Consider a column that contains both NaNs and nulls, and I need it to be sorted, aggregated, whatever. As soon as I change the NaNs to nulls and operate on it I have lost the ability to recover which nulls are really proxies for NaNs so they can be converted back to produce a proper answer, This also makes the assumption that how nulls are handled matches the behavior desired for NaNs which is incorrect in many cases (e.g.: how Spark sorts nulls vs. NaNs).

if a user desires behavior for NaNs to be anything other than what IEEE 754 specifies, then they can convert those NaNs to nulls and use the various null handling logic present in libcudf.

This has come up a number of times, and as discussed previously it is not a workable solution. NaNs are distinct from nulls in Spark, and therefore NaNs cannot be converted to nulls without losing information. Consider a column that contains both NaNs and nulls, and I need it to be sorted, aggregated, whatever. As soon as I change the NaNs to nulls and operate on it I have lost the ability to recover which nulls are really proxies for NaNs so they can be converted back to produce a proper answer, This also makes the assumption that how nulls are handled matches the behavior desired for NaNs which is incorrect in many cases (e.g.: how Spark sorts nulls vs. NaNs).

Is the fact that Spark NaN semantics differs from the IEEE 754 standard something that you think could be reasonably raised upstream in the Spark community? I understand that isn't a short term path, but from the perspective of libcudf being a C++ library, this feels out of scope and will just bloat the library.

@jrhemstad any thoughts on how hard it would be to allow someone to provide their own unary / binary ops via an additional shared library? I.E. I could imagine in the future that any Spark or Pandas or other library specific behaviors could be defined within a helper shared library that depends on libcudf and extends things as needed where it doesn't make sense to include them in libcudf directly.

@jrhemstad any thoughts on how hard it would be to allow someone to provide their own unary / binary ops via an additional shared library?

If someone is only interfacing with the precompiled libcudf.so, then this would basically be impossible unless using the PTX UDF kind of stuff we've already experimented with in the past.

Where it _would_ be possible is if a user interfaces directly with libcudf's header-only implementations (currently only in detail::, but that could change). In that case, we could expose function templates that allows the user to specify a custom comparator at compile time.

None of our detail:: APIs are really configured to allow customizing that comparator, but that could change. It would definitely require some decent refactoring work, but it's possible.

However, this solution would require a user to build their own library binary.

Is the fact that Spark NaN semantics differs from the IEEE 754 standard something that you think could be reasonably raised upstream in the Spark community?

No, definitely not. Users do not want sorts of columns that happen to contain NaNs to sort NaNs semi-randomly (as what occurs in practice with IEEE 754 NaN comparison behavior). Other databases have similar sort behaviors to Spark wrt. NaNs, and asking a very long standing and well documented database behavior to change is a non-starter.

I understand the desire to not juggle a bunch of conflicting requirements, but we need to find some way to meet them. As I understand it, that means we have to choose between:

  1. libcudf directly provides a way to perform this behavior
  2. libcudf provides extension points so applications can reuse most of the libcudf code but override the necessary behavior in an application-specific way
  3. application forks a custom libcudf

I really, really don't want to choose that last one. 😅

2. libcudf provides extension points so applications can reuse most of the libcudf code but override the necessary behavior in an application-specific way

I think what you're suggesting here is the solution @kkraus14 was hinting at and what I outlined above.

This would take a bit of work to expose the necessary function templates, but still probably less than the complexity of providing two separate code paths that libcudf would currently need to provide in option 1.

No, definitely not. Users do not want sorts of columns that happen to contain NaNs to sort NaNs semi-randomly (as what occurs in practice with IEEE 754 NaN comparison behavior).

I'm all for allowing control of NaNs in sorting because that's allowing the ability to define otherwise undefined behavior, but these binary operations are well defined for NaN values.

The problem is that sorting and comparison are expected to be related. If having the sort operation order NaNs as the largest, then IMO comparing a NaN against other numbers should reflect that sort order. Since IEEE 754 NaNs never compare, I don't see how you can ever sort them. If we want to sort them in a consistent way then there needs to be a way to consistently compare them.

The problem is that sorting and comparison are expected to be related.

I agree, which is why I actually suggested removing the specialization for NaNs in the sort comparator :) That way we're consistently conformant rather than a mix of non-conformance with conformance (as we've seen in Raza's issues).

Note that removing the sort specialization doesn't really solve anything other than allowing us to say, "we're just as useful as IEEE 754 when it comes to sorting with data that may contain NaNs. Make sure you don't have NaNs in your data before calling sort/orderby or it's worthless" 😉

Note that removing the sort specialization doesn't really solve anything other than allowing us to say, "we're just as useful as IEEE 754 when it comes to sorting with data that may contain NaNs. Make sure you don't have NaNs in your data before calling sort/orderby or it's worthless"

Like I always say, "Better to be consistently wrong than inconsistently right." :smile:

Besides Spark and Pandas, what do other apps on cudf want with respect to NaN handling? For example, curious what the Blazing group's thoughts are and if they have a proposal we haven't considered. cc: @felipeblazing

For Blazing its actually pretty important that NaN comparisons be deterministic. Non deterministic sorting could cause all kinds of issues for us. I think its very valid for a user to want to do something like have that be included as one of the groups when performing a group by.

Also how would it look to a user that they sort a list and see different results when the sort the list [2,0,Nan]

    [ 0 , NaN, 2 ]
    [ 0 , 2, NaN ]
    [ NaN, 0 , 2 ] 

I agree with @jrhemstad when he says its better to be consistently wrong as opposed to following the spec. The spec seems to suck in this case where users can care if something is NaN. Right now we aren't doing anything different from what cudf is doing and we have not tested what happens when we put NaN numbers in columns. Something we will have to rectify :).

We care about consistency. We don't really care if they come before or after. I imagine that it might be most convenient to put them at the end assuming people are most likely to sort things in ascending order.

I agree with @jrhemstad when he says its better to be consistently wrong as opposed to following the spec. The spec seems to suck in this case where users can care if something is NaN

@felipeblazing I think you're gravely misunderstanding my statement 😅

I'm saying we should consistently follow the IEEE 754 standard, which we don't currently do (as evidenced by the special handling of NaNs for sorting). This is the opposite of what you seem to be advocating for.

NaNs are, _by definition_, objects that cannot be ordered. The expectation is that if you have NaNs in your data, you should do something about that (like filtering them out or replacing them with a sentinel value) before trying to do any operation like a sort.

For Blazing its actually pretty important that NaN comparisons be deterministic.

I think this is going to be true for any cudf application that needs to deal directly with NaNs instead of treating them as equivalent to nulls. By adhering to IEEE 754, NaNs become explosive values in the data, ruining sorts, aggregations, lookups, etc. in which they appear. If libcudf strictly adheres to IEEE 754 for NaNs and only exposes function templates, then I would argue most cudf applications needing to wield NaNs will have to build their own libcudf with custom function templates. That's not ideal.

I think function templates are a good way to factor the problem and minimize duplicated code paths in libcudf. However I don't think libcudf's solution should stop there when it comes to NaNs because the NaN issue is likely not limited to a small subset of cudf applications. To me this would be akin to libcudf only sorting nulls first and saying, "Do you need to sort nulls last instead sometimes? Sorry, we don't support that directly. Here's a function template interface to help when you build your own from source."

IMO it comes down to how useful libcudf's build artifacts will be for applications that need to deal with NaNs. If almost everything in libcudf doesn't behave in a useful way when NaNs appear, then there's going to be a lot of libcudf 'flavors' in the wild that are essentially identical just to solve the NaN problem. At that point it would be tempting to fork libcudf, add the necessary function templates to wield NaNs in a useful manner, and publish it as an out-of-the-box solution for apps that must deal with NaNs.

This has been stated implicitly a couple of times, but just to be explicit: applications can still get predictable NaN behaviour with libcudf, just by converting NaNs to nulls.

For applications that need to deal with a combination of NaN and null, e.g., Spark (and also perhaps Pandas in the future), it's not clear to me what the behaviour should be. How does Spark handle sorting a column containing both NaN and null?

This has been stated implicitly a couple of times, but just to be explicit: applications can _still_ get predictable NaN behaviour with libcudf, just by converting NaNs to nulls.

The problem with this is it then becomes impossible to differentiate between NaNs and nulls.

For applications that need to deal with a combination of NaN and null, e.g., Spark (and also perhaps Pandas in the future), it's not clear to me what the behaviour should be. How does Spark handle sorting a column containing both NaN and null?

Spark currently handles it, Pandas hasn't crossed that road as of yet because it doesn't yet have nullable float columns, but does handle sorting NaNs.

How does Spark handle sorting a column containing both NaN and null?

NaNs are sorted relative to numeric values as explained in the PR description and documented at https://spark.apache.org/docs/latest/sql-reference.html#nan-semantics. Null sorting relative to non-null values is treated orthogonally and can be controlled by the user per-column (nulls first or nulls last).

For example, if you want to sort the column [NaN, 5, null, +Inf, -Inf, 7, NaN, 2] in ascending order with nulls first, the result would be [null, -Inf, 2, 5, 7, +Inf, NaN, NaN].

PostgreSQL matches Spark's behavior with respect to NaN handling, see https://www.postgresql.org/docs/current/datatype-numeric.html#DATATYPE-FLOAT.

I'm not sure if Blazing will choose to align, but it's clear they'll need to put NaNs _somewhere_ deterministically and separate from nulls when it comes to sorting and grouping.

If Pandas is going to start looking more like Spark for NaNs, I have to agree that the custom header version is not very useful.

Seems we should look at whether all of our clients might potentially want the same (or close to the same) thing.

And if not, we should look at how JIT might be put to use.

PostgreSQL matches Spark's behavior with respect to NaN handling
Seems we should look at whether all of our clients might potentially want the same (or close to the same) thing.

Java follows IEEE 754

Scala follows IEEE 754.

(While researching Scala, came across a particularly nice quote

NaN is more an abomination than a number. As far as I recall, the IEEE
standard intended it as an error value that's sticky in computations.
It certainly was never intended as a real value -- the only thing you
should ever do on a NaN is ask it with isNaN.

)

(It's wild to me that Spark deviates from its native language's behavior here :arrow_up: )

R follows IEEE 754

Julia follows IEEE 754

Matlab follows IEEE 754

MySQL doesn't even allow NaNs.

It's clear that the majority of languages follow IEEE 754. For those that do _not_, there is no consensus on what the rules are.

Furthermore, consider the fact that all modern hardware architectures (mostly, GPUs don't have signaling NaNs) adhere to IEEE 754, I can't see Spark's NaN behavior as anything other than abhorrent.

If Pandas is going to start looking more like Spark for NaNs

If anything, I believe Pandas is moving _further_ from Spark's NaN behavior, not closer. By adding a proper concept of a NULL, they no longer have to abuse NaN for nulls.

Pandas today is inconsistent in how it follows IEEE 754. For things like sorting, it does break from the standard and allows sorting NaNs to a consistent location (so it agrees with Spark here).

For Join, Pandas treats NaN == NaN (same as Spark).

For element-wise comparisons with NaN (like binary ops), it follows 754 whereas Spark does not.

For GroupBy, Pandas will drop NaNs in both the keys/values column. Spark does not.

It's becoming clear to me that if libcudf were to natively and wholly support Spark's NaN behavior, it would require extremely invasive changes that touch nearly _every_ feature in libcudf. In addition, not all libcudf consumers want Spark's behavior, so we'd have to maintain redundant code paths: one for IEEE 754 behavior and one for Spark behavior.

Consider if someone decided to redefine what 0.0 meant. This means we have to remember to explicitly check for 0.0 in any/all libcudf features. Not only would that make code more complex (and slower), but there will always be situations where C++ developers will forget to check for 0.0 because its contrary to the usual way of thinking and developing C++ code. As such, it will be a constant source of bugs and ongoing maintenance.

Spark's redefinition of NaN is just as bad as this hypothetical redefinition of 0.0.

Just as a tiny example of what I'm talking about, in C++, casting from NaN is undefined, but Spark expects it to be zero. So now any time we do a cast, we have to remember to check for NaN!

need to put NaNs somewhere deterministically and separate from nulls when it comes to sorting and grouping.

I think there's a middle ground here that minimizes the invasiveness of special casing for NaN while allowing Spark to do what it needs to do.

For example, we've identified that to Spark/Pandas/Blazing, deterministically sorting and matching NaNs in join/groupby keys is important (and libcudf already does this today). There's really no other way to emulate this behavior without libcudf natively supporting it.

In contrast, in the casting example, Spark could achieve it's desired behavior by first replacing NaNs with 0.0 and _then_ performing the cast. Is this slower? Certainly. However, since Spark's expected behavior is unique to Spark, it makes sense for Spark to pay for the cost of the added complexity and decreased performance rather than making _all_ consumers/developers of libcudf pay this cost.

Here's my proposal.

We proceed on a case-by-case basis and identify situations where it is literally impossible for Spark to otherwise achieve their desired behavior via composition of other libcudf features, e.g., sort/join/groupby. However, in cases like casting, Spark _can_ achieve their desired behavior by first replacing NaN with 0.

Binary ops look like another situation where Spark can do extra work to add support for it's NaN behavior. One way to do it is to generate a boolean mask from is_nan to indicate where NaN values are, then you can use that to generate the answer you expected. Alternatively, binops support JIT, so Spark can define custom binary operations as CUDA C functions (PTX not required) in the form of a string and pass those in.

As is true of all good compromises, I think all parties will find things to be unhappy about with this solution, but I don't see another way to proceed.

It's wild to me that Spark deviates from its native language's behavior here

Funny, it's wild to me that Pandas is not going to be self-consistent when it comes to sorting NaNs vs. comparing them. Sorting is comparing. PostgreSQL and Spark are self-consistent in their treatment of NaNs, and I suspect that's more useful to end-users in practice. Consistent with IEEE 754? No, but IEEE 754 is not the beginning and end of how to deal with NaNs. And I hope we can all agree IEEE 754 is very unhelpful when trying to deal with them in ETL queries. That's why it sounds like we're leaning towards libcudf breaking with that standard at least when it comes to sorting and grouping.

I'm totally cool if we can implement the required behaviors with libcudf primitives even if it takes multiple steps in some cases. For the casting issue, totally agree the Spark plugin should simply scan for NaNs and replace them ahead of the cast operation. We already do something similar for division where Spark expects division by zero to generate null. All divisor values are replaced with null before calling the divide op.

So big +1 from me if there's away to move forward without resorting to building a separate libcudf.

Funny, it's wild to me that Pandas is not going to be self-consistent when it comes to sorting NaNs vs. comparing them. Sorting _is comparing_.

Pandas exposes a parameter to control nan positioning of either first or last. This operates separately from ascending / descending sort order, which implies that the positioning of NaN values is decoupled from comparisons for this very reason.

From my perspective, a vectorized comparison operation is entirely different than an internal implementation detail of sorting comparison where it doesn't make sense to compare the two, especially when NaN handling in sorting is handled differently by different parties.

Funny, it's wild to me that Pandas is not going to be self-consistent when it comes to sorting NaNs vs. comparing them. Sorting _is comparing_. PostgreSQL and Spark are self-consistent in their treatment of NaNs, and I suspect that's more useful to end-users in practice.

100% agreed. Pandas being wildly annoying and inconsistent is a ship that has long since sailed for me :smile:

Java follows IEEE 754

Just want to point out that in Java, there is a difference between comparison operators ( <, >, == ) etc. and the method used by sort (compareTo()). The key difference in case of Double type is the NaN handling, where the compareTo() method considers NaNs larger than everything else and equal to itself.

As such, this might not be true:

Sorting is comparing.

The point I'm making is just to give an argument against doing this:

for consistency, I think we should roll back the non-conformant changes introduced for comparators in #3226.

The behaviour we have in our row operator would produce a consistent order ([ 0 , 2, NaN ]) which is not technically wrong. It still is one of the possible outputs from below:

[ 0 , NaN, 2 ]
[ 0 , 2, NaN ]
[ NaN, 0 , 2 ] 

Being able to guarantee this behaviour is conformant to not giving any guarantee. Those who don't want the guarantee can ignore it.

PS: correct me if I'm wrong. I'm no Java expert.

The point I'm making is just to give an argument against doing this:

for consistency, I think we should roll back the non-conformant changes introduced for comparators in #3226.

Yep, I'm not suggesting we do that anymore.

Also, I don't know how would we enable Spark behaviours in libcudf for casting. We'd probably have to go through the same steps:

For the casting issue, totally agree the Spark plugin should simply scan for NaNs and replace them ahead of the cast operation

So it's the same operations, same performance penalty even when done from inside libcudf.

Also, I don't know how would we enable Spark behaviours in libcudf for casting. We'd probably have to go through the same steps

You could define a custom unary operator that has a specialization for floating point types that does something like:

template <typename From, typename To>
std::enable_if_t<std::is_floating_point_v<From>, To> operator()(From v){
   if(is_nan(v))
      return To{0};
   else
     return static_cast<To>(v);
}

But this is the kind of over specialization that I really don't want libcudf to have to do.

I feel like we can have our cake and eat it too. The IEEE standard can apply to a standard comparison operators, and it should, for example when you do a WHERE clause in SQL. But it should not for the comparison operator that we use in something like a sort. The comparison operator in a sort does not have to be a standard <. It just needs to be strict weak ordering comparator, which it seems like the IEEE standard for < does not comply with that.

So when I saw we can have our cake and eat it too, we have a IEEE compliant comparison operators, and then we have sort comparison operators that are different

One idea I have been playing with along the same lines as @williamBlazing's comment is that it's possible to keep libcudf IEEE-754 compliant and still achieve desired behaviour for sort/merge/groupby. A sort on a column containing NaNs can be reinterpreted as a sort on _two_ columns, the column itself, and a boolean mask col == NaN (or col != NaN). This will place NaNs before/after other values as desired.

One idea I have been playing with along the same lines as @williamBlazing's comment is that it's possible to keep libcudf IEEE-754 compliant and still achieve desired behaviour for sort/merge/groupby. A sort on a column containing NaNs can be reinterpreted as a sort on _two_ columns, the column itself, and a boolean mask col == NaN (or col != NaN). This will place NaNs before/after other values as desired.

More-over, this is a generalization that could apply to more than just nans. It's an over-ride mask for forcing elements to the back/front of a sort. This could solve the sorting scenario of @jrhemstad 's hypothetical case where someone changes the meaning of 0.0.

So when I saw we can have our cake and eat it too, we have a IEEE compliant comparison operators, and then we have sort comparison operators that are different

It may not have been explicit enough, but in my proposal here I called out that we can keep the custom sort/groupby/join comparators libcudf currently has, even though they strictly aren't IEEE 754 compliant.

Though @shwina brings up a good point that perhaps you can emulate this behavior with an appropriate boolean mask.

The IEEE standard can apply to a standard comparison operators, and it should, for example when you do a WHERE clause in SQL.

Note that in PostgreSQL and Spark, NaN is a value that can compare against itself, so you can select rows that are NaNs by just selecting them directly like any other value.
For example, here's a quick Spark shell session showing that behavior:

scala> val df = Seq(Float.NaN, 0.0, 1.0, Float.NaN).toDF("floats")
df: org.apache.spark.sql.DataFrame = [floats: double]

scala> df.selectExpr("floats").where("floats = 'NaN'").collect
res10: Array[org.apache.spark.sql.Row] = Array([NaN], [NaN])

scala> df.selectExpr("floats = 'NaN'").collect
res11: Array[org.apache.spark.sql.Row] = Array([true], [false], [false], [true])

SQL engines that need this type of behavior can still implement this specific case via libcudf by translating the operation into an is_nan call instead. It gets a bit trickier when it's not a query literal but rather a computed scalar or an entire column-to-column compare where either side could contain NaNs. That's when it'd be nice to be able to override the binary comparison operators rather than force a series of "what if NaN" intermediate results and combine those results with is_nan masks and copy_if_else calls. to get the desired result. But even those are achievable today with some effort.

Aggregations and joins are where things get really messy, because those perform comparisons internally that are not controllable externally. I think we will need some libcudf help to make those work properly for Spark when NaNs are in the input. Note that this might dovetail nicely with some other asks like conditional join support, where the join condition could be pluggable and provided by the application. We eventually will have similar pluggable asks for aggregation functions.

Aggregations and joins are where things get really messy, because those perform comparisons internally that are not controllable externally. I think we will need some libcudf help to make those work properly for Spark when NaNs are in the input.

Agreed. Concrete example in hash join/groupby: there needs to be a specialization for the hashing function to make sure NaNs are hashed consistently. There's no way I can think of for the user to really emulate this.

Likewise, the comparator needs to be specialized to treat two rows {1,2,3, NaN} and {1, 2, 3, NaN} as equal.

I don't see how a boolean mask could help you here because you'd have {1, 2, 3, NaN, true} and {1, 2, 3, NaN, true}. Those two rows will still compare as not equal without specialization.

@shwina pointed out that you technically _can_ make the above work by generating the boolean mask and then converting NaNs to nulls, so the example becomes {1, 2, 3, null, true} and {1, 2, 3, null, true}, which can compare equal. The boolean mask ensures you can keep NaNs distinct from other nulls.

I'm not actually suggesting anyone do this, but it is an interesting exercise to show that it is possible.

As discussed offline with Jake, in theory, you could pre-process to convert NaNs to nulls, in which case you'd be comparing {1, 2, 3, null, true} and {1, 2, 3, null, true}. No information has been lost, as the boolean mask still lets you track where your NaNs have been.

As discussed offline with Jake, in theory, you could pre-process to convert NaNs to nulls

Again, NaNs and nulls are separate. As soon as I convert NaNs to nulls, I risk thinking a NaN is a null. If I compare a NaN-now-a-null to another always-was-a-null, that's going to be true, not false as it should be. Now I need to fixup the comparison result. Might as well just do the is_nan check on both sides, compare those masks directly, and OR the results of that into the regular column comparison, Less messy IMO.

@jlowe Agreed. The suggested approach was just for comparing entire rows as equal, not a true elementwise compare.

Again, NaNs and nulls are separate. As soon as I convert NaNs to nulls, I risk thinking a NaN is a null. If I compare a NaN-now-a-null to another always-was-a-null, that's going to be true, not false as it should be. Now I need to fixup the comparison result. Might as well just do the is_nan check on both sides, compare those masks directly, and OR the results of that into the regular column comparison, Less messy IMO.

@shwina was suggesting that you can preserve the independence of NaNs vs. nulls by using the boolean mask.

Example:
{1, 2, 3, NaN} -> {1, 2, 3, null, true}
{1, 2, 3, null} -> {1, 2, 3, null, false}

The two rows will still compare as not equal.

Ah yes, sorry I misread. That technique could be used to fix equivalence comparisons for joins.

@shwina was suggesting that you can preserve the independence of NaNs vs. nulls by using the boolean mask.

Example:
{1, 2, 3, NaN} -> {1, 2, 3, null, true}
{1, 2, 3, null} -> {1, 2, 3, null, false}

The two rows will still compare as not equal.

Is removing NaN handling from groupby / join something that's on the table?

Is removing NaN handling from groupby / join something that's on the table?

Not any more, no. Though @shwina has pointed out that we technically _could_ and still preserve the semantics that Spark/Pandas expect via boolean masks.

I originally proposed removing the NaN specialization for consistency here: https://github.com/rapidsai/cudf/issues/4760#issue-591285176

Furthermore, for consistency, I think we should roll back the non-conformant changes introduced for comparators in #3226.

But I've since rescinded that statement several times:

https://github.com/rapidsai/cudf/issues/4760#issuecomment-607927081

For example, we've identified that to Spark/Pandas/Blazing, deterministically sorting and matching NaNs in join/groupby keys is important (and libcudf already does this today). There's really no other way to emulate this behavior without libcudf natively supporting it.

https://github.com/rapidsai/cudf/issues/4760#issuecomment-608078838

Yep, I'm not suggesting we do that anymore.

https://github.com/rapidsai/cudf/issues/4760#issuecomment-608506766

It may not have been explicit enough, but in my proposal here I called out that we can keep the custom sort/groupby/join comparators libcudf currently has, even though they strictly aren't IEEE 754 compliant.

This has been resolved.

Was this page helpful?
0 / 5 - 0 ratings