Cudf: [BUG] nan_as_null parameter affects output of sort_values.

Created on 8 Jul 2019  路  48Comments  路  Source: rapidsai/cudf

Describe the bug
nan_as_null parameter affects output of sort_values.

Steps/Code to reproduce bug

In [22]: df = cudf.DataFrame({'a': cudf.Series([np.nan, 1.0, np.nan, 2.0, np.nan, 0.0], nan_as_null=True)})

In [23]: print(df.sort_values(by='a'))
     a
5  0.0
1  1.0
3  2.0
0
2
4
In [19]: df = cudf.DataFrame({'a': cudf.Series([np.nan, 1.0, np.nan, 2.0, np.nan, 0.0], nan_as_null=False)})

In [20]: print(df.sort_values(by='a'))
     a
0  nan
1  1.0
2  nan
3  2.0
4  nan
5  0.0

similar issues with methods using libcudf APIs. Eg. drop_duplicates (which uses sorting)

df = cudf.DataFrame({'a': cudf.Series([1.0, np.nan, 0, np.nan, 0, 1], nan_as_null=False)})

In [10]: print(df)
     a
0  1.0
1  nan
2  0.0
3  nan
4  0.0
5  1.0

In [11]: print(df.drop_duplicates())
     a
0  1.0
1  nan
2  0.0
3  nan
4  0.0
5  1.0

Expected behavior
For sorting, drop_duplicates, nan should be considered equal.

Environment overview (please complete the following information)

  • Environment location: Bare-metal
  • Method of cuDF install: from source
Spark bug libcudf

All 48 comments

One solution is to nan equality check for float, double values in libcudf C++ comparators?
Another: eliminate nan_as_null and convert nan to null always.

@karthikeyann can you please include the pandas results for comparison?

@williamBlazing @kkraus14 @revans2 @harrism
We need to decide how to treat nan and null for floating point columns in libcudf.

For pandas, NaN is same as null. cudf python should follow same.
For R, NaN, NA, NULL are different. NULL is ignored always. Sorted results does not contain NA and NaN. Unique treats NA and NaN as different. (https://www.r-bloggers.com/difference-between-na-and-nan-in-r/)
libcudf doesn't support R.

For Spark, how are NaN, NA and null comparison treated?
For SQL, how are NaN, NA and null comparison treated?

@karthikeyann Note there's an issue we raised for Spark aggregation specifically that shows what Spark expects: https://github.com/rapidsai/cudf/issues/2500. I just added some information there on how Spark handles nulls and NaN.

In summary:

There's enough variation of how one handles NaN/null that it should be configurable, I'd expect a small struct that has: nulls_are_equal, and at least nans_are_equal, but perhaps other settings for-0.0f == 0.0f, and nulls_equals_nans for the other cases you mentioned.

As far as NaN ordering is concerned, Spark considers them to be after positive infinity when sorting floating point values in ascending order.

There's enough variation of how one handles NaN/null that it should be configurable, I'd expect a small struct that has: nulls_are_equal, and at least nans_are_equal, but perhaps other settings for -0.0f == 0.0f, and nulls_equals_nans for the other cases you mentioned.

In several places we already expose options like if nulls_are_equal. We can possibly add nans_are_equal.

However, I'm resistant to providing all the knobs in libcudf to support the _entire_ cross product of expected behaviors of nulls/nans/special floating point values that the various existing Data Science platforms expect. Providing all of those knobs vastly increases complexity. Increased complexity increases software maintenance, the likelihood of bugs, and reduces performance.

I can support parameterizing NULL behavior, but if we have to start parameterizing special floating point behavior (especially when it deviates from IEE754), then that becomes a nightmare for implementation and testing.

@jrhemstad, definitely agree with not providing all of the knobs. That said, I would argue for providing the knobs that can't be handled outside of the group by and sort by:

These could be done in a separate call to find_and_replace_all:

  • -0.0f == 0.0f, just replace -0.0f with 0.0f
  • nulls == NaNs not an issue in Spark, but it seems find_and_replace_all to replace NaN with null

These can't be done outside of the group/sort kernels, as far as I know:

  • NaN == NaN,
  • NaN > +Infinity

For the performance reduction, could specializations be done in such a way to remove the NaN checks for the clients that don't need this?

We currently have both find_and_replace_all and as of 0.9, nans_to_nulls.

Is it OK to require the end user (e.g. cuDF/Pandas/Spark user) to insert a conversion pass when they know they need it? I can imagine a data scientist getting wrong results and realizing they have NaN entries. They could run nans_to_nulls and then re-run to get correct results. But is that acceptable?

Or would cuDF Python and cuDF Spark be required to insert nans_to_nulls calls on every floating point input column to algorithms that compare values (sort, unique, etc.)?

I think this is the key question.

I agree with @jrhemstad on not providing too many knobs.

find_and_replace_all for spark and nans_to_nulls in python looks to be good solution.

@abellina For spark,
NaN == NaN: find_and_replace_all NaN's with null and nulls_are_equal=true flag for equality comparisons.
NaN > +Inf: find_and_replace_all NaN's with null and pass nulls_are_smallest=false flag for sorting.

For cudf Python, nans_to_nulls is enough. (I verified it. It won't change the result data to null but just mask used for comparison. )

@abellina @jrhemstad Should this be done by end user or the cudf Python/Spark API itself?

NaN == NaN: find_and_replace_all NaN's with null and nulls_are_equal=true flag for equality comparisons.
NaN > +Inf: find_and_replace_all NaN's with null and pass nulls_are_smallest=false flag for sorting.

Note that Spark and Pandas have different semantics wrt. NaNs. In Spark, NaNs and nulls are not semantically equivalent. NaN is a value that is not a number, null is not a value.

Therefore I don't see how replacing NaNs with nulls would be useful for Spark. This would lead to false positives in equality comparisons when cplumns containing NaNs are compared with columns containing nulls and incorrect bucketing of NaNs with nulls in sorting and groupby operations. I only see find_and_replace_all being useful to normalize values like transforming -0 --> 0 and -NaN --> NaN before calling the functions for equality/sorting/grouping when normalization is desired.

@jlowe what will be results of NaN==null, NaN!=null, NaN>null, Nan<null in spark?

The answer depends upon context. If this is for a comparison that results in a boolean column then the answers are all null, since comparing anything with null always results in null.

If we're talking about ordering semantics for sorting a column then the answer to the first two are:
NaN==null : false
NaN!=null : true

The other two depend upon the requested null ordering. If nulls are requested first:
NaN>null : true
NaN<null : false
Requesting nulls last ordering reverses those two comparisons.

For grouping semantics, nulls are grouped only with other nulls, NaNs are grouped only with other NaNs.

@williamBlazing @jlowe @revans2
If we sort an array with, -Inf, +Inf, NaN, null, NaN, +0, -0, +ve, -ve numbers, what will be their final positions in ascending order?

@jlowe
For Spark, it should be
[-Inf, -ve, 0, -0, +ve, +Inf, NaN, NaN, null] (for nulls_are_smallest=false)
[null, -Inf, -ve, 0, -0, +ve, +Inf, NaN, NaN] (for nulls_are_smallest=true)
right?
@jlowe
NaN>+Inf is true always.
0==-0 is true.

@kkraus14
For Pandas, null is converted to NaN, it is
[-Inf, -ve, 0, -0, +ve, +Inf, NaN, NaN, NaN] (for na_position='last')
[ NaN, NaN, NaN, -Inf, -ve, 0, -0, +ve, +Inf] (for na_position='first')

@jrhemstad
In C++, if NaNs are present in an array, the sort produces incorrect result because as per IEEE-754, all comparison with NaN are false except NAN!=NAN. This is the reason to define the NaN behavior for sorting. (and also for null).
0==-0 (in C++)

@williamBlazing For SQL?

@karthikeyann why do you have -0 larger than 0 in all of your examples?

@karthikeyann why do you have -0 larger than 0 in all of your examples?

0 and -0 are equal. It retains input order (indicating stability).

@karthikeyann yes, the ordering you list is correct. Spark treats NaN larger than +Inf when sorting in ascending order. When sorting in descending order, NaN appears before +Inf (i.e.: at the top of the list when ignoring null placement).

Speaking of null placement, Spark users can specify null ordering per column which cudf currently cannot support in all cases. I filed #2713 to track that separately.

[-Inf, -ve, 0, -0, +ve, +Inf, NaN, NaN, null] (for nulls_are_smallest=false)
[null, -Inf, -ve, 0, -0, +ve, +Inf, NaN, NaN] (for nulls_are_smallest=true)

I am pretty sure that for SQL its the same as Spark in this case.
[-Inf, -ve, 0, -0, +ve, +Inf, NaN, NaN, null] (for nulls_are_smallest=false)
[null, -Inf, -ve, 0, -0, +ve, +Inf, NaN, NaN] (for nulls_are_smallest=true)

NaN is a value and its treated as the largest number, where null is not a value and is treated differently. If the order by is set to descending, then the ordering would be exactly the opposite. Note that in SQL, you dont specify nulls_are_smallest, but NULLS FIRST or NULLS LAST. Which means that for NULLS FIRST, we set nulls_are_smallest=true if the ordering is ascending and nulls_are_smallest=false if the ordering is descending and viceversa

gdf_order_by benchmark comparison
Benchmarked for 1k, 10k, 100k, 1M, 10M, 100M double elements, input table with 8 columns.
Output order is ascending.

Comparing nonan.all.json to nan.all.json
Benchmark                                                                   Time             CPU      Time Old      Time New       CPU Old       CPU New
--------------------------------------------------------------------------------------------------------------------------------------------------------
benchmark/double_reverse_1_null_0/1024/8/manual_time                     +0.0648         +0.0684             1             1             1             1
benchmark/double_reverse_1_null_0/10240/8/manual_time                    +0.0615         +0.0599             1             1             1             1
benchmark/double_reverse_1_null_0/102400/8/manual_time                   +0.0549         +0.0532             1             1             1             1
benchmark/double_reverse_1_null_0/1048576/8/manual_time                  +0.0134         +0.0134             3             3             3             3
benchmark/double_reverse_1_null_0/10485760/8/manual_time                 +0.0054         +0.0054            29            29            29            29
benchmark/double_reverse_1_null_0/104857600/8/manual_time                +0.0138         +0.0139           304           309           304           309
benchmark/double_reverse_1_null_1/1024/8/manual_time                     +0.0278         +0.0274             1             1             1             1
benchmark/double_reverse_1_null_1/10240/8/manual_time                    +0.0226         +0.0224             1             1             1             1
benchmark/double_reverse_1_null_1/102400/8/manual_time                   +0.0185         +0.0183             2             2             2             2
benchmark/double_reverse_1_null_1/1048576/8/manual_time                  +0.0090         +0.0089             6             6             6             6
benchmark/double_reverse_1_null_1/10485760/8/manual_time                 +0.0062         +0.0062            38            38            38            38
benchmark/double_reverse_1_null_1/104857600/8/manual_time                +0.0056         +0.0056           360           362           360           362
benchmark/double_reverse_0_null_0/1024/8/manual_time                     +0.0586         +0.0569             1             1             1             1
benchmark/double_reverse_0_null_0/10240/8/manual_time                    +0.0491         +0.0475             1             1             1             1
benchmark/double_reverse_0_null_0/102400/8/manual_time                   +0.0359         +0.0336             1             1             1             1
benchmark/double_reverse_0_null_0/1048576/8/manual_time                  +0.0156         +0.0155             5             5             5             5
benchmark/double_reverse_0_null_0/10485760/8/manual_time                 +0.0026         +0.0025            72            73            72            73
benchmark/double_reverse_0_null_0/104857600/8/manual_time                +0.0051         +0.0051           923           928           923           928
benchmark/double_reverse_0_null_1/1024/8/manual_time                     +0.0352         +0.0346             1             1             1             1
benchmark/double_reverse_0_null_1/10240/8/manual_time                    +0.0031         +0.0033             1             1             1             1
benchmark/double_reverse_0_null_1/102400/8/manual_time                   +0.0198         +0.0197             2             2             2             2
benchmark/double_reverse_0_null_1/1048576/8/manual_time                  +0.0064         -0.0002             6             6             6             6
benchmark/double_reverse_0_null_1/10485760/8/manual_time                 +0.0060         +0.0060            52            52            52            52
benchmark/double_reverse_0_null_1/104857600/8/manual_time                -0.0330         -0.0330           672           650           672           650

reverse_0 → input order is random
reverse_1 → input order is descending
null_0 → no nulls in the input table
null_1 → nulls present in the input table
Time Unit in ms.

Time = (new - old)/old (REF)

hasnulls=0 with NaN support is 0.26% to 6.48% slower.
hasnulls=1 with NaN support is <3.52% slower.

for large inputs(>1M) with NaN support,
~0.5% to 1.38% slower (without null),
~0.5% slower (with null)

I'm confused because your formula for Time is not a Time, it's a ratio. Are the things in the Time column ratios, or are they times?

Just realized I can scroll right... It's pretty clear that 1 ms resolution is not precise enough for this benchmark.

Is Nan support something we have to turn on globally and slow everything down, or can it be an option?

First 2 columns are ratio. Other columns are in ms. Only reporting is in ms, measured in ns Or ticks.

Yes. It could be controlled via a boolean template parameter which could be enabled by a function has_nans() (similar to nullable boolean template parameter).

Comparing ./OLD_bench3.json to ./NEW_bench2.json
Benchmark                                                                 Time             CPU      Time Old      Time New       CPU Old       CPU New
------------------------------------------------------------------------------------------------------------------------------------------------------
benchmark/double_hasna_0_null_0/1024/8/manual_time                     +0.0612         +0.0591           575           611           597           632
benchmark/double_hasna_0_null_0/10240/8/manual_time                    +0.0513         +0.0500           720           756           741           778
benchmark/double_hasna_0_null_0/102400/8/manual_time                   +0.0611         +0.0601           866           919           887           940
benchmark/double_hasna_0_null_0/1048576/8/manual_time                  +0.0106         +0.0106          4907          4959          4930          4982
benchmark/double_hasna_0_null_0/10485760/8/manual_time                 +0.0022         +0.0021         72382         72541         72409         72565
benchmark/double_hasna_0_null_0/104857600/8/manual_time                +0.0029         +0.0037        925148        927785        924439        927827
benchmark/double_hasna_1_null_1/1024/8/manual_time                     +0.0095         +0.0096           955           964           975           984
benchmark/double_hasna_1_null_1/10240/8/manual_time                    -0.0011         +0.0016          1285          1284          1303          1305
benchmark/double_hasna_1_null_1/102400/8/manual_time                   +0.0059         +0.0061          1799          1810          1820          1832
benchmark/double_hasna_1_null_1/1048576/8/manual_time                  +0.0061         +0.0061          6244          6283          6265          6304
benchmark/double_hasna_1_null_1/10485760/8/manual_time                 +0.0048         +0.0049         52089         52341         52111         52367
benchmark/double_hasna_1_null_1/104857600/8/manual_time                +0.0531         +0.0531        663715        698931        663764        698981
benchmark/double_hasna_1_null_0/1024/8/manual_time                     +0.3206         +0.3130           959          1266           981          1288
benchmark/double_hasna_1_null_0/10240/8/manual_time                    +0.4990         +0.4910          1321          1980          1342          2002
benchmark/double_hasna_1_null_0/102400/8/manual_time                   +0.7335         +0.7238          1781          3087          1802          3106
benchmark/double_hasna_1_null_0/1048576/8/manual_time                  +1.1249         +1.1214          6324         13438          6345         13460
benchmark/double_hasna_1_null_0/10485760/8/manual_time                 +1.4186         +1.4180         52106        126023         52130        126051
benchmark/double_hasna_1_null_0/104857600/8/manual_time                +1.1277         +1.1276        644368       1370994        644417       1371031

with 50% NaN in random locations, the slowdown from 5.3% to 141%. (The slowdown is same if NaN is replaced with +Inf).

@karthikeyann can you explain here what you explained in the meeting? How is this different from your previous benchmark?

previous benchmark did not have NaN in the data. The comparison is only for code change. benchmark data had numbers and nulls only. It's because orderby (thrust::sort) crashes if the data has NaN with following error (I don't know why).

terminate called after throwing an instance of 'thrust::system::system_error'
  what():  merge_sort: failed to synchronize: an illegal memory access was encountered
Aborted (core dumped)

The latest benchmark result has NaN values in the data.
hasna_0_null_0 = only numbers
hasna_1_null_1 = 50% numbers and 50% nulls
hasna_1_null_0 = 50% numbers and 50% NaN (nulls are not present)

With the latest code in PR https://github.com/rapidsai/cudf/pull/2561, there is no slowdown if the data has no NaNs.
If the data has NaN, slowdown is 32% to 141%.

Note: operations on columns could generate NaN. There is no conversion of NaN to nulls happen now for cudf.

cudf.DataFrame({'a': [None, -4, 0.0, 4]}).sqrt()
Out[30]: 
      a
0  null
1   NaN
2   0.0
3   2.0

@karthikeyann have you profiled where the extra time is being spent? I would expect the cost of isnan(x) to be constant whether or not x == NaN. So where is the extra time coming from?

I assumed it was because it's something like if (isnan(x))... and there's only a perf penalty if that if diverges within warps.

@jlowe can you comment whether you are OK with this performance cost?

I assumed it was because it's something like if (isnan(x))... and there's only a perf penalty if that if diverges within warps.

I'm not sure. Sort is such a bandwidth bound operation, I wouldn't expect a single branch divergence to cause more than 2x slowdown. I think it's worth profiling with Nsight Compute to get the full story.

can you comment whether you are OK with this performance cost?

Like @jrhemstad I'm a bit surprised the hit is that large when NaNs are present. However I'm very happy to see this:

there is no slowdown if the data has no NaNs.

That's the best outcome. No performance hit for the cases that worked for everyone today, and now NaNs are ordered properly if present. Correctness trumps performance, so I think it's worth integrating even with the corner case performance hit. Hopefully we can tackle it in a followup if not addressed here. I do not believe NaNs will be a common case in Spark, but it's important to handle them properly if they are there to avoid corrupted output.

I'm hoping the fixes for other problematic NaN cases like hashing for groupby and join (see #2500) also don't suffer performance hits when NaNs aren't present.

@karthikeyann a good opportunity to get experience with nsight compute if you haven't yet -- it's a pretty powerful tool, once you get used to it. Please share what you find! I agree with @jlowe we should go ahead and get this fix integrated.

I still think that divergence is the cause of the slowdown. When there is one more parameter to branch on, and considering the synthetic data contains exactly 50% NaNs spread randomly, I'd expect the kernel to take approx twice as long. And that's what seems to be happening here. One more idea for checking whether the slowdown is coming strictly from divergence: Make the synthetic data NaN not at random but with some pattern : like the first 50% values are NaNs and the rest valid. Due to sorting's access pattern, we won't have a strict halving of run time but if it affects it in some significant way, then that's a clue.

When there is one more parameter to branch on, and considering the synthetic data contains exactly 50% NaNs spread randomly, I'd expect the kernel to take approx twice as long.

This is predicated on the idea that sort is a compute bound operation. It's really bandwidth bound, and so even doubling the number of instructions should not cause a 2x slowdown.

@karthikeyann have you profiled with Nsight Compute yet?

Does the divergence affect the load/store pattern or timing? If so, it can affect bandwidth efficiency.

If the branch divergence theory were true, then we'd see the same 2x slowdown for a dataset with 50% random null values.

We've already effectively tested what @devavret described here:

One more idea for checking whether the slowdown is coming strictly from divergence: Make the synthetic data NaN not at random but with some pattern : like the first 50% values are NaNs and the rest valid. Due to sorting's access pattern, we won't have a strict halving of run time but if it affects it in some significant way, then that's a clue.

That said, now that I look at the row_inequality_operator, with NaN support, it is extremely branchy and it's logic could be significantly simplified.

See https://github.com/rapidsai/cudf/blob/branch-0.10/cpp/include/cudf/table/row_operators.cuh#L163 for the updated equivalent of row_inequality_comparator that has more compact logic (it doesn't have a specialization for floating point yet, but the pattern that is there should be used for the basis of simplifying the logic).

According to @karthikeyann , he's already tested another case where he pre-created a separate null mask for all the nan values and replaced all isnan() calls with validity checks. He left the rest of the logic untouched. That case still caused the same slowdown.

According to @karthikeyann , he's already tested another case where he pre-created a separate null mask for all the nan values and replaced all isnan() calls with validity checks. He left the rest of the logic untouched. That case still caused the same slowdown.

How is that different from the hasna_1_null_1 = 50% numbers and 50% nulls test case @karthikeyann described above? Those numbers show only a ~5% slowdown.

@karthikeyann can you open a PR for this? Still interested in profiling it but the slowdown may not be a problem and we want to get this fixed for 0.11.

Note that this was fixed for cudf::sorted_order here: https://github.com/rapidsai/cudf/pull/3239

gdf_order_by is not yet fixed.

Verified that orderby does not work. Is there an existing issue/PR for this we can track for this bug? Thank u!

Verified that orderby does not work. Is there an existing issue/PR for this we can track for this bug? Thank u!

This one!

Since gdf_order_by is legacy, do we need to fix this? Or can we close? @kuhushukla

Since gdf_order_by is legacy, do we need to fix this? Or can we close? @kuhushukla

I will look at verifying if this is still an issue with the libcudf++ version. If you think closing this and reopening a new one for any issues found on the new API is the way to go, I am happy with a close. Thanks!

Thanks @kuhushukla, I'll wait for your verification. Sameer has this on his spark issue tracker.

Fixed by #3239

Was this page helpful?
0 / 5 - 0 ratings

Related issues

shwina picture shwina  路  3Comments

razajafri picture razajafri  路  3Comments

jmkim picture jmkim  路  3Comments

AjayThorve picture AjayThorve  路  3Comments

ericmjl picture ericmjl  路  3Comments