Cudf: [BUG] nan_as_null parameter affects output of sort_values.

Created on 8 Jul 2019 · 48Comments · Source: rapidsai/cudf

Describe the bug
nan_as_null parameter affects output of sort_values.

Steps/Code to reproduce bug

In [22]: df = cudf.DataFrame({'a': cudf.Series([np.nan, 1.0, np.nan, 2.0, np.nan, 0.0], nan_as_null=True)})

In [23]: print(df.sort_values(by='a'))
     a
5  0.0
1  1.0
3  2.0
0
2
4
In [19]: df = cudf.DataFrame({'a': cudf.Series([np.nan, 1.0, np.nan, 2.0, np.nan, 0.0], nan_as_null=False)})

In [20]: print(df.sort_values(by='a'))
     a
0  nan
1  1.0
2  nan
3  2.0
4  nan
5  0.0

similar issues with methods using libcudf APIs. Eg. drop_duplicates (which uses sorting)

df = cudf.DataFrame({'a': cudf.Series([1.0, np.nan, 0, np.nan, 0, 1], nan_as_null=False)})

In [10]: print(df)
     a
0  1.0
1  nan
2  0.0
3  nan
4  0.0
5  1.0

In [11]: print(df.drop_duplicates())
     a
0  1.0
1  nan
2  0.0
3  nan
4  0.0
5  1.0

Expected behavior
For sorting, drop_duplicates, nan should be considered equal.

Environment overview (please complete the following information)

Environment location: Bare-metal
Method of cuDF install: from source

Spark bug libcudf

Source

karthikeyann

All 48 comments

One solution is to nan equality check for float, double values in libcudf C++ comparators?
Another: eliminate nan_as_null and convert nan to null always.

karthikeyann on 8 Jul 2019

@karthikeyann can you please include the pandas results for comparison?

harrism on 14 Aug 2019

@williamBlazing @kkraus14 @revans2 @harrism
We need to decide how to treat nan and null for floating point columns in libcudf.

For pandas, NaN is same as null. cudf python should follow same.
For R, NaN, NA, NULL are different. NULL is ignored always. Sorted results does not contain NA and NaN. Unique treats NA and NaN as different. (https://www.r-bloggers.com/difference-between-na-and-nan-in-r/)
libcudf doesn't support R.

For Spark, how are NaN, NA and null comparison treated?
For SQL, how are NaN, NA and null comparison treated?

karthikeyann on 14 Aug 2019

@karthikeyann Note there's an issue we raised for Spark aggregation specifically that shows what Spark expects: https://github.com/rapidsai/cudf/issues/2500. I just added some information there on how Spark handles nulls and NaN.

In summary:

There's enough variation of how one handles NaN/null that it should be configurable, I'd expect a small struct that has: nulls_are_equal, and at least nans_are_equal, but perhaps other settings for-0.0f == 0.0f, and nulls_equals_nans for the other cases you mentioned.

abellina on 14 Aug 2019

As far as NaN ordering is concerned, Spark considers them to be after positive infinity when sorting floating point values in ascending order.

jlowe on 14 Aug 2019

There's enough variation of how one handles NaN/null that it should be configurable, I'd expect a small struct that has: nulls_are_equal, and at least nans_are_equal, but perhaps other settings for -0.0f == 0.0f, and nulls_equals_nans for the other cases you mentioned.

In several places we already expose options like if nulls_are_equal. We can possibly add nans_are_equal.

However, I'm resistant to providing all the knobs in libcudf to support the _entire_ cross product of expected behaviors of nulls/nans/special floating point values that the various existing Data Science platforms expect. Providing all of those knobs vastly increases complexity. Increased complexity increases software maintenance, the likelihood of bugs, and reduces performance.

I can support parameterizing NULL behavior, but if we have to start parameterizing special floating point behavior (especially when it deviates from IEE754), then that becomes a nightmare for implementation and testing.

jrhemstad on 14 Aug 2019

@jrhemstad, definitely agree with not providing all of the knobs. That said, I would argue for providing the knobs that can't be handled outside of the group by and sort by:

These could be done in a separate call to find_and_replace_all:

-0.0f == 0.0f, just replace -0.0f with 0.0f
nulls == NaNs not an issue in Spark, but it seems find_and_replace_all to replace NaN with null

These can't be done outside of the group/sort kernels, as far as I know:

NaN == NaN,
NaN > +Infinity

For the performance reduction, could specializations be done in such a way to remove the NaN checks for the clients that don't need this?

abellina on 15 Aug 2019

We currently have both find_and_replace_all and as of 0.9, nans_to_nulls.

Is it OK to require the end user (e.g. cuDF/Pandas/Spark user) to insert a conversion pass when they know they need it? I can imagine a data scientist getting wrong results and realizing they have NaN entries. They could run nans_to_nulls and then re-run to get correct results. But is that acceptable?

Or would cuDF Python and cuDF Spark be required to insert nans_to_nulls calls on every floating point input column to algorithms that compare values (sort, unique, etc.)?

I think this is the key question.

harrism on 16 Aug 2019

👍1

I agree with @jrhemstad on not providing too many knobs.

find_and_replace_all for spark and nans_to_nulls in python looks to be good solution.

@abellina For spark,
NaN == NaN: find_and_replace_all NaN's with null and nulls_are_equal=true flag for equality comparisons.
NaN > +Inf: find_and_replace_all NaN's with null and pass nulls_are_smallest=false flag for sorting.

For cudf Python, nans_to_nulls is enough. (I verified it. It won't change the result data to null but just mask used for comparison. )

@abellina @jrhemstad Should this be done by end user or the cudf Python/Spark API itself?

karthikeyann on 16 Aug 2019

NaN == NaN: find_and_replace_all NaN's with null and nulls_are_equal=true flag for equality comparisons.
NaN > +Inf: find_and_replace_all NaN's with null and pass nulls_are_smallest=false flag for sorting.

Note that Spark and Pandas have different semantics wrt. NaNs. In Spark, NaNs and nulls are not semantically equivalent. NaN is a value that is not a number, null is not a value.

Therefore I don't see how replacing NaNs with nulls would be useful for Spark. This would lead to false positives in equality comparisons when cplumns containing NaNs are compared with columns containing nulls and incorrect bucketing of NaNs with nulls in sorting and groupby operations. I only see find_and_replace_all being useful to normalize values like transforming -0 --> 0 and -NaN --> NaN before calling the functions for equality/sorting/grouping when normalization is desired.

jlowe on 16 Aug 2019

👍1

@jlowe what will be results of NaN==null, NaN!=null, NaN>null, Nan<null in spark?

karthikeyann on 16 Aug 2019

The answer depends upon context. If this is for a comparison that results in a boolean column then the answers are all null, since comparing anything with null always results in null.

If we're talking about ordering semantics for sorting a column then the answer to the first two are:
NaN==null : false
NaN!=null : true

The other two depend upon the requested null ordering. If nulls are requested first:
NaN>null : true
NaN<null : false
Requesting nulls last ordering reverses those two comparisons.

For grouping semantics, nulls are grouped only with other nulls, NaNs are grouped only with other NaNs.

jlowe on 16 Aug 2019

@williamBlazing @jlowe @revans2
If we sort an array with, -Inf, +Inf, NaN, null, NaN, +0, -0, +ve, -ve numbers, what will be their final positions in ascending order?

@jlowe
For Spark, it should be
[-Inf, -ve, 0, -0, +ve, +Inf, NaN, NaN, null] (for nulls_are_smallest=false)
[null, -Inf, -ve, 0, -0, +ve, +Inf, NaN, NaN] (for nulls_are_smallest=true)
right?
@jlowe
NaN>+Inf is true always.
0==-0 is true.

@kkraus14
For Pandas, null is converted to NaN, it is
[-Inf, -ve, 0, -0, +ve, +Inf, NaN, NaN, NaN] (for na_position='last')
[ NaN, NaN, NaN, -Inf, -ve, 0, -0, +ve, +Inf] (for na_position='first')

@jrhemstad
In C++, if NaNs are present in an array, the sort produces incorrect result because as per IEEE-754, all comparison with NaN are false except NAN!=NAN. This is the reason to define the NaN behavior for sorting. (and also for null).
0==-0 (in C++)

@williamBlazing For SQL?

karthikeyann on 30 Aug 2019

@karthikeyann why do you have -0 larger than 0 in all of your examples?

harrism on 2 Sep 2019

@karthikeyann why do you have -0 larger than 0 in all of your examples?

0 and -0 are equal. It retains input order (indicating stability).

karthikeyann on 3 Sep 2019

👍1

@karthikeyann yes, the ordering you list is correct. Spark treats NaN larger than +Inf when sorting in ascending order. When sorting in descending order, NaN appears before +Inf (i.e.: at the top of the list when ignoring null placement).

jlowe on 3 Sep 2019

Speaking of null placement, Spark users can specify null ordering per column which cudf currently cannot support in all cases. I filed #2713 to track that separately.

jlowe on 3 Sep 2019

[-Inf, -ve, 0, -0, +ve, +Inf, NaN, NaN, null] (for nulls_are_smallest=false)
[null, -Inf, -ve, 0, -0, +ve, +Inf, NaN, NaN] (for nulls_are_smallest=true)

I am pretty sure that for SQL its the same as Spark in this case.
[-Inf, -ve, 0, -0, +ve, +Inf, NaN, NaN, null] (for nulls_are_smallest=false)
[null, -Inf, -ve, 0, -0, +ve, +Inf, NaN, NaN] (for nulls_are_smallest=true)

NaN is a value and its treated as the largest number, where null is not a value and is treated differently. If the order by is set to descending, then the ordering would be exactly the opposite. Note that in SQL, you dont specify nulls_are_smallest, but NULLS FIRST or NULLS LAST. Which means that for NULLS FIRST, we set nulls_are_smallest=true if the ordering is ascending and nulls_are_smallest=false if the ordering is descending and viceversa

williamBlazing on 3 Sep 2019

gdf_order_by benchmark comparison
Benchmarked for 1k, 10k, 100k, 1M, 10M, 100M double elements, input table with 8 columns.
Output order is ascending.

Comparing nonan.all.json to nan.all.json
Benchmark                                                                   Time             CPU      Time Old      Time New       CPU Old       CPU New
--------------------------------------------------------------------------------------------------------------------------------------------------------
benchmark/double_reverse_1_null_0/1024/8/manual_time                     +0.0648         +0.0684             1             1             1             1
benchmark/double_reverse_1_null_0/10240/8/manual_time                    +0.0615         +0.0599             1             1             1             1
benchmark/double_reverse_1_null_0/102400/8/manual_time                   +0.0549         +0.0532             1             1             1             1
benchmark/double_reverse_1_null_0/1048576/8/manual_time                  +0.0134         +0.0134             3             3             3             3
benchmark/double_reverse_1_null_0/10485760/8/manual_time                 +0.0054         +0.0054            29            29            29            29
benchmark/double_reverse_1_null_0/104857600/8/manual_time                +0.0138         +0.0139           304           309           304           309
benchmark/double_reverse_1_null_1/1024/8/manual_time                     +0.0278         +0.0274             1             1             1             1
benchmark/double_reverse_1_null_1/10240/8/manual_time                    +0.0226         +0.0224             1             1             1             1
benchmark/double_reverse_1_null_1/102400/8/manual_time                   +0.0185         +0.0183             2             2             2             2
benchmark/double_reverse_1_null_1/1048576/8/manual_time                  +0.0090         +0.0089             6             6             6             6
benchmark/double_reverse_1_null_1/10485760/8/manual_time                 +0.0062         +0.0062            38            38            38            38
benchmark/double_reverse_1_null_1/104857600/8/manual_time                +0.0056         +0.0056           360           362           360           362
benchmark/double_reverse_0_null_0/1024/8/manual_time                     +0.0586         +0.0569             1             1             1             1
benchmark/double_reverse_0_null_0/10240/8/manual_time                    +0.0491         +0.0475             1             1             1             1
benchmark/double_reverse_0_null_0/102400/8/manual_time                   +0.0359         +0.0336             1             1             1             1
benchmark/double_reverse_0_null_0/1048576/8/manual_time                  +0.0156         +0.0155             5             5             5             5
benchmark/double_reverse_0_null_0/10485760/8/manual_time                 +0.0026         +0.0025            72            73            72            73
benchmark/double_reverse_0_null_0/104857600/8/manual_time                +0.0051         +0.0051           923           928           923           928
benchmark/double_reverse_0_null_1/1024/8/manual_time                     +0.0352         +0.0346             1             1             1             1
benchmark/double_reverse_0_null_1/10240/8/manual_time                    +0.0031         +0.0033             1             1             1             1
benchmark/double_reverse_0_null_1/102400/8/manual_time                   +0.0198         +0.0197             2             2             2             2
benchmark/double_reverse_0_null_1/1048576/8/manual_time                  +0.0064         -0.0002             6             6             6             6
benchmark/double_reverse_0_null_1/10485760/8/manual_time                 +0.0060         +0.0060            52            52            52            52
benchmark/double_reverse_0_null_1/104857600/8/manual_time                -0.0330         -0.0330           672           650           672           650

reverse_0 → input order is random
reverse_1 → input order is descending
null_0 → no nulls in the input table
null_1 → nulls present in the input table
Time Unit in ms.

Time = (new - old)/old (REF)

hasnulls=0 with NaN support is 0.26% to 6.48% slower.
hasnulls=1 with NaN support is <3.52% slower.

for large inputs(>1M) with NaN support,
~0.5% to 1.38% slower (without null),
~0.5% slower (with null)

karthikeyann on 9 Sep 2019

I'm confused because your formula for Time is not a Time, it's a ratio. Are the things in the Time column ratios, or are they times?

harrism on 10 Sep 2019

Just realized I can scroll right... It's pretty clear that 1 ms resolution is not precise enough for this benchmark.

harrism on 10 Sep 2019

Is Nan support something we have to turn on globally and slow everything down, or can it be an option?

harrism on 10 Sep 2019

First 2 columns are ratio. Other columns are in ms. Only reporting is in ms, measured in ns Or ticks.

Yes. It could be controlled via a boolean template parameter which could be enabled by a function has_nans() (similar to nullable boolean template parameter).

karthikeyann on 10 Sep 2019

Comparing ./OLD_bench3.json to ./NEW_bench2.json
Benchmark                                                                 Time             CPU      Time Old      Time New       CPU Old       CPU New
------------------------------------------------------------------------------------------------------------------------------------------------------
benchmark/double_hasna_0_null_0/1024/8/manual_time                     +0.0612         +0.0591           575           611           597           632
benchmark/double_hasna_0_null_0/10240/8/manual_time                    +0.0513         +0.0500           720           756           741           778
benchmark/double_hasna_0_null_0/102400/8/manual_time                   +0.0611         +0.0601           866           919           887           940
benchmark/double_hasna_0_null_0/1048576/8/manual_time                  +0.0106         +0.0106          4907          4959          4930          4982
benchmark/double_hasna_0_null_0/10485760/8/manual_time                 +0.0022         +0.0021         72382         72541         72409         72565
benchmark/double_hasna_0_null_0/104857600/8/manual_time                +0.0029         +0.0037        925148        927785        924439        927827
benchmark/double_hasna_1_null_1/1024/8/manual_time                     +0.0095         +0.0096           955           964           975           984
benchmark/double_hasna_1_null_1/10240/8/manual_time                    -0.0011         +0.0016          1285          1284          1303          1305
benchmark/double_hasna_1_null_1/102400/8/manual_time                   +0.0059         +0.0061          1799          1810          1820          1832
benchmark/double_hasna_1_null_1/1048576/8/manual_time                  +0.0061         +0.0061          6244          6283          6265          6304
benchmark/double_hasna_1_null_1/10485760/8/manual_time                 +0.0048         +0.0049         52089         52341         52111         52367
benchmark/double_hasna_1_null_1/104857600/8/manual_time                +0.0531         +0.0531        663715        698931        663764        698981
benchmark/double_hasna_1_null_0/1024/8/manual_time                     +0.3206         +0.3130           959          1266           981          1288
benchmark/double_hasna_1_null_0/10240/8/manual_time                    +0.4990         +0.4910          1321          1980          1342          2002
benchmark/double_hasna_1_null_0/102400/8/manual_time                   +0.7335         +0.7238          1781          3087          1802          3106
benchmark/double_hasna_1_null_0/1048576/8/manual_time                  +1.1249         +1.1214          6324         13438          6345         13460
benchmark/double_hasna_1_null_0/10485760/8/manual_time                 +1.4186         +1.4180         52106        126023         52130        126051
benchmark/double_hasna_1_null_0/104857600/8/manual_time                +1.1277         +1.1276        644368       1370994        644417       1371031

with 50% NaN in random locations, the slowdown from 5.3% to 141%. (The slowdown is same if NaN is replaced with +Inf).

karthikeyann on 24 Sep 2019

@karthikeyann can you explain here what you explained in the meeting? How is this different from your previous benchmark?

harrism on 24 Sep 2019

previous benchmark did not have NaN in the data. The comparison is only for code change. benchmark data had numbers and nulls only. It's because orderby (thrust::sort) crashes if the data has NaN with following error (I don't know why).

terminate called after throwing an instance of 'thrust::system::system_error'
  what():  merge_sort: failed to synchronize: an illegal memory access was encountered
Aborted (core dumped)

The latest benchmark result has NaN values in the data.
hasna_0_null_0 = only numbers
hasna_1_null_1 = 50% numbers and 50% nulls
hasna_1_null_0 = 50% numbers and 50% NaN (nulls are not present)

With the latest code in PR https://github.com/rapidsai/cudf/pull/2561, there is no slowdown if the data has no NaNs.
If the data has NaN, slowdown is 32% to 141%.

karthikeyann on 24 Sep 2019

Note: operations on columns could generate NaN. There is no conversion of NaN to nulls happen now for cudf.

cudf.DataFrame({'a': [None, -4, 0.0, 4]}).sqrt()
Out[30]: 
      a
0  null
1   NaN
2   0.0
3   2.0

karthikeyann on 24 Sep 2019

@karthikeyann have you profiled where the extra time is being spent? I would expect the cost of isnan(x) to be constant whether or not x == NaN. So where is the extra time coming from?

jrhemstad on 24 Sep 2019

I assumed it was because it's something like if (isnan(x))... and there's only a perf penalty if that if diverges within warps.

harrism on 27 Sep 2019

@jlowe can you comment whether you are OK with this performance cost?

harrism on 27 Sep 2019

I assumed it was because it's something like if (isnan(x))... and there's only a perf penalty if that if diverges within warps.

I'm not sure. Sort is such a bandwidth bound operation, I wouldn't expect a single branch divergence to cause more than 2x slowdown. I think it's worth profiling with Nsight Compute to get the full story.

jrhemstad on 27 Sep 2019

can you comment whether you are OK with this performance cost?

Like @jrhemstad I'm a bit surprised the hit is that large when NaNs are present. However I'm very happy to see this:

there is no slowdown if the data has no NaNs.

That's the best outcome. No performance hit for the cases that worked for everyone today, and now NaNs are ordered properly if present. Correctness trumps performance, so I think it's worth integrating even with the corner case performance hit. Hopefully we can tackle it in a followup if not addressed here. I do not believe NaNs will be a common case in Spark, but it's important to handle them properly if they are there to avoid corrupted output.

I'm hoping the fixes for other problematic NaN cases like hashing for groupby and join (see #2500) also don't suffer performance hits when NaNs aren't present.

jlowe on 27 Sep 2019

@karthikeyann a good opportunity to get experience with nsight compute if you haven't yet -- it's a pretty powerful tool, once you get used to it. Please share what you find! I agree with @jlowe we should go ahead and get this fix integrated.

harrism on 28 Sep 2019

I still think that divergence is the cause of the slowdown. When there is one more parameter to branch on, and considering the synthetic data contains exactly 50% NaNs spread randomly, I'd expect the kernel to take approx twice as long. And that's what seems to be happening here. One more idea for checking whether the slowdown is coming strictly from divergence: Make the synthetic data NaN not at random but with some pattern : like the first 50% values are NaNs and the rest valid. Due to sorting's access pattern, we won't have a strict halving of run time but if it affects it in some significant way, then that's a clue.

devavret on 3 Oct 2019

When there is one more parameter to branch on, and considering the synthetic data contains exactly 50% NaNs spread randomly, I'd expect the kernel to take approx twice as long.

This is predicated on the idea that sort is a compute bound operation. It's really bandwidth bound, and so even doubling the number of instructions should not cause a 2x slowdown.

@karthikeyann have you profiled with Nsight Compute yet?

jrhemstad on 3 Oct 2019

Does the divergence affect the load/store pattern or timing? If so, it can affect bandwidth efficiency.

harrism on 4 Oct 2019

If the branch divergence theory were true, then we'd see the same 2x slowdown for a dataset with 50% random null values.

We've already effectively tested what @devavret described here:

One more idea for checking whether the slowdown is coming strictly from divergence: Make the synthetic data NaN not at random but with some pattern : like the first 50% values are NaNs and the rest valid. Due to sorting's access pattern, we won't have a strict halving of run time but if it affects it in some significant way, then that's a clue.

jrhemstad on 4 Oct 2019

That said, now that I look at the row_inequality_operator, with NaN support, it is extremely branchy and it's logic could be significantly simplified.

See https://github.com/rapidsai/cudf/blob/branch-0.10/cpp/include/cudf/table/row_operators.cuh#L163 for the updated equivalent of row_inequality_comparator that has more compact logic (it doesn't have a specialization for floating point yet, but the pattern that is there should be used for the basis of simplifying the logic).

jrhemstad on 4 Oct 2019

According to @karthikeyann , he's already tested another case where he pre-created a separate null mask for all the nan values and replaced all isnan() calls with validity checks. He left the rest of the logic untouched. That case still caused the same slowdown.

devavret on 4 Oct 2019

According to @karthikeyann , he's already tested another case where he pre-created a separate null mask for all the nan values and replaced all isnan() calls with validity checks. He left the rest of the logic untouched. That case still caused the same slowdown.

How is that different from the hasna_1_null_1 = 50% numbers and 50% nulls test case @karthikeyann described above? Those numbers show only a ~5% slowdown.

jrhemstad on 4 Oct 2019

@karthikeyann can you open a PR for this? Still interested in profiling it but the slowdown may not be a problem and we want to get this fixed for 0.11.

harrism on 23 Oct 2019

Note that this was fixed for cudf::sorted_order here: https://github.com/rapidsai/cudf/pull/3239

gdf_order_by is not yet fixed.

jrhemstad on 1 Nov 2019

Verified that orderby does not work. Is there an existing issue/PR for this we can track for this bug? Thank u!

kuhushukla on 1 Nov 2019

Verified that orderby does not work. Is there an existing issue/PR for this we can track for this bug? Thank u!

This one!

jrhemstad on 1 Nov 2019

Since gdf_order_by is legacy, do we need to fix this? Or can we close? @kuhushukla

harrism on 19 Dec 2019

Since gdf_order_by is legacy, do we need to fix this? Or can we close? @kuhushukla

I will look at verifying if this is still an issue with the libcudf++ version. If you think closing this and reopening a new one for any issues found on the new API is the way to go, I am happy with a close. Thanks!

kuhushukla on 19 Dec 2019

Thanks @kuhushukla, I'll wait for your verification. Sameer has this on his spark issue tracker.

harrism on 20 Dec 2019

Fixed by #3239