Describe the bug
nan_as_null parameter affects output of sort_values.
Steps/Code to reproduce bug
In [22]: df = cudf.DataFrame({'a': cudf.Series([np.nan, 1.0, np.nan, 2.0, np.nan, 0.0], nan_as_null=True)})
In [23]: print(df.sort_values(by='a'))
a
5 0.0
1 1.0
3 2.0
0
2
4
In [19]: df = cudf.DataFrame({'a': cudf.Series([np.nan, 1.0, np.nan, 2.0, np.nan, 0.0], nan_as_null=False)})
In [20]: print(df.sort_values(by='a'))
a
0 nan
1 1.0
2 nan
3 2.0
4 nan
5 0.0
similar issues with methods using libcudf APIs. Eg. drop_duplicates (which uses sorting)
df = cudf.DataFrame({'a': cudf.Series([1.0, np.nan, 0, np.nan, 0, 1], nan_as_null=False)})
In [10]: print(df)
a
0 1.0
1 nan
2 0.0
3 nan
4 0.0
5 1.0
In [11]: print(df.drop_duplicates())
a
0 1.0
1 nan
2 0.0
3 nan
4 0.0
5 1.0
Expected behavior
For sorting, drop_duplicates, nan should be considered equal.
Environment overview (please complete the following information)
One solution is to nan equality check for float, double values in libcudf C++ comparators?
Another: eliminate nan_as_null and convert nan to null always.
@karthikeyann can you please include the pandas results for comparison?
@williamBlazing @kkraus14 @revans2 @harrism
We need to decide how to treat nan and null for floating point columns in libcudf.
For pandas, NaN is same as null. cudf python should follow same.
For R, NaN, NA, NULL are different. NULL is ignored always. Sorted results does not contain NA and NaN. Unique treats NA and NaN as different. (https://www.r-bloggers.com/difference-between-na-and-nan-in-r/)
libcudf doesn't support R.
For Spark, how are NaN, NA and null comparison treated?
For SQL, how are NaN, NA and null comparison treated?
@karthikeyann Note there's an issue we raised for Spark aggregation specifically that shows what Spark expects: https://github.com/rapidsai/cudf/issues/2500. I just added some information there on how Spark handles nulls and NaN.
In summary:
There's enough variation of how one handles NaN/null that it should be configurable, I'd expect a small struct that has: nulls_are_equal, and at least nans_are_equal, but perhaps other settings for-0.0f == 0.0f, and nulls_equals_nans for the other cases you mentioned.
As far as NaN ordering is concerned, Spark considers them to be after positive infinity when sorting floating point values in ascending order.
There's enough variation of how one handles NaN/null that it should be configurable, I'd expect a small struct that has: nulls_are_equal, and at least nans_are_equal, but perhaps other settings for -0.0f == 0.0f, and nulls_equals_nans for the other cases you mentioned.
In several places we already expose options like if nulls_are_equal. We can possibly add nans_are_equal.
However, I'm resistant to providing all the knobs in libcudf to support the _entire_ cross product of expected behaviors of nulls/nans/special floating point values that the various existing Data Science platforms expect. Providing all of those knobs vastly increases complexity. Increased complexity increases software maintenance, the likelihood of bugs, and reduces performance.
I can support parameterizing NULL behavior, but if we have to start parameterizing special floating point behavior (especially when it deviates from IEE754), then that becomes a nightmare for implementation and testing.
@jrhemstad, definitely agree with not providing all of the knobs. That said, I would argue for providing the knobs that can't be handled outside of the group by and sort by:
These could be done in a separate call to find_and_replace_all:
-0.0f == 0.0f, just replace -0.0f with 0.0fnulls == NaNs not an issue in Spark, but it seems find_and_replace_all to replace NaN with nullThese can't be done outside of the group/sort kernels, as far as I know:
NaN == NaN, NaN > +InfinityFor the performance reduction, could specializations be done in such a way to remove the NaN checks for the clients that don't need this?
We currently have both find_and_replace_all and as of 0.9, nans_to_nulls.
Is it OK to require the end user (e.g. cuDF/Pandas/Spark user) to insert a conversion pass when they know they need it? I can imagine a data scientist getting wrong results and realizing they have NaN entries. They could run nans_to_nulls and then re-run to get correct results. But is that acceptable?
Or would cuDF Python and cuDF Spark be required to insert nans_to_nulls calls on every floating point input column to algorithms that compare values (sort, unique, etc.)?
I think this is the key question.
I agree with @jrhemstad on not providing too many knobs.
find_and_replace_all for spark and nans_to_nulls in python looks to be good solution.
@abellina For spark,
NaN == NaN: find_and_replace_all NaN's with null and nulls_are_equal=true flag for equality comparisons.
NaN > +Inf: find_and_replace_all NaN's with null and pass nulls_are_smallest=false flag for sorting.
For cudf Python, nans_to_nulls is enough. (I verified it. It won't change the result data to null but just mask used for comparison. )
@abellina @jrhemstad Should this be done by end user or the cudf Python/Spark API itself?
NaN == NaN: find_and_replace_all NaN's with null and nulls_are_equal=true flag for equality comparisons.
NaN > +Inf: find_and_replace_all NaN's with null and pass nulls_are_smallest=false flag for sorting.
Note that Spark and Pandas have different semantics wrt. NaNs. In Spark, NaNs and nulls are not semantically equivalent. NaN is a value that is not a number, null is not a value.
Therefore I don't see how replacing NaNs with nulls would be useful for Spark. This would lead to false positives in equality comparisons when cplumns containing NaNs are compared with columns containing nulls and incorrect bucketing of NaNs with nulls in sorting and groupby operations. I only see find_and_replace_all being useful to normalize values like transforming -0 --> 0 and -NaN --> NaN before calling the functions for equality/sorting/grouping when normalization is desired.
@jlowe what will be results of NaN==null, NaN!=null, NaN>null, Nan<null in spark?
The answer depends upon context. If this is for a comparison that results in a boolean column then the answers are all null, since comparing anything with null always results in null.
If we're talking about ordering semantics for sorting a column then the answer to the first two are:
NaN==null : false
NaN!=null : true
The other two depend upon the requested null ordering. If nulls are requested first:
NaN>null : true
NaN<null : false
Requesting nulls last ordering reverses those two comparisons.
For grouping semantics, nulls are grouped only with other nulls, NaNs are grouped only with other NaNs.
@williamBlazing @jlowe @revans2
If we sort an array with, -Inf, +Inf, NaN, null, NaN, +0, -0, +ve, -ve numbers, what will be their final positions in ascending order?
@jlowe
For Spark, it should be
[-Inf, -ve, 0, -0, +ve, +Inf, NaN, NaN, null] (for nulls_are_smallest=false)
[null, -Inf, -ve, 0, -0, +ve, +Inf, NaN, NaN] (for nulls_are_smallest=true)
right?
@jlowe
NaN>+Inf is true always.
0==-0 is true.
@kkraus14
For Pandas, null is converted to NaN, it is
[-Inf, -ve, 0, -0, +ve, +Inf, NaN, NaN, NaN] (for na_position='last')
[ NaN, NaN, NaN, -Inf, -ve, 0, -0, +ve, +Inf] (for na_position='first')
@jrhemstad
In C++, if NaNs are present in an array, the sort produces incorrect result because as per IEEE-754, all comparison with NaN are false except NAN!=NAN. This is the reason to define the NaN behavior for sorting. (and also for null).
0==-0 (in C++)
@williamBlazing For SQL?
@karthikeyann why do you have -0 larger than 0 in all of your examples?
@karthikeyann why do you have
-0larger than0in all of your examples?
0 and -0 are equal. It retains input order (indicating stability).
@karthikeyann yes, the ordering you list is correct. Spark treats NaN larger than +Inf when sorting in ascending order. When sorting in descending order, NaN appears before +Inf (i.e.: at the top of the list when ignoring null placement).
Speaking of null placement, Spark users can specify null ordering per column which cudf currently cannot support in all cases. I filed #2713 to track that separately.
[-Inf, -ve, 0, -0, +ve, +Inf, NaN, NaN, null](for nulls_are_smallest=false)
[null, -Inf, -ve, 0, -0, +ve, +Inf, NaN, NaN](for nulls_are_smallest=true)
I am pretty sure that for SQL its the same as Spark in this case.
[-Inf, -ve, 0, -0, +ve, +Inf, NaN, NaN, null] (for nulls_are_smallest=false)
[null, -Inf, -ve, 0, -0, +ve, +Inf, NaN, NaN] (for nulls_are_smallest=true)
NaN is a value and its treated as the largest number, where null is not a value and is treated differently. If the order by is set to descending, then the ordering would be exactly the opposite. Note that in SQL, you dont specify nulls_are_smallest, but NULLS FIRST or NULLS LAST. Which means that for NULLS FIRST, we set nulls_are_smallest=true if the ordering is ascending and nulls_are_smallest=false if the ordering is descending and viceversa
gdf_order_by benchmark comparison
Benchmarked for 1k, 10k, 100k, 1M, 10M, 100M double elements, input table with 8 columns.
Output order is ascending.
Comparing nonan.all.json to nan.all.json
Benchmark Time CPU Time Old Time New CPU Old CPU New
--------------------------------------------------------------------------------------------------------------------------------------------------------
benchmark/double_reverse_1_null_0/1024/8/manual_time +0.0648 +0.0684 1 1 1 1
benchmark/double_reverse_1_null_0/10240/8/manual_time +0.0615 +0.0599 1 1 1 1
benchmark/double_reverse_1_null_0/102400/8/manual_time +0.0549 +0.0532 1 1 1 1
benchmark/double_reverse_1_null_0/1048576/8/manual_time +0.0134 +0.0134 3 3 3 3
benchmark/double_reverse_1_null_0/10485760/8/manual_time +0.0054 +0.0054 29 29 29 29
benchmark/double_reverse_1_null_0/104857600/8/manual_time +0.0138 +0.0139 304 309 304 309
benchmark/double_reverse_1_null_1/1024/8/manual_time +0.0278 +0.0274 1 1 1 1
benchmark/double_reverse_1_null_1/10240/8/manual_time +0.0226 +0.0224 1 1 1 1
benchmark/double_reverse_1_null_1/102400/8/manual_time +0.0185 +0.0183 2 2 2 2
benchmark/double_reverse_1_null_1/1048576/8/manual_time +0.0090 +0.0089 6 6 6 6
benchmark/double_reverse_1_null_1/10485760/8/manual_time +0.0062 +0.0062 38 38 38 38
benchmark/double_reverse_1_null_1/104857600/8/manual_time +0.0056 +0.0056 360 362 360 362
benchmark/double_reverse_0_null_0/1024/8/manual_time +0.0586 +0.0569 1 1 1 1
benchmark/double_reverse_0_null_0/10240/8/manual_time +0.0491 +0.0475 1 1 1 1
benchmark/double_reverse_0_null_0/102400/8/manual_time +0.0359 +0.0336 1 1 1 1
benchmark/double_reverse_0_null_0/1048576/8/manual_time +0.0156 +0.0155 5 5 5 5
benchmark/double_reverse_0_null_0/10485760/8/manual_time +0.0026 +0.0025 72 73 72 73
benchmark/double_reverse_0_null_0/104857600/8/manual_time +0.0051 +0.0051 923 928 923 928
benchmark/double_reverse_0_null_1/1024/8/manual_time +0.0352 +0.0346 1 1 1 1
benchmark/double_reverse_0_null_1/10240/8/manual_time +0.0031 +0.0033 1 1 1 1
benchmark/double_reverse_0_null_1/102400/8/manual_time +0.0198 +0.0197 2 2 2 2
benchmark/double_reverse_0_null_1/1048576/8/manual_time +0.0064 -0.0002 6 6 6 6
benchmark/double_reverse_0_null_1/10485760/8/manual_time +0.0060 +0.0060 52 52 52 52
benchmark/double_reverse_0_null_1/104857600/8/manual_time -0.0330 -0.0330 672 650 672 650
reverse_0 → input order is random
reverse_1 → input order is descending
null_0 → no nulls in the input table
null_1 → nulls present in the input table
Time Unit in ms.
Time = (new - old)/old (REF)
hasnulls=0 with NaN support is 0.26% to 6.48% slower.
hasnulls=1 with NaN support is <3.52% slower.
for large inputs(>1M) with NaN support,
~0.5% to 1.38% slower (without null),
~0.5% slower (with null)
I'm confused because your formula for Time is not a Time, it's a ratio. Are the things in the Time column ratios, or are they times?
Just realized I can scroll right... It's pretty clear that 1 ms resolution is not precise enough for this benchmark.
Is Nan support something we have to turn on globally and slow everything down, or can it be an option?
First 2 columns are ratio. Other columns are in ms. Only reporting is in ms, measured in ns Or ticks.
Yes. It could be controlled via a boolean template parameter which could be enabled by a function has_nans() (similar to nullable boolean template parameter).
Comparing ./OLD_bench3.json to ./NEW_bench2.json
Benchmark Time CPU Time Old Time New CPU Old CPU New
------------------------------------------------------------------------------------------------------------------------------------------------------
benchmark/double_hasna_0_null_0/1024/8/manual_time +0.0612 +0.0591 575 611 597 632
benchmark/double_hasna_0_null_0/10240/8/manual_time +0.0513 +0.0500 720 756 741 778
benchmark/double_hasna_0_null_0/102400/8/manual_time +0.0611 +0.0601 866 919 887 940
benchmark/double_hasna_0_null_0/1048576/8/manual_time +0.0106 +0.0106 4907 4959 4930 4982
benchmark/double_hasna_0_null_0/10485760/8/manual_time +0.0022 +0.0021 72382 72541 72409 72565
benchmark/double_hasna_0_null_0/104857600/8/manual_time +0.0029 +0.0037 925148 927785 924439 927827
benchmark/double_hasna_1_null_1/1024/8/manual_time +0.0095 +0.0096 955 964 975 984
benchmark/double_hasna_1_null_1/10240/8/manual_time -0.0011 +0.0016 1285 1284 1303 1305
benchmark/double_hasna_1_null_1/102400/8/manual_time +0.0059 +0.0061 1799 1810 1820 1832
benchmark/double_hasna_1_null_1/1048576/8/manual_time +0.0061 +0.0061 6244 6283 6265 6304
benchmark/double_hasna_1_null_1/10485760/8/manual_time +0.0048 +0.0049 52089 52341 52111 52367
benchmark/double_hasna_1_null_1/104857600/8/manual_time +0.0531 +0.0531 663715 698931 663764 698981
benchmark/double_hasna_1_null_0/1024/8/manual_time +0.3206 +0.3130 959 1266 981 1288
benchmark/double_hasna_1_null_0/10240/8/manual_time +0.4990 +0.4910 1321 1980 1342 2002
benchmark/double_hasna_1_null_0/102400/8/manual_time +0.7335 +0.7238 1781 3087 1802 3106
benchmark/double_hasna_1_null_0/1048576/8/manual_time +1.1249 +1.1214 6324 13438 6345 13460
benchmark/double_hasna_1_null_0/10485760/8/manual_time +1.4186 +1.4180 52106 126023 52130 126051
benchmark/double_hasna_1_null_0/104857600/8/manual_time +1.1277 +1.1276 644368 1370994 644417 1371031
with 50% NaN in random locations, the slowdown from 5.3% to 141%. (The slowdown is same if NaN is replaced with +Inf).
@karthikeyann can you explain here what you explained in the meeting? How is this different from your previous benchmark?
previous benchmark did not have NaN in the data. The comparison is only for code change. benchmark data had numbers and nulls only. It's because orderby (thrust::sort) crashes if the data has NaN with following error (I don't know why).
terminate called after throwing an instance of 'thrust::system::system_error'
what(): merge_sort: failed to synchronize: an illegal memory access was encountered
Aborted (core dumped)
The latest benchmark result has NaN values in the data.
hasna_0_null_0 = only numbers
hasna_1_null_1 = 50% numbers and 50% nulls
hasna_1_null_0 = 50% numbers and 50% NaN (nulls are not present)
With the latest code in PR https://github.com/rapidsai/cudf/pull/2561, there is no slowdown if the data has no NaNs.
If the data has NaN, slowdown is 32% to 141%.
Note: operations on columns could generate NaN. There is no conversion of NaN to nulls happen now for cudf.
cudf.DataFrame({'a': [None, -4, 0.0, 4]}).sqrt()
Out[30]:
a
0 null
1 NaN
2 0.0
3 2.0
@karthikeyann have you profiled where the extra time is being spent? I would expect the cost of isnan(x) to be constant whether or not x == NaN. So where is the extra time coming from?
I assumed it was because it's something like if (isnan(x))... and there's only a perf penalty if that if diverges within warps.
@jlowe can you comment whether you are OK with this performance cost?
I assumed it was because it's something like
if (isnan(x))... and there's only a perf penalty if that if diverges within warps.
I'm not sure. Sort is such a bandwidth bound operation, I wouldn't expect a single branch divergence to cause more than 2x slowdown. I think it's worth profiling with Nsight Compute to get the full story.
can you comment whether you are OK with this performance cost?
Like @jrhemstad I'm a bit surprised the hit is that large when NaNs are present. However I'm very happy to see this:
there is no slowdown if the data has no NaNs.
That's the best outcome. No performance hit for the cases that worked for everyone today, and now NaNs are ordered properly if present. Correctness trumps performance, so I think it's worth integrating even with the corner case performance hit. Hopefully we can tackle it in a followup if not addressed here. I do not believe NaNs will be a common case in Spark, but it's important to handle them properly if they are there to avoid corrupted output.
I'm hoping the fixes for other problematic NaN cases like hashing for groupby and join (see #2500) also don't suffer performance hits when NaNs aren't present.
@karthikeyann a good opportunity to get experience with nsight compute if you haven't yet -- it's a pretty powerful tool, once you get used to it. Please share what you find! I agree with @jlowe we should go ahead and get this fix integrated.
I still think that divergence is the cause of the slowdown. When there is one more parameter to branch on, and considering the synthetic data contains exactly 50% NaNs spread randomly, I'd expect the kernel to take approx twice as long. And that's what seems to be happening here. One more idea for checking whether the slowdown is coming strictly from divergence: Make the synthetic data NaN not at random but with some pattern : like the first 50% values are NaNs and the rest valid. Due to sorting's access pattern, we won't have a strict halving of run time but if it affects it in some significant way, then that's a clue.
When there is one more parameter to branch on, and considering the synthetic data contains exactly 50% NaNs spread randomly, I'd expect the kernel to take approx twice as long.
This is predicated on the idea that sort is a compute bound operation. It's really bandwidth bound, and so even doubling the number of instructions should not cause a 2x slowdown.
@karthikeyann have you profiled with Nsight Compute yet?
Does the divergence affect the load/store pattern or timing? If so, it can affect bandwidth efficiency.
If the branch divergence theory were true, then we'd see the same 2x slowdown for a dataset with 50% random null values.
We've already effectively tested what @devavret described here:
One more idea for checking whether the slowdown is coming strictly from divergence: Make the synthetic data NaN not at random but with some pattern : like the first 50% values are NaNs and the rest valid. Due to sorting's access pattern, we won't have a strict halving of run time but if it affects it in some significant way, then that's a clue.
That said, now that I look at the row_inequality_operator, with NaN support, it is extremely branchy and it's logic could be significantly simplified.
See https://github.com/rapidsai/cudf/blob/branch-0.10/cpp/include/cudf/table/row_operators.cuh#L163 for the updated equivalent of row_inequality_comparator that has more compact logic (it doesn't have a specialization for floating point yet, but the pattern that is there should be used for the basis of simplifying the logic).
According to @karthikeyann , he's already tested another case where he pre-created a separate null mask for all the nan values and replaced all isnan() calls with validity checks. He left the rest of the logic untouched. That case still caused the same slowdown.
According to @karthikeyann , he's already tested another case where he pre-created a separate null mask for all the nan values and replaced all
isnan()calls with validity checks. He left the rest of the logic untouched. That case still caused the same slowdown.
How is that different from the hasna_1_null_1 = 50% numbers and 50% nulls test case @karthikeyann described above? Those numbers show only a ~5% slowdown.
@karthikeyann can you open a PR for this? Still interested in profiling it but the slowdown may not be a problem and we want to get this fixed for 0.11.
Note that this was fixed for cudf::sorted_order here: https://github.com/rapidsai/cudf/pull/3239
gdf_order_by is not yet fixed.
Verified that orderby does not work. Is there an existing issue/PR for this we can track for this bug? Thank u!
Verified that orderby does not work. Is there an existing issue/PR for this we can track for this bug? Thank u!
This one!
Since gdf_order_by is legacy, do we need to fix this? Or can we close? @kuhushukla
Since gdf_order_by is legacy, do we need to fix this? Or can we close? @kuhushukla
I will look at verifying if this is still an issue with the libcudf++ version. If you think closing this and reopening a new one for any issues found on the new API is the way to go, I am happy with a close. Thanks!
Thanks @kuhushukla, I'll wait for your verification. Sameer has this on his spark issue tracker.
Fixed by #3239