Cudf: [BUG] full_join considers nulls as equal

Created on 23 Jun 2020 · 7Comments · Source: rapidsai/cudf

Describe the bug
When I do a full_join in spark if the key has nulls in it I get the wrong answer because full_join in cudf considers null to be equal, but in the rest of the world it does not.

Steps/Code to reproduce bug
psudo code. I'll try to come up with an actual unit test to reproduce this, but for now...

a = {{null, 1}}
b = {{null, 2}}
full_join(a, b)

in spark on the CPU the returns

{{null, 1, null, null},
 {null, null, null, 2}}

from cudf it returns

{{null, 1, null, 2}}

Spark bug libcudf

Source

revans2

All 7 comments

Oldie but a goodie: https://github.com/rapidsai/cudf/issues/542

The behavior definitely used to be NULL != NULL, but that may have just been an oversight in porting since row_equality_comparator defaults to NULL == NULL.

So I think we should just switch back to NULL != NULL by specifying it in the row comparator.

jrhemstad on 23 Jun 2020

fyi, @revans2 the way things are currently implemented, this doesn't look like it's unique to full join. Inner and left outer would have this same issue.

jrhemstad on 23 Jun 2020

😕1

From our tests other join operators are not showing issues with nulls in the key. It is just full_join. I would need to dig into the details to see if something else is masking it in spark or not.

revans2 on 23 Jun 2020

I think this is a test the reproduces the issue

TEST_F(JoinTest, FullJoinWithNullKeys)
{
  column_wrapper<int32_t> col0_0{{0, 1}, {0, 1}};
  column_wrapper<int32_t> col1_0{{-1, 2}, {0, 1}};

  CVector cols0, cols1;
  cols0.push_back(col0_0.release());
  cols1.push_back(col1_0.release());

  Table t0(std::move(cols0));
  Table t1(std::move(cols1));

  auto result            = cudf::full_join(t0, t1, {0}, {0}, {{0, 0}});
  auto result_sort_order = cudf::sorted_order(result->view());
  auto sorted_result     = cudf::gather(result->view(), *result_sort_order);

  column_wrapper<int32_t> col_gold_0{{0, 1, -1, 2}, {0, 1, 0, 1}};
  CVector cols_gold;
  cols_gold.push_back(col_gold_0.release());
  Table gold(std::move(cols_gold));

  auto gold_sort_order = cudf::sorted_order(gold.view());
  auto sorted_gold     = cudf::gather(gold.view(), *gold_sort_order);
  cudf::test::expect_tables_equal(*sorted_gold, *sorted_result);
}

revans2 on 23 Jun 2020

👍1

I still don't know why this is not showing up in spark for a regular left join, but while trying to add in a right outer join, using left join I am seeing it there too. I need to dig a bit into spark to try and understand why it is not showing up on a left join.

revans2 on 23 Jun 2020

I need to dig a bit into spark to try and understand why it is not showing up on a left join.

I suspect that this might not show up in a right join either (had we one).

This might be happening because SparkSQL likely pushes down an implicit right_table.join_key IS NOT NULL filter, for LEFT OUTER JOIN. So the right-side null records are pre-filtered, and aren't available to produce the wrong results.

With a FULL OUTER JOIN, neither side can be pre-filtered. Without the join predicate pushed down, null-keys on both sides make it to the cudf::full_join(), and we get the incorrect results.

I'll try verify this tomorrow. Very interesting bug, this.