Cudf: [BUG] full_join considers nulls as equal

Created on 23 Jun 2020  路  7Comments  路  Source: rapidsai/cudf

Describe the bug
When I do a full_join in spark if the key has nulls in it I get the wrong answer because full_join in cudf considers null to be equal, but in the rest of the world it does not.

Steps/Code to reproduce bug
psudo code. I'll try to come up with an actual unit test to reproduce this, but for now...

a = {{null, 1}}
b = {{null, 2}}
full_join(a, b)

in spark on the CPU the returns

{{null, 1, null, null},
 {null, null, null, 2}}

from cudf it returns

{{null, 1, null, 2}}
Spark bug libcudf

All 7 comments

Oldie but a goodie: https://github.com/rapidsai/cudf/issues/542

The behavior definitely used to be NULL != NULL, but that may have just been an oversight in porting since row_equality_comparator defaults to NULL == NULL.

So I think we should just switch back to NULL != NULL by specifying it in the row comparator.

fyi, @revans2 the way things are currently implemented, this doesn't look like it's unique to full join. Inner and left outer would have this same issue.

From our tests other join operators are not showing issues with nulls in the key. It is just full_join. I would need to dig into the details to see if something else is masking it in spark or not.

I think this is a test the reproduces the issue

TEST_F(JoinTest, FullJoinWithNullKeys)
{
  column_wrapper<int32_t> col0_0{{0, 1}, {0, 1}};
  column_wrapper<int32_t> col1_0{{-1, 2}, {0, 1}};

  CVector cols0, cols1;
  cols0.push_back(col0_0.release());
  cols1.push_back(col1_0.release());

  Table t0(std::move(cols0));
  Table t1(std::move(cols1));

  auto result            = cudf::full_join(t0, t1, {0}, {0}, {{0, 0}});
  auto result_sort_order = cudf::sorted_order(result->view());
  auto sorted_result     = cudf::gather(result->view(), *result_sort_order);

  column_wrapper<int32_t> col_gold_0{{0, 1, -1, 2}, {0, 1, 0, 1}};
  CVector cols_gold;
  cols_gold.push_back(col_gold_0.release());
  Table gold(std::move(cols_gold));

  auto gold_sort_order = cudf::sorted_order(gold.view());
  auto sorted_gold     = cudf::gather(gold.view(), *gold_sort_order);
  cudf::test::expect_tables_equal(*sorted_gold, *sorted_result);
}

I still don't know why this is not showing up in spark for a regular left join, but while trying to add in a right outer join, using left join I am seeing it there too. I need to dig a bit into spark to try and understand why it is not showing up on a left join.

I need to dig a bit into spark to try and understand why it is not showing up on a left join.

I suspect that this might not show up in a right join either (had we one).

This might be happening because SparkSQL likely pushes down an implicit right_table.join_key IS NOT NULL filter, for LEFT OUTER JOIN. So the right-side null records are pre-filtered, and aren't available to produce the wrong results.

With a FULL OUTER JOIN, neither side can be pre-filtered. Without the join predicate pushed down, null-keys on both sides make it to the cudf::full_join(), and we get the incorrect results.

I'll try verify this tomorrow. Very interesting bug, this.

This was fixed a while ago by the patch from @mythrocks

Was this page helpful?
0 / 5 - 0 ratings