Cudf: [QST] expect_columns_equal() vs expect_columns_equivalent()

Created on 6 Aug 2020  路  5Comments  路  Source: rapidsai/cudf

I seek guidance on which test scenarios we should be using cudf::test:expect_columns_equal() vs cudf::test::expect_columns_equivalent(). I ran into this while writing structs_column_test, as part of #5807. I was advised to put up an issue for discussion here.

As discussed in #5700, when constructing a structs column, the parent column's null-mask will be ANDed with the children's, but the data streams are not modified.
The values in the resulting child column may be checked for correctness with an expected_column (via the appropriate column wrapper).
For primitive column types, both equivalence and equality checks succeed.
For list columns, they fail. :/

It turns out that cudf::test::column_comparator_impl<list_view> compares list columns by recursively comparing its data/offset contents. Constructing a list containing nulls via list_column_wrapper does not produce the same contents as a column whose null-mask has been twiddled.

Should one modify column_comparator_impl<list_view> to check for equivalence differently from equality?
Also, how would the caller of expect_columns_equal() discern that they should instead be calling expect_columns_equivalent()?

I think @nvdbaranec ran into a similar (or "equivalent" :]) issue, when testing list<string>.

libcudf question

All 5 comments

"equal" means two columns are exactly, bitwise equal.

"equivalent" means two columns are functionality equivalent.

Example, given two columns a and b:

  • If a.nullable() == true and a.null_count() == 0, while b.nullable() == false (which implies b.null_count() == ) then:

    • a and b are _equivalent_ if all of their elements are equivalent, but they are not _equal_ because they differ in nullability

This was the original intention of the difference between expect_column_equal and expect_column_equivalent.

Equivalence can also have special meaning for floating point values where elements can be considered _equivalent_ within some units of least precision.

My opinion here is : "equals" is either semantically incorrect or people are using it with the wrong expectations. 99.9% of the time what you care about is "if I were to pass either of these columns along to further computation I would get the same results" and I'm pretty sure that's how people are using it.

But the current definition of equals is much more strict than this, which causes unexpected side-effects and retroactive breakage. The clear example is how it is ok for empty strings columns to either have children or not. Depending on the whims of the function you called, you might get back a column which will pass expect_columns_equal() today but fail it tomorrow.

The concrete instance of this I'd point out is the change I have in flight to empty_like(). With the new change:

  • if you call empty_like() on a strings column that has values in it, you will get back an empty strings column with children
  • if you call make_empty_column(STRING) you will get back an empty strings column with no children

This is valid by the standards of cudf. But when I made this change, a few random tests started failing because they were using equals() when what they needed was equivalent(). The end result is - 6 or 7 random checks in semi_join_tests.cpp mysteriously use expect_columns_equivalent() when all the surrounding ones use expect_columns_equal() https://github.com/rapidsai/cudf/pull/5816/files#diff-f2512c5550f8bf5f410a0c523f856393 If I'm an end-user that seems random and unexplained.

So I guess I would advocate one of two things:

  • Change expect_columns_equal() to expect_columns_bitwise_equal() or something equally strongly worded. And change expect_columns_equivalent() to expect_columns_equal()

or

  • Make all tests use expect_columns_equivalent() by default and only use expect_columns_equal() when they truly want absolute equality across the board.

"if I were to pass either of these columns along to further computation I would get the same results"

That describes _equivalence_, not equality.

I'm pretty sure that's how people are using it.

This is likely because expect_columns_equivalent came along long after expect_columns_equal, so for a long time expect_columns_equal was the only thing.

Change expect_columns_equal() to expect_columns_bitwise_equal() or something equally strongly worded. And change expect_columns_equivalent() to expect_columns_equal()

Equivalence vs. equality is a well-established concept in mathematics, programming languages, and even C++ (e.g., a and b are _equal_ if a==b, a and b are _equivalent_ if !(a < b) && !(b < a)). I think we should just stick with the standard terminology.

Make all tests use expect_columns_equivalent() by default and only use expect_columns_equal() when they truly want absolute equality across the board.

I'd definitely support this. I agree that most places should probably use expect_columns_equivalent. The only reason they are not is because many tests were written before expect_columns_equivalent existed.

I'd definitely support this. I agree that most places should probably use expect_columns_equivalent. The only reason they are not is because many tests were written before expect_columns_equivalent existed.

Thank you for the advice, @jrhemstad, @nvdbaranec. I will switch the struct tests to use expect_columns_equivalent(). (It's not much, but it's a start.)
I'll tackle the equivalence checks for list as soon as it is viable, unless someone beats me to it.

To throw another log on the fire here : I just realized that expect_columns_equal() already returns true in a case that is just "equivalent". Specifically : sliced columns. The following all passes:

cudf::test::fixed_width_column_wrapper<int> a{1, 2, 3, 4, 5, 6};
auto sliced = cudf::split(a, {2});

cudf::test::fixed_width_column_wrapper<int> expected0{1, 2};
cudf::test::expect_columns_equal(sliced[0], expected0);

cudf::test::fixed_width_column_wrapper<int> expected1{3, 4, 5, 6};    
cudf::test::expect_columns_equal(sliced[1], expected1);

But these columns aren't really truly equal. The sliced/split columns have an offset > 0 and their data buffers are different (this difference is more exaggerated in columns that contain offsets such as strings and lists). expected0 and expected1 do not have an offset and their data is different. Equivalent, absolutely, but not the same.

Was this page helpful?
0 / 5 - 0 ratings