Describe the bug
When I call the cudf++ version of apply_boolean_mask on a table with a string column in it, the filtered output of the string column has all of the entries set to valid, even if they should be null. A non-string column in the same table sets it's validity mask correctly.
apply_boolean_mask calls gather under the hood, and if you look at the string implementation of gather it is allocating the validity buffer, but it never populates it.
The detail::gather for strings is not supposed to populate the output bitmask. It is only used here to materialize the data: https://github.com/rapidsai/cudf/blob/branch-0.12/cpp/include/cudf/detail/gather.cuh#L193
The bitmask should be getting populated later here in a code path that is ignorant to whether or not the column is a string column.
https://github.com/rapidsai/cudf/blob/branch-0.12/cpp/include/cudf/detail/gather.cuh#L353
Do you have a test that reproduces this?
I have tried with following test case, and I get proper answer.
TEST_F(ApplyBooleanMask, StringColumnTest1) {
cudf::test::strings_column_wrapper col1{{"This", "is", "the", "a", "k12", "string", "table", "column"}, {0, 1, 0, 1, 1, 0, 1, 0}};
cudf::test::fixed_width_column_wrapper<int32_t> col2{{10, 40, 70, 5, 2, 10, 34, 8}, {1, 1, 0, 1, 1, 0, 1, 1}};
cudf::table_view input {{col1, col2}};
cudf::test::fixed_width_column_wrapper<cudf::experimental::bool8> boolean_mask{{true, true, false, true, false, true, false, true}, {1, 1, 1, 1, 1, 1, 1, 1}};
cudf::test::strings_column_wrapper col1_expected{{"This", "is", "a", "string", "column"}, {0, 1, 1, 0, 0}};
cudf::test::fixed_width_column_wrapper<int32_t> col2_expected{{10, 40, 5, 10, 8}, {1, 1, 1, 0, 1}};
cudf::table_view expected {{col1_expected, col2_expected}};
auto got = cudf::experimental::apply_boolean_mask(input, boolean_mask);
for (int i = 0; i<got->num_columns(); i++){
cudf::test::print(got->view().column(i));
std::cout<<std::endl;
}
cudf::test::expect_tables_equal(expected, got->view());
}
Result:
NULL,is,a,NULL,NULL
10,40,5,NULL,8
@revans2 Can you please try to add this in apply_boolean_mask_tests.cu and let me know if you get same issue or not.
The code I have is in java. The test is very simple. But I'll see if I can come up with a test case to reproduce it in C++.
If possible, please provide java test case that I can run.
The test case is on an internal branch as we try to port the java API over to cudf++, so it is not in a place that you could easily run it right now.
I'll take some more time to dig into this because it looks like the C++ unit tests are checking essentially the same thing that the java test is, which makes me think that it is probably a java issue of some sort.
Sorry I started to check my inputs and it looks like we are not setting validity correctly somehow when it is for a String. I am not 100% sure what is happening but it looks to be unrelated to apply_boolean_mask.
Most helpful comment
Sorry I started to check my inputs and it looks like we are not setting validity correctly somehow when it is for a String. I am not 100% sure what is happening but it looks to be unrelated to apply_boolean_mask.