Cudf: [BUG] cudf++ string gather/apply_boolean_mask does not set validity

Created on 9 Dec 2019  路  8Comments  路  Source: rapidsai/cudf

Describe the bug
When I call the cudf++ version of apply_boolean_mask on a table with a string column in it, the filtered output of the string column has all of the entries set to valid, even if they should be null. A non-string column in the same table sets it's validity mask correctly.

apply_boolean_mask calls gather under the hood, and if you look at the string implementation of gather it is allocating the validity buffer, but it never populates it.

https://github.com/rapidsai/cudf/blob/f8c1c0d5e5336d2890ca1a6f2ec9bedad5d4694d/cpp/include/cudf/strings/detail/gather.cuh#L76-L111

? - Needs Triage Spark bug libcudf++

Most helpful comment

Sorry I started to check my inputs and it looks like we are not setting validity correctly somehow when it is for a String. I am not 100% sure what is happening but it looks to be unrelated to apply_boolean_mask.

All 8 comments

The detail::gather for strings is not supposed to populate the output bitmask. It is only used here to materialize the data: https://github.com/rapidsai/cudf/blob/branch-0.12/cpp/include/cudf/detail/gather.cuh#L193

The bitmask should be getting populated later here in a code path that is ignorant to whether or not the column is a string column.
https://github.com/rapidsai/cudf/blob/branch-0.12/cpp/include/cudf/detail/gather.cuh#L353

Do you have a test that reproduces this?

I have tried with following test case, and I get proper answer.

TEST_F(ApplyBooleanMask, StringColumnTest1) {
    cudf::test::strings_column_wrapper col1{{"This", "is", "the", "a", "k12", "string", "table", "column"}, {0, 1, 0, 1, 1, 0, 1, 0}};
    cudf::test::fixed_width_column_wrapper<int32_t> col2{{10, 40, 70, 5, 2, 10, 34, 8}, {1, 1, 0, 1, 1, 0, 1, 1}};
    cudf::table_view input {{col1, col2}};
    cudf::test::fixed_width_column_wrapper<cudf::experimental::bool8> boolean_mask{{true, true, false, true, false, true, false, true}, {1, 1, 1, 1, 1, 1, 1, 1}};
    cudf::test::strings_column_wrapper col1_expected{{"This", "is", "a", "string", "column"}, {0, 1, 1, 0, 0}};
    cudf::test::fixed_width_column_wrapper<int32_t> col2_expected{{10, 40, 5, 10, 8}, {1, 1, 1, 0, 1}};
    cudf::table_view expected {{col1_expected, col2_expected}};


    auto got = cudf::experimental::apply_boolean_mask(input, boolean_mask);
    for (int i = 0; i<got->num_columns(); i++){
        cudf::test::print(got->view().column(i));
        std::cout<<std::endl;
    }

    cudf::test::expect_tables_equal(expected, got->view());
}

Result:

NULL,is,a,NULL,NULL
10,40,5,NULL,8

@revans2 Can you please try to add this in apply_boolean_mask_tests.cu and let me know if you get same issue or not.

The code I have is in java. The test is very simple. But I'll see if I can come up with a test case to reproduce it in C++.

If possible, please provide java test case that I can run.

The test case is on an internal branch as we try to port the java API over to cudf++, so it is not in a place that you could easily run it right now.

I'll take some more time to dig into this because it looks like the C++ unit tests are checking essentially the same thing that the java test is, which makes me think that it is probably a java issue of some sort.

Sorry I started to check my inputs and it looks like we are not setting validity correctly somehow when it is for a String. I am not 100% sure what is happening but it looks to be unrelated to apply_boolean_mask.

Was this page helpful?
0 / 5 - 0 ratings