Cudf: [DISCUSSION] libcudf should not introspect input data to perform error checking

Created on 18 Jun 2020  路  16Comments  路  Source: rapidsai/cudf

Is your feature request related to a problem? Please describe.

A few functions in libcudf optionally validate that input data is valid before performing the operation. For example,
https://github.com/rapidsai/cudf/blob/31d746605fb4a3fd4b24a5732aed16ce7944adf7/cpp/include/cudf/copying.hpp#L62-L66

cudf::gather has a check_bounds bool parameter that enables verifying if the values in the gather_map are within bounds. This requires launching a kernel to introspect the gather map data:
https://github.com/rapidsai/cudf/blob/31d746605fb4a3fd4b24a5732aed16ce7944adf7/cpp/src/copying/gather.cu#L30-L39

The reason for this verification is that cuDF Python expects to throw an exception if any of the values are out of bounds.

However, there is no reason for libcudf to be performing this verification directly inside of the gather implementation. The cuDF Python bindings for gather can easily add this bounds checking as a pre-processing step.

cudf::repeat is a similar example of this behavior:

https://github.com/rapidsai/cudf/blob/b72e6476325532ee0ea112f604e23db928a4b24d/cpp/include/cudf/filling.hpp#L118-L122

Having this bounds checking inside the libcudf function is detrimental for a number of reasons:

  1. It complicates the libcudf implementation. The gather implementation is quite complicated because of all the branches inside, including whether or not we need to check bounds. It's not as simple as just a single if/else because the check_bounds flag interacts with other internal flags such as allow_negative_indices. It would pretty significantly simplify gathers implementation to move this check outside of gather's implementation.
  2. It violates the single responsibility principle. Verifying the validity of input data should be independent of performing the actual operation.
  3. It is not in line with the semantics of other C++ libraries. For example, thrust::gather does not perform any validation of the map data.

Describe the solution you'd like

Optional input data validation like in gather or repeat should be eliminated from libcudf features.

Python features that rely on this bounds checking should implement it as a pre-processing step using existing libcudf primitives, or identify new primitives that would enable the necessary validation.

Additional context

To be clear, I am _not_ suggesting libcudf functions do not perform any error checking whatsoever. I am suggesting we remove any error checking that requires introspection of input data (i.e., launching a kernel). Performing checks for data to be the right data type, size, etc. must still be preserved. An obvious indication of functions doing this today are optional boolean flags like in gather or repeat.

cuDF (Python) feature request libcudf proposal

Most helpful comment

Sounds good -- I'll benchmark and report here, and we can decide based on that?

All 16 comments

:+1:

I'm fine with this, but it's worth bringing up that gather performance is critical to performance in some applications, and the pre-processing associated with negative indexes was previously found to be a bottleneck: https://github.com/rapidsai/cudf/issues/2675

Ah, misunderstood -- this is about bounds checking rather than negative index transformation. I think both incur similar overhead though (cost of a binaryop), and it's worth taking into consideration that this will impact the performance of indexing in Python (cc: @kkraus14 )

Ah, misunderstood -- this is about bounds checking rather than negative index transformation. I think both incur similar overhead though (cost of a binaryop), and it's worth taking into consideration that this will impact the performance of indexing in Python (cc: @kkraus14 )

Whether we check bounds in libcudf or Python, either way it's a kernel launch. I wouldn't expect it to be substantively more expensive to do the bounds check outside of the gather call.

@shwina I think we can add this in the Cython as opposed to in the Python layer to amortize some of the typical Python overheads.

Sounds good -- I'll benchmark and report here, and we can decide based on that?

So I ran a quick benchmark. Here are the results:

With libcudf bounds checking:

size: 100 :: time: 0.007442206988343969
size: 1000 :: time: 0.006695960997603834
size: 10000 :: time: 0.006788923987187445
size: 100000 :: time: 0.009929071005899459
size: 1000000 :: time: 0.028811413008952513
size: 10000000 :: time: 0.06236411599093117

Without any bounds checking:

size: 100 :: time: 0.006547413009684533
size: 1000 :: time: 0.005977320979582146
size: 10000 :: time: 0.005807141977129504
size: 100000 :: time: 0.008778560993960127
size: 1000000 :: time: 0.020219083002302796
size: 10000000 :: time: 0.0580876559833996

With cudf bounds checking:

size: 100 :: time: 0.010622989007970318
size: 1000 :: time: 0.010242449003271759
size: 10000 :: time: 0.009886790998280048
size: 100000 :: time: 0.01461042498704046
size: 1000000 :: time: 0.026591391011606902
size: 10000000 :: time: 0.0630068660248071

Benchmark used (basically "reversing" a column by performing a gather):

import timeit
import cupy as cp
import cudf

for size in [100, 1_000, 10_000, 100_000, 1_000_000, 10_000_000]:
    a = cudf.Series(cp.arange(size))

    start = timeit.default_timer()
    for i in range(10):
        result = a.iloc[cp.arange(size-1, -1, -1)]
    end = timeit.default_timer()

    print(f"size: {size} :: time: {end-start}")

The "cudf bounds checking" is implemented in Cython with as little overhead as possible. Basically it is a max reduction (max(gather_map)).

    cdef data_type c_dtype = data_type(tid)
    cdef unique_ptr[aggregation] c_agg = move(make_aggregation("max"))
    cdef unique_ptr[scalar] c_reduce_result
    cdef Scalar sc = as_scalar(source_table._num_rows)
    cdef scalar* c_sc = sc.c_value.get()

    with nogil:

        c_reduce_result = move(
            cpp_reduce(
                gather_map_view,
                c_agg,
                c_dtype
            )
        )

    py_reduce_result = Scalar.from_unique_ptr(move(c_reduce_result))

    if py_reduce_result.value > 0:
        raise RuntimeError("Index out of bounds")

I tend to be +1 for cleaner code over small performance gains :) This represents about a 10-15% performance decrease.

I think if libcudf had a binop+reduce primitive though (even just for numeric types), that would allow us to separate bounds checking from gather, and help with performance on the Python side.

I tend to be +1 for cleaner code over small performance gains :) This represents about a 10-15% performance decrease.

Agreed. The performance difference is minimal enough to not be concerning to me.

I think if libcudf had a binop+reduce primitive though (even just for numeric types), that would allow us to separate bounds checking from gather, and help with performance on the Python side.

I think you could actually eliminate the binop by instead just doing a max reduction of the gather map. All you care about is if a single value in the gather map is out of bounds, so if you just compute the max and it is out of bounds, then you know there is an error and you can throw.

I think you could actually eliminate the binop by instead just doing a max reduction of the gather map. All you care about is if a single value in the gather map is out of bounds, so if you just compute the max and it is out of bounds, then you know there is an error and you can throw.

How silly of me :) Edited with numbers for doing just a max. The difference is even less concerning now (especially for larger problem sizes).

I'll put in a PR to do bounds checking in Cython for both scatter and gather. I'm happy to also throw in bounds checking _removal_ in C++.

I think you could actually eliminate the binop by instead just doing a max reduction of the gather map. All you care about is if a single value in the gather map is out of bounds, so if you just compute the max and it is out of bounds, then you know there is an error and you can throw.

How silly of me :) Edited with numbers for doing just a max. The difference is even less concerning now (especially for larger problem sizes).

That said, I think you actually need to do a min and a max reduction to account for the "negative values wrap" logic. I.e., you need to check if a value in the gather map is outside the bounds (-n, n]

Are you also considering removing support for negative index values?

Are you also considering removing support for negative index values?

No. I'm just saying that if you want to keep the same bounds checking logic that exists in libcudf, you need to check for (-n, n].

Got it -- yup makes sense.

We could even add a minmax reduction to do this in a single operation.

Was this page helpful?
0 / 5 - 0 ratings