Cudf: [BUG] reinterpret_cast is misused in many places in cuIO

Created on 18 Aug 2020  路  10Comments  路  Source: rapidsai/cudf

There are (at least) two ways in which the reinterpret_cast is misused:

  1. Used instead of static_cast:
    https://github.com/rapidsai/cudf/blob/f7fbc1160b17b969db2708e0cf6033d3db9fc1cf/cpp/src/io/avro/avro_gpu.cu#L95
    static cast should be used to cast from void*.
  2. The use causes undefined behavior:
    https://github.com/rapidsai/cudf/blob/f7fbc1160b17b969db2708e0cf6033d3db9fc1cf/cpp/src/io/avro/avro_gpu.cu#L112
    https://github.com/rapidsai/cudf/blob/f7fbc1160b17b969db2708e0cf6033d3db9fc1cf/cpp/src/io/avro/avro_gpu.cu#L131-L132
    These uses break the type aliasing rules as they cast between pointers to types that are not _similar_.

Proposed solution:

  • Use static_cast where appropriate.
  • Buffers whose data type is not known at compile time (input/output buffers usually) should be of type void* instead of uint8_t*, as they commonly are now.

    • Addition of a hostdevice_buffer type might be beneficial here.

  • Where the options above are not applicable, use std::memcpy instead of the cast.

TODO list of components:

  • [x] Avro
  • [ ] Compression
  • [ ] CSV
  • [x] ORC
  • [ ] Parquet
  • [ ] Stats
  • [ ] Utilities
bug cuIO tech debt

All 10 comments

omg

For the second one, since it was stated that it represents the int type and we are converting between integral types and the converted-to type (int32) has the appropriate alignment, it's safe and not type-aliasing.

This is the case where:

int32_t data[5] = {};
schemadesc_s schema;
schema.data = reintepret_cast<void*>(data);
schema.type = type_int32;

// in the function context
auto dataptr = reinterpret_cast<uint8_t*>(schema.data);
reinterpret_cast<int32_t *>(dataptr)[row] = static_cast<int32_t>(v); 

converting from T* -> void* -> U* -> T* is not type aliasing, since it wasn't dereferenced in the U* state, only when it was converted back to T*.
But it'd be worth using void* instead of uint8_t*.

But it'd be worth using void* instead of uint8_t*.

Agreed. If it was left as void* then static_cast could be used and it would look less circumspect.

@jrhemstad @vuule
I've made the changes in #6285. I also noticed and fixed the use of implicit pointer-to-bool conversions which is undefined behaviour.

@vuule the changes have been merged, do let me know if you need any additional changes to be made

Hi @lamarrr , there are actually many other places in cudf::io where reinterpret_cast use is problematic. I only listed a couple of examples.
If you want to further help with this (and you are more than welcome), you can look for reinterpret_cast calls in cudf::io that cast between pointer types.

Added the list of components so we can track progress more easily (since not all reinterpret_cast calls are a part of this issue).
JSON code does not seem to have any UB uses, so I did not include it in the list.

Hi @lamarrr , there are actually many other places in cudf::io where reinterpret_cast use is problematic. I only listed a couple of examples.
If you want to further help with this (and you are more than welcome), you can look for reinterpret_cast calls in cudf::io that cast between pointer types.

Yeah, sure! I'll try to help with some cleanups and push them within a day or two.
I also noted some c-style pointer casts which are essentially a combination of const_casts, reinterpret_cast, static_cast, and volatile casts. Essentially, possibly being a reinterpret_cast means it's just as dangerous as reinterpret_cast.

Missed the "closes" tag in #6386. Reopening.

Please note that most code that cooperatively loads structs uses reinterpret_cast to cast between pointer types. These instances will be addressed as part of https://github.com/rapidsai/cudf/issues/6236, please skip such cases when working on this issue.

Was this page helpful?
0 / 5 - 0 ratings