Cudf: [FEATURE] cuio: eliminate off-by-one adjustments in parse utilities

Created on 17 Sep 2020  路  4Comments  路  Source: rapidsai/cudf

Cuio has a few places where we unnecessarily add/subtract 1 from a "field length" or "size" or "end pointer". This is caused by some parse implementations using <= rather than <. If we update parse utilities such that the accept begin/end pointers, offsets, or spans, we can avoid these "adjust-by-one" fixes.

With the the current code, size of input to parse utilities must be calculated as end - begin - 1, and likewise the input must be corrected (if the calling code already has correct offsets) as example(begin, end + 1).

It's possible similar logic is used in parqeut/avro, but the portions I've found so far are in csv/json.

Let's standardize on end - begin == size for all offsets, and use for(auto pos = begin; pos < end; ++pos), for example.

CSV
-
The CSV code loops incorrectly, using <= rather than <, and this logic finds it way down in to the parse utils.
https://github.com/rapidsai/cudf/blob/8bafbb3ed3c58fd6015001731faf413798921d56/cpp/src/io/csv/csv_gpu.cu#L603-L608
https://github.com/rapidsai/cudf/blob/8bafbb3ed3c58fd6015001731faf413798921d56/cpp/src/io/csv/csv_gpu.cu#L468-L477

JSON
-
The JSON code loops correctly, but adjusts by one before calling parse utilities.
https://github.com/rapidsai/cudf/blob/8bafbb3ed3c58fd6015001731faf413798921d56/cpp/src/io/json/json_gpu.cu#L330-L331

Parse Utilities
-
The parse utilities code loops incorrectly, using <= rather than <.
https://github.com/rapidsai/cudf/blob/8bafbb3ed3c58fd6015001731faf413798921d56/cpp/src/io/utilities/parsing_utils.cuh#L162-L173

bug cuIO libcudf tech debt

All 4 comments

@cwharris Do you think this should be addressed together with https://github.com/rapidsai/cudf/issues/6210, given that both are related to iterator use in the utilities?

@vuule sgtm

Or we just have one pr fixing both issues.

Yeah, that what I've been thinking

Was this page helpful?
0 / 5 - 0 ratings