Cudf: [FEA] cuio: implement readers/writers as pure functions.

Created on 9 Sep 2020 · 8Comments · Source: rapidsai/cudf

Readers/writers are currently implemented as classes which contain member variables that add cognitive overhead. The readers could instead be implemented as pure functions which perform read/write.

Readers also "switch" in various places based on option arguments which are not necessarily shared among all code paths. Breaking a reader's read function in to multiple distinct read functions with shared functionality may increase readability and flexibility of the code.

Example:
https://github.com/rapidsai/cudf/blob/f4735c7f658da4a157dc09391da899b072878305/cpp/src/io/csv/reader_impl.hpp#L68-L101

cuIO feature request libcudf tech debt

Source

cwharris

All 8 comments

The public cuIO API is made up of pure functions. Only the detail API is OO.

harrism on 9 Sep 2020

👍1

Readers also "switch" in various places based on option arguments which are not necessarily shared among all code paths.

@cwharris Can you add an example?

vuule on 9 Sep 2020

👀1

@vuule

I consider "if (option) do {this} else {do this}` to be "switching" based on arguments, but I'm sure there's better nomenclature for this.

Here's an example related to decompression:

https://github.com/rapidsai/cudf/blob/f4735c7f658da4a157dc09391da899b072878305/cpp/src/io/csv/reader_impl.cu#L186-L189

https://github.com/rapidsai/cudf/blob/f4735c7f658da4a157dc09391da899b072878305/cpp/src/io/csv/reader_impl.cu#L217-L226

It looks like range_offset is unsupported when the file is compressed because we can't offset prior to decompression, and it doesn't make much sense for the consumer to know the offset from the beginning of the _uncompressed_ data. If, instead, we handled decompression prior read(...), then read could support range_offset without checking the compression type (since the document is already uncompressed).

That logic, however, is in a conditional branch that "switches" based on the "emptiness" of _source. Which brings me to datasource and the filepath string argument. These are mutually exclusive, but part of the same read call. If, instead, the read call accepted a begin / end iterators (or the same buffer type, if we don't want to instantiate multiple templates) of the source, then the source of the data becomes irrelevant to the read call. This means a "higher-level" read call could accept a filepath argument, and the other could accept a data_source. Both perform opening/reading (buffering, if necessary for avoiding template instantiation), optionally perform decompression, and then pass the stream/iterator/buffer to the read call. That also means range_offset/range_size moves away from the read call in favor of begin/end iterators.

cwharris on 9 Sep 2020

hypothetical reader api:

template<typename SourceIterator>
read(SourceIterator begin,
     SourceIterator end,
     reader_options options, // this basically becomes parsing options
     mr = ...,
     cudaStream_t stream = 0);

I realize this doesn't account for skip_rows and num_rows, but those are somewhat tangential and could be accounted for in overloads.

cwharris on 9 Sep 2020

The proposed separation of ingest and parsing makes sense. I'm just not sure if we need templates for the read(), wouldn't we have a buffer in memory after the ingest in any case?

This is specific to JSON and CSV readers, right? Input selection works very differently in ORC/Parquet. Probably Avro too, not sure of the top of my head.
Can you either rename the issue to focus to SCV+JSON or add an example+proposal for the other formats? I'm not clear on what the intended scope is.

vuule on 9 Sep 2020

👍1

@vuule You're right. I'm taking a look at the datasource class now, and in light of that abstraction, templates don't make sense.

Yes, what I've mentioned so far is specific to JSON/CSV, but I suspect the case is similar for the other readers, barring "paged" or "chunked" reads (though these could also be abstracted as pure functions if it's beneficial). Still investigating though. If the approach differs significantly, or we could divvy up the work better, separate issues would be useful.

cwharris on 9 Sep 2020

👍1

Any thoughts on keeping the classes, but removing all mutable data members? It would effectively leave reader/parser options and maybe source/sink. IMO this would remove the overhead of tracking the state, while eliminating the need to drag the options around. ORC writer is _almost_ implemented like this and it seems clear to me (long functions aside).

vuule on 15 Dec 2020

👍1

yeah, classes and structs are great. making them immutable will make the readers/writers easier to comprehend. win/win.

cwharris on 16 Dec 2020

Was this page helpful?

0 / 5 - 0 ratings