Readers/writers are currently implemented as classes which contain member variables that add cognitive overhead. The readers could instead be implemented as pure functions which perform read/write.
Readers also "switch" in various places based on option arguments which are not necessarily shared among all code paths. Breaking a reader's read function in to multiple distinct read functions with shared functionality may increase readability and flexibility of the code.
Example:
https://github.com/rapidsai/cudf/blob/f4735c7f658da4a157dc09391da899b072878305/cpp/src/io/csv/reader_impl.hpp#L68-L101
The public cuIO API is made up of pure functions. Only the detail API is OO.
Readers also "switch" in various places based on option arguments which are not necessarily shared among all code paths.
@cwharris Can you add an example?
@vuule
I consider "if (option) do {this} else {do this}` to be "switching" based on arguments, but I'm sure there's better nomenclature for this.
Here's an example related to decompression:
It looks like range_offset is unsupported when the file is compressed because we can't offset prior to decompression, and it doesn't make much sense for the consumer to know the offset from the beginning of the _uncompressed_ data. If, instead, we handled decompression prior read(...), then read could support range_offset without checking the compression type (since the document is already uncompressed).
That logic, however, is in a conditional branch that "switches" based on the "emptiness" of _source. Which brings me to datasource and the filepath string argument. These are mutually exclusive, but part of the same read call. If, instead, the read call accepted a begin / end iterators (or the same buffer type, if we don't want to instantiate multiple templates) of the source, then the source of the data becomes irrelevant to the read call. This means a "higher-level" read call could accept a filepath argument, and the other could accept a data_source. Both perform opening/reading (buffering, if necessary for avoiding template instantiation), optionally perform decompression, and then pass the stream/iterator/buffer to the read call. That also means range_offset/range_size moves away from the read call in favor of begin/end iterators.
hypothetical reader api:
template<typename SourceIterator>
read(SourceIterator begin,
SourceIterator end,
reader_options options, // this basically becomes parsing options
mr = ...,
cudaStream_t stream = 0);
I realize this doesn't account for skip_rows and num_rows, but those are somewhat tangential and could be accounted for in overloads.
The proposed separation of ingest and parsing makes sense. I'm just not sure if we need templates for the read(), wouldn't we have a buffer in memory after the ingest in any case?
This is specific to JSON and CSV readers, right? Input selection works very differently in ORC/Parquet. Probably Avro too, not sure of the top of my head.
Can you either rename the issue to focus to SCV+JSON or add an example+proposal for the other formats? I'm not clear on what the intended scope is.
@vuule You're right. I'm taking a look at the datasource class now, and in light of that abstraction, templates don't make sense.
Yes, what I've mentioned so far is specific to JSON/CSV, but I suspect the case is similar for the other readers, barring "paged" or "chunked" reads (though these could also be abstracted as pure functions if it's beneficial). Still investigating though. If the approach differs significantly, or we could divvy up the work better, separate issues would be useful.
Any thoughts on keeping the classes, but removing all mutable data members? It would effectively leave reader/parser options and maybe source/sink. IMO this would remove the overhead of tracking the state, while eliminating the need to drag the options around. ORC writer is _almost_ implemented like this and it seems clear to me (long functions aside).
yeah, classes and structs are great. making them immutable will make the readers/writers easier to comprehend. win/win.