It would be good to have support for Struct in CUDF, with arbitrarily complex child fields. E.g.
struct <name:string, age:int8, gpa:float, addresses:list<string>>
This issue aims to document how Struct support might be added to CUDF, the behaviour of Struct columns in different scenarios, and possibly any related APIs. (The description of this issue is a work in progress, and will be updated based on discussion and feedback.)
column_factory.struct<name:string, address:struct<street:string, city:string, state:string, po:int8>>list<coordinates:struct<x:int8, y:int8, z:int8>>struct <name:string, age:int8, gpa:float, addresses:list<string>>cudf::gather(), to gather Struct rows.cudf::test::column_wrapper.c++
cudf::struct_column_wrapper<string, int8_t, float, vector<string>> city_watch_members {
{"Carrot Ironfoundersson", 23, 3.75, {"Copperhead Mountains", "Ankh-Morpork"}},
{"Samuel Vimes", 48, null, {"Cockbill Street, Ankh-Morpork", "Scoone Avenue"}}
};
A Struct is a nested data-type containing an ordered sequence of child members (âfieldsâ). It is similar to a plain-old-data struct in C++.
Each struct field could have a different data type, and can be arbitrarily complex or nested. E.g.:
struct <name:string, age:int8, gpa:float, addresses:list<string>>
Each of name, age, gpa, and addresses has a different data-type. addresses is itself a nested datatype.
Struct columns might contain lists/structs, or be themselves elements of a list, or members of a struct.
The in-memory layout of Struct columns in Apache Arrow is described in the Apache Arrow Format Specification. It is expected that the in-memory layout for CUDF Structs will be near identical.
Since the cudf::column implementation already provides support for child columns, it is expected that implementing Struct fields as child columns should be straightforward.
cudf::column already accommodates the notion of nested types via child columns. String and List column types already use child columns to store metadata (offset information) to manage the underlying data buffers. Using child columns to implement fields appears straightforward and logical.cudf::gather() to gather Structs will simply involve delegation to the field columns.None, really. The availability of child-columns in cudf::column makes this the straightforward approach.
The in-memory layout of Struct columns in Apache Arrow is described in the Apache Arrow Format Specification. The layout of CUDF Structs will be near identical:
children.cudf::column::children.Struct row depends on collecting the corresponding rows of each child column, in their declared field order.cudf::column::data will not point to any data.struct<name:string, score:int8>, the following rows are valid:{âAlphaâ, 1} // (Non-null name and score){âBravoâ, null} // (Null score){null, 3} // (Null name)null // (Null Struct value)Struct<f:float, i:int>:Data: [ {1.0, 2},
{4.0, 5},
null,
{8.0, null} ]
{
type = struct
validity = [1, 1, 0, 1]
children = {
{
type = float
values = [1.0, 4.0, X, 8.0]
validity = [ 1, 1, 1, 1]
children = null
},
{
type = int
values = [2, 5, X, X]
validity = [0, 0, 1, 1]
children = null
}
}
}
Struct<f:float, ilist:list<int>>:Data: [ {1.0, (1,2,3)},
{4.0, (5,6,null)},
null,
{8.0, null} ]
{
type = struct
validity = [1, 1, 0, 1] // Third row is null.
children = {
{
type = float
values = [1.0, 4.0, X, 8.0]
validity = [ 1, 1, 1, 1]
children = null
},
{
type = list
values = null
validity = [1, 1, 1, 0]
offsets = [0, 3, 6, 6, 6]
children = {
{
type = int
values = [ 1, 2, 3, 5, 6, X ]
validity = [ 1, 1, 1, 1, 1, 0 ]
children = null
}
}
}
}
}
This example illustrates how the Structâs validity buffer overrides that of the child column:
{4.0, (5,6,null)} because:float field has a valid value: 4.0list field has a valid value: Offsets 3-6 in the underlying int column.int sub-column has an invalid value at offset 5, and therefore returns (5,6,null).X). But the Struct value at row 2 still returns null.list fieldâs validity indicates a null value at index 3, while the parent Struct does not. As a result, the Struct row at index 3 is read as {8.0,null}.List< Struct< f:float, i:int > >:Data: [ [{1.0, 1}, {2.0, 2}],
[{3.0, 3}],
[{4.0, null}],
[null],
null ]
{
type = list
values = null
validity = [ 1, 1, 1, 1, 0 ]
offsets = [ 0, 2, 3, 4, 4, 4]
children = {
{
type = struct
values = null
validity = [ 1, 1, 1, 1, 0, 1 ]
children = {
{
type = float
values = [ 1.0, 2.0, 3.0, 4.0, 5.0 ]
validity = [ 1, 1, 1, 1, 1 ]
children = null
},
{
type = int
values = [ 1, 2, 3, X, 5 ]
validity = [ 1, 1, 1, 0, 1 ]
children = null
}
}
}
}
}
This example illustrates how list<struct<int,...>> is analogous to list<int> columns. A List columnâs offset vector applies to a Struct element similarly to any primitive element. For instance, when composing the first list row as offsets 0-2 of the child column, the Struct column in turn collects elements 0-2 of each of its child columns (i.e. fields). Thus, the first List row is:
[ { 1.0, 1}, {2.0, 2} ].
map data-type could be in terms of a list of struct< key:KeyType, value:ValueType >. A Map column might simply have a single child list column, which holds a single struct column, which in turn has two child columns: key, and value.Something I've thought about in the past is that a struct column is really just a table with it's own row-wise null mask. It seems like it would be advantageous to lean on that similarity for reducing redundant code.
@jrhemstad but that means nesting applies equally to tables as well as columns, rather than just columns. e.g. you would need to be able to have a column of tables. Or a column of lists of tables. Right?
@jrhemstad but that means nesting applies equally to tables as well as columns, rather than just columns. e.g. you would need to be able to have a column of tables. Or a column of lists of tables. Right?
Yeah, perhaps the similarity is shallower than I originally thought.
I should probably ask this on Slack.
@jrhemstad , @jrhemstad: Might I solicit your advice on the following?
cudf::struct_column_wrapper<string, int8_t, float, vector<string>> city_watch_members {
{"Carrot Ironfoundersson", 23, 3.75, {"Copperhead Mountains", "Ankh-Morpork"}},
{"Samuel Vimes", 48, null, {"Cockbill Street, Ankh-Morpork", "Scoone Avenue"}}
};
My C++ skills... aren't, evidently. std::initializer_list<> requires a single type T. So what might the signature for this constructor look like? The type-list might be user-defined. I've been referring to std::tuple as a guide. I'd really appreciate your advice.
As an aside, @nvdbaranec, the third template parameter does not gel well with lists.
cudf::test::list_column_wrapper<T> takes a single type argument, regardless of the nesting level. (I found that very clever.) Should the third type argument in struct_column_wrapper be string, not vector<string>?
So what might the signature for this constructor look like? The type-list might be user-defined. I've been referring to
std::tupleas a guide. I'd really appreciate your advice.
Drawing again on the similarities between struct columns and tables, you basically want the table_wrapper I described here: https://github.com/rapidsai/cudf/issues/3203
It may be easier to just construct a struct_column_wrapper from other *_column_wrappers and construct each member column individually in the same way we do for tables today in tests.
Your example is a nice goal, and should be possible in theory, but it will be quite complex to make it work in practice, especially to allow indicating individual members of the struct are null in addition to indicating the entire struct is null.
It may be easier to just construct a struct_column_wrapper from other *_column_wrappers and construct each member column individually in the same way we do for tables today in tests.
I was afraid that might be the case, @jrhemstad. I'm afraid this will imply that constructing list<struct<...>> will be verbose.
I'll try continue. Thank you for confirming.
Another question, regarding a Struct's validity/null-mask. (This pertains to something that @jlowe mentioned in a prior discussion.)
In systems like Apache Spark SQL or Apache Hive, when a Struct's members are accessed, the value is deemed null if either Struct or that specific member is null.
In CUDF, a Struct may be constructed from child columns whose validity may be set independently of the Structs validity.
Are there any circumstances where a child column needs to be read using its original validity setting? Or should we always mask out the child columns at the indices where the parent Struct is null?
Or should we always mask out the child columns at the indices where the parent Struct is null?
What does Arrow do? AFAIK, the children and parent have independent null masks.
This pertains to something that @jlowe mentioned
My comment referred to operating on a field within a struct. For example, take a struct S of three fields: x, y, z, and I want to do a SQL SELECT S.x + S.y. One way to support this is to update libcudf binops so it understands how to properly access structure fields (and somehow specify in the input parameters which field in the struct column is being referenced for each input). It would not be correct to pass the x and y columns directly to the binop function since some structure rows may be null and not reflected in the validity masks of either the x or y columns.
At least in the short term, I think it would be nice if libcudf had a method to manifest a "struct field view" that allows a structure field column to be passed to existing, struct-oblivious libcudf functions. Building such a view would build a new validity mask by merging the validity of the field column with the validity of the parent struct column (via bitwise-AND), so the libcudf operation knows which rows in the column are null. I'm hoping we're able to re-use the data (not validity) of the original field column since we'd be passing a column_view for the input which doesn't need to own memory. But even if for some reason we have to make a full copy of the field column data to pull this off, at least we can open up a lot of libcudf functionality for queries that operate on structure fields.
It would not be correct to pass the x and y columns directly to the binop function since some structure rows may be null and not reflected in the validity masks of either the x or y columns.
Thanks, @jlowe. Permit me to rephrase: In the binop case (maybe even most cases), the column needs to be evaluated in the context of its parent struct column. But am I missing a counterpoint?
I think it would be nice if libcudf had a method to manifest a "struct field view" that allows a structure field column to be passed to existing, struct-oblivious libcudf functions.
This is brilliant. A struct field-view would be an elegant solution. (I'll have to figure out how to handle the column lifetimes, if I can't superimpose the struct's null-mask.)
But I wonder if this might be required if there are no circumstances where the struct's underlying column need to be evaluated independently of the struct's validity.
What does Arrow do? AFAIK, the children and parent have independent null masks.
The relevant section from the Arrow spec:
Any of the child field arrays can have null values according to their respective independent validity bitmaps.
This implies that for a particular struct slot the validity bitmap for the struct array might indicate a null slot
when one or more of its child arrays has a non-null value in their corresponding slot.
When reading the struct array the parent validity bitmap takes priority.
This is illustrated in the example above, the child arrays have valid entries for the null struct
but are âhiddenâ from the consumer by the parent arrayâs validity bitmap.
However, when treated independently corresponding values of the children array will be non-null.
In the context of a struct row, the struct's validity takes precedence. The examples listed in this issue's description illustrate the case.
Re-reading the last sentence last night gave me pause. I can't think of a situation where a Struct's children are "treated independently". It is akin to extracting the child column of a CUDF list column.
In implementation, I am considering superimposing the struct's validity on top of that of the child columns, at construction. This obviates having to do the same on each read. I'd like to be sure this wouldn't run afoul of what the CUDF team had in mind.
I have been advised to also seek guidance from @trxcllnt and @andygrove, who are more knowledgeable about Apache Arrow than I am. The question at hand is whether it would be incorrect to superimpose the parent (i.e. Struct) validity buffer over all its child columns, at construction:
// For each child of the struct
child.validity &= struct.validity;
Effect: A child column's row-value is null if either parent struct or the child column itself has that index set to null.
Advantage: The null mask does not need recomputation every time it is read. Also, all the rest of CUDF doesn't need special handling for struct fields.
Disadvantage: The column's original null-mask is lost, once the struct not preserved in the context of a struct.
This might be acceptable, since the child column is adopted by the struct, and is (AFAICT) not processed outside the struct's context.
Yes, in Arrow the struct parent and child nullmasks are independent. They're also independent in Lists + the other nested types (except Unions, but that was a recent change). @jrhemstad's observation that structs are a sort of table-within-a-table is more or less correct, format-wise. I vote not to deviate from Arrow in this way unless absolutely necessary, since deviations make the interop story more difficult.
To also add, there's a logical and semantic difference between an element in a struct being NULL versus the elements in the child columns being NULL and Arrow captures this:
import pyarrow as pa
test = pa.array(
[
{'a': 1, 'b': 2},
None,
{'a': None, 'b': 4},
{'a': 5, 'b': None},
{'a': None, 'b': None}
]
)
print(test[1])
# NULL
print(test[4])
# {'a': None, 'b': None}
So to be clear we are not deviating from the arrow spec. With the example from @kkraus14 above the arrow spec states that the validity looks like the following (expanded out for simplicity).
struct_validity = [VALID, INVALID, VALID , VALID , VALID]
a_validity = [VALID, ??? , INVALID, VALID , INVALID]
b_validity = [VALID, ??? , VALID , INVALID, INVALID]
The ??? in there are for entries that the spec says can be anything. It does not matter if they are VALID or INVALID because the parent's validity for the row says it is INVALID.
The issue is that if ??? is set to VALID, which the arrow spec says it can be, and someone tries to pull the column a out of the struct what should we do? Do we keep the validity as is for column a? Do we need to merge the validity for the parent with the validity for column a? For Spark there is no ambiguity. struct.a is NULL if struct is NULL. If that is true for pandas also, or pandas will never get to that point because an exception would be thrown ahead of time because None.a is illegal, then we should be able to remove the ambiguity early on, which is what @mythrocks is proposing.
If we remove the ambiguity early on then pulling a out of struct is a zero copy operation that returns a column_view. If we decide to just live with the ambiguity then any group that does not want that ambiguity needs to resolve it themselves. This would likely involve multiple kernels pulling different validity masks out of the data turning them into boolean columns so we can and them together and then doing a copy_if_else to finally get the fixed up data before doing the final computation. That would mean that doing struct.a + struct.b would likely involve 9 cudf calls involving kernels instead of 1. Deeper nesting makes it even worse.
Has anyone tried just asking on the Arrow mailing list?
Just closing the loop here with the results of yesterday's parley.
The current plan is to superimpose (or "flatten") the struct's validity mask to all applicable children. In this initial run, the original child masks will not be preserved. The Struct owns the children, and uses them to represent struct members.
If a use-case crops up where a user supplies the child-columns and expects them to be unmodified, we will revisit this decision:
structs_column_view.child(index), which might need to return the flattened index alongside the column.bitmask_and() kernel for all children, for each column batch, per access. Performance implications should be studied.For the moment, either might be a bridge too far.