Dataframes.jl: Clarify position on iteration API

Created on 14 May 2020  路  22Comments  路  Source: JuliaData/DataFrames.jl

I watched the awesome recent AMA and realized you guys are preparing for a 1.0 release :tada:

There's probably one decision I haven't got my head around with DataFrames and whether it would lead to breaking API changes, and that is to do with the iteration API.

In some cases a DataFrame opts into some iteration API functions:

first(df)
last(df)
filter(row -> predicate(row.column), df)

while in other cases it does not:

iterate(df)
map(df)
reduce(df)

It would seem cleanest to me to pick a side of the fence, and either:

  1. Implement row-iteration of DataFrame.
  2. require filter(predicate, eachrow(df)), first(eachrow(df)), last(eachrow(df)) before releasing a version 1.0.

If doing 1. then we'd want to see if there are any methods implemented that contradict this interpretation before releasing 1.0, but that actual implemtation could come later. I think most of the column iteration functionality has been removed, though.

Sorry if this has been discussed or decided elsewhere and I'm not aware of the outcome.


For some background, I quite often use filter on DataFrames and find this interface quite useful. I do tend to enjoy using Vector{NamedTuple} and TypedTables.jl because I also have access to map, reduce and mapreduce from Base, plus all the functions in SplitApplyCombine which exist for, well, splitting, and applying, and combining, especially for tables. I would love to support DataFrame in SplitApplyCombine, for example. (And there are many other great packages out there that rely on iteration, too).

My personal bias has always been towards a table is a relation which is a collection of rows and we can use relational algebra to manipulate it. I don't see a particular advantage of having DataFrame not support iterating either columns or rows and requiring eachrow and eachcol when you could choose one by default.

decision non-breaking

Most helpful comment

In general I tend towards supporting the iteration API. There are however the following issues that require being discussed.

  1. I think iterate is just fine to add; I think then we are fine to add other methods in Base which rely on "row"-oriented interface (#2251, #2229 applied also to AbstractDataFrame, reintroducing length, )
  2. Deciding on https://github.com/JuliaData/DataFrames.jl/issues/2194 for GroupedDataFrame
  3. Dealing with https://github.com/JuliaData/DataFrames.jl/issues/2097 for GroupedDataFrame and DataFrame; similarly https://github.com/JuliaData/DataFrames.jl/issues/2088 and https://github.com/JuliaData/DataFrames.jl/issues/2048
  4. with map and reduce there is a problem what should be the output type of the result (https://github.com/JuliaData/DataFrames.jl/issues/2053, also this is the reason why we deprecated map for GroupedDataFrame)
  5. There is a tricky thing about what type to pass to functions - DataFrameRow or NamedTuple or positional arguments - but this can be worked out I think later (in general - we should have different options here as different things are good depending on the fact if the table is wide or not)

Out of those points point 4 (and in consequence point 2) are crucial to decide on and hardest so I would concentrate the discussion on them first. The general questions are:

  • if I apply map to a DataFrame and we process it row-wise what should be the type of the result
  • if I apply reduce to a DataFrame and we process it row-wise what should be the type of the result
  • if I apply map to a GroupedDataFrame what should be the type of the result
  • if I apply reduce to a GroupedDataFrame what should be the type of the result
  • if I apply broadcasting to a GroupedDataFrame what should be the type of the result
  • relatedly (does not have to be resolved now and in this thread but I note it for completeness) if I apply broadcasting to a DataFrameRow what should be the type of the result (here a link is to NamedTuple which currently does not allow broadcasting)

All 22 comments

In general I tend towards supporting the iteration API. There are however the following issues that require being discussed.

  1. I think iterate is just fine to add; I think then we are fine to add other methods in Base which rely on "row"-oriented interface (#2251, #2229 applied also to AbstractDataFrame, reintroducing length, )
  2. Deciding on https://github.com/JuliaData/DataFrames.jl/issues/2194 for GroupedDataFrame
  3. Dealing with https://github.com/JuliaData/DataFrames.jl/issues/2097 for GroupedDataFrame and DataFrame; similarly https://github.com/JuliaData/DataFrames.jl/issues/2088 and https://github.com/JuliaData/DataFrames.jl/issues/2048
  4. with map and reduce there is a problem what should be the output type of the result (https://github.com/JuliaData/DataFrames.jl/issues/2053, also this is the reason why we deprecated map for GroupedDataFrame)
  5. There is a tricky thing about what type to pass to functions - DataFrameRow or NamedTuple or positional arguments - but this can be worked out I think later (in general - we should have different options here as different things are good depending on the fact if the table is wide or not)

Out of those points point 4 (and in consequence point 2) are crucial to decide on and hardest so I would concentrate the discussion on them first. The general questions are:

  • if I apply map to a DataFrame and we process it row-wise what should be the type of the result
  • if I apply reduce to a DataFrame and we process it row-wise what should be the type of the result
  • if I apply map to a GroupedDataFrame what should be the type of the result
  • if I apply reduce to a GroupedDataFrame what should be the type of the result
  • if I apply broadcasting to a GroupedDataFrame what should be the type of the result
  • relatedly (does not have to be resolved now and in this thread but I note it for completeness) if I apply broadcasting to a DataFrameRow what should be the type of the result (here a link is to NamedTuple which currently does not allow broadcasting)

To add one point to @bkamins' list: one tricky point is that broadcast applies the function to each cell (which is very useful). So if we supported map to apply an operation on each row it would be a bit inconsistent. Not necessarily a problem, though.

Overall I don't think we have to settle this before 1.0. At https://github.com/JuliaData/DataFrames.jl/issues/1514 we've decided DataFrames are collections of rows, so function from the collection API that we implement follow that convention. But we don't have to implement all of them right now, especially since you can always use eachrow.

Concerning the point about reduce, I would expect reduce(gdf::GroupedDataFrame, :old => + => :new) to be the same as combine(gdf, :old => sum => :new) (once the groups are reduced to just one element, IMO there isn't much point in keeping the grouping).

I think reduce on a GroupedDataFrame is relevant because you get support for:

  • Transducers
  • OnlineStats

With the added bonus that these operations would multithread automatically.

So my thinking was a bit different. I thought that reduce(op, gdf::GroupedDataFrame) would apply op as a reduction to elements of gdf (which are SubDataFrames).

On the other hand reduce(op, df::AbstractDataFrame) would operate row-wise (with possibly allowing column selectors).

So what you write would be rather combine(gdf, :old => (x -> reduce(+, x)) => :new) which is already allowed (and the same as just combine(gdf, :old => sum => :new as you noted).

So what you write would be rather combine(gdf, :old => (x -> reduce(+, x)) => :new) which is already allowed (and the same as just combine(gdf, :old => sum => :new) as you noted)

In my experience with JuliaDB, the problem is that this approach is much more wasteful, because you need to actually extract the subvectors. OTOH if you know that the user is applying a reduce operations many optimizations become possible (compared to :old => (x -> reduce(+, x)) => :new, where you cannot know that the user did not need the whole vector).

I suspect the same rationale applies here, because you can simply perform the reduction when iterating on gdf.idxs with gdf::GroupedDataFrame.

Yes - this is true (and internally we do it in a different way). I am simply referring to the fact what API of reduce should be (i.e. it should reduce over an iterable, not over iterables nested within this iterable).

Yes, overall one potential issue is that iteration over DataFrame and GroupedDataFrame gives very different objects. So all functions that rely on it can appear as quite inconsistent/unintuitive to users if they want to apply them to both types. OTOH select, transform and combine have more consistent behaviors in that regard, but they don't follow iteration.

Something that just occurred to me is that we could use combine(gd, :x => Reduce(op) => :newx) for reductions (similar to ByRow). That's actually how it works internally, and the corresponding objects could almost be exported as-is. Then we wouldn't need a new function at all.

we could use combine(gd, :x => Reduce(op) => :newx)

:100:

However, this will solve the issue only partially, we still have many other functions to decide how they should be handled (or not handled - what we do now, because we are unsure what we want :smile:).

Awesome. :smile:

In general I note a lot of interface questions like "what should map on ... return? I agree all this is hard - very hard - in fact I would suggest not answering these questions at all and deferring to some higher interface!

For example, a Vector{NamedTuple} makes a relatively handy table/dataframe, straight from Base. A NamedTuple is a bit too specific (much like Tuple compared to AbstractVector) and really needs generalizing to a more versatile row type; while we have Tables.jl definition based on getproperties these day's I'm thinking more about Dictionaries.AbstractDictionary{Symbol} for rows, so as an interface for a table or dataframe I'm thinking of, basically, AbstractArray{<:AbstarctDictionary{Symbol, Any}} where each row has identical indices (i.e. keys(table[1]) === keys(table[2]), etc).

This immediate restricts you (in a good way!). Starting with df::AbstactArray{<:AbstractDictionary{Symbol, Any}}:

  • if I apply map to a DataFrame and we process it row-wise what should be the type of the result.

map(f, df) always gives gives an array; if f returns rows the the answer is also a table/dataframe.

  • if I apply reduce to a DataFrame and we process it row-wise what should be the type of the result

For mapreduce(f, op, df) it depends on f and op, but you are just doing mapreduce over an AbstractArray so the anwser is given.

  • if I apply map to a GroupedDataFrame what should be the type of the result
  • if I apply reduce to a GroupedDataFrame what should be the type of the result

If a table is an array-of-dictionaries what is your grouped data frame? I see space for two different objects, actually, which you want to be able to switch between with ease. The first is a dictionary-of-tables (i.e. dictionary-of-arrays-of-dictionaries), which is what SplitApplyCombine.group gives when the input is a table, and it's pretty easy to use e.g. map on each group before combining them together again. The second is where you may also want to turn that object into a view of a "flat" table where the outer dictionary key becomes one of the columns (so the internals are akin to what JuliaDB calls NDSparse or what SQL DBs call "partitioned tables"). With the former you can map/filter/reduce over the groups, with the latter you can map/filter/reduce over the individual rows - just depends what you want to do.

IMO using external interfaces like AbstractVector and some container type which can represent "rows" will not only make your life easier as implementers but more importantly your users won't have to guess because they will just be using the interface they know from Base. Even if you don't literally subtype AbstractVector you can still use these arguments to drive your intuition (and you can bet your users probably will, whether you want them to or not).


To add one point to @bkamins' list: one tricky point is that broadcast applies the function to each cell (which is very useful).

I feel this is general; I would frequently like to use broadcast on Vector{Vector{Float64}} say but we don't have syntax for that - I would be very tempted to find an elegant solution to the general problem and use that instead. (In short I personally feel that broadcast should generalize map and they should coincide when the keys of all the inputs are identical, which is what happens with arrays).

I agree we should decide based on comparisons with vectors of rows. The anwser is pretty clear: map on data frames should operate on rows and return a data frame, and map on GroupedDataFrame should operate on data frames and return a GroupedDataFrame. But I'm don't think I'd be willing to trade the current behavior broadcast on data frames, which is very useful, for consistency with map, which is less useful IMO.

As for broadcast I am pretty sure that what we do now (treat data frame as a matrix) is what we should keep. It is inconsistent with the rest of the design (unfortunately), but this is what all users asked for for a long time. Also something like:

df[!, several_new_cols_and_some_old_cols] .= 0

makes perfect sense currently, but it is only possible because we treat data frame like a matrix.

and map on GroupedDataFrame should operate on data frames and return a GroupedDataFrame

To be clear - I'd argue the type of map(f, gdf) would depend on what f returns. If data frames are returned then yes you'd get a GroupedDataFrame back, but if it returns rows or scalars you might want something else.

Ah yes that's an interesting option.

But this would then have to apply to map on DataFrame also - right?

I suppose I'd want a Vector most of the time when the result of f in map(f, df) is not some kind of "row"?

I agree with @andyferris. I think map(f, df) should return a DataFrame only if f returns a row-like object. Otherwise a plain Vector makes sense (e.g., map(_ -> 1.0, df) should give ones(length(df))). For this, I think we need something like

Tables.isrow(::Nameduple) = true
Tables.isrow(::AbstractDict{<:Union{Symbol,String}}) = true
Tables.isrow(::DataFrameRow) = true
Tables.isrow(_) = false

Like wise, I think map(f, gdf) should return a GroupedDataFrame only if f returns a rowtable-like object (or maybe table-like is OK). We have Tables.isrowtable so this is implementable. But what should happen in other cases? Just return a Dict{GroupKey,Whatever}? Or would it be nice to have GroupedDataRow ("dictionary-of-dictionary") when f returns a row-like object? What about GroupedDataCell ("dictionary")?


BTW, I find it nice to conceptualize map by

map(f, xs) = reduce(concatenate, (singleton(k, f(v)) for (k, v) in pairs(xs)))

where xs is either a DataFrame or GroupedDataFrame. Defining map is equivalent to find appropriate definitions of concatenate and singleton. Note that the output type of singleton would be the output type of map. In general, singleton can depend on the input type (i.e., singleton(k, f(v), typeof(xs)) instead of singleton(k, f(v))).

Importantly, this connects map and reduce and avoids inconsistency in the design.

For example, map(f, ::Vector) in Base can be implemented (conceptualized) with concatenate = vcat and singleton(_, x) = [x].

A possible definition for dataframes is something like

concatenate(a::AbstractDataFrame, b::AbstractDataFrame) = vcat(a, b)
concatenate(a::GroupedDataFrame, b::GroupedDataFrame) = mergewith(concatenate, a, b)
concatenate(a::AbstractDict, b::AbstractDict) = merge(a, b)
concatenate(a::Any, b::Any) = vcat(a, b)  # fallback

# called via `map(f, ::DataFrame)`
singleton(::Integer, x) = Tables.isrow(x) ? DataFrame([x]) : [x]

# called via `map(f, ::GroupedDataFrame)`
function singleton(k::GroupKey, v)
    if Tables.isrowtable(v)
        return GroupedDataFrame(k, DataFrame(v))  # made-up constructor
    elseif Tables.isrow(v)
        return Dict(k => DataFrameRow(v))  # made-up constructor
        # return GroupedDataRow(k, DataFrameRow(v))  # made-up type
    else
        return Dict(k => v)
        # return GroupedDataCell(k, v)  # made-up type
    end
end

Or would it be nice to have GroupedDataRow ("dictionary-of-dictionary") when f returns a row-like object?

Yes. I believe we need tables (collections of rows) that index like an array and other tables that index like a dictionary. I'd be tempted to use the latter to model a table with a primary key (though strictly speaking it's unnecessary for the dictionary key to be one of the columns, it can live "outside" like row indices) .

I believe we need tables (collections of rows) that index like an array and other tables that index like a dictionary.

Are you talking about Tables.rowaccess vs Tables.columnaccess or something else? Don't we need additional kinds of "table" here? With Haskell-ish notation, I think we need to handle

[K -> V]                 | Vector{Dict{K,V}}                     "row table"
K -> [V]                 | Dict{K,Vector{V}}                     "column table"
(K1 -> V1) -> [K2 -> V2] | Dict{Dict{K1,V1},Vector{Dict{K1,V1}}} "GroupedDataFrame"
(K1 -> V1) -> (K2 -> V2) | Dict{Dict{K1,V1},Dict{K2,V2}}         "GroupedDataRow"
(K -> V1) -> V2          | Dict{Dict{K1,V1},V2}                  "GroupedDataCell"

With the SQL-like approach I'd imagine the latter three would be flattened to a table. But I thought @nalimilan's comment above https://github.com/JuliaData/DataFrames.jl/issues/2254#issuecomment-630234715 indicates we are dealing with (K1 -> V1) -> [K2 -> V2]. How do you treat (K1 -> V1) -> [K2 -> V2] and (K1 -> V1) -> (K2 -> V2) like tables?

BTW, I find the naming GroupedDataFrame unfortunate _if_ it really is (K1 -> V1) -> [K2 -> V2]. I think it'd be nice to rename it to DataFrameGroups or something and then re-introduce GroupedDataFrame as a row table view of DataFrameGroups.

Indeed the points you rise are good. I have the following comments:

  • we need to keep DataFrames.jl as simple as possible - it is assumed to be an entry-level package so we should keep in mind that most users will not be able to digest all the details
  • we in general can do it for 2.0 release of DataFrames.jl, but probably for 1.0 release we will stick with what we have
  • making all this first requires agreement in Tables.jl about the common API (actually I think all we write should not be DataFrames.jl specific I think)

Given this I have another perspective. Maybe we should always return a Vector of what is returned in map and just later delegate to the user what the user wants to do with this Vector (in particular if one wants a DataFrame it can be easily created later using one command). This is of course not most elegant solution but maybe it will be easiest for users to digest in practice?

FWIW, I'm not sure we'd really need a Tables.isrow because we have Tables.Row; i.e. you can wrap anything that satisfies the AbstractRow interface like Tables.Row(rowlike) and it provides consistent behaviors _and_ is a dispatchable type. It should also be efficient and not even allocated in cases where it's just a temporary wrapper.

Maybe we should always return a Vector of what is returned in map and just later delegate to the user what the user wants to do with this Vector (in particular if one wants a DataFrame it can be easily created later using one command).

You can already use [f(x) for x in df] or collect(f(x) for x in df) to express that the result should be a Vector. I think they are pretty concise. So, if you want a fixed return type, I think map should return a DataFrame/GroupedDataFrame when the input is DataFrame/GroupedDataFrame.

Having realized collect(f(x) for x in df), I'm actually a bit more inclined to map(f, ::DataFrame) :: DataFrame. Sorry @andyferris :)

The map(f, _) implementation can assert the output of f is row/table-like or maybe automatically "upcast" to the correct type. It's probably safer to avoid accidentally wrap a table in a row. Something like:

# called via `map(f, ::DataFrame)`
function singleton(::Integer, x)
    if Tables.istable(v)
        error("Expected a row-like object. Please use `Iterators.flatten` when ",
              "the number of rows may change.")
    elseif Tables.isrow(v)
        return DataFrame([v])
    else
        return DataFrame(value=[v])
    end
end

# called via `map(f, ::GroupedDataFrame)`
function singleton(k::GroupKey, v)
    if Tables.istable(v)
        return GroupedDataFrame(k, DataFrame(v))
    elseif Tables.isrow(v)
        return GroupedDataFrame(k, DataFrame([v]))
    else
        return GroupedDataFrame(k, DataFrame(value=[v]))
    end
end

we have Tables.Row

Ah, good point. I think being explicit is better here.

I agree that the fact that comprehensions always return vectors make it less useful to have map do the same thing. That's why map on GroupedDataFrame didn't return vectors (before its deprecation).

Regarding singleton, actually we already have a similar logic in combine to decide how to interpret the return value (whether it's a single value, a single row, or a table). But there when a scalar is returned we automatically generate a column name to store it. This is essential there since it's not very useful to return a vector without grouping information, and as @bkamins said it keeps things simple for users. So it could make sense to do the same for map(f, ::AbstractDataFrame).

Was this page helpful?
0 / 5 - 0 ratings

Related issues

mattBrzezinski picture mattBrzezinski  路  5Comments

ahalwright picture ahalwright  路  3Comments

davidanthoff picture davidanthoff  路  4Comments

bkamins picture bkamins  路  8Comments

jangorecki picture jangorecki  路  7Comments