Dataframes.jl: Behavior of `df[1, cols]` and `DataFrameRow`

Created on 24 Sep 2018 · 28Comments · Source: JuliaData/DataFrames.jl

@nalimilan This is speculative, but I want to discuss it before we finish getindex cleanup.

I think we should keep DataFrameRow structure (there were some discussion of returning NamedTuple from eachrow but it would not allow for mutation);
Currently getindex[1, :] returns a DataFrame but maybe it should return a DataFrameRow? Also then maybe DataFrameRow could accept only a selection of columns to make getindex[1, columns] work consistently and return DataFrameRow (this would be breaking and I am not 100% convinced we want it)?
If we do not go for point 2) maybe it would be reasonable to expose some function that allows to get a row of a DataFrame as DataFrameRow directly (something like getrow(df, 1))?

Source

bkamins

Most helpful comment

I like option 1 because #1449 gets us on the road to doing

mean(df[2, :])

Returning a DataFrame loses that progress.

I really like this df[[1], cols] returning AbstractDataFrame objects, and clarifying between vector and scalar sub-setting.

pdeffebach on 24 Sep 2018

👍2

All 28 comments

That definitely makes sense, especially if we decide that iterating over a data frame gives rows (and therefore DataFrameRow objects). That would also be much more efficient than creating a new DataFrame. Overall I wouldn't expect this to break a lot of code, but who knows. Anyway we should go through a deprecation period first, and we can always go back if it turns out to be a bad idea.

nalimilan on 24 Sep 2018

Before I move on can you clarify me the following design issues (those happened before I started contributing):

why DataFrameRow is not a subtype of AbstractDataFrame?
why SubDataFrame does not allow arbitrary AbstractDataFrame as a parent?
why DataFrameRow cannot be removed and replaced by SubDataFrame (assuming that the questions above do not have a strong reason to unify them - I guess all the mechanics where DataFrameRow is used in DataFrames.jl ecosystem could be rewritten for SubDataFrame and the mechanics allowing to iterate over DataFrameRow could be removed).

Probably you see what it is getting at: remove DataFrameRow in favor of SubDataFrame. Then df[1, :] would return a DataFrame still but @view df[1,:] would get a view of a row - and the same would be returned by eachrow iterator.

We could even retain DataFrameRow internally in DataFrames.jl if it is really needed but do not expose it to the user but instead expose publicly only SubDataFrame, e.g. in eachrow.

bkamins on 24 Sep 2018

* why `DataFrameRow` is not a subtype of `AbstractDataFrame`?

AFAIK it doesn't support the full interface. For example row[1, :x] doesn't work. Also iterating over a row should yield values like for NamedTuple, but we don't want that for data frames.

why SubDataFrame does not allow arbitrary AbstractDataFrame as a parent?

I guess that's just an oversight (we currently don't have other AbstractDataFrame types which can usefully be combine with SubDataFrame).

* why `DataFrameRow` cannot be removed and replaced by `SubDataFrame` (assuming that the questions above do not have a strong reason to unify them - I guess all the mechanics where `DataFrameRow` is used in DataFrames.jl ecosystem could be rewritten for `SubDataFrame` and the mechanics allowing to iterate over `DataFrameRow` could be removed).

I think it makes sense to differentiate a row from a subset of rows, if only because of the iteration difference I noted above. Having a different type can also allow for more efficient code, just like in by I was able to write faster methods for when a NamedTuple is returned rather than a DataFrame. There's a lot of hashing and comparison code for DataFrameRow which is used in places where performance is critical. Of course we could probably rewrite this for SubDataFrames where the indices are a scalar, but that wouldn't really simplify the code.

nalimilan on 24 Sep 2018

Thanks. This is what I see in the code. That is why I have said:

We could even retain DataFrameRow internally in DataFrames.jl if it is really needed but do not expose it to the user but instead expose publicly only SubDataFrame, e.g. in eachrow.

i.e. we internally use DataFrameRow but do not expose it to the users.

What is more SubDataFrame knows about the type of row-selector so SubDataFrame{Int} would have the same performance as DataFrameRow if we allow to create a view of a single ~column~ row (in fact apart from the fact that currently they have different methods implemented they would be identical).

bkamins on 24 Sep 2018

What is more SubDataFrame knows about the type of row-selector so SubDataFrame{Int} would have the same performance as DataFrameRow if we allow to create a view of a single column row (in fact apart from the fact that currently they have different methods implemented they would be identical).

Yes, but it would be weird to have iteration behave differently depending just on a type parameter, wouldn't it?

nalimilan on 24 Sep 2018

OK - let us leave it as is (but we have to be clear, at some point, in the documentation why we differentiate between DataFrameRow and SubDataFrame and what is the benefit of it). And actually if there is a benefit we should have either:

df[1, cols] return a DataFrameRow
df[[1], cols] return a DataFrame
@view df[1, cols] a DataFrameRow (the same as above)
@view df[[1], cols] a SubDataFrame

or:

df[1, cols] return a DataFrame
df[[1], cols] return a DataFrame
@view df[1, cols] a SubDataFrame
@view df[[1], cols] a SubDataFrame
row(df, 1) return a DataFrameRow

which one would you prefer?

bkamins on 24 Sep 2018

I think I'd prefer version 1, but as I said if it turns out its inconvenient we could change our strategy.

nalimilan on 24 Sep 2018

I like option 1 because #1449 gets us on the road to doing

mean(df[2, :])

Returning a DataFrame loses that progress.

I really like this df[[1], cols] returning AbstractDataFrame objects, and clarifying between vector and scalar sub-setting.

pdeffebach on 24 Sep 2018

👍2

Another option similar to 1 would be to have df[1, :] return a NamedTuple but view(df, 1) return a DataFrameRow. That would mirror more closely what happens with Matrix, i.e. the result of getindex lives independently from its parent, and it cannot be used to mutate the parent. That would have the advantage of consistency with other methods. The drawback would be 1) that mutation requires using view, which can be harder to find (but we also provide eachrow, which is more convenient anyway), and 2) that generating a NamedTuple could be slow if there are many columns and you only use a few of them (especially since the type instability of data frame doesn't allow benefiting from specialization).

Something to also take into account is that it would be easier to start with NamedTuple and move later to DataFrameRow if needed, since the former's immutability makes it less likely that code will break than if we make the change in the opposite direction.

nalimilan on 25 Sep 2018

It would be nice for df[1,:] to return the same thing as first(eachrow(df)). Because I anticipate users using

[f(x) for x in eachrow(df)]

and

[f(df[i, :] for i in nrow(df)]

interchangeably.

pdeffebach on 25 Sep 2018

@nalimilan I do not feel NamedTuple is a good approach as it is not mutable which breaks basic property of DataFrame.

However, I do not feel that eachrow(df) and df[i, :] have to be the same - as eachrow returns a view (which can be clearly explained to the users). Actually the argument that df[i, :] should return a copy is convincing.

Here is a summary what I think we should do:

df[col] return an original vector
df[cols] return a DataFrame of original vectors
@view df[col] return an original vector
@view df[cols] return a DataFrame of original vectors
df[row, col] return a value
df[rows, col] return a copy vector
@view df[row, col] a value
@view df[rows, col] a view
df[row, cols] return a DataFrame of copy vectors (this is disputable as it could be DataFrameRow or NamedTuple - here I guess we should choose what is most convenient as all options have their pros and cons)
df[rows, cols] return a DataFrame of copy vectors
@view df[row, cols] a DataFrameRow (this is disputable as it could be a SubDataFrame to simplify the interface as I have argued above but I guess we dropped it)
@view df[rows, cols] a SubDataFrame

When we settle on this I will update #1534.

bkamins on 25 Sep 2018

👍1

I agree with all of these except (as expected) the two ones you highlighted:

* `df[row, cols]` return a `DataFrame` of copy vectors (_this is disputable as it could be `DataFrameRow` or `NamedTuple` - here I guess we should choose what is most convenient as all options have their pros and cons_)

Returning a vector doesn't sound appropriate, as we would completely lose the column names which are an essential property defining a row. Either a DataFrameRow or a NamedTuple would be better. As I said I'm slightly leaning towards NamedTuple as it's always easier to be more restrictive first and make things less restrictive later if needed. I don't really get your position about this, since you said mutating is an essential property but then that returning a copy is convincing, two properties which seem incompatible to me.

I agree that eachrow(df) and df[i, :] can diverge, in particular because the former could return a type-specialized iterator which could allow for faster code (with a function barrier, or with compiler improvements), while the latter is doomed to be slow.

* `@view df[row, cols]` a `DataFrameRow` (_this is disputable as it could be a `SubDataFrame` to simplify the interface as I have argued above but I guess we dropped it_)

AFAICT this is not possible unless we find a solution for iteration, right?

nalimilan on 26 Sep 2018

I will lay out below what I think to let us consciously decide as this is a delicate thing and I guess we all want DataFrames 1.0 release to be stable in this area.

Case of `df[row, cols]`

mutating is essential

I mean that the result of df[row, cols] should be mutable, because df is mutable

returning a copy is convincing

Here I understand that we agree, i.e. it would be better that the result of df[row, cols] when mutated would not influence df.

Actually I think we should voice a third requirement as this is what I understand you want:

the result of df[row, cols] should drop from 2-D like object (DataFrame) to 1-D like object.

Now let me summarize the options we have discussed (I add NamedArray because you mentioned an option to return a vector):

Return type | Mutable | Copying | 1D | Breaking change |
-|-|-|-|-|
DataFrame | yes | yes | no | no (current behavior) |
DataFrameRow | yes | no | yes | mildly |
NamedTuple | no | yes | yes | yes (code that currently works might break, because it is immutable) |
NamedArray | yes | yes | yes | yes (introduces a new dependency) |

and every option has some disadvantage.

Based on this I lean towards DataFrame because it is not-breaking (this is what we currently have - although there is a duplication of functionality between df[[1], :] and df[1, :]).
I am OK with DataFrameRow as a second option as it is only mildly breaking and users are aware of this type in DataFrames.jl ecosystem (although you will be able to get it via view(df, 1, cols) so we would have a duplication of functionality). Actually we could make DataFrameRow be a wrapper around a one-row DataFrame like this df[1,:] would be view(df[[1], :], 1, :) to make it not-influence df on mutation but I am not sure if it would give any advantage over simply returning a DataFrame;
Returning NamedTuple can break existing code in unexpected places and introduces a new type for a user to think about (and already user has to handle DataFrame, SubDataFrame and DataFrameRow which is a lot).
Finally NamedArray creates a significant dependency.

Case `@view df[row, cols]`

I understand you support DataFrameRow here - right? I am OK with this choice (my only objection is that we have one more type for the user to learn that is why I put it on the table).

bkamins on 26 Sep 2018

Good summary. Though note that returning a DataFrameRow is as breaking as returning a NamedTuple, since mutating the DataFrameRow will now mutate the parent, while currently it doesn't affect it. So code which mutates the result of df[1,:] would change its behavior significantly.

Case @view df[row, cols]
I understand you support DataFrameRow here - right? I am OK with this choice (my only objection is that we have one more type for the user to learn that is why I put it on the table).

Right. I'm also open to investigating merging DataFrameRow and SubDataFrame, but maybe better make this a separate discussion since it shouldn't really change semantics (just simplify code and reduce the number of exported types).

nalimilan on 26 Sep 2018

DataFrameRow - I agree this is breaking that is why I considered this view(df[[1], :], 1, :) pattern to avoid mutation of source. Anyway - given what was written what is your opinion? As said - I am leaning towards DataFrame as I have said above because it is non-breaking (and we can always change it in the future if we have more evidence what would be best - but if you feel you have a strong case against it I am open).

Regarding merging DataFrameRow and SubDataFrame - agreed to postpone this discussion for another PR.

bkamins on 26 Sep 2018

I'd return either a DataFrameRow or a NamedTuple. Returning a DataFrame sounds like a bad solution since it's inefficient and one can always get the same result with df[1:1, :] or df[[1], :]. It's also inconsistent with Matrix indexing and with df[:, 1], which both return an object of a different type than the parent. Also I prefer breaking things earlier rather than later.

nalimilan on 26 Sep 2018

OK - you have convinced me for NamedTuple because:

it is an in-built concept, so we are not requiring user to learn something new
it can play nicely when passed to functions (in particular even if it is passed as a whole we get type stability for free inside the function)
we give a way to get other objects if user wants it If the drawbacks, i.e. compilation time or immutability, are significant to the user she can always choose to create what is best.

Let us wait if we get any comments and then I will update the PR accordingly.

bkamins on 26 Sep 2018

👍1

OK - moving next step further. If df[1, cols] returns a NamedTuple then what should copy(::DataFrameRow) return?

bkamins on 27 Sep 2018

Oh man. The difference might come down to the fact that a DataFrameRow lets you know which row number it came from. Philosophically when you copy something, you are removing its connection to the underlying Dataframe, so shouldn't you not care if you lose that row number? Hence returning a NamedTuple. But having copy return a different type is not right...

pdeffebach on 27 Sep 2018

@nalimilan I assume you know why I am asking 😄 and DataFrameRow is a view (for the reference of others see #1536).
For me it is a valid question because it tests the consistency of the setup. And following the behavior in Base if we go the way we plan here we should return a NamedTuple as you have said which is a bit weird.

bkamins on 28 Sep 2018

note that you can't copy a SubDataFrame so we might not need to define this.

DataFrame(df[1,:]) == DataFrame(DataFrameRow(df, 1))?

pdeffebach on 28 Sep 2018

note that you can't copy a SubDataFrame so we might not need to define this.

See https://github.com/JuliaData/DataFrames.jl/pull/1536.

I think it's fine for copy to return a different type, that's what Base does for SubArray. What's more annoying is that NamedTuple is immutable, while one would often call copy to be able to mutate an object without affecting its parent. We could also return a DataFrameRow with a one-row parent. Anyway I'm really not sure that's a common pattern, so I also vote for leaving this undefined at least for now.

nalimilan on 28 Sep 2018

OK - so I will start updating #1534 following what was discussed here.

bkamins on 28 Sep 2018

One issue with NamedTuple is that you can't use a vector of symbols to Index like you can with a DataFrameRow.

Though I'm not sure the use case for df[1,:]. It's tough to know how much this will be a problem in practice.

julia> t = (a = 1, b = 2, c = 5)
julia> t[[:a, :b]]
julia> t[[:a, :b]]
ERROR: MethodError: no method matching getindex(::NamedTuple{(:a, :b, :c),Tuple{Int64,Int64,Int64}}, ::Array{Symbol,1})
Closest candidates are:
  getindex(::NamedTuple, ::Int64) at namedtuple.jl:101
  getindex(::NamedTuple, ::Symbol) at namedtuple.jl:102
Stacktrace:
 [1] top-level scope at none:0

pdeffebach on 28 Sep 2018

Can I request, if it is relavant here, that DataFrameRow should support "haskey", as now it doesn't?

LeoK987 on 28 Sep 2018

@LeoK987 - can you open a separate issue for this so that we do not lose track of this request and can discuss it?

bkamins on 28 Sep 2018

@pdeffebach I think it is worth a PR to Julia Base, as indexing by a vector works for Tuple so it would be consistent if it worked for NamedTuple. EDIT: https://github.com/JuliaLang/julia/issues/29417

(I will wait a bit with #1534 as it seems we still get some feedback)

bkamins on 28 Sep 2018

Closing as df[1, cols] returns a DataFrameRow at least for now.
DataFrameRow now supports haskey.

bkamins on 14 Dec 2018

Was this page helpful?

0 / 5 - 0 ratings

Related issues

How do I get values from a data frame with Nullable?

gustafsson · 6Comments

split a df into parts

yakir12 · 6Comments

using nrow as named tuple when grouping

jangorecki · 7Comments

More intuitive functions

bbrunaud · 3Comments

Allow data frame and DataFrameRow to take part in broadcasting

bkamins · 8Comments

Dataframes.jl: Behavior of `df[1, cols]` and `DataFrameRow`

Most helpful comment

All 28 comments

Case of df[row, cols]

Case @view df[row, cols]

Related issues

Case of `df[row, cols]`

Case `@view df[row, cols]`