@nalimilan This is speculative, but I want to discuss it before we finish getindex cleanup.
DataFrameRow structure (there were some discussion of returning NamedTuple from eachrow but it would not allow for mutation);getindex[1, :] returns a DataFrame but maybe it should return a DataFrameRow? Also then maybe DataFrameRow could accept only a selection of columns to make getindex[1, columns] work consistently and return DataFrameRow (this would be breaking and I am not 100% convinced we want it)?DataFrame as DataFrameRow directly (something like getrow(df, 1))?That definitely makes sense, especially if we decide that iterating over a data frame gives rows (and therefore DataFrameRow objects). That would also be much more efficient than creating a new DataFrame. Overall I wouldn't expect this to break a lot of code, but who knows. Anyway we should go through a deprecation period first, and we can always go back if it turns out to be a bad idea.
Before I move on can you clarify me the following design issues (those happened before I started contributing):
DataFrameRow is not a subtype of AbstractDataFrame?SubDataFrame does not allow arbitrary AbstractDataFrame as a parent?DataFrameRow cannot be removed and replaced by SubDataFrame (assuming that the questions above do not have a strong reason to unify them - I guess all the mechanics where DataFrameRow is used in DataFrames.jl ecosystem could be rewritten for SubDataFrame and the mechanics allowing to iterate over DataFrameRow could be removed).Probably you see what it is getting at: remove DataFrameRow in favor of SubDataFrame. Then df[1, :] would return a DataFrame still but @view df[1,:] would get a view of a row - and the same would be returned by eachrow iterator.
We could even retain DataFrameRow internally in DataFrames.jl if it is really needed but do not expose it to the user but instead expose publicly only SubDataFrame, e.g. in eachrow.
* why `DataFrameRow` is not a subtype of `AbstractDataFrame`?
AFAIK it doesn't support the full interface. For example row[1, :x] doesn't work. Also iterating over a row should yield values like for NamedTuple, but we don't want that for data frames.
why
SubDataFramedoes not allow arbitraryAbstractDataFrameas a parent?
I guess that's just an oversight (we currently don't have other AbstractDataFrame types which can usefully be combine with SubDataFrame).
* why `DataFrameRow` cannot be removed and replaced by `SubDataFrame` (assuming that the questions above do not have a strong reason to unify them - I guess all the mechanics where `DataFrameRow` is used in DataFrames.jl ecosystem could be rewritten for `SubDataFrame` and the mechanics allowing to iterate over `DataFrameRow` could be removed).
I think it makes sense to differentiate a row from a subset of rows, if only because of the iteration difference I noted above. Having a different type can also allow for more efficient code, just like in by I was able to write faster methods for when a NamedTuple is returned rather than a DataFrame. There's a lot of hashing and comparison code for DataFrameRow which is used in places where performance is critical. Of course we could probably rewrite this for SubDataFrames where the indices are a scalar, but that wouldn't really simplify the code.
Thanks. This is what I see in the code. That is why I have said:
We could even retain
DataFrameRowinternally in DataFrames.jl if it is really needed but do not expose it to the user but instead expose publicly onlySubDataFrame, e.g. ineachrow.
i.e. we internally use DataFrameRow but do not expose it to the users.
What is more SubDataFrame knows about the type of row-selector so SubDataFrame{Int} would have the same performance as DataFrameRow if we allow to create a view of a single ~column~ row (in fact apart from the fact that currently they have different methods implemented they would be identical).
What is more
SubDataFrameknows about the type of row-selector soSubDataFrame{Int}would have the same performance asDataFrameRowif we allow to create a view of a single column row (in fact apart from the fact that currently they have different methods implemented they would be identical).
Yes, but it would be weird to have iteration behave differently depending just on a type parameter, wouldn't it?
OK - let us leave it as is (but we have to be clear, at some point, in the documentation why we differentiate between DataFrameRow and SubDataFrame and what is the benefit of it). And actually if there is a benefit we should have either:
df[1, cols] return a DataFrameRowdf[[1], cols] return a DataFrame@view df[1, cols] a DataFrameRow (the same as above)@view df[[1], cols] a SubDataFrameor:
df[1, cols] return a DataFramedf[[1], cols] return a DataFrame@view df[1, cols] a SubDataFrame@view df[[1], cols] a SubDataFramerow(df, 1) return a DataFrameRowwhich one would you prefer?
I think I'd prefer version 1, but as I said if it turns out its inconvenient we could change our strategy.
I like option 1 because #1449 gets us on the road to doing
mean(df[2, :])
Returning a DataFrame loses that progress.
I really like this df[[1], cols] returning AbstractDataFrame objects, and clarifying between vector and scalar sub-setting.
Another option similar to 1 would be to have df[1, :] return a NamedTuple but view(df, 1) return a DataFrameRow. That would mirror more closely what happens with Matrix, i.e. the result of getindex lives independently from its parent, and it cannot be used to mutate the parent. That would have the advantage of consistency with other methods. The drawback would be 1) that mutation requires using view, which can be harder to find (but we also provide eachrow, which is more convenient anyway), and 2) that generating a NamedTuple could be slow if there are many columns and you only use a few of them (especially since the type instability of data frame doesn't allow benefiting from specialization).
Something to also take into account is that it would be easier to start with NamedTuple and move later to DataFrameRow if needed, since the former's immutability makes it less likely that code will break than if we make the change in the opposite direction.
It would be nice for df[1,:] to return the same thing as first(eachrow(df)). Because I anticipate users using
[f(x) for x in eachrow(df)]
and
[f(df[i, :] for i in nrow(df)]
interchangeably.
@nalimilan I do not feel NamedTuple is a good approach as it is not mutable which breaks basic property of DataFrame.
However, I do not feel that eachrow(df) and df[i, :] have to be the same - as eachrow returns a view (which can be clearly explained to the users). Actually the argument that df[i, :] should return a copy is convincing.
Here is a summary what I think we should do:
df[col] return an original vectordf[cols] return a DataFrame of original vectors@view df[col] return an original vector@view df[cols] return a DataFrame of original vectors
df[row, col] return a value
df[rows, col] return a copy vector@view df[row, col] a value@view df[rows, col] a view
df[row, cols] return a DataFrame of copy vectors (this is disputable as it could be DataFrameRow or NamedTuple - here I guess we should choose what is most convenient as all options have their pros and cons)
df[rows, cols] return a DataFrame of copy vectors@view df[row, cols] a DataFrameRow (this is disputable as it could be a SubDataFrame to simplify the interface as I have argued above but I guess we dropped it)@view df[rows, cols] a SubDataFrameWhen we settle on this I will update #1534.
I agree with all of these except (as expected) the two ones you highlighted:
* `df[row, cols]` return a `DataFrame` of copy vectors (_this is disputable as it could be `DataFrameRow` or `NamedTuple` - here I guess we should choose what is most convenient as all options have their pros and cons_)
Returning a vector doesn't sound appropriate, as we would completely lose the column names which are an essential property defining a row. Either a DataFrameRow or a NamedTuple would be better. As I said I'm slightly leaning towards NamedTuple as it's always easier to be more restrictive first and make things less restrictive later if needed. I don't really get your position about this, since you said mutating is an essential property but then that returning a copy is convincing, two properties which seem incompatible to me.
I agree that eachrow(df) and df[i, :] can diverge, in particular because the former could return a type-specialized iterator which could allow for faster code (with a function barrier, or with compiler improvements), while the latter is doomed to be slow.
* `@view df[row, cols]` a `DataFrameRow` (_this is disputable as it could be a `SubDataFrame` to simplify the interface as I have argued above but I guess we dropped it_)
AFAICT this is not possible unless we find a solution for iteration, right?
I will lay out below what I think to let us consciously decide as this is a delicate thing and I guess we all want DataFrames 1.0 release to be stable in this area.
df[row, cols]mutating is essential
I mean that the result of df[row, cols] should be mutable, because df is mutable
returning a copy is convincing
Here I understand that we agree, i.e. it would be better that the result of df[row, cols] when mutated would not influence df.
Actually I think we should voice a third requirement as this is what I understand you want:
the result of df[row, cols] should drop from 2-D like object (DataFrame) to 1-D like object.
Now let me summarize the options we have discussed (I add NamedArray because you mentioned an option to return a vector):
Return type | Mutable | Copying | 1D | Breaking change |
-|-|-|-|-|
DataFrame | yes | yes | no | no (current behavior) |
DataFrameRow | yes | no | yes | mildly |
NamedTuple | no | yes | yes | yes (code that currently works might break, because it is immutable) |
NamedArray | yes | yes | yes | yes (introduces a new dependency) |
and every option has some disadvantage.
DataFrame because it is not-breaking (this is what we currently have - although there is a duplication of functionality between df[[1], :] and df[1, :]).DataFrameRow as a second option as it is only mildly breaking and users are aware of this type in DataFrames.jl ecosystem (although you will be able to get it via view(df, 1, cols) so we would have a duplication of functionality). Actually we could make DataFrameRow be a wrapper around a one-row DataFrame like this df[1,:] would be view(df[[1], :], 1, :) to make it not-influence df on mutation but I am not sure if it would give any advantage over simply returning a DataFrame;NamedTuple can break existing code in unexpected places and introduces a new type for a user to think about (and already user has to handle DataFrame, SubDataFrame and DataFrameRow which is a lot).NamedArray creates a significant dependency.@view df[row, cols]I understand you support DataFrameRow here - right? I am OK with this choice (my only objection is that we have one more type for the user to learn that is why I put it on the table).
Good summary. Though note that returning a DataFrameRow is as breaking as returning a NamedTuple, since mutating the DataFrameRow will now mutate the parent, while currently it doesn't affect it. So code which mutates the result of df[1,:] would change its behavior significantly.
Case
@view df[row, cols]
I understand you supportDataFrameRowhere - right? I am OK with this choice (my only objection is that we have one more type for the user to learn that is why I put it on the table).
Right. I'm also open to investigating merging DataFrameRow and SubDataFrame, but maybe better make this a separate discussion since it shouldn't really change semantics (just simplify code and reduce the number of exported types).
DataFrameRow - I agree this is breaking that is why I considered this view(df[[1], :], 1, :) pattern to avoid mutation of source. Anyway - given what was written what is your opinion? As said - I am leaning towards DataFrame as I have said above because it is non-breaking (and we can always change it in the future if we have more evidence what would be best - but if you feel you have a strong case against it I am open).
Regarding merging DataFrameRow and SubDataFrame - agreed to postpone this discussion for another PR.
I'd return either a DataFrameRow or a NamedTuple. Returning a DataFrame sounds like a bad solution since it's inefficient and one can always get the same result with df[1:1, :] or df[[1], :]. It's also inconsistent with Matrix indexing and with df[:, 1], which both return an object of a different type than the parent. Also I prefer breaking things earlier rather than later.
OK - you have convinced me for NamedTuple because:
Let us wait if we get any comments and then I will update the PR accordingly.
OK - moving next step further. If df[1, cols] returns a NamedTuple then what should copy(::DataFrameRow) return?
Oh man. The difference might come down to the fact that a DataFrameRow lets you know which row number it came from. Philosophically when you copy something, you are removing its connection to the underlying Dataframe, so shouldn't you not care if you lose that row number? Hence returning a NamedTuple. But having copy return a different type is not right...
@nalimilan I assume you know why I am asking 馃槃 and DataFrameRow is a view (for the reference of others see #1536).
For me it is a valid question because it tests the consistency of the setup. And following the behavior in Base if we go the way we plan here we should return a NamedTuple as you have said which is a bit weird.
note that you can't copy a SubDataFrame so we might not need to define this.
DataFrame(df[1,:]) == DataFrame(DataFrameRow(df, 1))?
note that you can't copy a
SubDataFrameso we might not need to define this.
See https://github.com/JuliaData/DataFrames.jl/pull/1536.
I think it's fine for copy to return a different type, that's what Base does for SubArray. What's more annoying is that NamedTuple is immutable, while one would often call copy to be able to mutate an object without affecting its parent. We could also return a DataFrameRow with a one-row parent. Anyway I'm really not sure that's a common pattern, so I also vote for leaving this undefined at least for now.
OK - so I will start updating #1534 following what was discussed here.
One issue with NamedTuple is that you can't use a vector of symbols to Index like you can with a DataFrameRow.
Though I'm not sure the use case for df[1,:]. It's tough to know how much this will be a problem in practice.
julia> t = (a = 1, b = 2, c = 5)
julia> t[[:a, :b]]
julia> t[[:a, :b]]
ERROR: MethodError: no method matching getindex(::NamedTuple{(:a, :b, :c),Tuple{Int64,Int64,Int64}}, ::Array{Symbol,1})
Closest candidates are:
getindex(::NamedTuple, ::Int64) at namedtuple.jl:101
getindex(::NamedTuple, ::Symbol) at namedtuple.jl:102
Stacktrace:
[1] top-level scope at none:0
Can I request, if it is relavant here, that DataFrameRow should support "haskey", as now it doesn't?
@LeoK987 - can you open a separate issue for this so that we do not lose track of this request and can discuss it?
@pdeffebach I think it is worth a PR to Julia Base, as indexing by a vector works for Tuple so it would be consistent if it worked for NamedTuple. EDIT: https://github.com/JuliaLang/julia/issues/29417
(I will wait a bit with #1534 as it seems we still get some feedback)
Closing as df[1, cols] returns a DataFrameRow at least for now.
DataFrameRow now supports haskey.
Most helpful comment
I like option 1 because #1449 gets us on the road to doing
Returning a DataFrame loses that progress.
I really like this
df[[1], cols]returning AbstractDataFrame objects, and clarifying between vector and scalar sub-setting.