I have also even a question (maybe it was discussed somewhere, but I think the decision could influence the design we are discussing in #1335 so I raise it now):
AbstractDataFrame is not a subtype of AbstractMatrix{Any}?DataFrameRow is not a subtype of AbstractDataFrame?In this way many functions that work on AbstractMatrix would work on DataFrames for free.
Additionally if some broadcast related features were implemented like in NamedArrays the result of such operations could remain a DataFrame if that were sensible.
I do not see any negative side effects (but I might be missing something).
The benefit is that now when the user knows that some columns in DataFrame are homogeneous then Array(df[rows, cols]) conversion has to be run to be able to perform the desired operations.
Of course in performance sensitive code it will still be required to do so, because conversion will infer the type of the [rows, cols] section of the DataFrame, but in many cases user wants a simple transformation and performance is not an issue (see eg. https://stackoverflow.com/questions/48037732/how-to-save-julia-for-loop-returns-in-an-array-or-dataframe).
I don't know the original motivations, but I can think of a few reasons why AbstractDataFrame shouldn't necessarily subtype AbstractMatrix:
df1'df2 doesn't really make sense.DataFrame isn't necessarily a 2-dimensional object; one could argue that each column represents another axis, which would make it an _N_-dimensional array rather than a matrix. (Indeed, the original name for JuliaDB was NDSparse.)As for why DataFrameRow isn't an AbstractDataFrame. However, should #1335 come to fruition, I think we should be able to do away with DataFrameRow entirely and represent rows as NamedTuples.
As for AbstractDataFrame <: AbstractMatrix{Any} I see those points, but e.g. transposition of data frame is a supported operation in other frameworks. Also I do not see a strong reason not to allow multiplication (R defines broadcasted addition and multiplication on data frames). If the operation does make sense it will fail anyway.
I am not very strongly for this change, but I want to put it under the discussion as this is something that e.g. R users would find convenient (of course it is not a definitive reason to allow it - we do not have to replicate R in DataFrames).
In general it boils down to what you have pointed out - do we treat DataFrame as a heterogeneous matrix or sparse N-dimensional array.
As for DataFrameRow - such a NamedTuple would have to consist of 1-element views into columns of an original DataFrame (this is needed as contract of DataFrameRow is that it is a view into a row of the original DataFrame allowing to mutate its contents) and thus would be expensive to create I guess.
And if I understand the original intention behind DataFrameRow is that it were to be maximally lightweight data structure (containing only a reference to original DataFrame and row number).
I think this has been discussed a lot before and the decision to make Julia data frames completely distinct from matrices has been a very conscious one (you should be able to find old issues about it). It's indeed surprising for R users, but in general R data frames are quite awkward to work with, and redesigns like dplyr adopt a completely different approach closer to databases. Anyway it's not a good idea to provide people with an apparently convenient syntax which would be very slow due to type instabilities: that would just be a trap.
Regarding DataFrameRow, I also think that it should be replaced with NamedTuple. Since NamedTuple is immutable, it should be completely free to create (no actual allocation), so that in practice creating a NamedTuple, modifying it and replacing the contents of a row with it should be equivalent to doing the same operation with a DataFrameRow (at least with a type-stable data frame).
Thanks for the explanation. Regarding DataFrameRow - do I understand you correctly that such code would become invalid after the change:
x = DataFrame(a=1:10, b=1:10)
for row in eachrow(x)
row[:a] += row[:b]
end
and now x[:a] is doubled?
That's a very good point. You're correct, if eachrow iterated NamedTuples, then row would be immutable and row[:a] += row[:b] would be an error. That's rather unfortunate...
Ah, right. Maybe we need eachrow to keep returning a DataFrameRow then. It's an interesting design case, and there doesn't seem to be many situations in which one may need to mutate the elements obtained by iterating over a collection. That's the case for matrices, where it makes sense to return a row or column SubArray rather than a copy (see proposals at https://github.com/JuliaLang/julia/issues/14491), but that's also for performance reasons to avoid allocations (which doesn't apply to tuples).
https://github.com/JuliaLang/julia/pull/21912 is related, since it would provide a convenient syntax to create a new NamedTuple as if you had mutated the original one. Then of course the tuple would have to be assigned back to the DataFrame, which is annoying. We would have to provide another syntax for that, such as:
byrow!(df) do row
row@a += row[:b]
row
end
But clearly that's not ideal. We could also add this byrow! function and make it use DataFrameRow objects, but change eachrow to use NamedTuple. I guess that depends on how strongly we want to push the view of data frames as collections of NamedTuple. There's also the solution of providing multiple iterator functions, e.g. eachrow and rows with different behaviors.
I am OK to change DataFrameRow to NamedTuple - just want to make sure that we understand the consequences.
The reasons I am mostly indifferent is that in general I do not think that eachrow is very useful - as of today it minimally saves typing and you have to remember that you work on a different type than AbstractDataFrame.
The difference is between:
for row in eachrow(df)
do_something_with(row[:column_name])
end
and
for row in 1:nrow(df)
do_something_with(df[row, :column_name])
end
Additionally working on DataFrameRow is slow, but it will be equally slow with NamedTuple implementation in most cases (in order to get significant performance benefits this NamedTuple would have to be passed to some internal barrier function as eltype(eachrow(::DataFrame)) will not be inferred by the compiler anyway).
Yes, it looks like we should provide functions taking an anonymous function which would be applied to each row, as that's the only way to get type stability. That's why I recently added filter (#1330). Maybe we should implement map and map!, operating row by row.
DataFrameRow could be useful if we provide a type like TypedDataFrame and advise using it for functions using explicit loops, but it's not clear yet that we want to do that, and anyway for many situations people won't write functions, and will run code in the global scope.
I understand that TypeDataFrame would render CompositeDataFrame from DataFramesMeta obsolete - yes?
I suspect it would, but I haven't looked at it in detail yet. It would be interesting to see how it's implemented.
One could also think about a type
julia
struct RowView{T} where {T<:TypedDataFrame}
df::T
row::Int
end
that overrides getproperty and setproperty. That should enable things like
julia
for r in rowviews(df)
println(r.colB)
r.colA = 34
end
Yes, that's essentially DataFrameRow but for a TypedDataFrame.
Yes, the only issue is that with current proposal in #1335 df would be normally a DataFrame not a TypedDataFrame, which means that rowviews(df) would not have type stable return value when df::DataFrame.
I am closing it as now it is more or less settled that we keep DataFrameRow and it is distinct than DataFrame.
Most helpful comment
Yes, that's essentially
DataFrameRowbut for aTypedDataFrame.