It's been noted several times that we don't have a consistent view on whether a data frame is a collection of rows or of columns. We should decide which one it is so that all exported functions and collection functions from Base we implement operate either on rows or on columns. Then functions which operate on the other dimension should mention it explicitly in their names, e.g. with the suffix rows or cols (as in IndexedTables).
See also: https://github.com/JuliaData/DataFrames.jl/issues/406, https://github.com/JuliaData/DataFrames.jl/issues/1200, https://github.com/JuliaData/DataFrames.jl/pull/1459, https://github.com/JuliaData/DataFrames.jl/issues/1513, https://github.com/JuliaData/DataFrames.jl/issues/1377.
Below is the list of all functions which are either row- or column-oriented. Those marked with * could be interpreted/useful both as row-oriented or column-oriented, and are therefore the most problematic (i.e. the ones to make consistent).
Row-oriented functions:
append!*
filter/filter!*
head & tail*
sort/sort! & sortperm & issorted
unique & nonunique
completecases
dropmissing/dropmissing!
Column-oriented functions:
delete!*
insert! (inconsistent with append! and push!)*
merge!*
haskey & get (old Associative interface)*
length*
getindex (with single argument)*
allowmissing/allowmissing!
disallowmissing/disallowmissing!
categorical!
names & rename
eltypes
describe
Both row- and column-oriented functions:
empty!
Functions with explicit name:
deleterows! (compare with delete!)
permutecols!
nrow/ncol
hcat/vcat
colwise
eachrow/eachcol
Complex:
by/groupby -> mostly row-oriented
aggregate -> mostly column-oriented (akin to colwise)
Overall, it seems it would be easier and more natural to get rid of column-oriented functions marked with *, and consider that a data frame is a collection of rows. Many of the problematic functions are either inherited from the time when DataFrame implemented the Associative interface, which isn't very useful in practice. Others (like insert!) are in conflict with similar functions which operate on rows. I suggest we rename the ones which are deemed useful to add the cols suffix. Functions with rows in their names could drop that suffix. This means:
delete! would become deletecols! (and replaced with deleterows! after deprecation period), insert! would become insertcols! (PR https://github.com/JuliaData/DataFrames.jl/pull/1560)length would be deprecated in favor of ncol/size(df, 2) (https://github.com/JuliaData/DataFrames.jl/issues/1200, see also https://github.com/JuliaData/DataFrames.jl/issues/406; PR https://github.com/JuliaData/DataFrames.jl/pull/1224)merge! would be removed, as it's not that useful and it's often confused with join (due to the name used in R and other apps) (PR https://github.com/JuliaData/DataFrames.jl/pull/1560)haskey and get would be removed (https://github.com/JuliaData/DataFrames.jl/pull/1836)empty! could be changed to call similar(df, 0) (https://github.com/JuliaData/DataFrames.jl/pull/1843 for the deprecation phase)df[col] would still be allowed since it's more convenient than df[:, col] (https://github.com/JuliaData/DataFrames.jl/issues/1200)Some column-oriented functions like name and rename! could be added the cols suffix, but I'm not sure it's worth it. More changes could be considered starting from the premise that collection functions and iteration operate on rows, but we could also continue being explicit about dimensions in other places (like map).
The main/only issue with viewing data frames as collections of rows is that despite being natural, it goes against the underlying representation as a vector of columns. But that's not necessarily a problem in practice as long as we provide convenient ways of applying operations to columns.
I like all these changes. I think of in stata how if statements are both succinct and refer to row-wise subsetting.
I also like getting rid of merge!, as it is really just an implementation of a join under certain parameters. And I have gotten confused by delete! etc. before.
Cc: @piever @shashi with the goal of using consistent APIs in JuliaDB and DataFrames where that makes sense.
In StructArrays I decided for the following (pretty much in line with IndexedTables):
StructArray <: AbstractArray{<:NamedTuple}getproperty selects columnsAnd I've been pretty happy with it: getting a row is s[i] a column is s.b, an element is either s.b[i] or s[i].b.
Is it really necessary to keep the one argument getindex select columns given that there is getproperty? I'd definitely be very happy if a DataFrame could be thought of as a AbstractArray{<:NamedTuples} (does the weak typing play against this though? is there a way to have a row based getindex be efficient?).
I don't have a clear view of the whole situation but I wanted to share the following:
_vec, i.e. @transform, @where become@transform_vec, @where_vec: maybe I can change to @transformcols and @wherecols ). At some point we thought of having hmerge and vmerge in JuliaDB (in analogy with hcat, vcat). merge and mergecols definitely makes sense (merge is for rows at the momentfilter and getindex for columns, they all correspond to select in JuliaDB. There is a bunch of helpers for operations. For example, select(iris, 3), select(iris, 1:4), select(iris, Not(:SepalLength)), select(iris, t -> contains(string(t), "Sepal")). Filtering is automatic for Regexes, for example select(iris, r"^Sepal"). See docs here.A key difference however is that the inplace version doesn't make sense for IndexedTables. I wonder if simply a version of select! that replaces the columns of the DataFrame with the outcome of select would make sense.
col in the name in JuliaDB. This is unfortunate because they can manipulate many columns as once. See renamecol(iris, :SepalLength => :SL, :SepalWidth => :SW), see here. Could we unify on the cols version?Thanks for the comments.
Is it really necessary to keep the one argument
getindexselect columns given that there isgetproperty? I'd definitely be very happy if a DataFrame could be thought of as aAbstractArray{<:NamedTuples}(does the weak typing play against this though? is there a way to have a row basedgetindexbe efficient?).
One issue is that getproperty only works to extract a single column from a literal name. Things like df[cols] and df[[:a, :b, :c]] are quite commonly needed, and people coming from other languages will expect them to work. I guess it wouldn't be the end of the world if we required df[:, cols] for that, but that wouldn't be great (also it's supposed to make a copy for consistency with other df[i, j] operations).
You're also right that AFAICT the absence of parameterization on the column types makes it difficult to iterate over rows of a data frame efficiently.
We need a unified way to decide whether something is for rows or cols (in JuliaDBMeta I use
_vec, i.e.@transform,@wherebecome@transform_vec,@where_vec: maybe I can change to@transformcolsand@wherecols). At some point we thought of havinghmergeandvmergein JuliaDB (in analogy withhcat,vcat).mergeandmergecolsdefinitely makes sense (mergeis for rows at the moment
Interesting. That's indeed an issue I've thought about several times in DataFramesMeta. AFAIK people generally want row-wise operations, except for two particular cases: 1) reductions (like :x .- mean(:x)) and 2) laggged variables. And even in these cases, it would be be convenient to think in row-wise terms so that e.g. one doesn't need to use dotted operators. SQL and dplyr don't provide column-wise functions. So I wonder whether it wouldn't be possible to have only row-wise macros, with a special function or syntax to indicate that one refers to the whole column rather than to a single element (for example :x - mean($x) or :x - mean(col(:x))). Probably better discuss this elsewhere though.
Concerning
filterandgetindexfor columns, they all correspond toselectin JuliaDB. There is a bunch of helpers for operations. For example,select(iris, 3),select(iris, 1:4),select(iris, Not(:SepalLength)),select(iris, t -> contains(string(t), "Sepal")). Filtering is automatic for Regexes, for exampleselect(iris, r"^Sepal"). See docs here.
filter only operates on rows in DataFrames. We don't have convenience functions to select columns, but they would be useful.
Don't you find it a bit problematic that select picks columns, but that the name doesn't indicate that?
A key difference however is that the inplace version doesn't make sense for IndexedTables. I wonder if simply a version of
select!that replaces the columns of theDataFramewith the outcome ofselectwould make sense.
Yeah, why not. But that's really not a big issue, as it can be added later.
The functions to explicitly manipulate columns have
colin the name in JuliaDB. This is unfortunate because they can manipulate many columns as once. Seerenamecol(iris, :SepalLength => :SL, :SepalWidth => :SW), see here. Could we unify on thecolsversion?
Indeed, that's annoying. I'm not sure whether the plural or the singular is best, but least the naming should be consistent everywhere.
You're also right that AFAICT the absence of parameterization on the column types makes it difficult to iterate over rows of a data frame efficiently.
IIUC, map and filter row-based can be fast because they can use the Tables interface to have efficient row iteration, whereas direct row-based getindex is always slow? In this case this is probably a sufficient technical difference to justify a different default from IndexedTables and one might as well tell the user to do
r = Tables.rows(df)
r[i]
to have fast row iteration.
We need a unified way to decide whether something is for rows or cols (in JuliaDBMeta I use _vec, i.e.
@transform,@wherebecome@transform_vec,@where_vec: maybe I can change to@transformcolsand@wherecols). At some point we thought of having hmerge and vmerge in JuliaDB (in analogy with hcat, vcat). merge and mergecols definitely makes sense (merge is for rows at the momentThat's indeed an issue I've thought about several times in DataFramesMeta. AFAIK people generally want row-wise operations, except for two particular cases: 1) reductions (like
:x .- mean(:x)) and 2) laggged variables.
There are also some other cases (say sorting or things like that) but esp. in JuliaDB they are tricky anyway because they don't generalize to the distributed case as easily. Viceversa sometimes when one uses the vectorised version, it could probably be done with for example OnlineStats in a way that parallelize naturally and works in distributed scenarios. I'd be definitely happy to discuss this further though, I'll open an issue in JuliaDBMeta.
Don't you find it a bit problematic that
selectpicks columns, but that the name doesn't indicate that?
Don't you find it a bit problematic that select picks columns, but that the name doesn't indicate that?
select is a bit complex in JuliaDB, but I agree it's column oriented and is one of the few column oriented operation that doesn't make it explicit in the name. Not sure what would be a good name that emphasizes this.
Indeed, that's annoying. I'm not sure whether the plural or the singular is best, but least the naming should be consistent everywhere.
I think it should to be cols because:
sum(v, dims = ...)cols works pretty much everywhere whereas col is strange at times, say mergecolIf we change this then keys and values should be cleaned-up or removed and maybe rethinking what eachrow and eachcol do/return is in place.
If we change this then
keysandvaluesshould be cleaned-up or removed and maybe rethinking whateachrowandeachcoldo/return is in place.
keys and values are deprecated already, so that's OK. OTOH I'd like to fix the asymmetry between eachrow (which returns DataFrameRow objects) and eachcol (which returns (name, vector) tuples). We could perhaps deprecate them in favor of rows and columns (or keep eachcol if that behavior is really useful).
Once the Tables PR is merged, I imagine Tables.rows and Tables.columns could be exported and would provide a lot of this functionality.
I think the following is quite nice and will work with all table types:
for v in columns(df)
...
end
for (key, val) in pairs(columns(df))
...
end
for row in rows(df)
...
end
IIUC,
mapandfilterrow-based can be fast because they can use theTablesinterface to have efficient row iteration, whereas direct row-basedgetindexis always slow? In this case this is probably a sufficient technical difference to justify a different default from IndexedTables and one might as well tell the user to dor = Tables.rows(df) r[i]to have fast row iteration.
Well, yeah, constructing a type-stable iterator allows fast iteration over rows. But that only helps if there's a function barrier to allow for specialization.
Also it's not clear whether it's reasonable to specialize on the type of all columns when there might be hundreds of them. See https://github.com/JuliaData/DataFrames.jl/issues/1335. That's why macros like DataFramesMeta/JuliaDBMeta seem promising to me: they allow specializing only on the relevant columns, which make it easier for the compiler.
I know they are deprecated, but now we could assign them meaning consistent with the interpretation that is chosen. Using rows and columns is fine with me.
Also eachcol is not really that useful, especially when we have columns (but maybe there is some legacy code that uses it so I guess it should stay).
Finally there are map methods in DataFrames.jl which confuse users. This also could be cleaned up.
Of course we do not have to do everything in one shot. The major decision is that we treat DataFrame a row-based object. The rest would be a consequence that can be fixed later.
Well, yeah, constructing a type-stable iterator allows fast iteration over rows. But that only helps if there's a function barrier to allow for specialization.
I've never worked with "very large" data, so I'm not familiar with this type of issues, but I wanted to mention that Tables is quite smart in its implementation in that the row is just a lazy wrapper of the table, so you don't actually materialize it and rows(df)[i].t is just another way of saying columns(df).t[i]. The other fields are untouched (JuliaDB instead would materialize the whole row unless specified otherwise, which is very bad in large tables). In this respect I think it works just as well as the macros in JuliaDBMetas that detect from the symbol which columns are used and only select those. But I agree that we need some care, compile times do get worrying in JuliaDB at times.
I've never worked with "very large" data, so I'm not familiar with this type of issues, but I wanted to mention that Tables is quite smart in its implementation in that the row is just a lazy wrapper of the table, so you don't actually materialize it and
rows(df)[i].tis just another way of sayingcolumns(df).t[i]. The other fields are untouched (JuliaDB instead would materialize the whole row unless specified otherwise, which is very bad in large tables). In this respect I think it works just as well as the macros in JuliaDBMetas that detect from the symbol which columns are used and only select those. But I agree that we need some care, compile times do get worrying in JuliaDB at times.
The issue isn't whether data is materialized or not, but just that AFAIK the compiler bails out beyond a certain number of columns (unless they are all of the same type, i.e. an NTuple). But I haven't investigated this deeply. Let's continue this discussion at #1335.
I know they are deprecated, but now we could assign them meaning consistent with the interpretation that is chosen. Using
rowsandcolumnsis fine with me.Also
eachcolis not really that useful, especially when we havecolumns(but maybe there is some legacy code that uses it so I guess it should stay).
+1
Finally there are
mapmethods in DataFrames.jl which confuse users. This also could be cleaned up.
Which ones? There was a very old map method for SubDataFrame which I deprecate in my grouping PR, but I don't see methods for DataFrame.
I'm generally in favor of treating tables as collections of rows at the API level.
I did not mean DataFrame but other types in DataFrames.jl. Those are map methods in DataFrames.jl as of today:
map(f::Function, sdf::SubDataFrame)
map(f::Function, gd::GroupedDataFrame)
map(f::Function, ga::GroupApplied)
map(f::Function, dfri::DataFrames.DFRowIterator)
map(f::Function, dfci::DataFrames.DFColumnIterator)
and I think a review of them for consistency with those changes would be good.
In general my thinking is to simplify DataFrames.jl API and minimize it as much as possible, as some things that were needed when this package was created probably could be removed now given the whole data wrangling ecosystem we have.
Apart from the SubDataFrame one and the GroupApplied one which I propose removing in https://github.com/JuliaData/DataFrames.jl/pull/1520, the others look OK to me since they are completely explicit. map(f, columns(df) and map(f, rows(df)) look really clean.
Let's discuss the GroupedDataFrame method at https://github.com/JuliaData/DataFrames.jl/pull/1520.
I like it — the only thing is that maybe vars/variables should be used instead of the col/columns suffix.
Seems going for collection of rows is a choice that dominates.
So the only leftover we would have is df[col] syntax that is inconsistent with this approach. A slight disturbance is that we would not be able to define eachindex and enumerate.
Finally there is a question if we want DataFrame to be broadcastable over rows.
A slight disturbance is that we would not be able to define
eachindexandenumerate.
Yes but we could have eachindex(columns(df)) and enumerate(columns(df)).
Finally there is a question if we want
DataFrameto be broadcastable over rows.
Good question. There's no hurry to support it, but that would indeed be consistent. That's actually one of the rare cases where operating over columns by default would probably be more useful, but we can provide other ways to do that like colwise or f.(columns(df)).
Actually if we go for row-based approach haskey could be renamed to hascolumn.
https://github.com/JuliaData/DataFrames.jl/pull/1560 gets rid of the most problematic methods. We can discuss what to do about others in separate PRs.
I just wanted to add my support for the way this is evolving! :heart:
I love the "relation is a collection of rows" style of interface. If the xxxcols(df) suffix is tiresome then xxx(columns(df)) is pretty clear - columns(df) is a kinda interesting and useful container in it's own right.
Actually if we go for row-based approachÂ
haskey could be renamed toÂhascolumn.
I suppose if we support df.column syntax then hasproperty is natural (you don't necessarily need to ditch column-based getindex just because you support column-based properties). Pity that didn't make it to Base for Julia 1.0 though.
I love the "relation is a collection of rows" style of interface. If the
xxxcols(df)suffix is tiresome thenxxx(columns(df))is pretty clear -columns(df)is a kinda interesting and useful container in it's own right.
Actually with https://github.com/JuliaData/DataFrames.jl/pull/1590 we're taking the stance that mapcols(f, df) is different from map(f, columns(df)), since the former returns a data frame and the latter a vector. So the xxxcols functions couldn't be replaced with xxx(columns(df)).
To my understanding we introduced mapcols solely to ensure that we give this different result.
Ok, yeah that makes sense.
Everything seems to be complete here. The general rule we follow is that generic collection functions that operate on elements (sort, push!...) operate over rows (when they are defined). Functions that operate over columns either end with the "cols" suffix, are table-specific operations (e.g. select), or operate at the vector level rather than at the element level (describe, allowmissing...).
names/rename! are partial exceptions to this rule. FWIW, JuliaDB uses renamecol, but dplyr and Pandas use rename.
FWIW, JuliaDB uses
renamecol, but dplyr and Pandas userename.
Actually JuliaDB has recently deprecated renamecol in favor of rename, so the APIs are consistent here.
Should we then define rename in DataAPI.jl?
CC @quinnj
I haven't been following the discussion for quite some time - but what is the future of iterate(::DataFrame)?
It seems that first(::DataFrame) returns the first row... and first is normally defined in terms of the iteration API.
We have filter(f, ::DataFrame) working as if it iterates rows. But there is no map or reduce that does the same.
I guess the collection of implemented functions seems partially complete to me... @nalimilan would you welcome contributions here?
We haven't really decided whether iteration (and map) should be supported. For now we wanted to stabilize the current API for 1.0. It would be consistent to define these over rows, but that may also be confusing (notably because of the difference with broadcasting), so this needs to be discussed in depth after 1.0.
Oh I see, broadcasting works over each cell. Thanks
I'd be up for supporting something like Tables.rename!(tbl, :oldcol, :newcol) in Tables.jl, and any Tables.jl source that supported column renaming could use it.
The way I think about it is that rename! is in place so probably not all source will support it. rename creates a new table so this should be OK for any table.
But do you mean then to define the bodies of rename (or rename!) or just define a function without implementing the behavior (e.g. DataFrame will most likely have a custom rename! and rename for performance reasons).
Yeah, I meant we could just have function rename! end in Tables.jl and individual packages could provide the implementation.
OK - so I am closing this issue and opened https://github.com/JuliaData/Tables.jl/pull/119.
Most helpful comment
I'm generally in favor of treating tables as collections of rows at the API level.