Every once in a while I need to access columns of a DataFrame as a vector of vectors. This is exactly what df.columns is, but of course we should not expose it. On the other hand eachcol is not very user friendly now.
What we could do:
Vector(df::DataFrame) = copy(df.columns) conversion;eachcol to be more usable.Actually I prefer the first option as we already have Matrix conversion. Any thoughts?
I think we should add an eachcol operation that doesn't return a (name, col) tuple, rather just iterating through columns, so it is easier to use like a vector of vectors.
So the todo would be:
eachcol to support new iteration protocolusenames to eachcol that is true by default; if it is false then do not pass names in the iterationDFColumnIterator have a second parameter that is true or false and keeps information which style we want (so that we can dispatch on it)map in such a way that if usenames is false a standard map is used (not the custom one that is present currently)If we are OK with this plan I can implement it.
Would changing the output of the iterator based on a keyword argument lead to type instability? If it is true, it returns a tuple, if not, it returns a vector?
The idea is the following:
struct DFColumnIterator{U, T <: AbstractDataFrame}
df::T
end
eachcol(df::T; usenames::Bool=true) where T<: AbstractDataFrame= DFColumnIterator{usenames, T}(df)
and then in any function you specify:
somefunction(itr::DFColumnIterator{true}) = ...
or
somefunction(itr::DFColumnIterator{false}) = ...
And it will be type stable AFAIK.
I think that is a good idea. maybe names instead of usenames. It kind of makes sense to define another iterator but I don't know what the name would be.
Ah - I see what you mean - we could use Val if we find that the compiler complains. But this should be a small union and those should be handled efficiently. (EDIT: this refers to your earlier question)
It might be hard to make this as fast as df[i] for i in 1:ncol(df)
using BenchmarkTools
df = DataFrame(rand(100,100));
function newarray(df::DataFrame, f::Function)
[f(df[i]) for i in 1:ncol(df)]
end
function newarrayiter(df::DataFrame, f::Function)
[f(col) for (name, col) in eachcol(df)]
end
julia> @btime newarray($df, mean);
7.931 渭s (104 allocations: 2.52 KiB)
julia> @btime newarrayiter($df, mean);
30.791 渭s (203 allocations: 5.59 KiB)
Perhaps this performance difference will go away if we stop returning name.
This in tests is actually faster:
struct Test{T <: AbstractDataFrame}
df::T
end
ec(df::AbstractDataFrame) = Test(df)
Base.length(itr::Test) = ncol(itr.df)
function Base.iterate(itr::Test, state::Int=0)
state += 1
ncol(itr.df) < state && return nothing
(itr.df[state], state)
end
than [f(df[i]) for i in 1:ncol(df)]
Cool. let's add the functionality then, and if we decide we need it to be a separate iterator rather than a parametric type we can always do that.
OK - @nalimilan - do you have any comments before the PR?
This actually might be a good time to think about #1335 and type stability of columns. What if DataFrames isn't type stable, but its iterator is?
Yes - this is the issue, but fortunately only noticeable if work done on a column is small; if the work is large enough then it is usually delegated to a function that works as barrier-function.
True. I've been benchmarking some large aggregate operations recently and I think that inference is a barrier to high performance for this. aggregate iterates, so maybe a decent amount of bottlenecks could be gotten rid of with just a typed iterator without the need for a NamedTuple implementation. But I don't know if typed iterators are a thing.
This actually might be a good time to think about #1335 and type stability of columns. What if DataFrames isn't type stable, but its iterator is?
Sorry, I recognize this is a touch off topic, but not sure where its best to bring it up: seems like a lack of type-stability of columns underlies many of the performance issues for DataFrames (if I understand what's going on). Is moving to type-stable columns a subject of discussion? If so, where?
I would say #1335 and #1256 have this discussion. Then there is nl/typed which has some code for an implementation.
And #744 - and old issue that is still open
thx
@bkamins Do you think something still needs to be done here?
It can be closed given our current implementation (there is still a deprecation period finished by https://github.com/JuliaData/DataFrames.jl/pull/1613, but we will not lose track of it). The type stability issue is important, but I would discuss it in the other threads.