Dataframes.jl: Adding Vector conversion for a DataFrame

Created on 24 Jul 2018 · 19Comments · Source: JuliaData/DataFrames.jl

Every once in a while I need to access columns of a DataFrame as a vector of vectors. This is exactly what df.columns is, but of course we should not expose it. On the other hand eachcol is not very user friendly now.

What we could do:

add something like Vector(df::DataFrame) = copy(df.columns) conversion;
redefine eachcol to be more usable.

Actually I prefer the first option as we already have Matrix conversion. Any thoughts?

Source

bkamins

All 19 comments

I think we should add an eachcol operation that doesn't return a (name, col) tuple, rather just iterating through columns, so it is easier to use like a vector of vectors.

pdeffebach on 27 Jul 2018

So the todo would be:

update current eachcol to support new iteration protocol
add kwarg usenames to eachcol that is true by default; if it is false then do not pass names in the iteration
make DFColumnIterator have a second parameter that is true or false and keeps information which style we want (so that we can dispatch on it)
define map in such a way that if usenames is false a standard map is used (not the custom one that is present currently)

If we are OK with this plan I can implement it.

bkamins on 27 Jul 2018

Would changing the output of the iterator based on a keyword argument lead to type instability? If it is true, it returns a tuple, if not, it returns a vector?

pdeffebach on 27 Jul 2018

The idea is the following:

struct DFColumnIterator{U, T <: AbstractDataFrame}
    df::T
end
eachcol(df::T; usenames::Bool=true) where T<: AbstractDataFrame= DFColumnIterator{usenames, T}(df)

and then in any function you specify:

somefunction(itr::DFColumnIterator{true}) = ...

somefunction(itr::DFColumnIterator{false}) = ...

And it will be type stable AFAIK.

bkamins on 27 Jul 2018

I think that is a good idea. maybe names instead of usenames. It kind of makes sense to define another iterator but I don't know what the name would be.

pdeffebach on 27 Jul 2018

Ah - I see what you mean - we could use Val if we find that the compiler complains. But this should be a small union and those should be handled efficiently. (EDIT: this refers to your earlier question)

bkamins on 27 Jul 2018

It might be hard to make this as fast as df[i] for i in 1:ncol(df)

using BenchmarkTools
df = DataFrame(rand(100,100));

function newarray(df::DataFrame, f::Function)
       [f(df[i]) for i in 1:ncol(df)]
end

function newarrayiter(df::DataFrame, f::Function)
       [f(col) for (name, col) in eachcol(df)]
end

julia> @btime newarray($df, mean);
  7.931 μs (104 allocations: 2.52 KiB)

julia> @btime newarrayiter($df, mean);
  30.791 μs (203 allocations: 5.59 KiB)

Perhaps this performance difference will go away if we stop returning name.

pdeffebach on 27 Jul 2018

This in tests is actually faster:

struct Test{T <: AbstractDataFrame}
    df::T
end
ec(df::AbstractDataFrame) = Test(df)
Base.length(itr::Test) = ncol(itr.df)

function Base.iterate(itr::Test, state::Int=0)
    state += 1
    ncol(itr.df) < state && return nothing
    (itr.df[state], state)
end

than [f(df[i]) for i in 1:ncol(df)]

bkamins on 27 Jul 2018

👍1

Cool. let's add the functionality then, and if we decide we need it to be a separate iterator rather than a parametric type we can always do that.

pdeffebach on 27 Jul 2018

OK - @nalimilan - do you have any comments before the PR?

bkamins on 27 Jul 2018

This actually might be a good time to think about #1335 and type stability of columns. What if DataFrames isn't type stable, but its iterator is?

pdeffebach on 27 Jul 2018

Yes - this is the issue, but fortunately only noticeable if work done on a column is small; if the work is large enough then it is usually delegated to a function that works as barrier-function.

bkamins on 27 Jul 2018

True. I've been benchmarking some large aggregate operations recently and I think that inference is a barrier to high performance for this. aggregate iterates, so maybe a decent amount of bottlenecks could be gotten rid of with just a typed iterator without the need for a NamedTuple implementation. But I don't know if typed iterators are a thing.

pdeffebach on 27 Jul 2018

This actually might be a good time to think about #1335 and type stability of columns. What if DataFrames isn't type stable, but its iterator is?

Sorry, I recognize this is a touch off topic, but not sure where its best to bring it up: seems like a lack of type-stability of columns underlies many of the performance issues for DataFrames (if I understand what's going on). Is moving to type-stable columns a subject of discussion? If so, where?

nickeubank on 27 Jul 2018

I would say #1335 and #1256 have this discussion. Then there is nl/typed which has some code for an implementation.

pdeffebach on 27 Jul 2018

👍1

And #744 - and old issue that is still open

bkamins on 27 Jul 2018

👍1

thx

nickeubank on 27 Jul 2018

@bkamins Do you think something still needs to be done here?

nalimilan on 21 Jan 2019

It can be closed given our current implementation (there is still a deprecation period finished by https://github.com/JuliaData/DataFrames.jl/pull/1613, but we will not lose track of it). The type stability issue is important, but I would discuss it in the other threads.

bkamins on 21 Jan 2019

Was this page helpful?

0 / 5 - 0 ratings