This code works in v0.20.2,
df = DataFrame(x=["a","b","c"], y=1:3)
findfirst(isequal(1:3),eachcol(df))
but errors in v0.21.0 with the error message.
ERROR: MethodError: no method matching nextind(::DataFrames.DataFrameColumns{DataFrame}, ::Symbol)
The cause seems to be that the underlying indices are now Symbols and not Ints. I didn't see this in the list of breaking changes so I figured it might be unintentional.
(A note, findall() works, but returns Symbols instead of Ints as before.)
Thanks for the new release and all the hard work!
This was intentional to make the following work consistently:
julia> keys(eachcol(df))
2-element Array{Symbol,1}:
:x
:y
julia> pairs(eachcol(df))
Iterators.Pairs(::DataFrames.DataFrameColumns{DataFrame}, ::Array{Symbol,1}) with 2 entries:
:x => ["a", "b", "c"]
:y => [1, 2, 3]
This was consider more useful than:
julia> x = rand(3)
3-element Array{Float64,1}:
0.5557510873245723
0.43710797460962514
0.42471785049513144
julia> keys(x)
3-element LinearIndices{1,Tuple{Base.OneTo{Int64}}}:
1
2
3
julia> pairs(x)
pairs(::Array{Float64,1}) with 3 entries:
1 => 0.555751
2 => 0.437108
3 => 0.424718
but indeed it has a drawback you have highlighted.
I think the solution is to add custom implementations of the functions you mention. Would that work for you?
(of course breaking these functions was not intentional and I have forgotten to mention this in release notes as there were 102 PRs to annotate so it slipped through)
I think the solution is to add custom implementations of the functions you mention. Would that work for you?
Absolutely. I don't think it's a common use case, but nice to have once in a while.
We should also fix:
julia> df = DataFrame(a=1,b=2)
1ร2 DataFrame
โ Row โ a โ b โ
โ โ Int64 โ Int64 โ
โโโโโโโผโโโโโโโโผโโโโโโโโค
โ 1 โ 1 โ 2 โ
julia> hash(eachcol(df))
ERROR: ArgumentError: invalid index: :b of type Symbol
cf. https://github.com/JuliaLang/julia/pull/36073
CC @nalimilan
@nalimilan - I have investigated this issue. There is no way to cleanly resolve it. We have the following options:
DataFrameColumns a subtype of AbstractVector (this is my preferred choice, the drawback is that it is mildly breaking; the good thing is that eachcol in Base does not return AbstractVector); then we just should add the methods we believe are worth to keep being supported by this typekeys(::DataFrameColumns) produce integers - this would cleanly resolve everything, but I think it would not be very useful in practice; so I do not like ithash and nextind) -> this is not a very good approach as we will be never sure in which case something breaksIn summary - I would go for option 1 (stop making DataFrameColumns a subtype of AbstractVector).
Maybe we can decide on a more general solution based on how https://github.com/JuliaLang/julia/pull/36073 is resolved? If keys has to return a vector of linear or cartesian indices for all AbstractVectors, then maybe we can add another function (in DataAPI?) to get less efficient but more informative names, which would be useful for DataFramesColumns, Tables.AbstractColumns but also named arrays types like NamedArray, AxisArray and the newer packages. Or if keys can return any object as long as it can be used to index into LinearIndices(dfc) and CartesianIndices(dfc), we can try something more complex.
Related issues: https://github.com/JuliaArrays/AxisArrays.jl/pull/152#issuecomment-446571461, https://github.com/JuliaArrays/AxisArrays.jl/pull/81, https://github.com/invenia/NamedDims.jl/issues/86.
Sure - we can wait.
any object as long as it can be used to index into LinearIndices(dfc) and CartesianIndices(dfc), we can try something more complex.
I understand that then you would have to provide a custom type that would be returned by LinearIndices(dfc) etc. - which seems a bit problematic, but maybe this is the way to go?
@nalimilan - given the comment in https://github.com/JuliaLang/julia/pull/36073 I think we should remove AbstractVector subtype from DataFrameColumns and just add custom methods that normally work for vectors as needed. If there is no objection I will implement this.
@rasmushenningsson - I guess it should be OK with you - right?
Please have a look at https://github.com/JuliaData/DataFrames.jl/pull/2291 for the implementation. Actually maybe I even like it more as indexing with multiple columns to DataFrameColumns now produces DataFrameColumns which seems more consistent.
Most helpful comment
This was intentional to make the following work consistently:
This was consider more useful than:
but indeed it has a drawback you have highlighted.
I think the solution is to add custom implementations of the functions you mention. Would that work for you?