Dataframes.jl: findfirst/findlast/nextind/prevind not working for eachcol in v0.21.0

Created on 6 May 2020  ยท  9Comments  ยท  Source: JuliaData/DataFrames.jl

This code works in v0.20.2,

df = DataFrame(x=["a","b","c"], y=1:3)
findfirst(isequal(1:3),eachcol(df))

but errors in v0.21.0 with the error message.

ERROR: MethodError: no method matching nextind(::DataFrames.DataFrameColumns{DataFrame}, ::Symbol)

The cause seems to be that the underlying indices are now Symbols and not Ints. I didn't see this in the list of breaking changes so I figured it might be unintentional.
(A note, findall() works, but returns Symbols instead of Ints as before.)

Thanks for the new release and all the hard work!

bug feature non-breaking priority

Most helpful comment

This was intentional to make the following work consistently:

julia> keys(eachcol(df))
2-element Array{Symbol,1}:
 :x
 :y

julia> pairs(eachcol(df))
Iterators.Pairs(::DataFrames.DataFrameColumns{DataFrame}, ::Array{Symbol,1}) with 2 entries:
  :x => ["a", "b", "c"]
  :y => [1, 2, 3]

This was consider more useful than:

julia> x = rand(3)
3-element Array{Float64,1}:
 0.5557510873245723
 0.43710797460962514
 0.42471785049513144

julia> keys(x)
3-element LinearIndices{1,Tuple{Base.OneTo{Int64}}}:
 1
 2
 3

julia> pairs(x)
pairs(::Array{Float64,1}) with 3 entries:
  1 => 0.555751
  2 => 0.437108
  3 => 0.424718

but indeed it has a drawback you have highlighted.

I think the solution is to add custom implementations of the functions you mention. Would that work for you?

All 9 comments

This was intentional to make the following work consistently:

julia> keys(eachcol(df))
2-element Array{Symbol,1}:
 :x
 :y

julia> pairs(eachcol(df))
Iterators.Pairs(::DataFrames.DataFrameColumns{DataFrame}, ::Array{Symbol,1}) with 2 entries:
  :x => ["a", "b", "c"]
  :y => [1, 2, 3]

This was consider more useful than:

julia> x = rand(3)
3-element Array{Float64,1}:
 0.5557510873245723
 0.43710797460962514
 0.42471785049513144

julia> keys(x)
3-element LinearIndices{1,Tuple{Base.OneTo{Int64}}}:
 1
 2
 3

julia> pairs(x)
pairs(::Array{Float64,1}) with 3 entries:
  1 => 0.555751
  2 => 0.437108
  3 => 0.424718

but indeed it has a drawback you have highlighted.

I think the solution is to add custom implementations of the functions you mention. Would that work for you?

(of course breaking these functions was not intentional and I have forgotten to mention this in release notes as there were 102 PRs to annotate so it slipped through)

I think the solution is to add custom implementations of the functions you mention. Would that work for you?

Absolutely. I don't think it's a common use case, but nice to have once in a while.

We should also fix:

julia> df = DataFrame(a=1,b=2)
1ร—2 DataFrame
โ”‚ Row โ”‚ a     โ”‚ b     โ”‚
โ”‚     โ”‚ Int64 โ”‚ Int64 โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ 1   โ”‚ 1     โ”‚ 2     โ”‚

julia> hash(eachcol(df))
ERROR: ArgumentError: invalid index: :b of type Symbol

cf. https://github.com/JuliaLang/julia/pull/36073

CC @nalimilan

@nalimilan - I have investigated this issue. There is no way to cleanly resolve it. We have the following options:

  1. stop making DataFrameColumns a subtype of AbstractVector (this is my preferred choice, the drawback is that it is mildly breaking; the good thing is that eachcol in Base does not return AbstractVector); then we just should add the methods we believe are worth to keep being supported by this type
  2. make keys(::DataFrameColumns) produce integers - this would cleanly resolve everything, but I think it would not be very useful in practice; so I do not like it
  3. leave things as is and just keep adding methods that "patch" the problematic functions (for now it seems to be hash and nextind) -> this is not a very good approach as we will be never sure in which case something breaks

In summary - I would go for option 1 (stop making DataFrameColumns a subtype of AbstractVector).

Maybe we can decide on a more general solution based on how https://github.com/JuliaLang/julia/pull/36073 is resolved? If keys has to return a vector of linear or cartesian indices for all AbstractVectors, then maybe we can add another function (in DataAPI?) to get less efficient but more informative names, which would be useful for DataFramesColumns, Tables.AbstractColumns but also named arrays types like NamedArray, AxisArray and the newer packages. Or if keys can return any object as long as it can be used to index into LinearIndices(dfc) and CartesianIndices(dfc), we can try something more complex.

Related issues: https://github.com/JuliaArrays/AxisArrays.jl/pull/152#issuecomment-446571461, https://github.com/JuliaArrays/AxisArrays.jl/pull/81, https://github.com/invenia/NamedDims.jl/issues/86.

Sure - we can wait.

any object as long as it can be used to index into LinearIndices(dfc) and CartesianIndices(dfc), we can try something more complex.

I understand that then you would have to provide a custom type that would be returned by LinearIndices(dfc) etc. - which seems a bit problematic, but maybe this is the way to go?

@nalimilan - given the comment in https://github.com/JuliaLang/julia/pull/36073 I think we should remove AbstractVector subtype from DataFrameColumns and just add custom methods that normally work for vectors as needed. If there is no objection I will implement this.

@rasmushenningsson - I guess it should be OK with you - right?

Please have a look at https://github.com/JuliaData/DataFrames.jl/pull/2291 for the implementation. Actually maybe I even like it more as indexing with multiple columns to DataFrameColumns now produces DataFrameColumns which seems more consistent.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

bbrunaud picture bbrunaud  ยท  3Comments

gustafsson picture gustafsson  ยท  6Comments

cormullion picture cormullion  ยท  6Comments

ahalwright picture ahalwright  ยท  3Comments

davidanthoff picture davidanthoff  ยท  4Comments