Dataframes.jl: findfirst/findlast/nextind/prevind not working for eachcol in v0.21.0

Created on 6 May 2020 · 9Comments · Source: JuliaData/DataFrames.jl

This code works in v0.20.2,

df = DataFrame(x=["a","b","c"], y=1:3)
findfirst(isequal(1:3),eachcol(df))

but errors in v0.21.0 with the error message.

ERROR: MethodError: no method matching nextind(::DataFrames.DataFrameColumns{DataFrame}, ::Symbol)

The cause seems to be that the underlying indices are now Symbols and not Ints. I didn't see this in the list of breaking changes so I figured it might be unintentional.
(A note, findall() works, but returns Symbols instead of Ints as before.)

Thanks for the new release and all the hard work!

bug feature non-breaking priority

Source

rasmushenningsson

👍1

Most helpful comment

This was intentional to make the following work consistently:

julia> keys(eachcol(df))
2-element Array{Symbol,1}:
 :x
 :y

julia> pairs(eachcol(df))
Iterators.Pairs(::DataFrames.DataFrameColumns{DataFrame}, ::Array{Symbol,1}) with 2 entries:
  :x => ["a", "b", "c"]
  :y => [1, 2, 3]

This was consider more useful than:

julia> x = rand(3)
3-element Array{Float64,1}:
 0.5557510873245723
 0.43710797460962514
 0.42471785049513144

julia> keys(x)
3-element LinearIndices{1,Tuple{Base.OneTo{Int64}}}:
 1
 2
 3

julia> pairs(x)
pairs(::Array{Float64,1}) with 3 entries:
  1 => 0.555751
  2 => 0.437108
  3 => 0.424718

but indeed it has a drawback you have highlighted.

I think the solution is to add custom implementations of the functions you mention. Would that work for you?

bkamins on 6 May 2020

👍2

All 9 comments

This was intentional to make the following work consistently:

julia> keys(eachcol(df))
2-element Array{Symbol,1}:
 :x
 :y

julia> pairs(eachcol(df))
Iterators.Pairs(::DataFrames.DataFrameColumns{DataFrame}, ::Array{Symbol,1}) with 2 entries:
  :x => ["a", "b", "c"]
  :y => [1, 2, 3]

This was consider more useful than:

julia> x = rand(3)
3-element Array{Float64,1}:
 0.5557510873245723
 0.43710797460962514
 0.42471785049513144

julia> keys(x)
3-element LinearIndices{1,Tuple{Base.OneTo{Int64}}}:
 1
 2
 3

julia> pairs(x)
pairs(::Array{Float64,1}) with 3 entries:
  1 => 0.555751
  2 => 0.437108
  3 => 0.424718

but indeed it has a drawback you have highlighted.

I think the solution is to add custom implementations of the functions you mention. Would that work for you?

bkamins on 6 May 2020

👍2

(of course breaking these functions was not intentional and I have forgotten to mention this in release notes as there were 102 PRs to annotate so it slipped through)

bkamins on 6 May 2020

I think the solution is to add custom implementations of the functions you mention. Would that work for you?

Absolutely. I don't think it's a common use case, but nice to have once in a while.

rasmushenningsson on 6 May 2020

We should also fix:

julia> df = DataFrame(a=1,b=2)
1×2 DataFrame
│ Row │ a     │ b     │
│     │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1   │ 1     │ 2     │

julia> hash(eachcol(df))
ERROR: ArgumentError: invalid index: :b of type Symbol

cf. https://github.com/JuliaLang/julia/pull/36073

CC @nalimilan

bkamins on 29 May 2020

@nalimilan - I have investigated this issue. There is no way to cleanly resolve it. We have the following options:

stop making DataFrameColumns a subtype of AbstractVector (this is my preferred choice, the drawback is that it is mildly breaking; the good thing is that eachcol in Base does not return AbstractVector); then we just should add the methods we believe are worth to keep being supported by this type
make keys(::DataFrameColumns) produce integers - this would cleanly resolve everything, but I think it would not be very useful in practice; so I do not like it
leave things as is and just keep adding methods that "patch" the problematic functions (for now it seems to be hash and nextind) -> this is not a very good approach as we will be never sure in which case something breaks

In summary - I would go for option 1 (stop making DataFrameColumns a subtype of AbstractVector).

bkamins on 30 May 2020

Maybe we can decide on a more general solution based on how https://github.com/JuliaLang/julia/pull/36073 is resolved? If keys has to return a vector of linear or cartesian indices for all AbstractVectors, then maybe we can add another function (in DataAPI?) to get less efficient but more informative names, which would be useful for DataFramesColumns, Tables.AbstractColumns but also named arrays types like NamedArray, AxisArray and the newer packages. Or if keys can return any object as long as it can be used to index into LinearIndices(dfc) and CartesianIndices(dfc), we can try something more complex.

nalimilan on 30 May 2020

Sure - we can wait.

any object as long as it can be used to index into LinearIndices(dfc) and CartesianIndices(dfc), we can try something more complex.

I understand that then you would have to provide a custom type that would be returned by LinearIndices(dfc) etc. - which seems a bit problematic, but maybe this is the way to go?

bkamins on 30 May 2020

@nalimilan - given the comment in https://github.com/JuliaLang/julia/pull/36073 I think we should remove AbstractVector subtype from DataFrameColumns and just add custom methods that normally work for vectors as needed. If there is no objection I will implement this.

@rasmushenningsson - I guess it should be OK with you - right?

bkamins on 11 Jun 2020

👍1

Please have a look at https://github.com/JuliaData/DataFrames.jl/pull/2291 for the implementation. Actually maybe I even like it more as indexing with multiple columns to DataFrameColumns now produces DataFrameColumns which seems more consistent.

bkamins on 16 Jun 2020

Was this page helpful?

0 / 5 - 0 ratings