Dataframes.jl: Problems in groupreduce_init

Created on 9 May 2020 · 8Comments · Source: JuliaData/DataFrames.jl

@nalimilan. We have two issues with Reduce design related to the eltype identified by groupreduce_init.

The first is:

julia> df = DataFrame(g=[1,1,1,2,2,2], x=Any[1,1,1,1.5,1.5,1.5])
6×2 DataFrame
│ Row │ g     │ x   │
│     │ Int64 │ Any │
├─────┼───────┼─────┤
│ 1   │ 1     │ 1   │
│ 2   │ 1     │ 1   │
│ 3   │ 1     │ 1   │
│ 4   │ 2     │ 1.5 │
│ 5   │ 2     │ 1.5 │
│ 6   │ 2     │ 1.5 │

julia> combine(groupby(df, :g), :x => sum)
ERROR: InexactError: Int64(1.5)

julia> df = DataFrame(g=[1,1,1,2,2,2], x=Any[1,1,1,1,1,missing])
6×2 DataFrame
│ Row │ g     │ x       │
│     │ Int64 │ Any     │
├─────┼───────┼─────────┤
│ 1   │ 1     │ 1       │
│ 2   │ 1     │ 1       │
│ 3   │ 1     │ 1       │
│ 4   │ 2     │ 1       │
│ 5   │ 2     │ 1       │
│ 6   │ 2     │ missing │

julia> combine(groupby(df, :g), :x => sum)
ERROR: MethodError: Cannot `convert` an object of type Missing to an object of type Int64

The second is that after https://github.com/JuliaLang/Statistics.jl/pull/25 we do not consistently identify eltype of the vector that is to store aggregated values (in DataFrames.jl if we get ints we keep them as ints, which can overflow - Base does a conversion to florat).

Proposed approach to solve it:

in groupreduce_init check if a column that is aggregated has a concrete type or a Union of a concrete type and missing; if yes do a fast path; if not - we have to go through the whole vector to check what we allow
pass adjust to groupreduce_init and if it is / then appropriately set eltype of res to take this fact into account

bug priority

Source

bkamins

All 8 comments

Good catch. Actually what's happening is that we call Base.reducedim_init, but since we don't want to allocate a vector of the same length as the input, we pass it a SubArray with a length equal to the number of groups. When the eltype is concrete that's OK, but when it's not reducedim_init may end up actually computing the reduction just to get its type. And in your example the first two values happen to be Int, so it returns Int.

There are several solutions:

like you suggest, drop the fast path for non-concrete types
copy the code from Base just to be able to set the expected length
since we work with vectors, we could also call reducedim_init with dimension 1 to get a 1-element vector, and resize it to the expected length

Your solution 1 is appealing and it may not be so slow, since reducedim_init computes the sum over all groups once just to compute the eltype. A possible drawback is that it will use a different algorithm: pairwise summation will be used, the accumulation type may differ across groups, and promotion will only happen at the end. There may be advantages to using the same algorithm in all cases (documentation would be clearer).

nalimilan on 9 May 2020

So we could do the following:

if nonmissingtype on eltype is concrete - do what we do now
use your option 2. otherwise (I think consistency is important)

As noted earlier - we should also take into account the fact that we might need to use target type when accumulating (matters for functions like mean).

bkamins on 9 May 2020

👍1

This becomes priority now for 1.5 release of Julia (as otherwise we will become inconsistent with Base with aggregations).

So I have two questions:

@nalimilan - will you push a fix to this or I should do it (I am OK to do it - I just want to avoid duplication of the effort)
do we "fix" it the way 1.5 works or we provide two implementations that have a different behaviour conditional on Julia version (I would prefer to stick to Julia 1.5, but this means that on old versions of Julia it will start to matter if someone writes :x => mean or x => x -> mean(x), because there is a bug in x -> mean(x)).

bkamins on 18 May 2020

I'm not working on it so feel free to do it! I agree adopting the Julia 1.5 behavior is OK, anyway we already differed from the 1.4 behavior since we don't use pairwise summation.

nalimilan on 18 May 2020

OK. I will make a PR probably this week.

bkamins on 18 May 2020

I am working on the PR and I was not sure if this:

julia> df = DataFrame(g=[1,1,1,2,2,2], x = [missing,missing,missing,3,4,5])
6×2 DataFrame
│ Row │ g     │ x       │
│     │ Int64 │ Int64?  │
├─────┼───────┼─────────┤
│ 1   │ 1     │ missing │
│ 2   │ 1     │ missing │
│ 3   │ 1     │ missing │
│ 4   │ 2     │ 3       │
│ 5   │ 2     │ 4       │
│ 6   │ 2     │ 5       │

julia> gdf = groupby(df, :g)
GroupedDataFrame with 2 groups based on key: g
First Group (3 rows): g = 1
│ Row │ g     │ x       │
│     │ Int64 │ Int64?  │
├─────┼───────┼─────────┤
│ 1   │ 1     │ missing │
│ 2   │ 1     │ missing │
│ 3   │ 1     │ missing │
⋮
Last Group (3 rows): g = 2
│ Row │ g     │ x      │
│     │ Int64 │ Int64? │
├─────┼───────┼────────┤
│ 1   │ 2     │ 3      │
│ 2   │ 2     │ 4      │
│ 3   │ 2     │ 5      │

julia> combine(gdf, :x => minimum∘skipmissing)
ERROR: ArgumentError: some groups contain only missing values

is intended or we should return missing in this case?

EDIT: I think it is OK to error after thinking about it.

Also this:

julia> combine(gdf, :x => length∘skipmissing)
ERROR: MethodError: no method matching length(::Base.SkipMissing{SubArray{Union{Missing, Int64},1,Array{Union{Missing, Int64},1},Tuple{Array{Int64,1}},false}})

is probably unintuitive - we could handle this cleanly I think (of course now in Base it errors, but we could handle it internally).

bkamins on 19 May 2020

There is a problem with Base.reducedim_init as it is designed to handle a different use case:

julia> Base.reducedim_init(identity, Base.add_sum, [1,missing], 1)
1-element Array{Union{Missing, Int64},1}:
 0

julia> Base.reducedim_init(identity, Base.add_sum, Any[1,missing], 1)
1-element Array{Missing,1}:
 missing

so we cannot use it unfortunately in general. I think the only option is not to use fast path if the container is not of concrete type.

bkamins on 19 May 2020

👍1

I have pushed the fix. Sometimes few lines of code are more mind blowing than 1000 lines, so a careful check for corner cases would be appreciated as we are near a deep rabbit hole of Base internals in this PR :).

bkamins on 19 May 2020

Was this page helpful?

0 / 5 - 0 ratings