Dataframes.jl: Problems in groupreduce_init

Created on 9 May 2020  ยท  8Comments  ยท  Source: JuliaData/DataFrames.jl

@nalimilan. We have two issues with Reduce design related to the eltype identified by groupreduce_init.

The first is:

julia> df = DataFrame(g=[1,1,1,2,2,2], x=Any[1,1,1,1.5,1.5,1.5])
6ร—2 DataFrame
โ”‚ Row โ”‚ g     โ”‚ x   โ”‚
โ”‚     โ”‚ Int64 โ”‚ Any โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ 1   โ”‚ 1     โ”‚ 1   โ”‚
โ”‚ 2   โ”‚ 1     โ”‚ 1   โ”‚
โ”‚ 3   โ”‚ 1     โ”‚ 1   โ”‚
โ”‚ 4   โ”‚ 2     โ”‚ 1.5 โ”‚
โ”‚ 5   โ”‚ 2     โ”‚ 1.5 โ”‚
โ”‚ 6   โ”‚ 2     โ”‚ 1.5 โ”‚

julia> combine(groupby(df, :g), :x => sum)
ERROR: InexactError: Int64(1.5)

or

julia> df = DataFrame(g=[1,1,1,2,2,2], x=Any[1,1,1,1,1,missing])
6ร—2 DataFrame
โ”‚ Row โ”‚ g     โ”‚ x       โ”‚
โ”‚     โ”‚ Int64 โ”‚ Any     โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ 1   โ”‚ 1     โ”‚ 1       โ”‚
โ”‚ 2   โ”‚ 1     โ”‚ 1       โ”‚
โ”‚ 3   โ”‚ 1     โ”‚ 1       โ”‚
โ”‚ 4   โ”‚ 2     โ”‚ 1       โ”‚
โ”‚ 5   โ”‚ 2     โ”‚ 1       โ”‚
โ”‚ 6   โ”‚ 2     โ”‚ missing โ”‚

julia> combine(groupby(df, :g), :x => sum)
ERROR: MethodError: Cannot `convert` an object of type Missing to an object of type Int64

The second is that after https://github.com/JuliaLang/Statistics.jl/pull/25 we do not consistently identify eltype of the vector that is to store aggregated values (in DataFrames.jl if we get ints we keep them as ints, which can overflow - Base does a conversion to florat).

Proposed approach to solve it:

  1. in groupreduce_init check if a column that is aggregated has a concrete type or a Union of a concrete type and missing; if yes do a fast path; if not - we have to go through the whole vector to check what we allow
  2. pass adjust to groupreduce_init and if it is / then appropriately set eltype of res to take this fact into account
bug priority

All 8 comments

Good catch. Actually what's happening is that we call Base.reducedim_init, but since we don't want to allocate a vector of the same length as the input, we pass it a SubArray with a length equal to the number of groups. When the eltype is concrete that's OK, but when it's not reducedim_init may end up actually computing the reduction just to get its type. And in your example the first two values happen to be Int, so it returns Int.

There are several solutions:

  1. like you suggest, drop the fast path for non-concrete types
  2. copy the code from Base just to be able to set the expected length
  3. since we work with vectors, we could also call reducedim_init with dimension 1 to get a 1-element vector, and resize it to the expected length

Your solution 1 is appealing and it may not be so slow, since reducedim_init computes the sum over all groups once just to compute the eltype. A possible drawback is that it will use a different algorithm: pairwise summation will be used, the accumulation type may differ across groups, and promotion will only happen at the end. There may be advantages to using the same algorithm in all cases (documentation would be clearer).

So we could do the following:

  1. if nonmissingtype on eltype is concrete - do what we do now
  2. use your option 2. otherwise (I think consistency is important)

As noted earlier - we should also take into account the fact that we might need to use target type when accumulating (matters for functions like mean).

This becomes priority now for 1.5 release of Julia (as otherwise we will become inconsistent with Base with aggregations).

So I have two questions:

  • @nalimilan - will you push a fix to this or I should do it (I am OK to do it - I just want to avoid duplication of the effort)
  • do we "fix" it the way 1.5 works or we provide two implementations that have a different behaviour conditional on Julia version (I would prefer to stick to Julia 1.5, but this means that on old versions of Julia it will start to matter if someone writes :x => mean or x => x -> mean(x), because there is a bug in x -> mean(x)).

I'm not working on it so feel free to do it! I agree adopting the Julia 1.5 behavior is OK, anyway we already differed from the 1.4 behavior since we don't use pairwise summation.

OK. I will make a PR probably this week.

I am working on the PR and I was not sure if this:

julia> df = DataFrame(g=[1,1,1,2,2,2], x = [missing,missing,missing,3,4,5])
6ร—2 DataFrame
โ”‚ Row โ”‚ g     โ”‚ x       โ”‚
โ”‚     โ”‚ Int64 โ”‚ Int64?  โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ 1   โ”‚ 1     โ”‚ missing โ”‚
โ”‚ 2   โ”‚ 1     โ”‚ missing โ”‚
โ”‚ 3   โ”‚ 1     โ”‚ missing โ”‚
โ”‚ 4   โ”‚ 2     โ”‚ 3       โ”‚
โ”‚ 5   โ”‚ 2     โ”‚ 4       โ”‚
โ”‚ 6   โ”‚ 2     โ”‚ 5       โ”‚

julia> gdf = groupby(df, :g)
GroupedDataFrame with 2 groups based on key: g
First Group (3 rows): g = 1
โ”‚ Row โ”‚ g     โ”‚ x       โ”‚
โ”‚     โ”‚ Int64 โ”‚ Int64?  โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ 1   โ”‚ 1     โ”‚ missing โ”‚
โ”‚ 2   โ”‚ 1     โ”‚ missing โ”‚
โ”‚ 3   โ”‚ 1     โ”‚ missing โ”‚
โ‹ฎ
Last Group (3 rows): g = 2
โ”‚ Row โ”‚ g     โ”‚ x      โ”‚
โ”‚     โ”‚ Int64 โ”‚ Int64? โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ 1   โ”‚ 2     โ”‚ 3      โ”‚
โ”‚ 2   โ”‚ 2     โ”‚ 4      โ”‚
โ”‚ 3   โ”‚ 2     โ”‚ 5      โ”‚

julia> combine(gdf, :x => minimumโˆ˜skipmissing)
ERROR: ArgumentError: some groups contain only missing values

is intended or we should return missing in this case?

EDIT: I think it is OK to error after thinking about it.

Also this:

julia> combine(gdf, :x => lengthโˆ˜skipmissing)
ERROR: MethodError: no method matching length(::Base.SkipMissing{SubArray{Union{Missing, Int64},1,Array{Union{Missing, Int64},1},Tuple{Array{Int64,1}},false}})

is probably unintuitive - we could handle this cleanly I think (of course now in Base it errors, but we could handle it internally).

There is a problem with Base.reducedim_init as it is designed to handle a different use case:

julia> Base.reducedim_init(identity, Base.add_sum, [1,missing], 1)
1-element Array{Union{Missing, Int64},1}:
 0

julia> Base.reducedim_init(identity, Base.add_sum, Any[1,missing], 1)
1-element Array{Missing,1}:
 missing

so we cannot use it unfortunately in general. I think the only option is not to use fast path if the container is not of concrete type.

I have pushed the fix. Sometimes few lines of code are more mind blowing than 1000 lines, so a careful check for corner cases would be appreciated as we are near a deep rabbit hole of Base internals in this PR :).

Was this page helpful?
0 / 5 - 0 ratings

Related issues

jangorecki picture jangorecki  ยท  7Comments

CameronBieganek picture CameronBieganek  ยท  6Comments

ahalwright picture ahalwright  ยท  3Comments

bkamins picture bkamins  ยท  8Comments

tlienart picture tlienart  ยท  8Comments