@nalimilan. We have two issues with Reduce design related to the eltype identified by groupreduce_init.
The first is:
julia> df = DataFrame(g=[1,1,1,2,2,2], x=Any[1,1,1,1.5,1.5,1.5])
6ร2 DataFrame
โ Row โ g โ x โ
โ โ Int64 โ Any โ
โโโโโโโผโโโโโโโโผโโโโโโค
โ 1 โ 1 โ 1 โ
โ 2 โ 1 โ 1 โ
โ 3 โ 1 โ 1 โ
โ 4 โ 2 โ 1.5 โ
โ 5 โ 2 โ 1.5 โ
โ 6 โ 2 โ 1.5 โ
julia> combine(groupby(df, :g), :x => sum)
ERROR: InexactError: Int64(1.5)
or
julia> df = DataFrame(g=[1,1,1,2,2,2], x=Any[1,1,1,1,1,missing])
6ร2 DataFrame
โ Row โ g โ x โ
โ โ Int64 โ Any โ
โโโโโโโผโโโโโโโโผโโโโโโโโโโค
โ 1 โ 1 โ 1 โ
โ 2 โ 1 โ 1 โ
โ 3 โ 1 โ 1 โ
โ 4 โ 2 โ 1 โ
โ 5 โ 2 โ 1 โ
โ 6 โ 2 โ missing โ
julia> combine(groupby(df, :g), :x => sum)
ERROR: MethodError: Cannot `convert` an object of type Missing to an object of type Int64
The second is that after https://github.com/JuliaLang/Statistics.jl/pull/25 we do not consistently identify eltype of the vector that is to store aggregated values (in DataFrames.jl if we get ints we keep them as ints, which can overflow - Base does a conversion to florat).
Proposed approach to solve it:
groupreduce_init check if a column that is aggregated has a concrete type or a Union of a concrete type and missing; if yes do a fast path; if not - we have to go through the whole vector to check what we allowadjust to groupreduce_init and if it is / then appropriately set eltype of res to take this fact into accountGood catch. Actually what's happening is that we call Base.reducedim_init, but since we don't want to allocate a vector of the same length as the input, we pass it a SubArray with a length equal to the number of groups. When the eltype is concrete that's OK, but when it's not reducedim_init may end up actually computing the reduction just to get its type. And in your example the first two values happen to be Int, so it returns Int.
There are several solutions:
reducedim_init with dimension 1 to get a 1-element vector, and resize it to the expected lengthYour solution 1 is appealing and it may not be so slow, since reducedim_init computes the sum over all groups once just to compute the eltype. A possible drawback is that it will use a different algorithm: pairwise summation will be used, the accumulation type may differ across groups, and promotion will only happen at the end. There may be advantages to using the same algorithm in all cases (documentation would be clearer).
So we could do the following:
nonmissingtype on eltype is concrete - do what we do nowAs noted earlier - we should also take into account the fact that we might need to use target type when accumulating (matters for functions like mean).
This becomes priority now for 1.5 release of Julia (as otherwise we will become inconsistent with Base with aggregations).
So I have two questions:
:x => mean or x => x -> mean(x), because there is a bug in x -> mean(x)).I'm not working on it so feel free to do it! I agree adopting the Julia 1.5 behavior is OK, anyway we already differed from the 1.4 behavior since we don't use pairwise summation.
OK. I will make a PR probably this week.
I am working on the PR and I was not sure if this:
julia> df = DataFrame(g=[1,1,1,2,2,2], x = [missing,missing,missing,3,4,5])
6ร2 DataFrame
โ Row โ g โ x โ
โ โ Int64 โ Int64? โ
โโโโโโโผโโโโโโโโผโโโโโโโโโโค
โ 1 โ 1 โ missing โ
โ 2 โ 1 โ missing โ
โ 3 โ 1 โ missing โ
โ 4 โ 2 โ 3 โ
โ 5 โ 2 โ 4 โ
โ 6 โ 2 โ 5 โ
julia> gdf = groupby(df, :g)
GroupedDataFrame with 2 groups based on key: g
First Group (3 rows): g = 1
โ Row โ g โ x โ
โ โ Int64 โ Int64? โ
โโโโโโโผโโโโโโโโผโโโโโโโโโโค
โ 1 โ 1 โ missing โ
โ 2 โ 1 โ missing โ
โ 3 โ 1 โ missing โ
โฎ
Last Group (3 rows): g = 2
โ Row โ g โ x โ
โ โ Int64 โ Int64? โ
โโโโโโโผโโโโโโโโผโโโโโโโโโค
โ 1 โ 2 โ 3 โ
โ 2 โ 2 โ 4 โ
โ 3 โ 2 โ 5 โ
julia> combine(gdf, :x => minimumโskipmissing)
ERROR: ArgumentError: some groups contain only missing values
is intended or we should return missing in this case?
EDIT: I think it is OK to error after thinking about it.
Also this:
julia> combine(gdf, :x => lengthโskipmissing)
ERROR: MethodError: no method matching length(::Base.SkipMissing{SubArray{Union{Missing, Int64},1,Array{Union{Missing, Int64},1},Tuple{Array{Int64,1}},false}})
is probably unintuitive - we could handle this cleanly I think (of course now in Base it errors, but we could handle it internally).
There is a problem with Base.reducedim_init as it is designed to handle a different use case:
julia> Base.reducedim_init(identity, Base.add_sum, [1,missing], 1)
1-element Array{Union{Missing, Int64},1}:
0
julia> Base.reducedim_init(identity, Base.add_sum, Any[1,missing], 1)
1-element Array{Missing,1}:
missing
so we cannot use it unfortunately in general. I think the only option is not to use fast path if the container is not of concrete type.
I have pushed the fix. Sometimes few lines of code are more mind blowing than 1000 lines, so a careful check for corner cases would be appreciated as we are near a deep rabbit hole of Base internals in this PR :).