Julia: Add groupby to the standard lib

Created on 15 Jun 2019  ·  34Comments  ·  Source: JuliaLang/julia

Was surprised it's missing, really useful function, something like

groupby(f, list::Array) = begin
  foldl(list; init = Dict()) do dict, v
    push!(get!(dict, f(v), []), v)
    dict
  end
end

Most helpful comment

It would be great to find a common interface, but I think it should first live in a package, and only when everybody is relatively OK with it should it be moved to Base (if at all).

All 34 comments

See https://github.com/JuliaData/SplitApplyCombine.jl. I'm not sure it adds anything to have it in Base: better have more complete packages with a series of related functions.

Yes, this is unlikely to be added to Julia itself.

The natural languages and the computer languages too comply to power law, big head - long tail (20/80 or Pareto or Exponential etc) distribution.

80% of use cases covered by just 20% of features, probably even more extreme in case of software languages.

groupby in my opinion belongs to big head or standard library. Ruby, Kotlin, Elixir, Java have it. I was really surprised that Julia - a functional language with focus on data processing doesn't have that essential function for data manipulation in its standard library. Even SQL - with its really strict and compact set of built-in functions has GROUP BY.

better have more complete packages with a series of related functions

That's why there's no need in "more complete package". Because those "series of related functions" - are the long tail that you never need. And 36 stars on that package SplitApplyCombine.jl confirm that it's not that popular.

What would you propose the Base groupby function do? From your code example (I'm a little unclear on why it's using fold instead of just a loop), it seems like you'd want it to apply a group key function to an iterable collection of items and then construct a dict mapping each group key value to the vector of values which produced that key. Does that sound correct?

A major concern is that you want to use the name groupby in data packages like DataFrames and having it in Base with a different meaning makes that particularly annoying. So unless the two meanings are compatible, we wouldn't want to do that.

why it's using fold instead of just a loop

Do you mean for loop like that? Or there's some shorter version?

groupby(f, list::Array) = begin
  groups = Dict()
  for v in list
    push!(get!(groups, f(v), []), v)
  end
  groups
end

What would you propose the Base groupby function do? it seems like you'd want it to apply a group key function to an iterable collection of items and then construct a dict mapping each group key value to the vector of values which produced that key.

Yes, that, or some similar data structure like [[(key_a, v1), (key_a, v2)], [(key_b, v3)], ...] that would make sense for representing grouped data.

A major concern is that you want to use the name groupby in data packages like DataFrames and having it in Base with a different meaning makes that particularly annoying. So unless the two meanings are compatible, we wouldn't want to do that.

How it actually looked like. I needed to group the list, and typed ?> groupby - found none. Wow, strange. Typed julia groupby in Google - it gave me DataFrame.groupby - no, I definitely don't need some advanced DataFrame library just to group simple collection. Tried searching further - found nothing and had to write the groupby myself.

There are 2 things that I don't understand with that reasoning:

2) I'm not sure why I would need DataFrame. There's a reason for DataFrame in Python because Python is slow and DataFrame allows to express data computation in a general way so it will be understood by its C-DataFrame-engine and executed fast. But Julia is fast - so why should I bother learning some special DataFrame API when I can just use plain Arrays and Collections?

3) You saying - that you don't want to add some function to standard library because it may conflict with third-party package using the same function? There are 2 problems here in my opinion: 1) I see priorities exactly the opposite - the core is more important than third-party packages. 2) There shouldn't be problem because (in my understanding, maybe I'm missing something) the whole point of multiple dispatch is to solve such cases and allow to exist same function names operating on different arguments.

Reopening for discussion. The first point about Python doesn’t make sense to me: there’s absolutely no reason Python needs data frames to implement an operation that takes a collection and returns a dictionary of arrays (or lists as Python calls them). Searching “Python groupby” gives “pandas.DataFrame.groupby” and there is no built-in group by operation. So we are in exactly the same state of affairs as Python, which suggests that it’s not that unreasonable.

That said, groupby is a pretty handy operation and slightly annoying to write out, so maybe it would be reasonable to have it.

1) I see priorities exactly the opposite - the core is more important than third-party packages.

Not at all. The core should implement what is necessary to have in the core and hard or awkward to have in packages. There must be a good reason to be in the core language. Any complex functionality that can be removed into a package should be. Being stuck in the core language is where things go to fossilize. However, the proposed groupby functionality is clear and simple enough that it might be acceptable to fossilize it.

2) There shouldn't be problem because (in my understanding, maybe I'm missing something) the whole point of multiple dispatch is to solve such cases and allow to exist same function names operating on different arguments.

That depends on if they implement compatible signatures and concepts that are truly the same. If the DataFrames.groupby operation ever wanted to have a method that took an arbitrary collection and produced DataFrames, then that would directly conflict with the proposed groupby function. So multiple dispatch doesn’t necessarily solve this problem and one does have to think—generally pretty carefully—about what a generic function really means.

Searching “Python groupby” gives “pandas.DataFrame.groupby” and there is no built-in group by operation

FWIW groupby is in the python stdlib: https://docs.python.org/3/library/itertools.html#itertools.groupby. Though not sure the python stdlib is the muse

I see your point, makes sense. I guess we have slightly different goals. I prefer unified solution ready out of the box. And you prefer something more like a constructor to assemble custom solution you want.

I think it would be OK for DataFrames to import and override Base.groupby since we only need to handle AbstractDataFrame arguments. But I'd rather not commit now to a particular API for groupby. As the Python example shows, it can make sense to return an iterator, but the OP suggested returning a dict instead, and yet a recent comment proposed returning an array. Let's wait until a clear solution emerges in packages. Though maybe the iterator version could go to Iterators.

Cc: @andyferris

IterTools.jl has it
https://juliacollections.github.io/IterTools.jl/latest/#groupby(f,-xs)-1

In general Base.Iterators contains only the bare minimum set of iterator processing that Base itself needs.
Other iterator processing functions go in IterTools.jl.
Just like Base contains minimal data structures, the rest being in DataStructures.jl

Julia's way of placing arguments follows the reading of the function's name:
group __ by ___
groupby
group by
groupby(items, fn)

Query.jl also has it.

I agree, I don't think this belongs in base. IterTools.jl and and other packages can handle this perfectly well.

FWIW I've wanted this quite often. An argument for having it in Base is that this would emphasize having a common syntax and intuition. I think exactly the argument that it is challenging to come up with a general API is a good reason for trying to do so, aligning the API across the many (IterTools, Query, DataFrames, SplitApplyCombine) that have it.
I also tend to feel that when there is a functionality that gets reimplemented again and again across the ecosystem, that's a sign that there is a need to consolidate. And it should be clear and straightforward enough to fossilize.

Yes, I think having a single obvious way of grouping data would be better than having a smattering of implementations with slightly different semantics.

But what should that shared semantic be? understand the semantics in e.g. the OP and SplitApplyCombine,group is a bit different than what might seem obvious/intuitive for a tabular structure. To that end I got sucked into a vortex of dictionaries and tables that I haven’t worked my way through yet (I’ve been distracted by other things, unfortunately).

I did think there may be a path to combine the semantics of groupby by (a) tweaking AbstractDict and (b) dealing with partitioned tables as nested structures and (c) thinking of tables with a known unique key column as behaving more like a dictionary from the key value to the row than as an “array”-like thing mapping consecutive integers to rows (typical DataFrame).

But ignoring completely tables for a minute, I think obvious and common operations on/between arrays and dictionaries totally belong in Base. Put another way - consider how well fleshed out the AbstractArray interface is - batteries included, even ignoring LinearAlgebra. I don’t see why a lesser level of sophistication should apply to AbstractDict (or various obvious transformations involving dictionaries and arrays).

The groupby operation in Query.jl returns an iterator of Grouping items, where Grouping <: AbstractArray. That works well for both the non-tabular and tabular space (in fact, I think it ends up with a much more natural story than resorting to dicts). The design is just copied from LINQ.

It would be great to find a common interface, but I think it should first live in a package, and only when everybody is relatively OK with it should it be moved to Base (if at all).

groupby in others languages (rust, python, ruby, etc.) is one of the less harmonized definition among high-order functions. From datastore to datastore, there are some mismatches too. Not so simple.

Edited: after a bit of investigation as suggested by oxinabox, a consensus may have emerged in favor of sql. good god.

IterTools.jl has it

Itertools groupby only group consecutive values eg. is rather useless.

Itertools groupby only group consecutive values eg. is rather useless.

I wouldn't say it is useless
This is the exact behavour of:

idk about ruby's.

As you said, this is one of the least harmonized defintions.
This is a reason not to have it in the stdlib in my mind.

I vote +1 and -1 to your post :)

First, let's forget itertools, not a stdlib IMHO.
We've got now those teams:

Consecutive grouping caps

  • python
  • rust

Total grouping caps

According to this sample, a consensus seems to emerge in favor of grouping caps.
(My previous post may have been told a little too quickly, i will now downvote it).

The problem is however that group by apply materialization of data even solliciting intermediary collections. There will always be someone that need another version of array.
But sticking to iterator, vector, dictionary could do the things for a base version...

The main issue at this point is harmonizing with the DataFrames definition in such a way that we don't block DataFrames from using the same name. In other words, we need a generic meaning for groupby that: (a) can be implemented in base, operating on generic iterables, returning a dict of vectors; and (b) agrees with the DataFrames definition of groupby, so that it's legitimate for DataFrames to extend base's groupby function—i.e. they fundamentally do the same thing.

It is not clear to me that the version that operates on iterables should return a dict. I think the Query/LINQ approach that returns a Grouping is actually more composable for the iterable and table case.

That's good feedback. In that case this should probably not be in Base.

Yes, DataFrames and Query both return an iterable of groups, which has the advantage that you can aggregate it without necessarily allocating a copy of all grouped rows. For simple reductions like summing, this lazy operation can improve performance a lot as you can even avoid computing the permutation that sorts entries by groups, and instead compute the group sums in one pass over the data.

I wonder to what extent that approach makes sense for single vectors, as the cost of allocating copies is lower compared to a multi-column table -- and the vector of group indices takes as much space as the original data.

I commented it in https://github.com/JuliaData/DataFrames.jl/issues/1256#issuecomment-560146746 but I find quite useful sometimes that pandas.DataFrame.groupby is an iterable of key-group pairs rather than groups. This would be equivalent to returning a Dict.


Also, I think using non-lazy group vectors is necessary for unindexable iterators. In particular, the input iterator may be stateful (e.g., eachline).

I wonder to what extent that approach makes sense for single vectors, as the cost of allocating copies is lower compared to a multi-column table -- and the vector of group indices takes as much space as the original data.

I suppose this may not be the case for vectors of big named tuples or structs?

Due to these points, I think Base needs to have at least two methods of groupby: one for generic iterable (e.g., returning Dict{_, <:Vector}) and one for AbstractVector (e.g., returning Dict{_, <:SubArray}).

FWIW I think it would be great to have a groupby in Base. Currently you can get really far just using Vector{NamedTuple} as a table type, and map and filter for bread-and-butter data processing with no dependencies. groupby is the one that pops up frequently that I need an external package for. As far as what type it should return - I get that using a Dict internally is a reasonable way to implement this, but that seems like an implementation detail and not a fundamental property of the implementation. An iterable of Pairs seems pretty reasonable for this (which currently is what Dict is, but that might not be forever).

Here's one implementation I've been playing around with. It's heavily inspired by @andyferris's SplitApplyCombine, but returns a Vector. It's designed to be maximally flexible, taking user-provided functions. It might make sense to give it a more sort-like API with keyword arguments for the functions.

It's also implemented in terms of IterTools.groupby, which only groups consecutively, so that would have to be re-implemented or copied to include this in Base. It could also be tweaked to use generators to reduce allocation, but so far that's been a net performance loss in my tests.

using IterTools: IterTools
"""
    groupby(by=identity, mapf=identity, groupf=Pair, x)
Group `x` by `by.(x)`, applying `mapf` to each element within the group, and then
calling `groupf(key, values)` on each group.

Example:

julia> data = [
    (foo=1, bar=2, baz=3),
    (foo=2, bar=1, baz=2),
    (foo=1, bar=10, baz=10)]

julia> gs = groupby(r->r.foo, r->r.baz, data)
2-element Array{Pair{Int64,Array{Int64,1}},1}:
 1 => [3, 10]
 2 => [2]
"""
@inline groupby(x) = groupby(identity, x)
@inline groupby(by, x) = groupby(by, identity, x)
@inline groupby(by, mapf, x) = groupby(by, mapf, Pair, x)
function groupby(by, mapf, groupf, x)
    groups = IterTools.groupby(by, sort(x; by=by))
    map(g->groupf(by(g[1]), map(mapf, g)), groups)
end

@ssfrr I have relied on groupby or its equivalent when working with databases (large and small), are there other large domains of use where groupby is frequently called?

I see good reason to add groupby to a standard lib, that would simplify, promulgate, and, over time, enhance the ease with which members of the community approach their tabular processing.

To put something into Base means that Base has been .. in that specific manner .. lacking or wrongly unsupportive of _ and is only best served by the inclusion of _. Is groupby more appropriate to Base than dot (which is in the LinearAlgebra standard lib).

Yeah, you're right, a stdlib is better, I didn't think that through very carefully. In my mind it's closely related to map, and filter, but is definitely much less widely-used than they are. Base.Iterators seems like a reasonable home (which should perhaps also be a stdlib, but that seems like a separate issue).

It's also implemented in terms of IterTools.groupby, which only groups consecutively

By the way, _if_ we were to add groupby in stdlib or Base, I think the first step would be to add an iterator transformation (say) partitionby in Base.Iterators with this semantics. I borrowed the name from Clojure's partition-by. We already have Iterators.partition so I think it's a reasonable name. Also, if we have Iterators.partitionby, groupby can be done with one-linear (see below).

returns a Vector

If not returning a dictionary, I suggest returning a lazy iterator and make the lifetime and ownership of each element clear in the API to leave opportunity of optimizations (i.e., we can return a view or reuse the buffer/output vector in the partitionby implementation). If we return a lazy iterator, we don't need mapf and groupf. A reference implementation will be something like:

const partitionby = IterTools.groupby
groupby(by, x) = (by(g[1]) => g for g in partitionby(by, sort(x, by = by)))

Even though this is one-linear, it can be optimized more (e.g., specialize partitionby on dense array, use sortperm if eltype(x) is big, etc.) so I think it makes sense to have it in a library function.

Alternatively, maybe it's better to use pairs(groupby(by, x)) to get pairs:

struct Groups
    by
    partitions
end

iterate(gs::Groups, ...) = ...  # forward to partitions
groupby(by, x) = Groups(by, partitionby(by, sort(x, by = by)))
pairs(gs::Groups) = (by(g[1]) => g for g in gs.partitions)

Base.Iterators seems like a reasonable home

I think Base.Iterators should only contain (mostly) non-allocating lazy iterator transformations that is almost free to construct, to keep its focus clear. Group-by in the sense discussed here typically needs sorting so I don't think it fits in this scope.

I think it would be more reasonable to start shipping an implementation in a package. Then if everybody agrees it's a good API, it could be moved to Base or the stdlib.

I think it would be more reasonable

Given that this is the topic of the entire thread, with many good arguments above, this seems quite thin. Could you phrase your point so it adresses the arguments above?

It's just that there's already one implementation in SplitApplyCombine, and that other implementations are being proposed here, so it's already not clear which API is best nor whether it will suit all needs. And even @tkf's nice proposal above seems uncertain about the exact type to return.

Anyway, if people want to use that function soon, it would better live in a package, otherwise you won't be able to use it until Julia 1.5 at best.

I agree that clojure's partition-by is worth a look, and there's no reason I see not to put that into IterTools.

It seems that a non-sorted (or otherwise indexed) collection can't be grouped with no memory overhead so should be dealt with outside of IterTools. Sorting and partitioning seems like a lower-level implementation of the operation you really want - splitting the elements of a collection into groups. We can definitely recommend that as a "pattern" for people to use from Base - but I really think we should provide a generic function with well-defined semantics that people can (a) use and (b) overload with different implementation details.

And in case this wasn't clear - everyone is welcome to treat SplitApplyCombine as a place to prototype interfaces and implementations that might be suitable for inclusion in Base or a stdlib - in a way that was what it was created for, and really summed up by @ssfrr with "you can get really far just using Vector{NamedTuple} as a table type, and map and filter for bread-and-butter data processing" and trying to fill in the remaining gaps. The functions I use most are definitely the group* family (I am surprised how often I use groupcount, actually) - and not necessarily on tabular data, they are great for arrays and other collections too.

As for the discussion above about whether returning a dictionary or an array or something simpler is best, or something that iterates groups or key-group pairs - that seems very interesting to me! I always felt we should double-down on AbstractDict so it's as easy to use and extend as AbstractArray so that choosing a dictionary output to grouping wouldn't feel like any kind of implementation burden; just an interface guarantee that lets you deal with the output.

the advantage that you can aggregate it without necessarily allocating a copy of all grouped rows

Yes, I strongly agree, it seems essential that whatever we do that there is a way of doing reductions over the groups without sorting or excess allocations, etc.

The problem there is that could be 1/ very enviable first 2/ then very badly implemented in the long run. One example : the (10 differents) filter semantics of panda vs the (one and universal) where semantic of sql.

All this stuff is a matter of relational calculus / relational algebra, not data structure.
If you want to see a good example of a how generic implementation can be provided and specialized Apache Calcite is a very good start.

Particulary class Aggregate implemented by ElasticsearchAggregate, EnumerableAggregate, GeodeAggregate, JdbcRules.JdbcAggregate, LogicalAggregate, MongoAggregate, PigAggregate

Missing to rely on those strong foundational work, and too much time (eg. years) may be and will be spent examing cosmetic and/or dysfunctional options.
And at this stage, putting this first in stdlib does not seem a good point IMHO

Was this page helpful?
0 / 5 - 0 ratings

Related issues

felixrehren picture felixrehren  ·  3Comments

m-j-w picture m-j-w  ·  3Comments

tkoolen picture tkoolen  ·  3Comments

StefanKarpinski picture StefanKarpinski  ·  3Comments

yurivish picture yurivish  ·  3Comments