Pandas: "Function does not reduce" for multiindex, but works fine for single index.

Created on 19 Jul 2013 · 17Comments · Source: pandas-dev/pandas

When I use pd.Series.tolist as a reducer with a single column groupby, it works.
When I do the same with multiindex, it does not.

It seems the "fast" cython groupby function, which has no quarrel with reducing into lists, throws an exception if the index is "complex", which seem to mean multiindex. When that exception is caught, the groupby function falls back to the "pure_python" groupby, which throws a new exception if the reducing function returns a list.

Is this a bug or is there some logic to this which is not apparent to me?

Reproduce:

import pandas as pd
s1 = pd.Series(randn(5), index=['a', 'b', 'c', 'd', 'e'])
df = pd.DataFrame([s1], columns=['a', 'b', 'c', 'd', 'e'])
for i in range(0,10):
    s1 = pd.Series(randn(5), index=['a', 'b', 'c', 'd', 'e'])
    df2 = pd.DataFrame([s1], columns=['a', 'b', 'c', 'd', 'e'])
    df = pd.concat([df, df2])
df['gk'] = 'foo'
df['gk2'] = 'bar'

# This works.
df.groupby(['gk']).agg(pd.Series.tolist)

# This does not.
df.groupby(['gk', 'gk2']).agg(pd.Series.tolist)

Dtypes Groupby MultiIndex

Source

frlnx

Most helpful comment

how does using tolist save you any data? its the same data just in a list and comparisons are then hard

I think you can one of these:

df.groupby(keys).apply(lambda x: x._get_numeric_data().abs()sum()) or another function that effectively _hashes_ a row together
df.groupby(['gk','gk2']).agg(lambda x: tuple(x.tolist())) will do what you want with the multi-indexes (or single index); as a tuple it is inferred as a reduction

jreback on 22 Jul 2013

👍3

All 17 comments

what exactly are you trying to accomplish?

the first groupby is very odd to do

as is passing a reduction function of pd.Series.tolist which is technically not a reduction function at all

jreback on 19 Jul 2013

not sure the first case ought to work either...

jtratner on 21 Jul 2013

A simplified example would be this:
I use it to compare KPIs of single rows to their peers. The peers being determined by the group key. I compare the KPIs to the averages / means of the KPIs of each group. It's an efficient way of doing it since I do not want to keep the original dataset in memory. This example is only for two levels. The superset and the entry compared. I actually do this on four levels, which makes it a whole lot messier, and the tolist helps strip the complexitivity down.
If I could not do tolist as an aggregator (which in my experience is quite common practice, lots of SQL variants have support for it) I would have to keep both the original dataset and the grouped dataset and then access the original by index.
Needless to say that would eat my memory in no-time, and my intuition tells me it would be slower, I could perform some tests and publish if necessary.

frlnx on 22 Jul 2013

how does using tolist save you any data? its the same data just in a list and comparisons are then hard

I think you can one of these:

df.groupby(keys).apply(lambda x: x._get_numeric_data().abs()sum()) or another function that effectively _hashes_ a row together
df.groupby(['gk','gk2']).agg(lambda x: tuple(x.tolist())) will do what you want with the multi-indexes (or single index); as a tuple it is inferred as a reduction

jreback on 22 Jul 2013

👍3

Well, if wrapping the list in a tuple is an acceptable solution, removing the check for the list from the source should also be, since it really does not add any functionality or help.

frlnx on 22 Jul 2013

ok....will think about it....

but still I am curious, how tolist actually reduces? it doesn't change the amount of data you have all, so your argument about not keeping data in memory is fallacious

here's another though, why don't you hash the result? you want to compare it to other items right? (to see if they are the same)

jreback on 22 Jul 2013

I want to compare, yes, but I need to know if the median KPIs of the group are greater or smaller than the same KPIs of each entry which makes up the group. It is not a question of equal or not.

The original dataframe contains a lot more than just the columns I keep with the tolist "reduction". I could delete the other columns, but the columns coming out of our API keeps changing independent of my work, so a whitelist approach is really the only way. I can of course make a whitelist approach in other ways, but this is a very simple way of getting what I need.

frlnx on 22 Jul 2013

ok...if it works for you

why can't you do the actualy compare in the agg/applied function itself?

def f(x):
     if (x.median()>=other_data).all()
            return x.median()
     return nan

df.groupby(keys).apply(f)

jreback on 22 Jul 2013

That may be a nice way of doing it.
I still need the lists outputted to different files for validation though.

frlnx on 22 Jul 2013

np...just trying to help...will keep this open in any event..thanks for the report

jreback on 22 Jul 2013

I will be able to proceed with a mix of the workarounds you provided, thanks!

frlnx on 22 Jul 2013

@frinx you'd think that removing the check for list would be just fine, but there are some quirks with how groupby is handled in pandas, so it's not straightforward to just remove it. :-/ You can see #3805 for a start on this.

jtratner on 24 Jul 2013

Just ran into this too. Rather than being forced to collect the unique values in a column into a string, I would have liked them to be collected into a list (e.g. for use later).

reduce in python is really a fold left, and there's absolutely nothing wrong with a fold left returning a collection. e.g. the identity reduction for a list s is reduce(lambda x,y: x+[y],s,[]), which fails in pd.agg.

'reducing' the quantity of data is not a requirement of reduce, functionally speaking (that argument also fails considering that str works in pd.agg).

florianverhein on 22 Oct 2015

I run into this working with shredded (EAV) data. Often, there are multiple values for a single column. Take data representing authors on articles:

recordId,attributeName,value
1,title,"Map-Reduce Extensions and Recursive Queries"
1,author,"Foto N. Afrati"
1,author,"Vinayak Borkar"
1,author,"Michael Carey"
1,author,"Neoklis Polyzotis"
1,author,"Jeffrey D. Ullman"

This data is coming from a 3rd party, and I'd like to fix it so I can work with it. I want to create a 3-level index of ['recordId', 'attributeName', 'instance'] with a single column for the values. The way to create 'instance' is to:

eav.set_index(['recordId', 'attributeName'], inplace=True) to promote recordId and attributeName to indexes; then
s=eav.groupby(level=['recordId', 'attributeName']).agg(pd.Series.tolist) to aggregate into to a Series of lists of values, then
use t=pd.DataFrame(s.tolist(), s.index) to split into columns with numeric labels, then
use t.stack() to create the 3rd level index.

Sadly, this doesn't work because _aggregate_series_pure_python(self, obj, func) has an explicit exclusion for the case when the aggregator returns a list for the first group.

Many other data processing platforms have recognized the utility of aggregating into lists: PostgreSQL has array_agg; Spark has collect_list(); MySQL has group_concat; etc.

The exclusion is even less sensical when you consider that many methods such as Series.str.split() will create columns of lists, but the exclusion in _aggregate_series_pure_python(self, obj, func) prevents creation of list values when grouping.

nbateshaus on 22 Jun 2016

@nbateshaus what's your desired output there? Is it

In [12]: df['instance'] = df.groupby(['recordId', 'attributeName']).value.cumcount()

In [13]: df.set_index(['recordId', 'attributeName', 'instance'])
Out[13]:
                                                                       value
recordId attributeName instance
1        title         0         Map-Reduce Extensions and Recursive Queries
         author        0                                      Foto N. Afrati
                       1                                      Vinayak Borkar
                       2                                       Michael Carey
                       3                                   Neoklis Polyzotis
                       4                                   Jeffrey D. Ullman

I know that nested data is important, but the building-blocks aren't in place for pandas to support it well at the moment.

TomAugspurger on 22 Jun 2016

Yes, that's the output I want. Assuming I sort by ['e','a'] first, this is probably faster, too. It is still very convenient to be able to create list values.

Looking at the history of this restriction, it looks like it was accidentally introduced by a transcription error in f3c0a081e2cfc8e073f8461cac5c242d0e4219d0 - at the time, it was an assertion, and it went from

assert(not (isinstance(res, list) and len(res) == len(self.dummy)))

where, as far as I can tell, dummy is uninitialized, to

assert(not isinstance(res, list))

The original assertion was added without comment in 71e9046c52246535d4db1f350e82c3a84d748f88, in response to #612 "Pure python multi-key groupby can't handle non-numeric results". Which reveals another oddity: groupby().agg(pd.Series.tolist) works fine for single-key groupings; it only fails for multi-key groupings.

>>> eav.groupby(['attributeName']).agg(pd.Series.tolist)
                      recordId  \
attributeName                    
author         [1, 1, 1, 1, 1]   
title                      [1]   

                                                           value  
attributeName                                                     
author         [Foto N. Afrati, Vinayak Borkar, Michael Carey...  
title              [Map-Reduce Extensions and Recursive Queries]  
>>> eav.groupby(['recordId', 'attributeName']).agg(pd.Series.tolist)
Traceback (most recent call last):
  File "/Users/nik/anaconda/envs/python3/lib/python3.5/site-packages/pandas/core/groupby.py", line 1863, in agg_series
    return self._aggregate_series_fast(obj, func)
  File "/Users/nik/anaconda/envs/python3/lib/python3.5/site-packages/pandas/core/groupby.py", line 1868, in _aggregate_series_fast
    func = self._is_builtin_func(func)
AttributeError: 'BaseGrouper' object has no attribute '_is_builtin_func'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/nik/anaconda/envs/python3/lib/python3.5/site-packages/pandas/core/groupby.py", line 3597, in aggregate
    return super(DataFrameGroupBy, self).aggregate(arg, *args, **kwargs)
  File "/Users/nik/anaconda/envs/python3/lib/python3.5/site-packages/pandas/core/groupby.py", line 3122, in aggregate
    return self._python_agg_general(arg, *args, **kwargs)
  File "/Users/nik/anaconda/envs/python3/lib/python3.5/site-packages/pandas/core/groupby.py", line 777, in _python_agg_general
    result, counts = self.grouper.agg_series(obj, f)
  File "/Users/nik/anaconda/envs/python3/lib/python3.5/site-packages/pandas/core/groupby.py", line 1865, in agg_series
    return self._aggregate_series_pure_python(obj, func)
  File "/Users/nik/anaconda/envs/python3/lib/python3.5/site-packages/pandas/core/groupby.py", line 1899, in _aggregate_series_pure_python
    raise ValueError('Function does not reduce')
ValueError: Function does not reduce

nbateshaus on 22 Jun 2016

👍1

The example by @frinx seems to now work fine, _and_ @bobhaffner said that #18354 "might" close this. So assuming this is solved. @nbateshaus please feel free to provide a reproducible example if this is still an issue.