Pandas: AssertionError when grouping with max/min as aggregation functions (pandas-1.0.0)

Created on 31 Jan 2020 · 12Comments · Source: pandas-dev/pandas

Code Sample

import pandas as pd
import numpy as np

df = pd.DataFrame({'key1' : ['a', 'a', 'b', 'b', 'a'],
                   'key2' : ['one', 'two', 'one', 'two', 'one'],
                   'key3' : ['three', 'three', 'three', 'six', 'six'],
                   'data1' : np.random.randn(5),
                   'data2' : np.random.randn(5)})
df.groupby('key1').min()

Problem description

Since pandas-1.0.0, an AssertionError is thrown when grouping a DataFrame by a key and using max/min as aggregation functions. It works fine if only 1 key (other than the grouping key) is of the type object in the DataFrame, but it doesn't when the number of keys of type object is bigger than 1 (as shown in the example). This configuration worked fine on previous versions of pandas (e.g., pandas-0.25.3).

Expected Output

| key1 | key2 | key3 | data1 | data2 |
|:-------|:-------|:-------|---------:|----------:|
| a | one | six | -0.67246 | -1.6302 |
| b | one | six | -1.72628 | -0.907298 |

Output of `pd.show_versions()`

commit : None
python : 3.7.6.final.0
python-bits : 64
OS : Darwin
OS-release : 19.2.0
machine : x86_64
processor : i386
byteorder : little
LC_ALL : None
LANG : None
LOCALE : None.UTF-8

pandas : 1.0.0
numpy : 1.18.1
pytz : 2019.3
dateutil : 2.8.1
pip : 20.0.2
setuptools : 45.1.0.post20200127
Cython : 0.29.14
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.10.3
IPython : 7.11.1
pandas_datareader: None
bs4 : None
bottleneck : 1.3.1
fastparquet : None
gcsfs : None
lxml.etree : None
matplotlib : 3.1.2
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
pytest : None
pyxlsb : None
s3fs : None
scipy : 1.3.1
sqlalchemy : None
tables : None
tabulate : 0.8.3
xarray : None
xlrd : None
xlwt : None
xlsxwriter : None
numba : 0.48.0

Groupby Has PR Regression

Source

marcevrard

All 12 comments

Thanks for the report. That assert is from https://github.com/pandas-dev/pandas/pull/29035 (cc @jbrockmendel)

We do min on object dtype, which is NotImplemented in Cython, so fall back to the python agg. Then in

                    result = cast(DataFrame, result)
                    # unwrap DataFrame to get array
                    assert len(result._data.blocks) == 1
                    result = result._data.blocks[0].values
                    if isinstance(result, np.ndarray) and result.ndim == 1:
                        result = result.reshape(1, -1)

the ` assert len(result._data.blocks) == 1 fails

(Pdb) pp result
     key2 key3
key1
a     one  six
b     one  six

and we fall through to the finally with a DataFrame.

FYI @marcevrard , we publish release candidates and nightly builds, if you want to catch these before the release. You can select watch pandas' "Release only" if nightly builds aren't an option.

TomAugspurger on 31 Jan 2020

👍1

I guess the faulty assumption is that a groupby aggregation on an single Block won't split it into multiple blocks. This apparently isn't true for object blocks

(Pdb) obj._data.blocks
(ObjectBlock: slice(0, 2, 1), 2 x 5, dtype: object,)
(Pdb) result._data.blocks
(ObjectBlock: slice(0, 1, 1), 1 x 2, dtype: object, ObjectBlock: slice(1, 2, 1), 1 x 2, dtype: object)

TomAugspurger on 1 Feb 2020

I guess the faulty assumption is that a groupby aggregation on an single Block won't split it into multiple blocks.

I guess _split_and_operate would have to be called in there somehow. Easiest solution would be to raise TypeError, which should revert to the previous behavior. Longer-term we probably should be handling that case without raising.

jbrockmendel on 1 Feb 2020

Looking into this today.

TomAugspurger on 3 Feb 2020

Thank you for the quick fix, I confirm it indeed works again with the 1.0.1 version.

marcevrard on 9 Feb 2020

I'm still seeing this error in pandas-1.0.1:

    df_mh_spc6 = df_mh_spc5.groupby(['bldg_id'], as_index=False, sort=False).max()
env/lib/python3.8/site-packages/pandas/core/groupby/groupby.py:1378: in f
    return self._cython_agg_general(alias, alt=npfunc, **kwargs)
env/lib/python3.8/site-packages/pandas/core/groupby/generic.py:1003: in _cython_agg_general
    agg_blocks, agg_items = self._cython_agg_blocks(
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <pandas.core.groupby.generic.DataFrameGroupBy object at 0x11d88db20>, how = 'max'
alt = <function amax at 0x111901d30>, numeric_only = False, min_count = -1

    def _cython_agg_blocks(
        self, how: str, alt=None, numeric_only: bool = True, min_count: int = -1
    ) -> "Tuple[List[Block], Index]":
        # TODO: the actual managing of mgr_locs is a PITA
        # here, it should happen via BlockManager.combine

        data: BlockManager = self._get_data_to_aggregate()

        if numeric_only:
            data = data.get_numeric_data(copy=False)

        agg_blocks: List[Block] = []
        new_items: List[np.ndarray] = []
        deleted_items: List[np.ndarray] = []
        # Some object-dtype blocks might be split into List[Block[T], Block[U]]
        split_items: List[np.ndarray] = []
        split_frames: List[DataFrame] = []

        no_result = object()
        for block in data.blocks:
            # Avoid inheriting result from earlier in the loop
            result = no_result
            locs = block.mgr_locs.as_array
            try:
                result, _ = self.grouper.aggregate(
                    block.values, how, axis=1, min_count=min_count
                )
            except NotImplementedError:
                # generally if we have numeric_only=False
                # and non-applicable functions
                # try to python agg

                if alt is None:
                    # we cannot perform the operation
                    # in an alternate way, exclude the block
                    assert how == "ohlc"
                    deleted_items.append(locs)
                    continue

                # call our grouper again with only this block
                obj = self.obj[data.items[locs]]
                if obj.shape[1] == 1:
                    # Avoid call to self.values that can occur in DataFrame
                    #  reductions; see GH#28949
                    obj = obj.iloc[:, 0]

                s = get_groupby(obj, self.grouper)
                try:
                    result = s.aggregate(lambda x: alt(x, axis=self.axis))
                except TypeError:
                    # we may have an exception in trying to aggregate
                    # continue and exclude the block
                    deleted_items.append(locs)
                    continue
                else:
                    result = cast(DataFrame, result)
                    # unwrap DataFrame to get array
                    if len(result._data.blocks) != 1:
                        # We've split an object block! Everything we've assumed
                        # about a single block input returning a single block output
                        # is a lie. To keep the code-path for the typical non-split case
                        # clean, we choose to clean up this mess later on.
                        split_items.append(locs)
                        split_frames.append(result)
                        continue

                    assert len(result._data.blocks) == 1
                    result = result._data.blocks[0].values
                    if isinstance(result, np.ndarray) and result.ndim == 1:
                        result = result.reshape(1, -1)

            assert not isinstance(result, DataFrame)

            if result is not no_result:
                # see if we can cast the block back to the original dtype
                result = maybe_downcast_numeric(result, block.dtype)

                if block.is_extension and isinstance(result, np.ndarray):
                    # e.g. block.values was an IntegerArray
                    # (1, N) case can occur if block.values was Categorical
                    #  and result is ndarray[object]
                    assert result.ndim == 1 or result.shape[0] == 1
                    try:
                        # Cast back if feasible
                        result = type(block.values)._from_sequence(
                            result.ravel(), dtype=block.values.dtype
                        )
                    except ValueError:
                        # reshape to be valid for non-Extension Block
                        result = result.reshape(1, -1)

                agg_block: Block = block.make_block(result)

            new_items.append(locs)
            agg_blocks.append(agg_block)

        if not (agg_blocks or split_frames):
            raise DataError("No numeric types to aggregate")

        if split_items:
            # Clean up the mess left over from split blocks.
            for locs, result in zip(split_items, split_frames):
>               assert len(locs) == result.shape[1]
E               AssertionError

env/lib/python3.8/site-packages/pandas/core/groupby/generic.py:1110: AssertionError

Any idea why this might still be happening? @TomAugspurger
It worked fine until pandas-1.0.0.

Thanks!

zking1219 on 13 Feb 2020

👍1

@zking1219 if you have a minimal example I'd recommend opening a new issue.

TomAugspurger on 13 Feb 2020

I can see how that might help, I'll work on putting one together. Thanks!

zking1219 on 13 Feb 2020

I am still getting this error message running pandas 1.0.5. I switched back to 0.25.1 and it is working just fine. My dataset is a little complicated and I don't have time to put together a minimal example now, but thought you would want to know that this still seems to be a problem.

willkochtitzky on 24 Jul 2020

👍1

Seeing the same error in the pandas-1.1.0 version as well

prakass1 on 13 Aug 2020

I got the same error message...
With pdb I have managed to find out that there is a problem in the 1450th row:D
This command does not give an AssertionError (in pdb):
datafile.head(1464).groupby("column_name").min()
But this one does:
datafile.head(1465).groupby("column_name").min()

The 1465th row has 43 columns instead of 42. But when I have deleted the 42nd column (Ie the 43rd) nothing got better. I still get the same error message.

SierprinskiFox on 13 Aug 2020

the pandas-1.1.0 version as

I did not see the error again in 1.1.0. I followed the procedure to recompile all the packages and also I identified that it had also occurred since I had some NaN in my data which I have fixed. Before was not getting assert error even in the presence of NaN

prakass1 on 13 Aug 2020

Was this page helpful?

0 / 5 - 0 ratings