Pandas: AssertionError when grouping with max/min as aggregation functions (pandas-1.0.0)

Created on 31 Jan 2020  路  12Comments  路  Source: pandas-dev/pandas

Code Sample

import pandas as pd
import numpy as np

df = pd.DataFrame({'key1' : ['a', 'a', 'b', 'b', 'a'],
                   'key2' : ['one', 'two', 'one', 'two', 'one'],
                   'key3' : ['three', 'three', 'three', 'six', 'six'],
                   'data1' : np.random.randn(5),
                   'data2' : np.random.randn(5)})
df.groupby('key1').min()

Problem description

Since pandas-1.0.0, an AssertionError is thrown when grouping a DataFrame by a key and using max/min as aggregation functions. It works fine if only 1 key (other than the grouping key) is of the type object in the DataFrame, but it doesn't when the number of keys of type object is bigger than 1 (as shown in the example). This configuration worked fine on previous versions of pandas (e.g., pandas-0.25.3).

Expected Output

| key1 | key2 | key3 | data1 | data2 |
|:-------|:-------|:-------|---------:|----------:|
| a | one | six | -0.67246 | -1.6302 |
| b | one | six | -1.72628 | -0.907298 |

Output of pd.show_versions()

commit : None
python : 3.7.6.final.0
python-bits : 64
OS : Darwin
OS-release : 19.2.0
machine : x86_64
processor : i386
byteorder : little
LC_ALL : None
LANG : None
LOCALE : None.UTF-8

pandas : 1.0.0
numpy : 1.18.1
pytz : 2019.3
dateutil : 2.8.1
pip : 20.0.2
setuptools : 45.1.0.post20200127
Cython : 0.29.14
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.10.3
IPython : 7.11.1
pandas_datareader: None
bs4 : None
bottleneck : 1.3.1
fastparquet : None
gcsfs : None
lxml.etree : None
matplotlib : 3.1.2
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
pytest : None
pyxlsb : None
s3fs : None
scipy : 1.3.1
sqlalchemy : None
tables : None
tabulate : 0.8.3
xarray : None
xlrd : None
xlwt : None
xlsxwriter : None
numba : 0.48.0

Groupby Has PR Regression

All 12 comments

Thanks for the report. That assert is from https://github.com/pandas-dev/pandas/pull/29035 (cc @jbrockmendel)

We do min on object dtype, which is NotImplemented in Cython, so fall back to the python agg. Then in

                    result = cast(DataFrame, result)
                    # unwrap DataFrame to get array
                    assert len(result._data.blocks) == 1
                    result = result._data.blocks[0].values
                    if isinstance(result, np.ndarray) and result.ndim == 1:
                        result = result.reshape(1, -1)

the ` assert len(result._data.blocks) == 1 fails

(Pdb) pp result
     key2 key3
key1
a     one  six
b     one  six

and we fall through to the finally with a DataFrame.

FYI @marcevrard , we publish release candidates and nightly builds, if you want to catch these before the release. You can select watch pandas' "Release only" if nightly builds aren't an option.

I guess the faulty assumption is that a groupby aggregation on an single Block won't split it into multiple blocks. This apparently isn't true for object blocks

(Pdb) obj._data.blocks
(ObjectBlock: slice(0, 2, 1), 2 x 5, dtype: object,)
(Pdb) result._data.blocks
(ObjectBlock: slice(0, 1, 1), 1 x 2, dtype: object, ObjectBlock: slice(1, 2, 1), 1 x 2, dtype: object)

I guess the faulty assumption is that a groupby aggregation on an single Block won't split it into multiple blocks.

I guess _split_and_operate would have to be called in there somehow. Easiest solution would be to raise TypeError, which should revert to the previous behavior. Longer-term we probably should be handling that case without raising.

Looking into this today.

Thank you for the quick fix, I confirm it indeed works again with the 1.0.1 version.

I'm still seeing this error in pandas-1.0.1:

    df_mh_spc6 = df_mh_spc5.groupby(['bldg_id'], as_index=False, sort=False).max()
env/lib/python3.8/site-packages/pandas/core/groupby/groupby.py:1378: in f
    return self._cython_agg_general(alias, alt=npfunc, **kwargs)
env/lib/python3.8/site-packages/pandas/core/groupby/generic.py:1003: in _cython_agg_general
    agg_blocks, agg_items = self._cython_agg_blocks(
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <pandas.core.groupby.generic.DataFrameGroupBy object at 0x11d88db20>, how = 'max'
alt = <function amax at 0x111901d30>, numeric_only = False, min_count = -1

    def _cython_agg_blocks(
        self, how: str, alt=None, numeric_only: bool = True, min_count: int = -1
    ) -> "Tuple[List[Block], Index]":
        # TODO: the actual managing of mgr_locs is a PITA
        # here, it should happen via BlockManager.combine

        data: BlockManager = self._get_data_to_aggregate()

        if numeric_only:
            data = data.get_numeric_data(copy=False)

        agg_blocks: List[Block] = []
        new_items: List[np.ndarray] = []
        deleted_items: List[np.ndarray] = []
        # Some object-dtype blocks might be split into List[Block[T], Block[U]]
        split_items: List[np.ndarray] = []
        split_frames: List[DataFrame] = []

        no_result = object()
        for block in data.blocks:
            # Avoid inheriting result from earlier in the loop
            result = no_result
            locs = block.mgr_locs.as_array
            try:
                result, _ = self.grouper.aggregate(
                    block.values, how, axis=1, min_count=min_count
                )
            except NotImplementedError:
                # generally if we have numeric_only=False
                # and non-applicable functions
                # try to python agg

                if alt is None:
                    # we cannot perform the operation
                    # in an alternate way, exclude the block
                    assert how == "ohlc"
                    deleted_items.append(locs)
                    continue

                # call our grouper again with only this block
                obj = self.obj[data.items[locs]]
                if obj.shape[1] == 1:
                    # Avoid call to self.values that can occur in DataFrame
                    #  reductions; see GH#28949
                    obj = obj.iloc[:, 0]

                s = get_groupby(obj, self.grouper)
                try:
                    result = s.aggregate(lambda x: alt(x, axis=self.axis))
                except TypeError:
                    # we may have an exception in trying to aggregate
                    # continue and exclude the block
                    deleted_items.append(locs)
                    continue
                else:
                    result = cast(DataFrame, result)
                    # unwrap DataFrame to get array
                    if len(result._data.blocks) != 1:
                        # We've split an object block! Everything we've assumed
                        # about a single block input returning a single block output
                        # is a lie. To keep the code-path for the typical non-split case
                        # clean, we choose to clean up this mess later on.
                        split_items.append(locs)
                        split_frames.append(result)
                        continue

                    assert len(result._data.blocks) == 1
                    result = result._data.blocks[0].values
                    if isinstance(result, np.ndarray) and result.ndim == 1:
                        result = result.reshape(1, -1)

            assert not isinstance(result, DataFrame)

            if result is not no_result:
                # see if we can cast the block back to the original dtype
                result = maybe_downcast_numeric(result, block.dtype)

                if block.is_extension and isinstance(result, np.ndarray):
                    # e.g. block.values was an IntegerArray
                    # (1, N) case can occur if block.values was Categorical
                    #  and result is ndarray[object]
                    assert result.ndim == 1 or result.shape[0] == 1
                    try:
                        # Cast back if feasible
                        result = type(block.values)._from_sequence(
                            result.ravel(), dtype=block.values.dtype
                        )
                    except ValueError:
                        # reshape to be valid for non-Extension Block
                        result = result.reshape(1, -1)

                agg_block: Block = block.make_block(result)

            new_items.append(locs)
            agg_blocks.append(agg_block)

        if not (agg_blocks or split_frames):
            raise DataError("No numeric types to aggregate")

        if split_items:
            # Clean up the mess left over from split blocks.
            for locs, result in zip(split_items, split_frames):
>               assert len(locs) == result.shape[1]
E               AssertionError

env/lib/python3.8/site-packages/pandas/core/groupby/generic.py:1110: AssertionError

Any idea why this might still be happening? @TomAugspurger
It worked fine until pandas-1.0.0.

Thanks!

@zking1219 if you have a minimal example I'd recommend opening a new issue.

I can see how that might help, I'll work on putting one together. Thanks!

I am still getting this error message running pandas 1.0.5. I switched back to 0.25.1 and it is working just fine. My dataset is a little complicated and I don't have time to put together a minimal example now, but thought you would want to know that this still seems to be a problem.

Seeing the same error in the pandas-1.1.0 version as well

I got the same error message...
With pdb I have managed to find out that there is a problem in the 1450th row:D
This command does not give an AssertionError (in pdb):
datafile.head(1464).groupby("column_name").min()
But this one does:
datafile.head(1465).groupby("column_name").min()

The 1465th row has 43 columns instead of 42. But when I have deleted the 42nd column (Ie the 43rd) nothing got better. I still get the same error message.

the pandas-1.1.0 version as

I did not see the error again in 1.1.0. I followed the procedure to recompile all the packages and also I identified that it had also occurred since I had some NaN in my data which I have fixed. Before was not getting assert error even in the presence of NaN

Was this page helpful?
0 / 5 - 0 ratings