import pandas as pd
import numpy as np
df = pd.DataFrame({'key1' : ['a', 'a', 'b', 'b', 'a'],
'key2' : ['one', 'two', 'one', 'two', 'one'],
'key3' : ['three', 'three', 'three', 'six', 'six'],
'data1' : np.random.randn(5),
'data2' : np.random.randn(5)})
df.groupby('key1').min()
Since pandas-1.0.0, an AssertionError is thrown when grouping a DataFrame by a key and using max/min as aggregation functions. It works fine if only 1 key (other than the grouping key) is of the type object in the DataFrame, but it doesn't when the number of keys of type object is bigger than 1 (as shown in the example). This configuration worked fine on previous versions of pandas (e.g., pandas-0.25.3).
| key1 | key2 | key3 | data1 | data2 |
|:-------|:-------|:-------|---------:|----------:|
| a | one | six | -0.67246 | -1.6302 |
| b | one | six | -1.72628 | -0.907298 |
pd.show_versions()commit : None
python : 3.7.6.final.0
python-bits : 64
OS : Darwin
OS-release : 19.2.0
machine : x86_64
processor : i386
byteorder : little
LC_ALL : None
LANG : None
LOCALE : None.UTF-8
pandas : 1.0.0
numpy : 1.18.1
pytz : 2019.3
dateutil : 2.8.1
pip : 20.0.2
setuptools : 45.1.0.post20200127
Cython : 0.29.14
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.10.3
IPython : 7.11.1
pandas_datareader: None
bs4 : None
bottleneck : 1.3.1
fastparquet : None
gcsfs : None
lxml.etree : None
matplotlib : 3.1.2
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
pytest : None
pyxlsb : None
s3fs : None
scipy : 1.3.1
sqlalchemy : None
tables : None
tabulate : 0.8.3
xarray : None
xlrd : None
xlwt : None
xlsxwriter : None
numba : 0.48.0
Thanks for the report. That assert is from https://github.com/pandas-dev/pandas/pull/29035 (cc @jbrockmendel)
We do min on object dtype, which is NotImplemented in Cython, so fall back to the python agg. Then in
result = cast(DataFrame, result)
# unwrap DataFrame to get array
assert len(result._data.blocks) == 1
result = result._data.blocks[0].values
if isinstance(result, np.ndarray) and result.ndim == 1:
result = result.reshape(1, -1)
the ` assert len(result._data.blocks) == 1 fails
(Pdb) pp result
key2 key3
key1
a one six
b one six
and we fall through to the finally with a DataFrame.
FYI @marcevrard , we publish release candidates and nightly builds, if you want to catch these before the release. You can select watch pandas' "Release only" if nightly builds aren't an option.
I guess the faulty assumption is that a groupby aggregation on an single Block won't split it into multiple blocks. This apparently isn't true for object blocks
(Pdb) obj._data.blocks
(ObjectBlock: slice(0, 2, 1), 2 x 5, dtype: object,)
(Pdb) result._data.blocks
(ObjectBlock: slice(0, 1, 1), 1 x 2, dtype: object, ObjectBlock: slice(1, 2, 1), 1 x 2, dtype: object)
I guess the faulty assumption is that a groupby aggregation on an single Block won't split it into multiple blocks.
I guess _split_and_operate would have to be called in there somehow. Easiest solution would be to raise TypeError, which should revert to the previous behavior. Longer-term we probably should be handling that case without raising.
Looking into this today.
Thank you for the quick fix, I confirm it indeed works again with the 1.0.1 version.
I'm still seeing this error in pandas-1.0.1:
df_mh_spc6 = df_mh_spc5.groupby(['bldg_id'], as_index=False, sort=False).max()
env/lib/python3.8/site-packages/pandas/core/groupby/groupby.py:1378: in f
return self._cython_agg_general(alias, alt=npfunc, **kwargs)
env/lib/python3.8/site-packages/pandas/core/groupby/generic.py:1003: in _cython_agg_general
agg_blocks, agg_items = self._cython_agg_blocks(
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
self = <pandas.core.groupby.generic.DataFrameGroupBy object at 0x11d88db20>, how = 'max'
alt = <function amax at 0x111901d30>, numeric_only = False, min_count = -1
def _cython_agg_blocks(
self, how: str, alt=None, numeric_only: bool = True, min_count: int = -1
) -> "Tuple[List[Block], Index]":
# TODO: the actual managing of mgr_locs is a PITA
# here, it should happen via BlockManager.combine
data: BlockManager = self._get_data_to_aggregate()
if numeric_only:
data = data.get_numeric_data(copy=False)
agg_blocks: List[Block] = []
new_items: List[np.ndarray] = []
deleted_items: List[np.ndarray] = []
# Some object-dtype blocks might be split into List[Block[T], Block[U]]
split_items: List[np.ndarray] = []
split_frames: List[DataFrame] = []
no_result = object()
for block in data.blocks:
# Avoid inheriting result from earlier in the loop
result = no_result
locs = block.mgr_locs.as_array
try:
result, _ = self.grouper.aggregate(
block.values, how, axis=1, min_count=min_count
)
except NotImplementedError:
# generally if we have numeric_only=False
# and non-applicable functions
# try to python agg
if alt is None:
# we cannot perform the operation
# in an alternate way, exclude the block
assert how == "ohlc"
deleted_items.append(locs)
continue
# call our grouper again with only this block
obj = self.obj[data.items[locs]]
if obj.shape[1] == 1:
# Avoid call to self.values that can occur in DataFrame
# reductions; see GH#28949
obj = obj.iloc[:, 0]
s = get_groupby(obj, self.grouper)
try:
result = s.aggregate(lambda x: alt(x, axis=self.axis))
except TypeError:
# we may have an exception in trying to aggregate
# continue and exclude the block
deleted_items.append(locs)
continue
else:
result = cast(DataFrame, result)
# unwrap DataFrame to get array
if len(result._data.blocks) != 1:
# We've split an object block! Everything we've assumed
# about a single block input returning a single block output
# is a lie. To keep the code-path for the typical non-split case
# clean, we choose to clean up this mess later on.
split_items.append(locs)
split_frames.append(result)
continue
assert len(result._data.blocks) == 1
result = result._data.blocks[0].values
if isinstance(result, np.ndarray) and result.ndim == 1:
result = result.reshape(1, -1)
assert not isinstance(result, DataFrame)
if result is not no_result:
# see if we can cast the block back to the original dtype
result = maybe_downcast_numeric(result, block.dtype)
if block.is_extension and isinstance(result, np.ndarray):
# e.g. block.values was an IntegerArray
# (1, N) case can occur if block.values was Categorical
# and result is ndarray[object]
assert result.ndim == 1 or result.shape[0] == 1
try:
# Cast back if feasible
result = type(block.values)._from_sequence(
result.ravel(), dtype=block.values.dtype
)
except ValueError:
# reshape to be valid for non-Extension Block
result = result.reshape(1, -1)
agg_block: Block = block.make_block(result)
new_items.append(locs)
agg_blocks.append(agg_block)
if not (agg_blocks or split_frames):
raise DataError("No numeric types to aggregate")
if split_items:
# Clean up the mess left over from split blocks.
for locs, result in zip(split_items, split_frames):
> assert len(locs) == result.shape[1]
E AssertionError
env/lib/python3.8/site-packages/pandas/core/groupby/generic.py:1110: AssertionError
Any idea why this might still be happening? @TomAugspurger
It worked fine until pandas-1.0.0.
Thanks!
@zking1219 if you have a minimal example I'd recommend opening a new issue.
I can see how that might help, I'll work on putting one together. Thanks!
I am still getting this error message running pandas 1.0.5. I switched back to 0.25.1 and it is working just fine. My dataset is a little complicated and I don't have time to put together a minimal example now, but thought you would want to know that this still seems to be a problem.
Seeing the same error in the pandas-1.1.0 version as well
I got the same error message...
With pdb I have managed to find out that there is a problem in the 1450th row:D
This command does not give an AssertionError (in pdb):
datafile.head(1464).groupby("column_name").min()
But this one does:
datafile.head(1465).groupby("column_name").min()
The 1465th row has 43 columns instead of 42. But when I have deleted the 42nd column (Ie the 43rd) nothing got better. I still get the same error message.
the pandas-1.1.0 version as
I did not see the error again in 1.1.0. I followed the procedure to recompile all the packages and also I identified that it had also occurred since I had some NaN in my data which I have fixed. Before was not getting assert error even in the presence of NaN