[x] I have checked that this issue has not already been reported.
[x] I have confirmed this bug exists on the latest version of pandas.
from decimal import Decimal
df = pd.DataFrame({'col_1': [Decimal(1.5), Decimal(4.0)], 'col_2': [5.0, 10.0]})
df.mean(axis=1) # returns 5, 10 -- ignoring the decimals types in the averaging
df2 = pd.DataFrame({'col_1': [Decimal(1.5), Decimal(4.0)], 'col_2': [5, 10]})
df2.mean(axis=1) # returns 3.25, 7 -- includes the decimals types in the averaging
There is inconsistent behaviour on how Decimal is being averaging depending if it is averaged to an int vs a float. Is it expected that the two dataframes above return different results?
I would expect in both cases to see 3.25 and 7 as the mean of the rows.
pd.show_versions()commit : None
python : 3.6.6.final.0
python-bits : 64
OS : Darwin
OS-release : 18.7.0
machine : x86_64
processor : i386
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8
pandas : 1.0.3
numpy : 1.18.3
pytz : 2019.3
dateutil : 2.8.1
pip : 20.0.2
setuptools : 46.1.3
Cython : None
pytest : 5.4.1
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : 1.2.8
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : 2.8.5 (dt dec pq3 ext lo64)
jinja2 : 2.11.2
IPython : 7.13.0
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : None
matplotlib : 3.2.1
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
pytest : 5.4.1
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : 1.3.16
tables : None
tabulate : None
xarray : None
xlrd : 1.2.0
xlwt : None
xlsxwriter : 1.2.8
numba : None
Discovered the issue when I saw that .mean(axis=1) was averaging the Decimal types if I selected skipna=True but not when doing skipna=False. Haven't been able to reproduce this yet but might help understand what's going on.
@cuchoi Thanks for the report and I'd agree this is odd. It looks like this is because Decimal and int adds whereas Decimal and float does not (probably so as not to lose precision):
[ins] In [3]: Decimal(1.0) + 1
Out[3]: Decimal('2')
[ins] In [4]: Decimal(1.0) + 1.0
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-4-70cba7894e40> in <module>
----> 1 Decimal(1.0) + 1.0
TypeError: unsupported operand type(s) for +: 'decimal.Decimal' and 'float'
So maybe this is more of an unfortunate consequence (at least in this case) of the deliberate implementation of Decimal rather than a bug? (Of course it's then a bit arbitrary that the float column becomes the average rather than the Decimal one.)
Thanks for pointing that out. Indeed that makes sense. It is still a bit confusing that in python the resulting object is Decimal, but in pandas is a float, but a float cannot be averaged with a Decimal.
Moreover, if I change one of the values to text, for example like this:
df = pd.DataFrame({'col_1': ["text", Decimal(4.0)], 'col_2': [5, 10]})
df.mean(axis=1) # now it returns returns 5, 10 -- ignoring the Decimal(4.0) in the averaging
Now the second row doesn't include in the average the Decimal in the operation. Is this expected behaviour as well?
Now the second row doesn't include in the average the Decimal in the operation. Is this expected behaviour as well?
Yes, that's expected / consistent with the above. It's trying to do the averaging across all rows and if that fails falls back on only the "numeric" columns.
Would it make sense to either make the result of the row a NaN when skipna=False or to have an option to raise an error if .mean() is not able to use all rows?
The goal would be to be able to know when columns are being ignored.
Would it make sense to either make the result of the row a NaN when
skipna=Falseor to have an option to raise an error if.mean()is not able to use all rows?
You can have it return an error by setting numeric_only=False (this effectively disables the try / except behavior when it's set to its default of None):
[ins] In [1]: from decimal import Decimal
[ins] In [2]: df = pd.DataFrame({"a": [Decimal(1), Decimal(2)], "b": [1.0, 2.0]})
[ins] In [3]: df.mean(axis=1, numeric_only=False)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-3-4ee7aee33965> in <module>
----> 1 df.mean(axis=1, numeric_only=False)
~/pandas/pandas/core/generic.py in stat_func(self, axis, skipna, level, numeric_only, **kwargs)
11286 if level is not None:
11287 return self._agg_by_level(name, axis=axis, level=level, skipna=skipna)
> 11288 return self._reduce(
11289 func, name=name, axis=axis, skipna=skipna, numeric_only=numeric_only
11290 )
~/pandas/pandas/core/frame.py in _reduce(self, op, name, axis, skipna, numeric_only, filter_type, **kwds)
8438 # After possibly _get_data and transposing, we are now in the
8439 # simple case where we can use BlockManager._reduce
-> 8440 res = df._mgr.reduce(blk_func)
8441 assert isinstance(res, dict)
8442 if len(res):
~/pandas/pandas/core/internals/managers.py in reduce(self, func, *args, **kwargs)
336 res = {}
337 for blk in self.blocks:
--> 338 bres = func(blk.values, *args, **kwargs)
339
340 if np.ndim(bres) == 0:
~/pandas/pandas/core/frame.py in blk_func(values)
8434 return values._reduce(name, skipna=skipna, **kwds)
8435 else:
-> 8436 return op(values, axis=1, skipna=skipna, **kwds)
8437
8438 # After possibly _get_data and transposing, we are now in the
~/pandas/pandas/core/nanops.py in _f(*args, **kwargs)
69 try:
70 with np.errstate(invalid="ignore"):
---> 71 return f(*args, **kwargs)
72 except ValueError as e:
73 # we want to transform an object array
~/pandas/pandas/core/nanops.py in f(values, axis, skipna, **kwds)
127 result = alt(values, axis=axis, skipna=skipna, **kwds)
128 else:
--> 129 result = alt(values, axis=axis, skipna=skipna, **kwds)
130
131 return result
~/pandas/pandas/core/nanops.py in nanmean(values, axis, skipna, mask)
556 dtype_count = dtype
557 count = _get_counts(values.shape, mask, axis, dtype=dtype_count)
--> 558 the_sum = _ensure_numeric(values.sum(axis, dtype=dtype_sum))
559
560 if axis is not None and getattr(the_sum, "ndim", False):
~/opt/miniconda3/envs/pandas-dev/lib/python3.8/site-packages/numpy/core/_methods.py in _sum(a, axis, dtype, out, keepdims, initial, where)
36 def _sum(a, axis=None, dtype=None, out=None, keepdims=False,
37 initial=_NoValue, where=True):
---> 38 return umr_sum(a, axis, dtype, out, keepdims, initial, where)
39
40 def _prod(a, axis=None, dtype=None, out=None, keepdims=False,
TypeError: unsupported operand type(s) for +: 'decimal.Decimal' and 'float'
Good to know, thanks! Should I close this issue?
@cuchoi I can close. Thanks for the report nonetheless, it's an interesting edge case