Pandas: BUG: Inconsistent behaviour when averaging Decimals, floats and ints

Created on 15 May 2020  路  8Comments  路  Source: pandas-dev/pandas

  • [x] I have checked that this issue has not already been reported.

  • [x] I have confirmed this bug exists on the latest version of pandas.


from decimal import Decimal

df = pd.DataFrame({'col_1': [Decimal(1.5), Decimal(4.0)], 'col_2': [5.0, 10.0]})
df.mean(axis=1) # returns 5, 10 -- ignoring the decimals types in the averaging

df2 = pd.DataFrame({'col_1': [Decimal(1.5), Decimal(4.0)], 'col_2': [5, 10]})
df2.mean(axis=1) # returns 3.25, 7 -- includes the decimals types in the averaging

Problem description

There is inconsistent behaviour on how Decimal is being averaging depending if it is averaged to an int vs a float. Is it expected that the two dataframes above return different results?

Expected Output

I would expect in both cases to see 3.25 and 7 as the mean of the rows.

Output of pd.show_versions()

INSTALLED VERSIONS

commit : None
python : 3.6.6.final.0
python-bits : 64
OS : Darwin
OS-release : 18.7.0
machine : x86_64
processor : i386
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.0.3
numpy : 1.18.3
pytz : 2019.3
dateutil : 2.8.1
pip : 20.0.2
setuptools : 46.1.3
Cython : None
pytest : 5.4.1
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : 1.2.8
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : 2.8.5 (dt dec pq3 ext lo64)
jinja2 : 2.11.2
IPython : 7.13.0
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : None
matplotlib : 3.2.1
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
pytest : 5.4.1
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : 1.3.16
tables : None
tabulate : None
xarray : None
xlrd : 1.2.0
xlwt : None
xlsxwriter : 1.2.8
numba : None

Bug Dtypes Numeric

All 8 comments

Discovered the issue when I saw that .mean(axis=1) was averaging the Decimal types if I selected skipna=True but not when doing skipna=False. Haven't been able to reproduce this yet but might help understand what's going on.

@cuchoi Thanks for the report and I'd agree this is odd. It looks like this is because Decimal and int adds whereas Decimal and float does not (probably so as not to lose precision):

[ins] In [3]: Decimal(1.0) + 1
Out[3]: Decimal('2')

[ins] In [4]: Decimal(1.0) + 1.0
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-4-70cba7894e40> in <module>
----> 1 Decimal(1.0) + 1.0

TypeError: unsupported operand type(s) for +: 'decimal.Decimal' and 'float'

So maybe this is more of an unfortunate consequence (at least in this case) of the deliberate implementation of Decimal rather than a bug? (Of course it's then a bit arbitrary that the float column becomes the average rather than the Decimal one.)

Thanks for pointing that out. Indeed that makes sense. It is still a bit confusing that in python the resulting object is Decimal, but in pandas is a float, but a float cannot be averaged with a Decimal.

Moreover, if I change one of the values to text, for example like this:

df = pd.DataFrame({'col_1': ["text", Decimal(4.0)], 'col_2': [5, 10]})
df.mean(axis=1) # now it returns returns 5, 10 -- ignoring the Decimal(4.0) in the averaging

Now the second row doesn't include in the average the Decimal in the operation. Is this expected behaviour as well?

Now the second row doesn't include in the average the Decimal in the operation. Is this expected behaviour as well?

Yes, that's expected / consistent with the above. It's trying to do the averaging across all rows and if that fails falls back on only the "numeric" columns.

Would it make sense to either make the result of the row a NaN when skipna=False or to have an option to raise an error if .mean() is not able to use all rows?

The goal would be to be able to know when columns are being ignored.

Would it make sense to either make the result of the row a NaN when skipna=False or to have an option to raise an error if .mean() is not able to use all rows?

You can have it return an error by setting numeric_only=False (this effectively disables the try / except behavior when it's set to its default of None):

[ins] In [1]: from decimal import Decimal                                                                                                                                                                    

[ins] In [2]: df = pd.DataFrame({"a": [Decimal(1), Decimal(2)], "b": [1.0, 2.0]})                                                                                                                            

[ins] In [3]: df.mean(axis=1, numeric_only=False)                                                                                                                                                            
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-3-4ee7aee33965> in <module>
----> 1 df.mean(axis=1, numeric_only=False)

~/pandas/pandas/core/generic.py in stat_func(self, axis, skipna, level, numeric_only, **kwargs)
  11286         if level is not None:
  11287             return self._agg_by_level(name, axis=axis, level=level, skipna=skipna)
> 11288         return self._reduce(
  11289             func, name=name, axis=axis, skipna=skipna, numeric_only=numeric_only
  11290         )

~/pandas/pandas/core/frame.py in _reduce(self, op, name, axis, skipna, numeric_only, filter_type, **kwds)
   8438             # After possibly _get_data and transposing, we are now in the
   8439             #  simple case where we can use BlockManager._reduce
-> 8440             res = df._mgr.reduce(blk_func)
   8441             assert isinstance(res, dict)
   8442             if len(res):

~/pandas/pandas/core/internals/managers.py in reduce(self, func, *args, **kwargs)
    336         res = {}
    337         for blk in self.blocks:
--> 338             bres = func(blk.values, *args, **kwargs)
    339 
    340             if np.ndim(bres) == 0:

~/pandas/pandas/core/frame.py in blk_func(values)
   8434                     return values._reduce(name, skipna=skipna, **kwds)
   8435                 else:
-> 8436                     return op(values, axis=1, skipna=skipna, **kwds)
   8437 
   8438             # After possibly _get_data and transposing, we are now in the

~/pandas/pandas/core/nanops.py in _f(*args, **kwargs)
     69             try:
     70                 with np.errstate(invalid="ignore"):
---> 71                     return f(*args, **kwargs)
     72             except ValueError as e:
     73                 # we want to transform an object array

~/pandas/pandas/core/nanops.py in f(values, axis, skipna, **kwds)
    127                     result = alt(values, axis=axis, skipna=skipna, **kwds)
    128             else:
--> 129                 result = alt(values, axis=axis, skipna=skipna, **kwds)
    130 
    131             return result

~/pandas/pandas/core/nanops.py in nanmean(values, axis, skipna, mask)
    556         dtype_count = dtype
    557     count = _get_counts(values.shape, mask, axis, dtype=dtype_count)
--> 558     the_sum = _ensure_numeric(values.sum(axis, dtype=dtype_sum))
    559 
    560     if axis is not None and getattr(the_sum, "ndim", False):

~/opt/miniconda3/envs/pandas-dev/lib/python3.8/site-packages/numpy/core/_methods.py in _sum(a, axis, dtype, out, keepdims, initial, where)
     36 def _sum(a, axis=None, dtype=None, out=None, keepdims=False,
     37          initial=_NoValue, where=True):
---> 38     return umr_sum(a, axis, dtype, out, keepdims, initial, where)
     39 
     40 def _prod(a, axis=None, dtype=None, out=None, keepdims=False,

TypeError: unsupported operand type(s) for +: 'decimal.Decimal' and 'float'

Good to know, thanks! Should I close this issue?

@cuchoi I can close. Thanks for the report nonetheless, it's an interesting edge case

Was this page helpful?
0 / 5 - 0 ratings