Pandas: Unexpected results for the mean of a DataFrame of ufloat from the uncertainties package.

Created on 6 Sep 2016 · 14Comments · Source: pandas-dev/pandas

Related to #6898.

I find it very convenient to use a DataFrame of ufloat from the uncertainties package. Each entry consists of (value, error) and could represent the result of Monte Carlo simulations or an experiment.

At present taking sums along both axes gives the expected result, but taking the mean does not.

import pandas as pd
import numpy as np
from uncertainties import unumpy

value = np.arange(12).reshape(3,4)
err = 0.01 * np.arange(12).reshape(3,4) + 0.005

data = unumpy.uarray(value, err)

df = pd.DataFrame(data, index=['r1', 'r2', 'r3'], columns=['c1', 'c2', 'c3', 'c4'])

Examples:

print (df)
               c1             c2             c3             c4
r1  0.000+/-0.005  1.000+/-0.015  2.000+/-0.025  3.000+/-0.035
r2    4.00+/-0.04    5.00+/-0.06    6.00+/-0.07    7.00+/-0.08
r3    8.00+/-0.09    9.00+/-0.10   10.00+/-0.11   11.00+/-0.12

df.sum(axis=0) # This works

c1    12.00+/-0.10
c2    15.00+/-0.11
c3    18.00+/-0.13
c4    21.00+/-0.14
dtype: object

df.sum(axis=1) # This works

r1     6.00+/-0.05
r2    22.00+/-0.12
r3    38.00+/-0.20
dtype: object

df.mean(axis=0) # This does not work

Series([], dtype: float64)

Expected (`df.apply(lambda x: x.sum() / x.size)`)

c1    4.000+/-0.032
c2      5.00+/-0.04
c3      6.00+/-0.04
c4      7.00+/-0.05
dtype: object

df.mean(axis=1) # This does not work

r1   NaN
r2   NaN
r3   NaN
dtype: float64

Expected (`df.T.apply(lambda x: x.sum() / x.size)`)

r1    1.500+/-0.011
r2    5.500+/-0.031
r3      9.50+/-0.05
dtype: object

Dtypes Enhancement Nuisance Columns Numeric Reductions

Source

bgatessucks

👍3

Most helpful comment

Seen from the outside, it looks like in both cases Pandas decrees that the result of mean() should be of type float64: in @rth's example above the NumPy array actually contains integers, that are converted to float64 (which is doable); in the case of uncertainties.UFloat numbers with uncertainty, forcing the result to float64 is mostly meaningless (as this would get rid of the uncertainty) and mean() does not produce the expected result.

In contrast, as the original post shows, Pandas is more open on the data type of sum(), which is, correctly, object, for uncertainties.UFloat objects.

I think that it is desirable that since Pandas is able to sum(), it be able to get the mean() too (since the mean is not much more than a sum).

lebigot on 7 Sep 2016

👍6

All 14 comments

this is very much like #13446 . Since pandas doesn't know that an uncertainity is numeric it cannot deal with it, similar to Decimal.

Without a custom dtype, or special support baked into object dtypes, this is not supported.

If someone wanted to contribute this functionaility then that would be great. Conceptually this is very easy, but there are lots of implementation details.

jreback on 6 Sep 2016

👍1

@jreback Do I understand correctly that there is nothing that the uncertainties module can do to solve this issue?

lebigot on 6 Sep 2016

I have no idea
if u want t dig in and see would be great

jreback on 6 Sep 2016

A useful first step would be to see if you can reproduce the issue with numpy alone (not using pandas).

shoyer on 6 Sep 2016

@shoyer No issue with numpy alone:

import pandas as pd
import numpy as np
from uncertainties import unumpy

value = np.arange(12).reshape(3,4)
err = 0.01 * np.arange(12).reshape(3,4) + 0.005

data = unumpy.uarray(value, err)

df = pd.DataFrame(data, index=['r1', 'r2', 'r3'], columns=['c1', 'c2', 'c3', 'c4'])

print (df.apply(lambda x: x.sum() / x.size).values), "\n"

print (data.mean(axis=0)), "\n"

print (df.T.apply(lambda x: x.sum() / x.size).values), "\n"

print (data.mean(axis=1))

bgatessucks on 6 Sep 2016

👍1

@bgatessucks what is the type/dtype of unumpy.uarray? Is it a numpy array with dtype=object?

shoyer on 6 Sep 2016

@shoyer

type(data) is <type 'numpy.ndarray'>.

bgatessucks on 6 Sep 2016

And data.dtype?

shoyer on 6 Sep 2016

I just wanted to be sure that you're not using subclassing or something else like that.

In any case, I think this is probably a pandas bug (but would need someone to work through/figure out). We should have a fallback implementation of mean (like NumPy's mean) that works on object arrays.

shoyer on 6 Sep 2016

👍1

@shoyer Sorry I had missed that:

data.dtype is object.

bgatessucks on 6 Sep 2016

For what it's worth, the same example as above works with a DataFrame initialized with a numpy array of dtype='object' containing floats.

import pandas as pd
import numpy as np
from IPython.display import display

data = np.arange(12).reshape(3,4).astype('object')

df = pd.DataFrame(data, index=['r1', 'r2', 'r3'],
                 columns=['c1', 'c2', 'c3', 'c4'], dtype='object')

display(df.sum(axis=0))
display(df.sum(axis=1))
display(df.mean(axis=0))
display(df.mean(axis=1))

so I guess that pandas is able to correctly infer in this case that an array of dtype="object" contains numbers (floats) unlike with the array containing ufloat elements from the uncertainties package.

rth on 7 Sep 2016

In contrast, as the original post shows, Pandas is more open on the data type of sum(), which is, correctly, object, for uncertainties.UFloat objects.

I think that it is desirable that since Pandas is able to sum(), it be able to get the mean() too (since the mean is not much more than a sum).

lebigot on 7 Sep 2016

👍6

Is there any news on this subject? Same problem here, with pandas version 1.0.1.