Pandas: Impossible to (neatly) z-score a Series

Created on 2 Mar 2016 · 14Comments · Source: pandas-dev/pandas

z-scoring (centering a variable at its mean and dividing by its standard deviation) is a common statistical operation. However, it ends up being rather hard to do if one's data is represented by a pandas Series object.

There's a few things one might think of to try. The scipy.stats module has a zscore function, but applying it to a Series raises an exception (I think because it doesn't know what to do with a column vector? I'm not sure):

s = pd.Series(np.random.rand(10))
s.apply(stats.zscore)

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-966-4f828190fd49> in <module>()
      1 s = pd.Series(np.random.rand(10))
----> 2 s.apply(stats.zscore)

/Users/mwaskom/anaconda/lib/python2.7/site-packages/pandas/core/series.pyc in apply(self, func, convert_dtype, args, **kwds)
   2167             values = lib.map_infer(values, lib.Timestamp)
   2168 
-> 2169         mapped = lib.map_infer(values, f, convert=convert_dtype)
   2170         if len(mapped) and isinstance(mapped[0], Series):
   2171             from pandas.core.frame import DataFrame

pandas/src/inference.pyx in pandas.lib.map_infer (pandas/lib.c:62578)()

/Users/mwaskom/anaconda/lib/python2.7/site-packages/scipy/stats/stats.pyc in zscore(a, axis, ddof)
   2105     """
   2106     a = np.asanyarray(a)
-> 2107     mns = a.mean(axis=axis)
   2108     sstd = a.std(axis=axis, ddof=ddof)
   2109     if axis and mns.ndim < a.ndim:

/Users/mwaskom/anaconda/lib/python2.7/site-packages/numpy/core/_methods.pyc in _mean(a, axis, dtype, out, keepdims)
     54     arr = asanyarray(a)
     55 
---> 56     rcount = _count_reduce_items(arr, axis)
     57     # Make this warning show up first
     58     if rcount == 0:

/Users/mwaskom/anaconda/lib/python2.7/site-packages/numpy/core/_methods.pyc in _count_reduce_items(arr, axis)
     48     items = 1
     49     for ax in axis:
---> 50         items *= arr.shape[ax]
     51     return items
     52 

IndexError: tuple index out of range

Actually writing out the formula using a lambda returns a Series of nulls:

s.apply(lambda x: (x - x.mean()) / x.std())

0   NaN
1   NaN
2   NaN
3   NaN
4   NaN
5   NaN
6   NaN
7   NaN
8   NaN
9   NaN
dtype: float64

And calling stats.zscore does not preserve the pandas metadata:

stats.zscore(s)

array([-0.3957885 , -0.89938219,  1.51650231, -1.01533894,  1.73900501,
       -1.08299489, -0.90929288,  0.08315865,  0.80475036,  0.15938106])

This is at least partially scipy's fault (particularly for the last example), but is it possible to change something in pandas so that this is easier? I am not exactly sure what's causing the exception in the simplest option.

Perhaps the best thing to do would be to add a zscore method to `Series objects?

output of `pd.show_versions()`

INSTALLED VERSIONS
------------------
commit: None
python: 2.7.11.final.0
python-bits: 64
OS: Darwin
OS-release: 13.4.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.17.1
nose: 1.3.7
pip: None
setuptools: 18.3.1
Cython: 0.20.1
numpy: 1.10.2
scipy: 0.16.1
statsmodels: 0.6.1
IPython: 4.0.3
sphinx: 1.2.3
patsy: 0.4.0
dateutil: 2.4.2
pytz: 2015.7
blosc: None
bottleneck: None
tables: 3.1.1
numexpr: 2.3.1
matplotlib: 1.5.1
openpyxl: 1.8.5
xlrd: 0.9.3
xlwt: 0.7.5
xlsxwriter: 0.5.5
lxml: 3.3.5
bs4: 4.3.1
html5lib: 0.999
httplib2: None
apiclient: None
sqlalchemy: 0.9.4
pymysql: None
psycopg2: None
Jinja2: None

Dtypes Numeric Reshaping

Source

mwaskom

👍1

Most helpful comment

I posted the post idiomatic and correct way to to normalization.

In [1]: s = pd.Series(np.random.rand(10))

In [4]: (s-s.mean())/s.std()
Out[4]: 
0   -0.237282
1    1.256156
2    1.268999
3   -0.230126
4   -0.944522
5   -1.463070
6   -0.822994
7   -0.540303
8    1.197952
9    0.515190
dtype: float64

# This is why you get all nans
In [6]: s[0]
Out[6]: 0.43126132923383709

In [7]: v = s[0]

In [8]: (v-v.mean())/v.std()
Out[8]: nan

.apply is the most mis(abused) function. It doesn't need anything. Further using .pipe here is silly.

jreback on 3 Mar 2016

👍2

All 14 comments

(s-s.mean())/s.std()

jreback on 2 Mar 2016

👍2

Not sure why I go to the trouble to write out a useful issue report if you're not going to actually read it:

mwaskom on 3 Mar 2016

@mwaskom well your code is very different from Jeff's. Please don't dismiss it.

kawochen on 3 Mar 2016

I think Michael's point was that a person from a stats background might initially just translate the formula to a function.

TomAugspurger on 3 Mar 2016

I've been against adding a normalize method since that word is overloaded, but zscore is relatively clear, if you're familiar with stats.

TomAugspurger on 3 Mar 2016

👍2

I will also point out that the ddof default is different in scipy's zscore than in pandas's std, just in case you (@mwaskom) didn't know.

kawochen on 3 Mar 2016

Isn't Series.apply supposed to first try if it can pass the full series to the function, before it does it elementwise (what is happening here) ?

jorisvandenbossche on 3 Mar 2016

@mwaskom you want to use .pipe instead of .apply:

In [24]: s.pipe(lambda x: (x - x.mean()) / x.std())
Out[24]:
0   -0.800491
1   -1.342080
2    0.504102
3   -0.844059
4    0.238954
5   -1.009755
6   -0.007397
7    1.707074
8    0.333148
9    1.220504
dtype: float64

Somewhat confusingly, .apply (on Series and DataFrame) is only for operations that aggregate out a dimension. I don't know why it gives the all NA series in this case instead of erroring, though.

shoyer on 3 Mar 2016

I posted the post idiomatic and correct way to to normalization.

In [1]: s = pd.Series(np.random.rand(10))

In [4]: (s-s.mean())/s.std()
Out[4]: 
0   -0.237282
1    1.256156
2    1.268999
3   -0.230126
4   -0.944522
5   -1.463070
6   -0.822994
7   -0.540303
8    1.197952
9    0.515190
dtype: float64

# This is why you get all nans
In [6]: s[0]
Out[6]: 0.43126132923383709

In [7]: v = s[0]

In [8]: (v-v.mean())/v.std()
Out[8]: nan

.apply is the most mis(abused) function. It doesn't need anything. Further using .pipe here is silly.

jreback on 3 Mar 2016

👍2

Thanks @jreback, I'd never seen the formula for a z-score before, and I had no idea that you could do basic mathematical operations Series objects, so that is very enlightening.

mwaskom on 3 Mar 2016

Somewhat confusingly, .apply (on Series and DataFrame) is only for operations that aggregate out a dimension.

@shoyer What do you mean with that? As it works as expected for a Dataframe column (so without reducing a dimension):

In [28]: s.to_frame().apply(lambda x: (x - x.mean()) / x.std())
Out[28]:
          0
0 -1.047008
1  0.969972
2  0.867152
3  0.176468
4  1.324789
5  0.659774
6 -0.297906
7 -1.647645
8  0.080457
9 -1.086053

I am always confused with DataFrame.apply, which has its good use (in contrast with Series.apply). But I find it quite confusing that Series.apply is not doing the same as DataFrame.apply does for each column/row.
In any case, the docstring of Series.apply is also not very illuminating

jorisvandenbossche on 3 Mar 2016

Some further thoughts: say you want to use stats.zscore, pipe is also not an ideal solution:

In [38]: s.pipe(stats.zscore, ddof=1)
Out[38]:
array([-1.04700754,  0.96997194,  0.86715166,  0.17646793,  1.324789  ,
        0.65977436, -0.29790601, -1.64764543,  0.08045739, -1.0860533 ])

While DataFrame.apply does what want:

In [36]: s.to_frame().apply(stats.zscore, ddof=1)
Out[36]:
          0
0 -1.047008
1  0.969972
2  0.867152
3  0.176468
4  1.324789
5  0.659774
6 -0.297906
7 -1.647645
8  0.080457
9 -1.086053

It would be nice if there is a way to perform such a function on a Series.

jorisvandenbossche on 3 Mar 2016

the function is applied to each (N-1)-D element of the N-D container (like built-in map). to_frame() embeds the 1-D Series in a 2-D DataFrame

kawochen on 3 Mar 2016

@jorisvandenbossche OK, apparently I was confused here about how apply works. I guess it's more flexible than only doing aggregations.

And just to clarify for the record, the weirdness with NAs arises because s[0] returns a numpy.float64 object, which defines .mean() and .std() methods -- so we end up dividing by 0.

shoyer on 3 Mar 2016

Was this page helpful?

0 / 5 - 0 ratings