z-scoring (centering a variable at its mean and dividing by its standard deviation) is a common statistical operation. However, it ends up being rather hard to do if one's data is represented by a pandas Series object.
There's a few things one might think of to try. The scipy.stats module has a zscore function, but applying it to a Series raises an exception (I think because it doesn't know what to do with a column vector? I'm not sure):
s = pd.Series(np.random.rand(10))
s.apply(stats.zscore)
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-966-4f828190fd49> in <module>()
1 s = pd.Series(np.random.rand(10))
----> 2 s.apply(stats.zscore)
/Users/mwaskom/anaconda/lib/python2.7/site-packages/pandas/core/series.pyc in apply(self, func, convert_dtype, args, **kwds)
2167 values = lib.map_infer(values, lib.Timestamp)
2168
-> 2169 mapped = lib.map_infer(values, f, convert=convert_dtype)
2170 if len(mapped) and isinstance(mapped[0], Series):
2171 from pandas.core.frame import DataFrame
pandas/src/inference.pyx in pandas.lib.map_infer (pandas/lib.c:62578)()
/Users/mwaskom/anaconda/lib/python2.7/site-packages/scipy/stats/stats.pyc in zscore(a, axis, ddof)
2105 """
2106 a = np.asanyarray(a)
-> 2107 mns = a.mean(axis=axis)
2108 sstd = a.std(axis=axis, ddof=ddof)
2109 if axis and mns.ndim < a.ndim:
/Users/mwaskom/anaconda/lib/python2.7/site-packages/numpy/core/_methods.pyc in _mean(a, axis, dtype, out, keepdims)
54 arr = asanyarray(a)
55
---> 56 rcount = _count_reduce_items(arr, axis)
57 # Make this warning show up first
58 if rcount == 0:
/Users/mwaskom/anaconda/lib/python2.7/site-packages/numpy/core/_methods.pyc in _count_reduce_items(arr, axis)
48 items = 1
49 for ax in axis:
---> 50 items *= arr.shape[ax]
51 return items
52
IndexError: tuple index out of range
Actually writing out the formula using a lambda returns a Series of nulls:
s.apply(lambda x: (x - x.mean()) / x.std())
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
6 NaN
7 NaN
8 NaN
9 NaN
dtype: float64
And calling stats.zscore does not preserve the pandas metadata:
stats.zscore(s)
array([-0.3957885 , -0.89938219, 1.51650231, -1.01533894, 1.73900501,
-1.08299489, -0.90929288, 0.08315865, 0.80475036, 0.15938106])
This is at least partially scipy's fault (particularly for the last example), but is it possible to change something in pandas so that this is easier? I am not exactly sure what's causing the exception in the simplest option.
Perhaps the best thing to do would be to add a zscore method to `Series objects?
pd.show_versions()INSTALLED VERSIONS
------------------
commit: None
python: 2.7.11.final.0
python-bits: 64
OS: Darwin
OS-release: 13.4.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
pandas: 0.17.1
nose: 1.3.7
pip: None
setuptools: 18.3.1
Cython: 0.20.1
numpy: 1.10.2
scipy: 0.16.1
statsmodels: 0.6.1
IPython: 4.0.3
sphinx: 1.2.3
patsy: 0.4.0
dateutil: 2.4.2
pytz: 2015.7
blosc: None
bottleneck: None
tables: 3.1.1
numexpr: 2.3.1
matplotlib: 1.5.1
openpyxl: 1.8.5
xlrd: 0.9.3
xlwt: 0.7.5
xlsxwriter: 0.5.5
lxml: 3.3.5
bs4: 4.3.1
html5lib: 0.999
httplib2: None
apiclient: None
sqlalchemy: 0.9.4
pymysql: None
psycopg2: None
Jinja2: None
(s-s.mean())/s.std()
Not sure why I go to the trouble to write out a useful issue report if you're not going to actually read it:

@mwaskom well your code is very different from Jeff's. Please don't dismiss it.
I think Michael's point was that a person from a stats background might initially just translate the formula to a function.
I've been against adding a normalize method since that word is overloaded, but zscore is relatively clear, if you're familiar with stats.
I will also point out that the ddof default is different in scipy's zscore than in pandas's std, just in case you (@mwaskom) didn't know.
Isn't Series.apply supposed to first try if it can pass the full series to the function, before it does it elementwise (what is happening here) ?
@mwaskom you want to use .pipe instead of .apply:
In [24]: s.pipe(lambda x: (x - x.mean()) / x.std())
Out[24]:
0 -0.800491
1 -1.342080
2 0.504102
3 -0.844059
4 0.238954
5 -1.009755
6 -0.007397
7 1.707074
8 0.333148
9 1.220504
dtype: float64
Somewhat confusingly, .apply (on Series and DataFrame) is only for operations that aggregate out a dimension. I don't know why it gives the all NA series in this case instead of erroring, though.
I posted the post idiomatic and correct way to to normalization.
In [1]: s = pd.Series(np.random.rand(10))
In [4]: (s-s.mean())/s.std()
Out[4]:
0 -0.237282
1 1.256156
2 1.268999
3 -0.230126
4 -0.944522
5 -1.463070
6 -0.822994
7 -0.540303
8 1.197952
9 0.515190
dtype: float64
# This is why you get all nans
In [6]: s[0]
Out[6]: 0.43126132923383709
In [7]: v = s[0]
In [8]: (v-v.mean())/v.std()
Out[8]: nan
.apply is the most mis(abused) function. It doesn't need anything. Further using .pipe here is silly.
Thanks @jreback, I'd never seen the formula for a z-score before, and I had no idea that you could do basic mathematical operations Series objects, so that is very enlightening.
Somewhat confusingly, .apply (on Series and DataFrame) is only for operations that aggregate out a dimension.
@shoyer What do you mean with that? As it works as expected for a Dataframe column (so without reducing a dimension):
In [28]: s.to_frame().apply(lambda x: (x - x.mean()) / x.std())
Out[28]:
0
0 -1.047008
1 0.969972
2 0.867152
3 0.176468
4 1.324789
5 0.659774
6 -0.297906
7 -1.647645
8 0.080457
9 -1.086053
I am always confused with DataFrame.apply, which has its good use (in contrast with Series.apply). But I find it quite confusing that Series.apply is not doing the same as DataFrame.apply does for each column/row.
In any case, the docstring of Series.apply is also not very illuminating
Some further thoughts: say you want to use stats.zscore, pipe is also not an ideal solution:
In [38]: s.pipe(stats.zscore, ddof=1)
Out[38]:
array([-1.04700754, 0.96997194, 0.86715166, 0.17646793, 1.324789 ,
0.65977436, -0.29790601, -1.64764543, 0.08045739, -1.0860533 ])
While DataFrame.apply does what want:
In [36]: s.to_frame().apply(stats.zscore, ddof=1)
Out[36]:
0
0 -1.047008
1 0.969972
2 0.867152
3 0.176468
4 1.324789
5 0.659774
6 -0.297906
7 -1.647645
8 0.080457
9 -1.086053
It would be nice if there is a way to perform such a function on a Series.
the function is applied to each (N-1)-D element of the N-D container (like built-in map). to_frame() embeds the 1-D Series in a 2-D DataFrame
@jorisvandenbossche OK, apparently I was confused here about how apply works. I guess it's more flexible than only doing aggregations.
And just to clarify for the record, the weirdness with NAs arises because s[0] returns a numpy.float64 object, which defines .mean() and .std() methods -- so we end up dividing by 0.
Most helpful comment
I posted the post idiomatic and correct way to to normalization.
.applyis the most mis(abused) function. It doesn't need anything. Further using.pipehere is silly.