unlike when I use .aggregate('mean'), I lose the groupedby columns and any nonnumeric columns when using transform
>>> d = {'a':['a', 'a', 'c', 'c'], 'b':['z','z','z','y'], 'dat':[4.2, 2.4, 3.5, 5.3], 'dat2':[1.3,2.6,3.4,4.1]}
>>> df = pd.DataFrame(d)
>>> zscore = lambda x: (x - x.mean()) / x.std()
>>> df.groupby('a').transform(zscore)
dat dat2
0 0.707107 -0.707107
1 -0.707107 0.707107
2 -0.707107 -0.707107
3 0.707107 0.707107
If I add 'a' to the index, and then transform, I still lose column b
>>> df = df.set_index('a')
>>> df.groupby(level='a').transform(zscore)
dat dat2
a
a 0.707107 -0.707107
a -0.707107 0.707107
c -0.707107 -0.707107
c 0.707107 0.707107
The only way to keep non-numeric information columns is to add them all to the index, regardless of whether I need to group on them or not
>>> df = df.set_index(['a', 'b'])
>>> df.groupby(level='a').transform(zscore)
dat dat2
a b
c z 0.993399 -0.475293
z 0.198680 -0.095059
z 0.198680 1.425880
y -1.390759 -0.855528
I would expect the columns that I grouped on to remain, like they do when using aggregate('mean') for example.
a b dat dat2
0 a z 0.707107 -0.707107
1 a z -0.707107 0.707107
2 c z -0.707107 -0.707107
3 c y 0.707107 0.707107
pd.show_versions()INSTALLED VERSIONS
------------------
commit: None
python: 2.7.11.final.0
python-bits: 64
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 70 Stepping 1, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
pandas: 0.17.1
nose: 1.3.7
pip: 8.0.2
setuptools: 18.5
Cython: 0.23.4
numpy: 1.10.1
scipy: 0.16.0
statsmodels: 0.6.1
IPython: 4.0.1
sphinx: 1.3.1
patsy: 0.4.0
dateutil: 2.4.2
pytz: 2015.7
blosc: None
bottleneck: 1.0.0
tables: 3.2.2
numexpr: 2.4.4
matplotlib: 1.5.0
openpyxl: 2.2.6
xlrd: 0.9.4
xlwt: 1.0.0
xlsxwriter: 0.7.7
lxml: 3.4.4
bs4: 4.4.1
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.0.9
pymysql: 0.7.1.None
psycopg2: None
Jinja2: None
I think this is tricky. Right now we seem to run each column in something like a
try:
function(x)
except TypeError:
pass
while other kinds of exceptions are raised. This isn't great since TypeErrors can be raised for all sorts of reasons. It'd be terribly difficult to debug a problem where a TypeError was raised, but pandas swallowed it. Meanwhile you've gotten back your column of supposedly "transformed" values.
Also I don't think we keep around nuisance columns in aggregations (did this change recently?).
In [33]: df.groupby('a').agg('mean')
Out[33]:
dat dat2
a
a 3.3 1.95
c 4.4 3.75
Thanks for the response Tom,
Yeah, dropping nuisance columns in aggregations makes sense to me, but since transform doesn't do a reduce, I'd like to keep a string column around perhaps, since it's what is categorizing my data. (column 'b' in my example)
Also, when you do a groupby().agg(), if the groupby columns are not in the index, they get added to it. Could groupby().transform() mimic this behavior?
>>> df.groupby('a').agg('mean')
dat dat2
a
a 3.3 1.95
c 4.4 3.75
>>> df.groupby('a').transform(lambda x: x + 1)
dat dat2
0 5.2 2.3
1 3.4 3.6
2 4.5 4.4
3 6.3 5.1
I'm curious to see what others say, but in this case I prefer the explicit
pd.concat([
df,
df.groupby('a').transform(f).add_suffix('_transformed')
], axis=1)
If you were just reading the code
df.groupby('a').transform(myfunction)
You would have no way of knowing which columns get the functions applied or not since it depends on whether a TypeError was raised in the function (we should document this by the way, I don't see it in the docstring). At least with the current state of things you can look at the output, and see that b is not there, so clearly it hasn't been transformed.
This would also be an API breaking change, which raises the bar for acceptance. Still curious to hear others' thoughts.
this is correct and @TomAugspurger soln is the right one. By definition a transform returns a same indexed result to the original. The fact that the nuiscance columns are dropped is because they are not transformable.
This is NOT dropping the original index at all.
an easy way around is to keep a list of de_duplicated groupby column and add it later as sequence is preserved
Most helpful comment
I'm curious to see what others say, but in this case I prefer the explicit
If you were just reading the code
You would have no way of knowing which columns get the functions applied or not since it depends on whether a TypeError was raised in the function (we should document this by the way, I don't see it in the docstring). At least with the current state of things you can look at the output, and see that
bis not there, so clearly it hasn't been transformed.This would also be an API breaking change, which raises the bar for acceptance. Still curious to hear others' thoughts.