Pandas: transform() drops groupby columns if not in index or if not numeric

Created on 29 Feb 2016 · 5Comments · Source: pandas-dev/pandas

unlike when I use .aggregate('mean'), I lose the groupedby columns and any nonnumeric columns when using transform

>>> d = {'a':['a', 'a', 'c', 'c'], 'b':['z','z','z','y'], 'dat':[4.2, 2.4, 3.5, 5.3], 'dat2':[1.3,2.6,3.4,4.1]}
>>> df = pd.DataFrame(d)
>>> zscore = lambda x: (x - x.mean()) / x.std()
>>> df.groupby('a').transform(zscore)
        dat      dat2
0  0.707107 -0.707107
1 -0.707107  0.707107
2 -0.707107 -0.707107
3  0.707107  0.707107

If I add 'a' to the index, and then transform, I still lose column b

>>> df = df.set_index('a')
>>> df.groupby(level='a').transform(zscore)
        dat      dat2 
a                     
a  0.707107 -0.707107 
a -0.707107  0.707107 
c -0.707107 -0.707107 
c  0.707107  0.707107

The only way to keep non-numeric information columns is to add them all to the index, regardless of whether I need to group on them or not

>>> df = df.set_index(['a', 'b'])
>>> df.groupby(level='a').transform(zscore)
          dat      dat2 
a b                     
c z  0.993399 -0.475293 
  z  0.198680 -0.095059 
  z  0.198680  1.425880 
  y -1.390759 -0.855528

Expected Output

I would expect the columns that I grouped on to remain, like they do when using aggregate('mean') for example.

   a    b        dat      dat2
0  a    z    0.707107 -0.707107
1  a    z    -0.707107  0.707107
2  c    z    -0.707107 -0.707107
3  c    y    0.707107  0.707107

output of `pd.show_versions()`

INSTALLED VERSIONS                                                               
------------------                                                               
commit: None                                                                     
python: 2.7.11.final.0                                                           
python-bits: 64                                                                  
OS: Windows                                                                      
OS-release: 7                                                                    
machine: AMD64                                                                   
processor: Intel64 Family 6 Model 70 Stepping 1, GenuineIntel                    
byteorder: little                                                                
LC_ALL: None                                                                     
LANG: None                                                                       

pandas: 0.17.1                                                                   
nose: 1.3.7                                                                      
pip: 8.0.2                                                                       
setuptools: 18.5                                                                 
Cython: 0.23.4                                                                   
numpy: 1.10.1                                                                    
scipy: 0.16.0                                                                    
statsmodels: 0.6.1                                                               
IPython: 4.0.1                                                                   
sphinx: 1.3.1                                                                    
patsy: 0.4.0                                                                     
dateutil: 2.4.2                                                                  
pytz: 2015.7                                                                     
blosc: None                                                                      
bottleneck: 1.0.0                                                                
tables: 3.2.2                                                                    
numexpr: 2.4.4                                                                   
matplotlib: 1.5.0                                                                
openpyxl: 2.2.6                                                                  
xlrd: 0.9.4                                                                      
xlwt: 1.0.0                                                                      
xlsxwriter: 0.7.7                                                                
lxml: 3.4.4                                                                      
bs4: 4.4.1                                                                       
html5lib: None                                                                   
httplib2: None                                                                   
apiclient: None                                                                  
sqlalchemy: 1.0.9                                                                
pymysql: 0.7.1.None                                                              
psycopg2: None                                                                   
Jinja2: None

API Design Groupby Usage Question

Source

iamdavehawkins

👍1

Most helpful comment

I'm curious to see what others say, but in this case I prefer the explicit

pd.concat([
    df,
    df.groupby('a').transform(f).add_suffix('_transformed')
], axis=1)

If you were just reading the code

df.groupby('a').transform(myfunction)

You would have no way of knowing which columns get the functions applied or not since it depends on whether a TypeError was raised in the function (we should document this by the way, I don't see it in the docstring). At least with the current state of things you can look at the output, and see that b is not there, so clearly it hasn't been transformed.

This would also be an API breaking change, which raises the bar for acceptance. Still curious to hear others' thoughts.

TomAugspurger on 2 Mar 2016

👍2

All 5 comments

I think this is tricky. Right now we seem to run each column in something like a

try:
    function(x)
except TypeError:
    pass

while other kinds of exceptions are raised. This isn't great since TypeErrors can be raised for all sorts of reasons. It'd be terribly difficult to debug a problem where a TypeError was raised, but pandas swallowed it. Meanwhile you've gotten back your column of supposedly "transformed" values.

Also I don't think we keep around nuisance columns in aggregations (did this change recently?).

In [33]: df.groupby('a').agg('mean')
Out[33]:
   dat  dat2
a
a  3.3  1.95
c  4.4  3.75

TomAugspurger on 1 Mar 2016

Thanks for the response Tom,
Yeah, dropping nuisance columns in aggregations makes sense to me, but since transform doesn't do a reduce, I'd like to keep a string column around perhaps, since it's what is categorizing my data. (column 'b' in my example)

Also, when you do a groupby().agg(), if the groupby columns are not in the index, they get added to it. Could groupby().transform() mimic this behavior?

>>> df.groupby('a').agg('mean')
   dat  dat2                   
a                              
a  3.3  1.95                   
c  4.4  3.75                   
>>> df.groupby('a').transform(lambda x: x + 1) 
   dat  dat2                                   
0  5.2   2.3                                   
1  3.4   3.6                                   
2  4.5   4.4                                   
3  6.3   5.1

iamdavehawkins on 2 Mar 2016

I'm curious to see what others say, but in this case I prefer the explicit

pd.concat([
    df,
    df.groupby('a').transform(f).add_suffix('_transformed')
], axis=1)

If you were just reading the code

df.groupby('a').transform(myfunction)

This would also be an API breaking change, which raises the bar for acceptance. Still curious to hear others' thoughts.

TomAugspurger on 2 Mar 2016

👍2

this is correct and @TomAugspurger soln is the right one. By definition a transform returns a same indexed result to the original. The fact that the nuiscance columns are dropped is because they are not transformable.

This is NOT dropping the original index at all.