Pandas: transform() drops groupby columns if not in index or if not numeric

Created on 29 Feb 2016  路  5Comments  路  Source: pandas-dev/pandas

unlike when I use .aggregate('mean'), I lose the groupedby columns and any nonnumeric columns when using transform

>>> d = {'a':['a', 'a', 'c', 'c'], 'b':['z','z','z','y'], 'dat':[4.2, 2.4, 3.5, 5.3], 'dat2':[1.3,2.6,3.4,4.1]}
>>> df = pd.DataFrame(d)
>>> zscore = lambda x: (x - x.mean()) / x.std()
>>> df.groupby('a').transform(zscore)
        dat      dat2
0  0.707107 -0.707107
1 -0.707107  0.707107
2 -0.707107 -0.707107
3  0.707107  0.707107

If I add 'a' to the index, and then transform, I still lose column b

>>> df = df.set_index('a')
>>> df.groupby(level='a').transform(zscore)
        dat      dat2 
a                     
a  0.707107 -0.707107 
a -0.707107  0.707107 
c -0.707107 -0.707107 
c  0.707107  0.707107 

The only way to keep non-numeric information columns is to add them all to the index, regardless of whether I need to group on them or not

>>> df = df.set_index(['a', 'b'])
>>> df.groupby(level='a').transform(zscore)
          dat      dat2 
a b                     
c z  0.993399 -0.475293 
  z  0.198680 -0.095059 
  z  0.198680  1.425880 
  y -1.390759 -0.855528 

Expected Output

I would expect the columns that I grouped on to remain, like they do when using aggregate('mean') for example.

   a    b        dat      dat2
0  a    z    0.707107 -0.707107
1  a    z    -0.707107  0.707107
2  c    z    -0.707107 -0.707107
3  c    y    0.707107  0.707107

output of pd.show_versions()

INSTALLED VERSIONS                                                               
------------------                                                               
commit: None                                                                     
python: 2.7.11.final.0                                                           
python-bits: 64                                                                  
OS: Windows                                                                      
OS-release: 7                                                                    
machine: AMD64                                                                   
processor: Intel64 Family 6 Model 70 Stepping 1, GenuineIntel                    
byteorder: little                                                                
LC_ALL: None                                                                     
LANG: None                                                                       

pandas: 0.17.1                                                                   
nose: 1.3.7                                                                      
pip: 8.0.2                                                                       
setuptools: 18.5                                                                 
Cython: 0.23.4                                                                   
numpy: 1.10.1                                                                    
scipy: 0.16.0                                                                    
statsmodels: 0.6.1                                                               
IPython: 4.0.1                                                                   
sphinx: 1.3.1                                                                    
patsy: 0.4.0                                                                     
dateutil: 2.4.2                                                                  
pytz: 2015.7                                                                     
blosc: None                                                                      
bottleneck: 1.0.0                                                                
tables: 3.2.2                                                                    
numexpr: 2.4.4                                                                   
matplotlib: 1.5.0                                                                
openpyxl: 2.2.6                                                                  
xlrd: 0.9.4                                                                      
xlwt: 1.0.0                                                                      
xlsxwriter: 0.7.7                                                                
lxml: 3.4.4                                                                      
bs4: 4.4.1                                                                       
html5lib: None                                                                   
httplib2: None                                                                   
apiclient: None                                                                  
sqlalchemy: 1.0.9                                                                
pymysql: 0.7.1.None                                                              
psycopg2: None                                                                   
Jinja2: None                                                                     
API Design Groupby Usage Question

Most helpful comment

I'm curious to see what others say, but in this case I prefer the explicit

pd.concat([
    df,
    df.groupby('a').transform(f).add_suffix('_transformed')
], axis=1)

If you were just reading the code

df.groupby('a').transform(myfunction)

You would have no way of knowing which columns get the functions applied or not since it depends on whether a TypeError was raised in the function (we should document this by the way, I don't see it in the docstring). At least with the current state of things you can look at the output, and see that b is not there, so clearly it hasn't been transformed.

This would also be an API breaking change, which raises the bar for acceptance. Still curious to hear others' thoughts.

All 5 comments

I think this is tricky. Right now we seem to run each column in something like a

try:
    function(x)
except TypeError:
    pass

while other kinds of exceptions are raised. This isn't great since TypeErrors can be raised for all sorts of reasons. It'd be terribly difficult to debug a problem where a TypeError was raised, but pandas swallowed it. Meanwhile you've gotten back your column of supposedly "transformed" values.

Also I don't think we keep around nuisance columns in aggregations (did this change recently?).

In [33]: df.groupby('a').agg('mean')
Out[33]:
   dat  dat2
a
a  3.3  1.95
c  4.4  3.75

Thanks for the response Tom,
Yeah, dropping nuisance columns in aggregations makes sense to me, but since transform doesn't do a reduce, I'd like to keep a string column around perhaps, since it's what is categorizing my data. (column 'b' in my example)

Also, when you do a groupby().agg(), if the groupby columns are not in the index, they get added to it. Could groupby().transform() mimic this behavior?

>>> df.groupby('a').agg('mean')
   dat  dat2                   
a                              
a  3.3  1.95                   
c  4.4  3.75                   
>>> df.groupby('a').transform(lambda x: x + 1) 
   dat  dat2                                   
0  5.2   2.3                                   
1  3.4   3.6                                   
2  4.5   4.4                                   
3  6.3   5.1                                   

I'm curious to see what others say, but in this case I prefer the explicit

pd.concat([
    df,
    df.groupby('a').transform(f).add_suffix('_transformed')
], axis=1)

If you were just reading the code

df.groupby('a').transform(myfunction)

You would have no way of knowing which columns get the functions applied or not since it depends on whether a TypeError was raised in the function (we should document this by the way, I don't see it in the docstring). At least with the current state of things you can look at the output, and see that b is not there, so clearly it hasn't been transformed.

This would also be an API breaking change, which raises the bar for acceptance. Still curious to hear others' thoughts.

this is correct and @TomAugspurger soln is the right one. By definition a transform returns a same indexed result to the original. The fact that the nuiscance columns are dropped is because they are not transformable.

This is NOT dropping the original index at all.

an easy way around is to keep a list of de_duplicated groupby column and add it later as sequence is preserved

Was this page helpful?
0 / 5 - 0 ratings