This is my first issue on GitHub, so apologies in advance if there's something wrong with the format.
My issue does not have any expected output, I just really want to understand if and why the Series.transform() method is not redundant. Overall, the transform() methods are very similar to apply() methods, and as I was trying to figure out what the difference between them is (this Stack Overflow topic was helpful), I managed to pinpoint 3 primary differences:
1) When the DataFrame is grouped on several categories, apply() sends the entire sub-DataFrames within the function, while transform() sends each column of each sub-DataFrame separately. That's why columns can't access values in other columns within transform();
2) When the input passed to the function is an iterable of a certain length, apply() can still have the output of any length, while transform() has a limitation of having to output an iterable of the same length as the input;
3) When the function outputs a scalar, apply() returns that scalar, while transform() propagates that scalar to the iterable of the input length.
I conducted a series of experiments that test these three differences on each applicable pandas object type: Series, DataFrame, SeriesGroupBy, and DataFrameGroupBy. I can send my ipynb with the code and the results if necessary, but it would be sufficient to just look at the conclusion for the Series type:
1 – not applicable. In both cases the function has a scalar input.
2 – not applicable. No matter what the function returns, in both cases the result is assigned to the single cell, even if it means entire DataFrames within cells of a Seires.
3 – not applicable. The input length is always "1" (it's considered "1" even when it's an iterable), so there's no need to propagate.
Inapplicability of 1 is self-explanatory. But 2 was a surprise. Below is the code I tried:
import pandas as pd
df = pd.DataFrame({'State':['Texas', 'Texas', 'Florida', 'Florida'],
'a':[4,5,1,3], 'b':[6,10,3,11]})
def return_df(x):
return pd.DataFrame([[4, 5], [3, 2]])
def return_series(x):
return pd.Series([1, 2])
df['a'].transform(return_df)
df['a'].transform(return_series)
If you try this code, you'll see that it doesn't matter what the function returns. Whatever it is, it will be put inside the single Series cell in its entirety. Is this behavior intentional? It results in the output size being predetermined by the input size, so all the size checks that Series.transform() has within itself become redundant. I can't imagine any situation where Series.transform() could behave in a different way from Series.apply(). And that raises the question I posed: why does Series.transform() exist?
Your observation 1 is wrong. Series.transform can also take a function that takes a Series. Your problem is that in your examples the return values have only two rows while your df had 4.
In [20]: df.a.transform(lambda x: (x - x.mean()) / x.std())
Out[20]:
0 0.439155
1 1.024695
2 -1.317465
3 -0.146385
Name: a, dtype: float64
And you can also do multiple transformers:
In [27]: df.a.transform([np.sqrt, np.exp])
Out[27]:
sqrt exp
0 2.000000 54.598150
1 2.236068 148.413159
2 1.000000 2.718282
3 1.732051 20.085537
Neither of those are available with apply.
@MarcoGorelli The question was about Series.transform, which does not allow aggregator broadcasting, unlike SeriesGroupBy.transform (for which aggregator broadcasting is the main use case).
In [3]: s1.transform('sum')
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-3-6db3fc2c8d83> in <module>
----> 1 s1.transform('sum')
C:\Miniconda3\envs\bleeding\lib\site-packages\pandas\core\series.py in transform(self, func, axis, *args, **kwargs)
3715 # Validate the axis parameter
3716 self._get_axis_number(axis)
-> 3717 return super().transform(func, *args, **kwargs)
3718
3719 def apply(self, func, convert_dtype=True, args=(), **kwds):
C:\Miniconda3\envs\bleeding\lib\site-packages\pandas\core\generic.py in transform(self, func, *args, **kwargs)
10427 result = self.agg(func, *args, **kwargs)
10428 if is_scalar(result) or len(result) != len(self):
> 10429 raise ValueError("transforms cannot produce aggregated results")
10430
10431 return result
ValueError: transforms cannot produce aggregated results
Admittedly it's unclear to me why Series.transform does not support aggregator broadcasting, since I thought the point of adding agg, transform, and apply was to mimic the groupby versions.
@Liam3851 yes, you're right - will delete then as it's not relevant
@Liam3851 tried the first code. Fascinating. So the function treats the first "x" as a scalar but simultaneously treats the second "x" as a Series? How does it know which is which? Admittedly, when I tried to see what exactly get passed in the function, I did notice that there were two prints, and the second one was the entire Series for some reason. But it's still not clear to me how this works. It's true this doesn't work with apply() though, I've just checked. So they are different after all.
Regarding your second example, apply() seems to work in my case.
@Liam3851 I just realized that the first "x" is a Series too, and then the function just returns the resulting Series once. So I assume what it does is trying to work with "x" as a scalar and when it fails, it passes the entire Series and works with "x" as a Series then?
And now it's not clear to me why we need to use transform() here if we can do the same without using any methods at all.
This was implemented #14668 - you can see the reasoning in that and the associated notes if you want to see the history. Simply put, transform allows multiple functions as input
Closing as I don't know if there is anything to be done here, but if you disagree can certainly reopen. Thanks!
@WillAyd The proof that the methods can show different behavior after all was given, so the issue can be considered closed, yes. But if possible, I'd still like to know a) how transform() decides whether to pass a scalar or a Series, and b) what's the purpose of passing Series if whatever the function does can be done directly on a Series without using transform() at all.
I appreciate your link to the relevant pull request, but I'm new to GitHub, so it's not quite clear to me what exactly I should look/click at to get answers to these two questions. I looked through all the messages that mention transform(), but it didn't really make things clear to me. I'm sorry.