consider the following example
a = rand(100)
b = np.floor(rand(100)*100)
df = pd.DataFrame({'a' : a , 'b' : b})
grp = df.groupby(df.b)
I have grouped the values in a
by b
.
Now, if I want to plot the trend over the groups with mean and std I can do
grp.a.agg([np.mean, lambda x : np.mean(x) + np.std(x) , lambda x : np.mean(x) - np.std(x) ]).plot()
which gives me
SpecificationError: Function names must be unique, found multiple named <lambda>
while
grp.a.agg([np.mean, lambda x : np.mean(x) + np.std(x) ]).plot()
which has just one lambda works ok.
Is this a bug?
In order to make the thing work I had to define real functions (i.e. in terms of def
), to be put in agg
.
You can specify a dictionary; this requires named columns. I suppose it could work, not 100% sure why it was done this way (it needs unique functions as the results are returned as a dictionary; they could in theory be returned as a list I think that could simply create columns).
In [27]: grp.a.agg({'one' : np.mean, 'two' : lambda x : np.mean(x) + np.std(x) , 'three' : lambda x : np.mean(x) - np.std(x) })
Out[27]:
three two one
b
-253 0.156897 0.156897 0.156897
-216 0.452120 0.452120 0.452120
-191 0.893074 0.893074 0.893074
-178 1.170801 1.170801 1.170801
-177 -1.324476 -1.324476 -1.324476
-162 0.835708 1.241353 1.038531
-156 -1.220583 -1.220583 -1.220583
-147 -2.301474 -2.301474 -2.301474
-136 -1.125749 -1.125749 -1.125749
-133 -0.398064 -0.398064 -0.398064
-132 0.011879 0.011879 0.011879
-129 -0.257017 -0.257017 -0.257017
-114 0.795851 0.795851 0.795851
-113 -1.697932 -1.697932 -1.697932
-111 -0.309536 -0.309536 -0.309536
-110 -0.031828 -0.031828 -0.031828
-94 -0.391354 -0.391354 -0.391354
-87 -0.010518 0.551286 0.270384
-85 -0.711772 -0.711772 -0.711772
-77 -0.147718 -0.106666 -0.127192
-73 -0.796055 0.985810 0.094878
-68 -0.249214 -0.249214 -0.249214
-65 0.897349 0.897349 0.897349
-64 -0.151405 -0.014542 -0.082973
-60 -0.305136 -0.305136 -0.305136
-52 0.084092 0.084092 0.084092
-51 -0.821255 -0.619251 -0.720253
-48 -0.542030 1.237966 0.347968
-44 0.822566 0.822566 0.822566
-43 0.165354 0.165354 0.165354
-38 1.052166 1.052166 1.052166
-33 0.649841 0.649841 0.649841
-32 -0.020592 -0.020592 -0.020592
-31 -1.340543 0.886358 -0.227093
-30 0.278267 0.278267 0.278267
-15 0.220145 0.220145 0.220145
-12 -0.247523 -0.247523 -0.247523
-9 -1.017454 -1.017454 -1.017454
-5 2.230568 2.230568 2.230568
-3 -1.258155 -1.258155 -1.258155
1 -0.310485 -0.310485 -0.310485
2 -0.265832 -0.265832 -0.265832
3 -0.008983 -0.008983 -0.008983
5 -0.320702 -0.320702 -0.320702
13 -0.634021 -0.634021 -0.634021
14 0.588749 0.588749 0.588749
16 -0.843814 -0.843814 -0.843814
18 -0.534178 -0.534178 -0.534178
19 -0.246229 -0.246229 -0.246229
20 -0.095204 -0.095204 -0.095204
21 -1.586995 0.941961 -0.322517
27 -0.054841 -0.054841 -0.054841
38 0.108338 0.108338 0.108338
39 -0.924176 -0.924176 -0.924176
57 -0.562416 -0.144378 -0.353397
60 1.074620 1.074620 1.074620
64 -1.302721 0.358431 -0.472145
71 0.033022 0.033022 0.033022
75 1.088710 1.088710 1.088710
78 -0.300983 -0.300983 -0.300983
... ... ...
@jreback
Thanks!
going to close this; if you fee that this really should be implemented, pls reopen (and if you can submit a PR!)
The proposed workaround throws a FutureWarning in the current version of pandas. Should this bug be reopened?
That's indeed an unfortunate side effect of the deprecation.
I think the easiest solution is to use actual named functions instead of lambda's:
In [79]: def mean_plus_std(x): return np.mean(x) + np.std(x)
In [80]: def mean_minus_std(x): return np.mean(x) - np.std(x)
In [81]: grp.a.agg([np.mean, mean_plus_std, mean_minus_std])
Out[81]:
mean mean_plus_std mean_minus_std
b
0.0 0.468446 0.696463 0.240430
2.0 0.032308 0.032308 0.032308
3.0 0.704209 0.874344 0.534075
...
Something else we have been discussing is to allow kwargs to be different functions, something like:
grp.a.agg(one=np.mean, two=lambda x : np.mean(x) + np.std(x) , three=lambda x : np.mean(x) - np.std(x) ])
but this has not been implemented (and has some additional difficulties, as how to deal with kwargs that could be passed to the function)
I found a workaround.
def p(x):
return (1,2)
#will return two values in one function
df.groupby(col).apply(lambda x:p(x))
#will convert the new column into two columns of different values
df[[newCol1,newCol2]] = df[df.columns.values[-1]].apply(pd.Series)
This has caused me huge frustration and I believe this should be updated to allow passing the same function and then providing the desired name of the output column. I'm working with a custom aggregation function that takes an additional argument by using functool's partial or simply using multiple lambda functions. I was hoping to avoid 6 separate named functions, but with the current method I have to do that, even though each function is only slightly different than the other. The "workarounds" here don't save any time compared to just having separately defined functions that are all very similar.
I have found a more satisfactory workaround, specifically for the case where you want to apply multiple similar functions to the same column. You can create a function factory like so:
def ip_is(ip):
def ipf(x):
return (x==ip).mean()
ipf.__name__ = 'ipf {}'.format(str(ip))
return ipf
ip_by_day = dfp.groupby('day').agg({'ip': [ip_is('123'), ip_is('456), ip_is('789')]})
Here I'm checking how many records per day have a certain IP. Basically you can alter the name of the function returned manually and avoid the Specification Error.
I am doing something like this and I run into similar error
fs = [lambda x: np.percentile(x, p) for p in ptiles] + [np.sum]
off_smry = gb_off['delivery_time'].agg(fs)
Here is the error I get
SpecificationError: Function names must be unique, found multiple named <lambda>
I think it should be allowed to do something like this. In practical scenarios people could be generating multiple lambda functions to apply.
I am doing something like this and I run into similar error
fs = [lambda x: np.percentile(x, p) for p in ptiles] + [np.sum] off_smry = gb_off['delivery_time'].agg(fs)
Here is the error I get
SpecificationError: Function names must be unique, found multiple named <lambda>
I think it should be allowed to do something like this. In practical scenarios people could be generating multiple lambda functions to apply.
I'm experiencing the same problem
Same problem here. Push.
Most helpful comment
This has caused me huge frustration and I believe this should be updated to allow passing the same function and then providing the desired name of the output column. I'm working with a custom aggregation function that takes an additional argument by using functool's partial or simply using multiple lambda functions. I was hoping to avoid 6 separate named functions, but with the current method I have to do that, even though each function is only slightly different than the other. The "workarounds" here don't save any time compared to just having separately defined functions that are all very similar.