The get_group() method supports getting one group from a grouped object by
# Current syntax
grouped.get_group('name1')
but you can't get multiple groups simply by
# Desired syntax
grouped.get_group(['name1', 'name2'])
This causes "ValueError: must supply a tuple to get_group with multiple grouping keys"
My workaround for now is using concat and list comprehension
# Workaround
pd.concat([group for (name, group) in grouped if name in ['name1', 'name2']])
but this is a bit cumbersome and not Pythonic...
Could we modify get_group() to support the syntax shown in code snippet # Desired syntax? Or maybe implement another get_groups() method?
pd.show_versions()commit: None
python: 3.7.2.final.0
python-bits: 64
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 94 Stepping 3, GenuineIntel
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: None.None
pandas: 0.24.2
pytest: 4.3.1
pip: 19.0.3
setuptools: 40.8.0
Cython: 0.29.6
numpy: 1.16.2
scipy: 1.2.1
pyarrow: None
xarray: 0.12.3
IPython: 7.4.0
sphinx: 1.8.5
patsy: 0.5.1
dateutil: 2.8.0
pytz: 2018.9
blosc: None
bottleneck: 1.2.1
tables: 3.5.1
numexpr: 2.6.9
feather: None
matplotlib: 3.0.3
openpyxl: 2.6.1
xlrd: 1.2.0
xlwt: 1.3.0
xlsxwriter: 1.1.5
lxml.etree: 4.3.2
bs4: 4.7.1
html5lib: 1.0.1
sqlalchemy: 1.3.1
pymysql: None
psycopg2: 2.7.6.1 (dt dec pq3 ext lo64)
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None
Not sure about this. I can see what you are after when dealing with one grouping, but if you have multiple items that you have grouped upon a list / tuple would refer to values across all of the groupings and not multiple values within one grouping.
To illustrate, the below would be totally ambiguous in what you are looking for:
>>> df = pd.DataFrame({"col1": list("aabb"), "col2": list("bbcc")})
>>> df
col1 col2
0 a b
1 a b
2 b c
3 b c
>>> grp = df.groupby(["col1", "col2"])
# get_groups with multiple values will look across both columns today
# but would be ambiguous with request
>>> grp.get_group(("a", "b"))
col1 col2
0 a b
1 a b
I wouldn't want to support doing this implicitly. A get_groups is better, but is it that much better than writing your own
pd.concat([grouped.get_group(name) for name in groups])
?
Thanks @WillAyd @TomAugspurger for the comment. My understanding is groupby() and get_group() are reciprocal operations:
df.groupby(): from dataframe to groupinggrp.get_group(): from grouping to dataframeSince it's common to call groupby() once and get multiple groupings out of a single dataframe (operation "one-df-to-many-grp"), there should be a method to call once and get multiple groupings back into a single dataframe (operation "many-grp-to-one-df").
And if get_group() isn't the right method to do "many-grp-to-one-df", we need either a more advanced get_groups(), or a method with a different name, to satisfy this need.
Yes we can do pd.concat([grouped.get_group(name) for name in groups]), but we can also do something more elegant and powerful. For example, if we implement two new methods with names different from df.groupby() and grp.get_group() (so as not to break the backward compatibility):
df.group(by=label or tuple of labels, ...)grp.to_df(name=name or list of names, ...), where a name can be a label or a tuple of labelsThen full reciprocity is achieved, and we can do method chaining like this (using @WillAyd 's example):
df.group(by=('col1', 'col2')).do_something_group_specific().to_df(name=[('a', 'b'), ('b', 'c'])
Here do_something_group_specific() means you can do different operations for groups of name ('a', 'b') and ('b', 'c'). The end result? You get the full df back, but it has been manipulated group-wisely!
And if we implement some more 5-letter-named methods, chained dataframe operations can be as nice as this:
df.order(by='value', ascending=False)\
.slice(rows=..., cols=...)\
.group(by=...)\
.sieve(func=lambda x: len(x) >= THRESHOLD...)\
.to_df(name=...)\
.print(to='new_window')
Thoughts?
-1 on adding anything
working with just a small handful of groups is way less common that simply using an aggregation function
you can already use a list comprehension to iterate over groups if needed
You have to put as argument of the function get_group a tuple.
grouped.get_group((name_1, name_2))
my approach
grpidx=(name_1,name_2)
dfidx=np.sort(np.concatenate([gb.indices[x] for x in grpidx]))
df.loc[dfidx]
my approach
grpidx=(name_1,name_2) dfidx=np.sort(np.concatenate([gb.indices[x] for x in grpidx])) df.loc[dfidx]
Use df.iloc[dfidx, :] when the data frame has a multi-index.
Most helpful comment
You have to put as argument of the function get_group a tuple.
grouped.get_group((name_1, name_2))