Pandas: Modify get_group() method to allow getting multiple groups

Created on 5 Sep 2019 · 7Comments · Source: pandas-dev/pandas

Problem description

The get_group() method supports getting one group from a grouped object by

# Current syntax
grouped.get_group('name1')

but you can't get multiple groups simply by

# Desired syntax
grouped.get_group(['name1', 'name2'])

This causes "ValueError: must supply a tuple to get_group with multiple grouping keys"

My workaround for now is using concat and list comprehension

# Workaround
pd.concat([group for (name, group) in grouped if name in ['name1', 'name2']])

but this is a bit cumbersome and not Pythonic...

Expected Output

Could we modify get_group() to support the syntax shown in code snippet # Desired syntax? Or maybe implement another get_groups() method?

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit: None
python: 3.7.2.final.0
python-bits: 64
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 94 Stepping 3, GenuineIntel
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: None.None

pandas: 0.24.2
pytest: 4.3.1
pip: 19.0.3
setuptools: 40.8.0
Cython: 0.29.6
numpy: 1.16.2
scipy: 1.2.1
pyarrow: None
xarray: 0.12.3
IPython: 7.4.0
sphinx: 1.8.5
patsy: 0.5.1
dateutil: 2.8.0
pytz: 2018.9
blosc: None
bottleneck: 1.2.1
tables: 3.5.1
numexpr: 2.6.9
feather: None
matplotlib: 3.0.3
openpyxl: 2.6.1
xlrd: 1.2.0
xlwt: 1.3.0
xlsxwriter: 1.1.5
lxml.etree: 4.3.2
bs4: 4.7.1
html5lib: 1.0.1
sqlalchemy: 1.3.1
pymysql: None
psycopg2: 2.7.6.1 (dt dec pq3 ext lo64)
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None

Groupby

Source

frank-yifei-wang

Most helpful comment

You have to put as argument of the function get_group a tuple.

grouped.get_group((name_1, name_2))

iancrespiok on 10 Apr 2020

👍2

All 7 comments

Not sure about this. I can see what you are after when dealing with one grouping, but if you have multiple items that you have grouped upon a list / tuple would refer to values across all of the groupings and not multiple values within one grouping.

To illustrate, the below would be totally ambiguous in what you are looking for:

>>> df = pd.DataFrame({"col1": list("aabb"), "col2": list("bbcc")})
>>> df
  col1 col2
0    a    b
1    a    b
2    b    c
3    b    c

>>> grp = df.groupby(["col1", "col2"])
# get_groups with multiple values will look across both columns today
# but would be ambiguous with request
>>> grp.get_group(("a", "b"))
  col1 col2
0    a    b
1    a    b

WillAyd on 5 Sep 2019

I wouldn't want to support doing this implicitly. A get_groups is better, but is it that much better than writing your own

pd.concat([grouped.get_group(name) for name in groups])

TomAugspurger on 6 Sep 2019

Thanks @WillAyd @TomAugspurger for the comment. My understanding is groupby() and get_group() are reciprocal operations:

df.groupby(): from dataframe to grouping
grp.get_group(): from grouping to dataframe

Since it's common to call groupby() once and get multiple groupings out of a single dataframe (operation "one-df-to-many-grp"), there should be a method to call once and get multiple groupings back into a single dataframe (operation "many-grp-to-one-df").

And if get_group() isn't the right method to do "many-grp-to-one-df", we need either a more advanced get_groups(), or a method with a different name, to satisfy this need.

Yes we can do pd.concat([grouped.get_group(name) for name in groups]), but we can also do something more elegant and powerful. For example, if we implement two new methods with names different from df.groupby() and grp.get_group() (so as not to break the backward compatibility):

df.group(by=label or tuple of labels, ...)
grp.to_df(name=name or list of names, ...), where a name can be a label or a tuple of labels

Then full reciprocity is achieved, and we can do method chaining like this (using @WillAyd 's example):

df.group(by=('col1', 'col2')).do_something_group_specific().to_df(name=[('a', 'b'), ('b', 'c'])

Here do_something_group_specific() means you can do different operations for groups of name ('a', 'b') and ('b', 'c'). The end result? You get the full df back, but it has been manipulated group-wisely!

And if we implement some more 5-letter-named methods, chained dataframe operations can be as nice as this:

df.order(by='value', ascending=False)\
  .slice(rows=..., cols=...)\
  .group(by=...)\
  .sieve(func=lambda x: len(x) >= THRESHOLD...)\
  .to_df(name=...)\
  .print(to='new_window')

Thoughts?

frank-yifei-wang on 8 Sep 2019

-1 on adding anything

working with just a small handful of groups is way less common that simply using an aggregation function

you can already use a list comprehension to iterate over groups if needed

jreback on 8 Sep 2019

You have to put as argument of the function get_group a tuple.

grouped.get_group((name_1, name_2))

iancrespiok on 10 Apr 2020

👍2

my approach

grpidx=(name_1,name_2)
dfidx=np.sort(np.concatenate([gb.indices[x] for x in grpidx]))
df.loc[dfidx]

cdarlint on 30 Jul 2020

👍1

my approach

grpidx=(name_1,name_2)
dfidx=np.sort(np.concatenate([gb.indices[x] for x in grpidx]))
df.loc[dfidx]

Use df.iloc[dfidx, :] when the data frame has a multi-index.

edxu96 on 19 Aug 2020

Was this page helpful?

0 / 5 - 0 ratings

Related issues

AttributeError: Cannot use pandas from a script file

songololo · 3Comments

read_csv(filename_with_asian_locale) failed in python 3.6 for windows

mfmain · 3Comments

Cannot use apply on Series with Timestamp values

nathanielatom · 3Comments

Pandas get_dummies() and n-1 Categorical Encoding Option to avoid Collinearity?

jaradc · 3Comments

Dataframe creation: Specifying dtypes with a dictionary

amelio-vazquez-reina · 3Comments