Pandas: Modify get_group() method to allow getting multiple groups

Created on 5 Sep 2019  路  7Comments  路  Source: pandas-dev/pandas

Problem description

The get_group() method supports getting one group from a grouped object by

# Current syntax
grouped.get_group('name1')

but you can't get multiple groups simply by

# Desired syntax
grouped.get_group(['name1', 'name2'])

This causes "ValueError: must supply a tuple to get_group with multiple grouping keys"

My workaround for now is using concat and list comprehension

# Workaround
pd.concat([group for (name, group) in grouped if name in ['name1', 'name2']])

but this is a bit cumbersome and not Pythonic...

Expected Output

Could we modify get_group() to support the syntax shown in code snippet # Desired syntax? Or maybe implement another get_groups() method?

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.7.2.final.0
python-bits: 64
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 94 Stepping 3, GenuineIntel
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: None.None

pandas: 0.24.2
pytest: 4.3.1
pip: 19.0.3
setuptools: 40.8.0
Cython: 0.29.6
numpy: 1.16.2
scipy: 1.2.1
pyarrow: None
xarray: 0.12.3
IPython: 7.4.0
sphinx: 1.8.5
patsy: 0.5.1
dateutil: 2.8.0
pytz: 2018.9
blosc: None
bottleneck: 1.2.1
tables: 3.5.1
numexpr: 2.6.9
feather: None
matplotlib: 3.0.3
openpyxl: 2.6.1
xlrd: 1.2.0
xlwt: 1.3.0
xlsxwriter: 1.1.5
lxml.etree: 4.3.2
bs4: 4.7.1
html5lib: 1.0.1
sqlalchemy: 1.3.1
pymysql: None
psycopg2: 2.7.6.1 (dt dec pq3 ext lo64)
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None

Groupby

Most helpful comment

You have to put as argument of the function get_group a tuple.

grouped.get_group((name_1, name_2))

All 7 comments

Not sure about this. I can see what you are after when dealing with one grouping, but if you have multiple items that you have grouped upon a list / tuple would refer to values across all of the groupings and not multiple values within one grouping.

To illustrate, the below would be totally ambiguous in what you are looking for:

>>> df = pd.DataFrame({"col1": list("aabb"), "col2": list("bbcc")})
>>> df
  col1 col2
0    a    b
1    a    b
2    b    c
3    b    c

>>> grp = df.groupby(["col1", "col2"])
# get_groups with multiple values will look across both columns today
# but would be ambiguous with request
>>> grp.get_group(("a", "b"))
  col1 col2
0    a    b
1    a    b

I wouldn't want to support doing this implicitly. A get_groups is better, but is it that much better than writing your own

pd.concat([grouped.get_group(name) for name in groups])

?

Thanks @WillAyd @TomAugspurger for the comment. My understanding is groupby() and get_group() are reciprocal operations:

  • df.groupby(): from dataframe to grouping
  • grp.get_group(): from grouping to dataframe

Since it's common to call groupby() once and get multiple groupings out of a single dataframe (operation "one-df-to-many-grp"), there should be a method to call once and get multiple groupings back into a single dataframe (operation "many-grp-to-one-df").

And if get_group() isn't the right method to do "many-grp-to-one-df", we need either a more advanced get_groups(), or a method with a different name, to satisfy this need.

Yes we can do pd.concat([grouped.get_group(name) for name in groups]), but we can also do something more elegant and powerful. For example, if we implement two new methods with names different from df.groupby() and grp.get_group() (so as not to break the backward compatibility):

  • df.group(by=label or tuple of labels, ...)
  • grp.to_df(name=name or list of names, ...), where a name can be a label or a tuple of labels

Then full reciprocity is achieved, and we can do method chaining like this (using @WillAyd 's example):

df.group(by=('col1', 'col2')).do_something_group_specific().to_df(name=[('a', 'b'), ('b', 'c'])

Here do_something_group_specific() means you can do different operations for groups of name ('a', 'b') and ('b', 'c'). The end result? You get the full df back, but it has been manipulated group-wisely!

And if we implement some more 5-letter-named methods, chained dataframe operations can be as nice as this:

df.order(by='value', ascending=False)\
  .slice(rows=..., cols=...)\
  .group(by=...)\
  .sieve(func=lambda x: len(x) >= THRESHOLD...)\
  .to_df(name=...)\
  .print(to='new_window')

Thoughts?

-1 on adding anything

working with just a small handful of groups is way less common that simply using an aggregation function

you can already use a list comprehension to iterate over groups if needed

You have to put as argument of the function get_group a tuple.

grouped.get_group((name_1, name_2))

my approach

grpidx=(name_1,name_2)
dfidx=np.sort(np.concatenate([gb.indices[x] for x in grpidx]))
df.loc[dfidx]

my approach

grpidx=(name_1,name_2)
dfidx=np.sort(np.concatenate([gb.indices[x] for x in grpidx]))
df.loc[dfidx]

Use df.iloc[dfidx, :] when the data frame has a multi-index.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

nathanielatom picture nathanielatom  路  3Comments

songololo picture songololo  路  3Comments

matthiasroder picture matthiasroder  路  3Comments

Ashutosh-Srivastav picture Ashutosh-Srivastav  路  3Comments

scls19fr picture scls19fr  路  3Comments