Pandas: BUG: groupby().ffill() adds group labels as extra column

Created on 18 Jun 2018 · 22Comments · Source: pandas-dev/pandas

Code Sample, a copy-pastable example if possible

Input:

pd.DataFrame(0, [1,2], [3,4]).groupby([5, 6]).ffill()

Output:

   NaN  3  4
1    5  0  0
2    6  0  0

Problem description

groupby().ffill() adds an additional column to the dataframe, containing a copy of the group labels. This is a regression in pandas v0.23.0 (#19673).

Expected Output

   3  4
1  0  0
2  0  0

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit: None
python: 3.6.5.final.0
python-bits: 64
OS: Linux
OS-release: 4.14.14-200.fc26.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: C
LANG: C
LOCALE: None.None

pandas: 0.23.1
pytest: 3.6.0
pip: 10.0.1
setuptools: 39.2.0
Cython: 0.28.3
numpy: 1.13.3
scipy: 0.19.1
pyarrow: None
xarray: None
IPython: 4.2.1
sphinx: None
patsy: 0.5.0
dateutil: 2.7.3
pytz: 2018.4
blosc: None
bottleneck: 1.2.1
tables: 3.4.3
numexpr: 2.6.2
feather: None
matplotlib: 2.2.2
openpyxl: None
xlrd: 1.1.0
xlwt: None
xlsxwriter: None
lxml: None
bs4: 4.6.0
html5lib: 1.0.1
sqlalchemy: 1.2.8
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: 0.1.5
pandas_gbq: None
pandas_datareader: None

Algos Groupby Regression

Source

adbull

👍1

All 22 comments

@adbull I think you should try and build from master. The given problem no longer persists in the current version of pandas

>>> pd.DataFrame(0, [1,2], [3,4]).groupby([5, 6]).ffill()
   3  4
1  0  0
2  0  0

uds5501 on 18 Jun 2018

@uds5501 : Thanks for looking into this! Would you like to add a test to close?

gfyoung on 19 Jun 2018

~~Marking for 0.23.2, as this is a relatively trivial thing that we should be able to get in (I can bring this to the finish line if others are unable to do so).~~

gfyoung on 19 Jun 2018

@gfyoung actually no. I just fetched recent version of master and can now actually reproduce this error. I am sorry for the confusion.

>>> import pandas as pd
>>> pd.__version__
'0.24.0.dev0+122.g6131a59'
>>> pd.DataFrame(0, [1,2], [3,4]).groupby([5, 6]).ffill()
   NaN  3  4
1    5  0  0
2    6  0  0

I have honestly no idea how did I got the one which i reported before but the error is reproducable

uds5501 on 19 Jun 2018

@uds5501 : Sorry, I wasn't able to check until now, and yes, I can reproduce too. This could be a regression though potentially.

@adbull : Investigation and PR are welcome!

gfyoung on 19 Jun 2018

Hey, can i work on this issue? This is my first time i am contributing to an open source project. I have some experience using pandas dataframe

aggarwalvinayak on 19 Jun 2018

👍1

@aggarwalvinayak : By all means! Go for it!

gfyoung on 19 Jun 2018

👍1

Can i get some help getting started with this. Like where to found the implementation of groupby and ffill be located and how should i proceed in making this work.
I did found this wasn't a bug in my previous version of pandas 0.20.1 but is a problem in current master

aggarwalvinayak on 21 Jun 2018

@aggarwalvinayak : Oh, that is very useful information! You can proceed as follows:

1) Can you figure out the most recent version / release of pandas where this example works?
2) Are you familiar with git bisect ? If not, have a look here: https://git-scm.com/docs/git-bisect

The goal is to figure out the commit where this example starts to fail. Then we can debug from there.

gfyoung on 21 Jun 2018

As in the original report, this is a regression between 0.22.0 and 0.23.0, presumably due to the reimplementation of groupby().ffill() in cython (#19673).

adbull on 21 Jun 2018

@gfyoung d87ca1c723154b09a005f865a06a38d4bb82917c was the first commit in which this problem existed and the commit message was Cythonized GroupBy Fill and was aimed at improving performance of GroupBy.ffill & GroupBy.bfill on issue #11296 and #19673
@WillAyd Could you help me with this as you made these changes. It would help me alot to look into this problem further as it my first issue and i dont have much experience

aggarwalvinayak on 21 Jun 2018

@aggarwalvinayak sure as you look at it and have questions feel free to ask

WillAyd on 21 Jun 2018

Just as a heads up - this "good first issue" label was added when we thought this was just going to be a test case. I've removed it as it appears to be a little more complex than that. Absolutely welcome to diagnose and debug but just want to be clear that it may not be as simple as originally thought

WillAyd on 21 Jun 2018

👍1

The error occurs when the number of columns specified in the groupby equals number of rows in the dataframe while one of the columns is not contained in the dataframe. If the number of columns in the groupby is not equivalent to the number of rows in the dataframe a keyerror is raised. Why is it desired for nothing to happen if the number of columns specified equals the number of rows in the dataframe (and one of the columns specified is not contained in the dataframe) as opposed to a keyerror? I couldn't find the reason in the groupby docs.

HarryVolek on 21 Jun 2018

Marking this for 0.23.2, as the regression was introduced in 0.23.0.

gfyoung on 21 Jun 2018

IMO this is not a regression. If you do:

pd.DataFrame(0, [1,2], [3,4]).groupby([3, 4]).ffill()

Then you do not get the additional column. 5 and 6 are not valid labels, so if anything this should be raising a KeyError.

WillAyd on 21 Jun 2018

@WillAyd: Feel free to relabel if that is the case. @jreback: Thoughts?

gfyoung on 21 Jun 2018

This is a regression. groupby arguments do not have to be column keys, they can also be group labels, e.g. the following is a valid groupby call:

>>> pd.DataFrame([[1,np.nan,np.nan,np.nan]]).T.groupby([1,1,2,2]).ffill()
   NaN    0
0    1  1.0
1    1  1.0
2    2  NaN
3    2  NaN

In 0.22.0, the correct result was returned, with no extra column; in 0.23.0, the extra column gets added.

@HarryVolek The special logic when the argument is the same length as the df, and contains an entry that is not a column key, is to support this use case. Possible this could be better documented?

adbull on 21 Jun 2018

In your original example 5 and 6 are not labels, but rather arbitrary values. Looking at the documentation I see the case for ambiguity with the below sentence:

"If an ndarray is passed, the values are used as-is determine the groups."

In spite of the fact you are using a list and not an ndarray I'm inferring you are looking to use arbitrary values to determine groups as is, not necessarily the labels. I haven't quite used this in that fashion before so I'll let others chime in with thoughts, but I feel like using in that manner is fraught with peril similar to the .ix conversation.

WillAyd on 21 Jun 2018

So I agree this API is not ideal -- there's clearly a dual meaning between 'list of column keys' and 'list of group labels'. However, that was the API as of pandas 0.22.0, so any change to it should really be made gradually, with a DeprecationWarning, etc.

Furthermore, the regression also affects cases where the meaning is unambiguous:

>>> pd.DataFrame([[1,np.nan,np.nan,np.nan]]).T.groupby(np.array([1,1,2,2])).ffill()
   NaN    0
0    1  1.0
1    1  1.0
2    2  NaN
3    2  NaN
>>> pd.DataFrame([[1,np.nan,np.nan,np.nan]]).T.groupby(pd.Series([1,1,2,2])).ffill()
   NaN    0
0    1  1.0
1    1  1.0
2    2  NaN
3    2  NaN

adbull on 21 Jun 2018

OK fair enough. Looking at it the problem is going to stem from the below:

https://github.com/pandas-dev/pandas/blob/f1ffc5fae06a7294dc831887b0d76177aec9b708/pandas/core/groupby/groupby.py#L4839

Some kind of differentiation there between labels and arbitrary values could help determine whether or not the grouping should be included in the output or not. PRs welcome!

WillAyd on 21 Jun 2018

Looks like the issue also occurs when grouping by an index level:

>>> pd.DataFrame([[1,np.nan,np.nan,np.nan]],[0],[1,1,2,2]).T.groupby(level=0).ffill()
   NaN    0
1    1  1.0
1    1  1.0
2    2  NaN
2    2  NaN

adbull on 3 Jul 2018

👍1

Was this page helpful?

0 / 5 - 0 ratings

Related issues

can't plot multi-row subplots

ericdf · 3Comments

ENH: Support for multiple comment characters with readers

ebran · 3Comments

frame _apply_standard error when operating on 0 or NaN values

venuktan · 3Comments

ValueError plotting bar plot from DataFrame with existing Axes

swails · 3Comments

Pandas get_dummies() and n-1 Categorical Encoding Option to avoid Collinearity?

jaradc · 3Comments

Pandas: BUG: groupby().ffill() adds group labels as extra column

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

All 22 comments

Related issues

Output of `pd.show_versions()`