Input:
pd.DataFrame(0, [1,2], [3,4]).groupby([5, 6]).ffill()
Output:
NaN 3 4
1 5 0 0
2 6 0 0
groupby().ffill() adds an additional column to the dataframe, containing a copy of the group labels. This is a regression in pandas v0.23.0 (#19673).
3 4
1 0 0
2 0 0
pd.show_versions()commit: None
python: 3.6.5.final.0
python-bits: 64
OS: Linux
OS-release: 4.14.14-200.fc26.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: C
LANG: C
LOCALE: None.None
pandas: 0.23.1
pytest: 3.6.0
pip: 10.0.1
setuptools: 39.2.0
Cython: 0.28.3
numpy: 1.13.3
scipy: 0.19.1
pyarrow: None
xarray: None
IPython: 4.2.1
sphinx: None
patsy: 0.5.0
dateutil: 2.7.3
pytz: 2018.4
blosc: None
bottleneck: 1.2.1
tables: 3.4.3
numexpr: 2.6.2
feather: None
matplotlib: 2.2.2
openpyxl: None
xlrd: 1.1.0
xlwt: None
xlsxwriter: None
lxml: None
bs4: 4.6.0
html5lib: 1.0.1
sqlalchemy: 1.2.8
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: 0.1.5
pandas_gbq: None
pandas_datareader: None
@adbull I think you should try and build from master. The given problem no longer persists in the current version of pandas
>>> pd.DataFrame(0, [1,2], [3,4]).groupby([5, 6]).ffill()
3 4
1 0 0
2 0 0
@uds5501 : Thanks for looking into this! Would you like to add a test to close?
Marking for 0.23.2, as this is a relatively trivial thing that we should be able to get in (I can bring this to the finish line if others are unable to do so).
@gfyoung actually no. I just fetched recent version of master and can now actually reproduce this error. I am sorry for the confusion.
>>> import pandas as pd
>>> pd.__version__
'0.24.0.dev0+122.g6131a59'
>>> pd.DataFrame(0, [1,2], [3,4]).groupby([5, 6]).ffill()
NaN 3 4
1 5 0 0
2 6 0 0
I have honestly no idea how did I got the one which i reported before but the error is reproducable
@uds5501 : Sorry, I wasn't able to check until now, and yes, I can reproduce too. This could be a regression though potentially.
@adbull : Investigation and PR are welcome!
Hey, can i work on this issue? This is my first time i am contributing to an open source project. I have some experience using pandas dataframe
@aggarwalvinayak : By all means! Go for it!
Can i get some help getting started with this. Like where to found the implementation of groupby and ffill be located and how should i proceed in making this work.
I did found this wasn't a bug in my previous version of pandas 0.20.1 but is a problem in current master
@aggarwalvinayak : Oh, that is very useful information! You can proceed as follows:
1) Can you figure out the most recent version / release of pandas where this example works?
2) Are you familiar with git bisect ? If not, have a look here: https://git-scm.com/docs/git-bisect
The goal is to figure out the commit where this example starts to fail. Then we can debug from there.
As in the original report, this is a regression between 0.22.0 and 0.23.0, presumably due to the reimplementation of groupby().ffill() in cython (#19673).
@gfyoung d87ca1c723154b09a005f865a06a38d4bb82917c was the first commit in which this problem existed and the commit message was Cythonized GroupBy Fill and was aimed at improving performance of GroupBy.ffill & GroupBy.bfill on issue #11296 and #19673
@WillAyd Could you help me with this as you made these changes. It would help me alot to look into this problem further as it my first issue and i dont have much experience
@aggarwalvinayak sure as you look at it and have questions feel free to ask
Just as a heads up - this "good first issue" label was added when we thought this was just going to be a test case. I've removed it as it appears to be a little more complex than that. Absolutely welcome to diagnose and debug but just want to be clear that it may not be as simple as originally thought
The error occurs when the number of columns specified in the groupby equals number of rows in the dataframe while one of the columns is not contained in the dataframe. If the number of columns in the groupby is not equivalent to the number of rows in the dataframe a keyerror is raised. Why is it desired for nothing to happen if the number of columns specified equals the number of rows in the dataframe (and one of the columns specified is not contained in the dataframe) as opposed to a keyerror? I couldn't find the reason in the groupby docs.
Marking this for 0.23.2, as the regression was introduced in 0.23.0.
IMO this is not a regression. If you do:
pd.DataFrame(0, [1,2], [3,4]).groupby([3, 4]).ffill()
Then you do not get the additional column. 5 and 6 are not valid labels, so if anything this should be raising a KeyError.
@WillAyd: Feel free to relabel if that is the case. @jreback: Thoughts?
This is a regression. groupby arguments do not have to be column keys, they can also be group labels, e.g. the following is a valid groupby call:
>>> pd.DataFrame([[1,np.nan,np.nan,np.nan]]).T.groupby([1,1,2,2]).ffill()
NaN 0
0 1 1.0
1 1 1.0
2 2 NaN
3 2 NaN
In 0.22.0, the correct result was returned, with no extra column; in 0.23.0, the extra column gets added.
@HarryVolek The special logic when the argument is the same length as the df, and contains an entry that is not a column key, is to support this use case. Possible this could be better documented?
In your original example 5 and 6 are not labels, but rather arbitrary values. Looking at the documentation I see the case for ambiguity with the below sentence:
"If an ndarray is passed, the values are used as-is determine the groups."
In spite of the fact you are using a list and not an ndarray I'm inferring you are looking to use arbitrary values to determine groups as is, not necessarily the labels. I haven't quite used this in that fashion before so I'll let others chime in with thoughts, but I feel like using in that manner is fraught with peril similar to the .ix conversation.
So I agree this API is not ideal -- there's clearly a dual meaning between 'list of column keys' and 'list of group labels'. However, that was the API as of pandas 0.22.0, so any change to it should really be made gradually, with a DeprecationWarning, etc.
Furthermore, the regression also affects cases where the meaning is unambiguous:
>>> pd.DataFrame([[1,np.nan,np.nan,np.nan]]).T.groupby(np.array([1,1,2,2])).ffill()
NaN 0
0 1 1.0
1 1 1.0
2 2 NaN
3 2 NaN
>>> pd.DataFrame([[1,np.nan,np.nan,np.nan]]).T.groupby(pd.Series([1,1,2,2])).ffill()
NaN 0
0 1 1.0
1 1 1.0
2 2 NaN
3 2 NaN
OK fair enough. Looking at it the problem is going to stem from the below:
Some kind of differentiation there between labels and arbitrary values could help determine whether or not the grouping should be included in the output or not. PRs welcome!
Looks like the issue also occurs when grouping by an index level:
>>> pd.DataFrame([[1,np.nan,np.nan,np.nan]],[0],[1,1,2,2]).T.groupby(level=0).ffill()
NaN 0
1 1 1.0
1 1 1.0
2 2 NaN
2 2 NaN