Cudf: [BUG] Column Renaming Bug

Created on 31 Jan 2020 · 6Comments · Source: rapidsai/cudf

Line 35 gives an error: KeyError: 'fact_sales_tax_amt'

But I would expect it to be called that. Looks like there's a bug with how the columns are named in my example.

Please confirm -- I'm running 0.11.

Thanks.

bug cuDF (Python)

Source

RedTailedHawk

All 6 comments

@RedTailedHawk I believe this is a known issue related to handling MultiColumns. We're working on fixes in 0.13 for this.

kkraus14 on 31 Jan 2020

👍1

I have recreated this and have been working with rename today so I thought I'd try to knock it out.

millerhooks on 18 Jun 2020

Why do we support this syntax for aggregate?

>>> a.groupby('a').agg({'a': 'max', 'b': ['min', 'mean']})
            a   b
          max min mean
        a
        1   1   1  1.5
        2   2   3  3.0

Is there an important compatibility reason for not putting a single aggregate item in a list? I personally do not think it's cleaner. It's always going to be a set of finite values, even if a set of one.
The following works as expected. We could default to expecting listlikes for agg. This is where they keys are getting lost regardless of how we want to handle it.

gdf = gd.DataFrame(
        {
            "id": list(range(10)),
            "000": np.random.randint(0, 10, 10),
            "111": np.random.randint(0, 10, 10),
            "222": np.random.randint(0, 10, 10),
            "a": np.random.randint(0, 10, 10),
            "b": np.random.randint(0, 10, 10),
            "c": np.random.randint(0, 10, 10),
            "d": np.random.randint(0, 10, 10),
        }
    )

    gdf = (
        gdf
        .groupby(["id"])
        .agg({"000": ["sum", "min", "max"], 
              "111": ["sum", "min", "max"], 
              "222": ["sum"]})
        .reset_index()
        .rename(columns={'id': "new__id",
                        ("000", "sum"): "000_amt",
                        ("111", "sum"): "__int_agg1",
                        ("111", "min"): "__int_agg2",
                        ("111", "max"): "__int_agg3",
                        ("222", "sum"): "___Problem_INT_AGG99",})
    )
    #Tuple Agg Copies Fine
    gdf['__int_agg_2xx'] = gdf["000_amt"]

    #No KeyError
    gdf['__int_agg_3xx'] = gdf["___Problem_INT_AGG99"]

I am going to target some more high priority items but if anyone votes one way or the other, I'll get a PR in. @kkraus14

millerhooks on 19 Jun 2020

Is there an important compatibility reason for not putting a single aggregate item in a list?

Unfortunately yes, every aggregation isn't in a list it doesn't create a MultiIndex IIRC in addition to other esoteric behaviors. @shwina is likely the most knowledgeable here.

kkraus14 on 19 Jun 2020

Ok. Thanks Keith. If there's a reason I'm happy to take it in that direction.

millerhooks on 19 Jun 2020

Yeah a list-like for agg v/s just a string have different behaviour:

In [1]: import pandas as pd                 

In [2]: a = pd.DataFrame({'a': [1, 1, 2], 'b': [1, 2, 3]})       

In [3]: a.groupby('a').agg('sum')           
Out[3]: 
   b
a   
1  3
2  3

In [4]: a.groupby('a').agg(['sum'])         
Out[4]: 
    b
  sum
a    
1   3
2   3