Cudf: [BUG] Column Renaming Bug

Created on 31 Jan 2020  路  6Comments  路  Source: rapidsai/cudf

image

Line 35 gives an error: KeyError: 'fact_sales_tax_amt'

But I would expect it to be called that. Looks like there's a bug with how the columns are named in my example.

Please confirm -- I'm running 0.11.

Thanks.

bug cuDF (Python)

All 6 comments

@RedTailedHawk I believe this is a known issue related to handling MultiColumns. We're working on fixes in 0.13 for this.

I have recreated this and have been working with rename today so I thought I'd try to knock it out.

Why do we support this syntax for aggregate?

>>> a.groupby('a').agg({'a': 'max', 'b': ['min', 'mean']})
            a   b
          max min mean
        a
        1   1   1  1.5
        2   2   3  3.0

Is there an important compatibility reason for not putting a single aggregate item in a list? I personally do not think it's cleaner. It's always going to be a set of finite values, even if a set of one.
The following works as expected. We could default to expecting listlikes for agg. This is where they keys are getting lost regardless of how we want to handle it.

gdf = gd.DataFrame(
        {
            "id": list(range(10)),
            "000": np.random.randint(0, 10, 10),
            "111": np.random.randint(0, 10, 10),
            "222": np.random.randint(0, 10, 10),
            "a": np.random.randint(0, 10, 10),
            "b": np.random.randint(0, 10, 10),
            "c": np.random.randint(0, 10, 10),
            "d": np.random.randint(0, 10, 10),
        }
    )

    gdf = (
        gdf
        .groupby(["id"])
        .agg({"000": ["sum", "min", "max"], 
              "111": ["sum", "min", "max"], 
              "222": ["sum"]})
        .reset_index()
        .rename(columns={'id': "new__id",
                        ("000", "sum"): "000_amt",
                        ("111", "sum"): "__int_agg1",
                        ("111", "min"): "__int_agg2",
                        ("111", "max"): "__int_agg3",
                        ("222", "sum"): "___Problem_INT_AGG99",})
    )
    #Tuple Agg Copies Fine
    gdf['__int_agg_2xx'] = gdf["000_amt"]

    #No KeyError
    gdf['__int_agg_3xx'] = gdf["___Problem_INT_AGG99"]

I am going to target some more high priority items but if anyone votes one way or the other, I'll get a PR in. @kkraus14

Is there an important compatibility reason for not putting a single aggregate item in a list?

Unfortunately yes, every aggregation isn't in a list it doesn't create a MultiIndex IIRC in addition to other esoteric behaviors. @shwina is likely the most knowledgeable here.

Ok. Thanks Keith. If there's a reason I'm happy to take it in that direction.

Yeah a list-like for agg v/s just a string have different behaviour:

In [1]: import pandas as pd                 

In [2]: a = pd.DataFrame({'a': [1, 1, 2], 'b': [1, 2, 3]})       

In [3]: a.groupby('a').agg('sum')           
Out[3]: 
   b
a   
1  3
2  3

In [4]: a.groupby('a').agg(['sum'])         
Out[4]: 
    b
  sum
a    
1   3
2   3
Was this page helpful?
0 / 5 - 0 ratings