Pandas: Feature: Qcut when passed labels and duplicates='drop' should drop corresponding labels

Created on 11 Sep 2018 · 11Comments · Source: pandas-dev/pandas

Code Sample, a copy-pastable example if possible

import pandas as pd
import numpy as np
def add_quantiles(data, column, quantiles=4):
    """
    Returns the given dataframe with dummy columns for quantiles of a given column. Quantiles can be a int to 
    specify equal spaced quantiles or an array of quantiles 
    :param data: DataFrame :type data: DataFrame 
    :param column: column to which add quantiles :type column: string 
    :param quantiles: number of quantiles to generate or list of quantiles :type quantiles: Union[int, list of float] 
    :return: DataFrame 
    """
    if isinstance(quantiles, int):
        labels = [column + "_" + str(int(quantile / quantiles * 100)) + "q" for quantile in range(1, quantiles + 1)]
    if isinstance(quantiles, list):
        labels = [column + "_" + str(int(quantile * 100)) + "q" for quantile in quantiles]
        del labels[0]  # Bin labels must be one fewer than the number of bin edges
    data = pd.concat([data, pd.get_dummies(pd.qcut(x=data[column],
                                                   q=quantiles,
                                                   labels=labels, duplicates='drop'))], axis=1)
    return data

zs = np.zeros(3)
rs = np.random.randint(1, 100, size=3)
arr=np.concatenate((zs, rs))
ser = pd.Series(arr)
df = pd.DataFrame({'numbers':ser})
print(df)
#numbers
#0      0.0
#1      0.0
#2      0.0
#3     33.0
#4     81.0
#5     13.0
print(add_quantiles(df, 'numbers'))
Traceback (most recent call last):
  File "pandas_qcut.py", line 29, in <module>
    print(add_quantiles(df, 'numbers'))
  File "pandas_qcut.py", line 20, in add_quantiles
    labels=labels, duplicates='drop'))], axis=1)
  File "/home/mindcraft/anaconda3/lib/python3.6/site-packages/pandas/core/reshape/tile.py", line 206, in qcut
    dtype=dtype, duplicates=duplicates)
  File "/home/mindcraft/anaconda3/lib/python3.6/site-packages/pandas/core/reshape/tile.py", line 252, in _bins_to_cuts
    raise ValueError('Bin labels must be one fewer than '
ValueError: Bin labels must be one fewer than the number of bin edges

Problem description

When using this function with quantiles that return repeated bins, the function raises "ValueError: Bin labels must be one fewer than the number of bin edges". When using the optional parameter "duplicates" the only way to pass a valid "labels" parameters is checking for duplicate bins beforehand, repeating code in order to calculate the bins.

Expected Output

Pd.qcut should return the quantilizated column with the labels corresponding to the indices of the unique bins.
E.g output of add_quantiles.:

   numbers  numbers_50q  numbers_75q  numbers_100q
0      0.0            1            0             0
1      0.0            1            0             0
2      0.0            1            0             0
3     33.0            0            0             1
4     81.0            0            0             1
5     13.0            0            1             0

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit: None

pandas: 0.22.0
pytest: 3.5.0
pip: 18.0
setuptools: 40.0.0
Cython: 0.28.1
numpy: 1.14.2
scipy: 1.0.0
pyarrow: None
xarray: None
IPython: 6.2.1
sphinx: 1.7.2
patsy: 0.5.0
dateutil: 2.7.2
pytz: 2018.3
blosc: None
bottleneck: 1.2.1
tables: 3.4.2
numexpr: 2.6.4
feather: None
matplotlib: 2.2.2
openpyxl: 2.5.1
xlrd: 1.1.0
xlwt: 1.2.0
xlsxwriter: 1.0.2
lxml: 4.2.1
bs4: 4.6.0
html5lib: 1.0.1
sqlalchemy: 1.2.5
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

Bug cut

Source

olveirap

👍3

Most helpful comment

Here is an even simpler example. It works with duplicates='drop' alone:

>>> pandas.qcut([ord(x) for x in list('aaaaaabc')], q=3, retbins=True)
ValueError: Bin edges must be unique: array([ 97.,  97.,  97.,  99.]).
You can drop duplicate edges by setting the 'duplicates' kwarg

>>> pandas.qcut([ord(x) for x in list('aaaaaabc')], q=3, retbins=True, duplicates='drop')
([(96.999, 99.0], (96.999, 99.0], (96.999, 99.0], (96.999, 99.0], (96.999, 99.0], (96.999, 99.0], (96.999, 99.0], (96.999, 99.0]]
Categories (1, interval[float64]): [(96.999, 99.0]], array([ 97.,  99.]))

But if you try to apply labels, then it fails:

>>> pandas.qcut([ord(x) for x in list('aaaaaabc')], q=3, retbins=True, duplicates='drop', labels=[1, 2, 3])
ValueError: Bin labels must be one fewer than the number of bin edges

There is no way to know in advance how many bin edges Pandas is going to drop, or even which ones it has dropped after the fact, so it's pretty much impossible to use duplicates='drop' and labels together reliably.

qris on 4 Oct 2018

👍4

All 11 comments

@olveirap : Thanks for reporting this? Could you update your issue to show what output you're getting currently and what output you would expect? That would be very helpful.

cc @jreback

gfyoung on 11 Sep 2018

Sorry for not being clear enough, I've edited the issue with more clear expected output and current behavior

olveirap on 12 Sep 2018

👍1

Can you make your sample a minimally reproducible one? There are a lot of extraneous elements that make it more difficult to decipher.

http://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports

WillAyd on 12 Sep 2018

Sorry again, I provide here a more simple use of pd.qcut which represent the issue I'm talking about:

import pandas as pd
import numpy as np
zs = np.zeros(3)
rs = np.random.randint(1, 100, size=3)
arr=np.concatenate((zs, rs))
ser = pd.Series(arr)
df = pd.DataFrame({'numbers':ser})
df

	numbers
0	0.0
1	0.0
2	0.0
3	84.0
4	95.0
5	77.0

labels = ['numbers_25q', 'numbers_50q', 'numbers_75q', 'numbers_100q']
pd.qcut(x=df['numbers'],
q=4,
labels=labels, duplicates='drop')

---------------------------------------------------------------------------

ValueError                                Traceback (most recent call last)

<ipython-input-14-1afe71bbc979> in <module>()
      1 pd.qcut(x=df['numbers'],
      2 q=4,
----> 3 labels=labels, duplicates='drop')


~/anaconda3/lib/python3.6/site-packages/pandas/core/reshape/tile.py in qcut(x, q, labels, retbins, precision, duplicates)
    204     fac, bins = _bins_to_cuts(x, bins, labels=labels,
    205                               precision=precision, include_lowest=True,
--> 206                               dtype=dtype, duplicates=duplicates)
    207 
    208     return _postprocess_for_cut(fac, bins, retbins, x_is_series,


~/anaconda3/lib/python3.6/site-packages/pandas/core/reshape/tile.py in _bins_to_cuts(x, bins, right, labels, precision, include_lowest, dtype, duplicates)
    250         else:
    251             if len(labels) != len(bins) - 1:
--> 252                 raise ValueError('Bin labels must be one fewer than '
    253                                  'the number of bin edges')
    254         if not is_categorical_dtype(labels):


ValueError: Bin labels must be one fewer than the number of bin edges

labels = ['numbers_50q', 'numbers_75q', 'numbers_100q']
pd.qcut(x=df['numbers'],
q=4,
labels=labels, duplicates='drop')

0     numbers_50q
1     numbers_50q
2     numbers_50q
3    numbers_100q
4    numbers_100q
5     numbers_75q
Name: numbers, dtype: category
Categories (3, object): [numbers_50q < numbers_75q < numbers_100q]

olveirap on 12 Sep 2018

Not sure that your desired output is plausible here as it's rather ambiguous what the bins should be when you are asking for 4 of them but only 3 ranges are really possible.

So in that case the ValueError makes sense, though perhaps a better error message can be thrown

WillAyd on 12 Sep 2018

Do you think dropping the label on the same index of duplicate bin is a bad solution?

I guess a warning while doing so would be the best of both worlds, informing the user of the possible ambiguity while returning something useful for some cases (such as mine)

olveirap on 12 Sep 2018

Do you think dropping the label on the same index of duplicate bin is a bad solution?

Only because I don't think it's generalizable. With your example what is the differentiator that makes you want to drop "_25" but keep "_100" instead of the other way around? What would happen if you ended up with say only 1 bin but 4 labels?

I don't think there is a clear cut answer to situations like the above so I'd be -1 here. Of course let's see what others think

WillAyd on 12 Sep 2018

Because that would alter the order of the labels in a way that they are no longer assigned to the intended quantile. In the order of the labels it's implicitly passed the assignment to the bin, if you drop the second or the first and you drop the last label then you are arbitrarly changing the assigned labels.

First label was to the first quantile, since first and second quantile are repeated upon calculating the bins, the correct label for the unique bins calculated is to be the one of the second quantile, since it's equivalent to using the following code:

labels = ['numbers_50q', 'numbers_75q', 'numbers_100q']
quantiles = [.0,.5,.75,1.]
pd.qcut(x=df['numbers'],
q=quantiles,
labels=labels)

0     numbers_50q
1     numbers_50q
2     numbers_50q
3    numbers_100q
4    numbers_100q
5     numbers_75q
Name: numbers, dtype: category
Categories (3, object): [numbers_50q < numbers_75q < numbers_100q]

olveirap on 12 Sep 2018

Here is an even simpler example. It works with duplicates='drop' alone:

>>> pandas.qcut([ord(x) for x in list('aaaaaabc')], q=3, retbins=True)
ValueError: Bin edges must be unique: array([ 97.,  97.,  97.,  99.]).
You can drop duplicate edges by setting the 'duplicates' kwarg

>>> pandas.qcut([ord(x) for x in list('aaaaaabc')], q=3, retbins=True, duplicates='drop')
([(96.999, 99.0], (96.999, 99.0], (96.999, 99.0], (96.999, 99.0], (96.999, 99.0], (96.999, 99.0], (96.999, 99.0], (96.999, 99.0]]
Categories (1, interval[float64]): [(96.999, 99.0]], array([ 97.,  99.]))

But if you try to apply labels, then it fails:

>>> pandas.qcut([ord(x) for x in list('aaaaaabc')], q=3, retbins=True, duplicates='drop', labels=[1, 2, 3])
ValueError: Bin labels must be one fewer than the number of bin edges

qris on 4 Oct 2018

👍4

Sorry, I found your example a bit convoluted, so this:
[ord(x) for x in list('aaaaaabc')
Returns this:
[97, 97, 97, 97, 97, 97, 98, 99]
What the code should try to do with q=3 is separate the numbers between the 0-percentile and 33-percentile in a bin, the same for 33-percentile and 66-percentile and lastly 66-percentile and 100-percenile. In the array above the value 97 is inside every bin, so what you get is a bin that goes from the 0-percentile to 100-percentile.

There is no way to know in advance how many bin edges Pandas is going to drop, or even which ones it has dropped after the fact, so it's pretty much impossible to use duplicates='drop' and labels together reliably.

Why? as far as I'm understanding the code, from this line: https://github.com/pandas-dev/pandas/blob/0409521665bd436a10aea7e06336066bf07ff057/pandas/core/reshape/tile.py#L327
I think calculating the difference between bins and unique_bins is pretty straightforward. I'm aware that what labels to keep when dropping bins isn't as evident and it depends of the use of labels.

In my example, my labels referred to the upper limit of the bin and that's why I was pushing for dropping the ones before the first nonduplicate, but this could be configured with an optional parameter to behave in the opposite way, keeping the label from first duplicate bin and absorving the ones that come after.

I'm aware that at this point I'm probably nitpicking about a functionality probably noone uses like me, I will try to do a fork with this functionality for myself if you don't find it would be proper to have it here

olveirap on 4 Oct 2018

👍1

Moving test below to follow the line "bins = algos.quantile(x, quantiles)" fixed the problem for me.

    #else:
    #    if len(labels) != len(bins) - 1:
    #        raise ValueError('Bin labels must be one fewer than '
    #                         'the number of bin edges')