Pandas: BUG: Bins are unexpected for qcut when the edges are duplicated

Created on 11 May 2017 · 6Comments · Source: pandas-dev/pandas

Code Sample, a copy-pastable example if possible

#
import pandas as pd
import numpy as np
values = np.empty(shape=10)
values[:3] = 0
values[3:5] = 1
values[5:7] = 2
values[7:9] = 3
values[9:] = 4
pd.qcut(values,5,duplicates='drop')

Problem description

The first bin contains both 0 and 1. Since I'm looking to put 20% in each bin I would expect to have the first bin to contain only 0's (for 30% of the data) rather than 0's and 1's (for 50% of the data).

Expected Output

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit: None
python: 2.7.11.final.0
python-bits: 64
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 58 Stepping 9, GenuineIntel
byteorder: little
LC_ALL: None
LANG: en_US
LOCALE: None.None

pandas: 0.20.1
pytest: 2.8.5
pip: 8.1.1
setuptools: 21.2.1
Cython: 0.23.4
numpy: 1.11.0
scipy: 0.17.0
xarray: None
IPython: 4.0.3
sphinx: 1.3.5
patsy: 0.4.1
dateutil: 2.4.2
pytz: 2015.7
blosc: None
bottleneck: 1.2.0
tables: 3.2.2
numexpr: 2.5.2
feather: None
matplotlib: 1.5.1
openpyxl: 2.3.2
xlrd: 0.9.4
xlwt: 1.0.0
xlsxwriter: 0.8.4
lxml: 3.5.0
bs4: 4.4.1
html5lib: None
sqlalchemy: 1.0.11
pymysql: None
psycopg2: None
jinja2: 2.8
s3fs: None
pandas_gbq: None
pandas_datareader: 0.2.1

Bug cut

Source

artturib

Most helpful comment

I ran into this today. Consider the case:

In [3]: pd.qcut([1,1,1,1,2,3,4], 3, duplicates='drop')
Out[3]: 
[(0.999, 2.0], (0.999, 2.0], (0.999, 2.0], (0.999, 2.0], (0.999, 2.0], (2.0, 4.0], (2.0, 4.0]]
Categories (2, interval[float64]): [(0.999, 2.0] < (2.0, 4.0]]

In [9]: pd.Series([1,1,1,1,2,3,4]).quantile([ 0.        ,  0.33333333,  0.66666667,  1.        ])
Out[9]: 
0.000000    1.0
0.333333    1.0
0.666667    2.0
1.000000    4.0

Given this data with these quantile values, I would expect the bins to be [(0.9999,1] < [2,4)], however they are [(0.999, 2.0] < (2.0, 4.0]]

I think this is a bug in the qcut logic with duplicates.

Specifically, qcut decides on the quantiles using linspace if it isn't specified. The linspace is np.linspace(0,1, num_quantiles+1). The bucket ranges are then constructed by taking consecutive pairs of the quantiles values.

The problem is if the min and first quantile values are duplicate, than we drop one and the first quantile is then treated as the min for the first bucket constructed.

I think the fix is if the 0th and 1st bin values are equal, to update the 0th bin value by subtracting a small epsilon instead of filtering it

wyegelwel on 1 Dec 2017

👍3

All 6 comments

You are effectively doing this.

In [11]: pd.qcut([0, 1, 2, 3, 4, 5], 5)

Out[11]: 
[(-0.001, 1.0], (-0.001, 1.0], (1.0, 2.0], (2.0, 3.0], (3.0, 4.0], (4.0, 5.0]]
Categories (5, interval[float64]): [(-0.001, 1.0] < (1.0, 2.0] < (2.0, 3.0] < (3.0, 4.0] < (4.0, 5.0]]

you have to have 1 bin that has 2 values. drop does exactly what it sounds like, it uses the unique values.

So this looks like the right answer. If you don't want this, I would specify the bins yourself.

jreback on 11 May 2017

@TomAugspurger

jreback on 11 May 2017

@jreback
I agree with you that the function does what is advertised i.e. drops duplicated bins. I think this has a surprising result in the my example.

I think there is a slight difference between our examples. In my example I'd like the bins to be 20% and qcut makes the bin that would be 30% with only 0's 50% with {0,1} since all 0's don't completely fit in 20%.
In your example you would like the bins to be 20% and every single value would be less than 20% hence two values have to go to the same bin since 0's don't completely fill the first bin.

Your suggestion of specifying the bins myself is not feasible since I don't know the distribution of the values before hand.

artturib on 12 May 2017

Your suggestion of specifying the bins myself is not feasible since I don't know the distribution of the values before hand.

pandas has the same problem :) Doing qcut(x, 5) is just qcut(x, [0, .2, .4, .6, .8, 1.]), which can't give you your desired outcome since the 20th and 40th percentiles are the same.

I did a brief skim of other packages, and it seems like they get around this by iteratively adjusting the quantiles until things work. @artturib would you mind writing up a function that does what you want, and we can see if we can integrate it into pandas?

TomAugspurger on 15 May 2017

I ran into this today. Consider the case:

In [3]: pd.qcut([1,1,1,1,2,3,4], 3, duplicates='drop')
Out[3]: 
[(0.999, 2.0], (0.999, 2.0], (0.999, 2.0], (0.999, 2.0], (0.999, 2.0], (2.0, 4.0], (2.0, 4.0]]
Categories (2, interval[float64]): [(0.999, 2.0] < (2.0, 4.0]]

In [9]: pd.Series([1,1,1,1,2,3,4]).quantile([ 0.        ,  0.33333333,  0.66666667,  1.        ])
Out[9]: 
0.000000    1.0
0.333333    1.0
0.666667    2.0
1.000000    4.0

Given this data with these quantile values, I would expect the bins to be [(0.9999,1] < [2,4)], however they are [(0.999, 2.0] < (2.0, 4.0]]

I think this is a bug in the qcut logic with duplicates.

The problem is if the min and first quantile values are duplicate, than we drop one and the first quantile is then treated as the min for the first bucket constructed.

I think the fix is if the 0th and 1st bin values are equal, to update the 0th bin value by subtracting a small epsilon instead of filtering it

wyegelwel on 1 Dec 2017

👍3

@wyegelwel thanks for the report! a PR to fix would be welcome!

jreback on 1 Dec 2017

Was this page helpful?

0 / 5 - 0 ratings

Related issues

DataFrame.describe can't return percentiles when data set contain nan

tade0726 · 3Comments

df.duplicated and drop_duplicates raise TypeError with set and list values.

Abrosimov-a-a · 3Comments

frame _apply_standard error when operating on 0 or NaN values

venuktan · 3Comments

Pandas get_dummies() and n-1 Categorical Encoding Option to avoid Collinearity?

jaradc · 3Comments

can't plot multi-row subplots

ericdf · 3Comments