This code fails for any K
:
# Your code here
K = 100
pd.qcut([0] * K + [1] * (K + 1), 2)
With pandas 0.19.2, I have:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-8-782385490865> in <module>()
----> 1 pd.qcut([0] * K + [1] * (K + 1), 2)
pandas/tools/tile.py in qcut(x, q, labels, retbins, precision)
173 bins = algos.quantile(x, quantiles)
174 return _bins_to_cuts(x, bins, labels=labels, retbins=retbins,
--> 175 precision=precision, include_lowest=True)
176
177
pandas/tools/tile.py in _bins_to_cuts(x, bins, right, labels, retbins, precision, name, include_lowest)
192
193 if len(algos.unique(bins)) < len(bins):
--> 194 raise ValueError('Bin edges must be unique: %s' % repr(bins))
195
196 if include_lowest:
ValueError: Bin edges must be unique: array([0, 1, 1])
We need some kind of option to decide how to assign values to a quantile bucket in the event that two quantiles have the same value prior to the searchsorted
call. In this case, the appropriate behavior may be to assign all 1
values to the 50% quantile bucket.
There recently has been some improvement regarding this. With master:
In [101]: pd.qcut([0] * K + [1] * (K + 1), 2)
...
ValueError: Bin edges must be unique: array([0, 1, 1]).
You can drop duplicate edges by setting the 'duplicates' kwarg
In [102]: pd.qcut([0] * K + [1] * (K + 1), 2, duplicates='drop')
Out[102]:
[[0, 1], [0, 1], [0, 1], [0, 1], [0, 1], ..., [0, 1], [0, 1], [0, 1], [0, 1], [0, 1]]
Length: 201
Categories (1, object): [[0, 1]]
So there is option to deal with duplicates edges, but the option chosen here is to take less bins instead of assigning all values to one of the bins.
See the issue https://github.com/pandas-dev/pandas/issues/7751 and final PR merged couple days ago https://github.com/pandas-dev/pandas/pull/15000
Effectively, the duplicates='drop'
will assign all of clumped values into a single bin. In my experience with data, this happens most when there's a zero value for most observations and a tail of non zero values. For example, take 'snowfall_in_inches'. Most for days, this will be zero. If we want to split into quantiles, we'll need to group all of the zero values into one bucket. duplicates='drop'
should do this. Happy to improve if there's a better way though.
I looked at the behavior after #15000 -- I'm going to leave this issue open for now. We should look into the quantile algorithms in other statistical packages. for example we have:
> quantile(data, c(0, 0.5, 1), type=1)
0% 50% 100%
0 1 1
> quantile(data, c(0, 0.5, 1), type=2)
0% 50% 100%
0 1 1
> quantile(data, c(0, 0.5, 1), type=3)
0% 50% 100%
0 0 1
I think having duplicate bin edges is fine as long as we have a convention about which bin to assign the data to. I would argue in this case, the correct sample quantiles are [0, 1), [1, 1]
, and so we have two well-defined bins to assign the data to. In the degenerate case where multiple bins have the same start and end point, we would want to assign values to the leftmost (or rightmost) bin
Can someone review the commit I just made? This is the first time I am contributing to pandas. Do I need to write a test for the same? Also, I need to know what to do with the duplicates parameter. Thank you!
Can someone review the commit I just made? This is the first time I am contributing to pandas. Do I need to write a test for the same? Also, I need to know what to do with the duplicates parameter. Thank you!
Yes please to writing a test - this is often a good first step @puneet29
Most helpful comment
I looked at the behavior after #15000 -- I'm going to leave this issue open for now. We should look into the quantile algorithms in other statistical packages. for example we have:
I think having duplicate bin edges is fine as long as we have a convention about which bin to assign the data to. I would argue in this case, the correct sample quantiles are
[0, 1), [1, 1]
, and so we have two well-defined bins to assign the data to. In the degenerate case where multiple bins have the same start and end point, we would want to assign values to the leftmost (or rightmost) bin