Pandas: qcut can fail for highly discontinuous data distributions

Created on 5 Jan 2017  路  6Comments  路  Source: pandas-dev/pandas

Code Sample, a copy-pastable example if possible

This code fails for any K:

# Your code here
K = 100

pd.qcut([0] * K + [1] * (K + 1), 2)

Problem description

With pandas 0.19.2, I have:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-8-782385490865> in <module>()
----> 1 pd.qcut([0] * K + [1] * (K + 1), 2)

pandas/tools/tile.py in qcut(x, q, labels, retbins, precision)
    173     bins = algos.quantile(x, quantiles)
    174     return _bins_to_cuts(x, bins, labels=labels, retbins=retbins,
--> 175                          precision=precision, include_lowest=True)
    176 
    177 

pandas/tools/tile.py in _bins_to_cuts(x, bins, right, labels, retbins, precision, name, include_lowest)
    192 
    193     if len(algos.unique(bins)) < len(bins):
--> 194         raise ValueError('Bin edges must be unique: %s' % repr(bins))
    195 
    196     if include_lowest:

ValueError: Bin edges must be unique: array([0, 1, 1])

Expected Output

We need some kind of option to decide how to assign values to a quantile bucket in the event that two quantiles have the same value prior to the searchsorted call. In this case, the appropriate behavior may be to assign all 1 values to the 50% quantile bucket.

Bug cut

Most helpful comment

I looked at the behavior after #15000 -- I'm going to leave this issue open for now. We should look into the quantile algorithms in other statistical packages. for example we have:

> quantile(data, c(0, 0.5, 1), type=1)
  0%  50% 100% 
   0    1    1 
> quantile(data, c(0, 0.5, 1), type=2)
  0%  50% 100% 
   0    1    1 
> quantile(data, c(0, 0.5, 1), type=3)
  0%  50% 100% 
   0    0    1 

I think having duplicate bin edges is fine as long as we have a convention about which bin to assign the data to. I would argue in this case, the correct sample quantiles are [0, 1), [1, 1], and so we have two well-defined bins to assign the data to. In the degenerate case where multiple bins have the same start and end point, we would want to assign values to the leftmost (or rightmost) bin

All 6 comments

There recently has been some improvement regarding this. With master:

In [101]: pd.qcut([0] * K + [1] * (K + 1), 2)
...
ValueError: Bin edges must be unique: array([0, 1, 1]).
You can drop duplicate edges by setting the 'duplicates' kwarg

In [102]: pd.qcut([0] * K + [1] * (K + 1), 2, duplicates='drop')
Out[102]: 
[[0, 1], [0, 1], [0, 1], [0, 1], [0, 1], ..., [0, 1], [0, 1], [0, 1], [0, 1], [0, 1]]
Length: 201
Categories (1, object): [[0, 1]]

So there is option to deal with duplicates edges, but the option chosen here is to take less bins instead of assigning all values to one of the bins.

Effectively, the duplicates='drop' will assign all of clumped values into a single bin. In my experience with data, this happens most when there's a zero value for most observations and a tail of non zero values. For example, take 'snowfall_in_inches'. Most for days, this will be zero. If we want to split into quantiles, we'll need to group all of the zero values into one bucket. duplicates='drop' should do this. Happy to improve if there's a better way though.

I looked at the behavior after #15000 -- I'm going to leave this issue open for now. We should look into the quantile algorithms in other statistical packages. for example we have:

> quantile(data, c(0, 0.5, 1), type=1)
  0%  50% 100% 
   0    1    1 
> quantile(data, c(0, 0.5, 1), type=2)
  0%  50% 100% 
   0    1    1 
> quantile(data, c(0, 0.5, 1), type=3)
  0%  50% 100% 
   0    0    1 

I think having duplicate bin edges is fine as long as we have a convention about which bin to assign the data to. I would argue in this case, the correct sample quantiles are [0, 1), [1, 1], and so we have two well-defined bins to assign the data to. In the degenerate case where multiple bins have the same start and end point, we would want to assign values to the leftmost (or rightmost) bin

Can someone review the commit I just made? This is the first time I am contributing to pandas. Do I need to write a test for the same? Also, I need to know what to do with the duplicates parameter. Thank you!

Can someone review the commit I just made? This is the first time I am contributing to pandas. Do I need to write a test for the same? Also, I need to know what to do with the duplicates parameter. Thank you!

Yes please to writing a test - this is often a good first step @puneet29

Was this page helpful?
0 / 5 - 0 ratings