import pandas as pd
import numpy as np
def add_quantiles(data, column, quantiles=4):
"""
Returns the given dataframe with dummy columns for quantiles of a given column. Quantiles can be a int to
specify equal spaced quantiles or an array of quantiles
:param data: DataFrame :type data: DataFrame
:param column: column to which add quantiles :type column: string
:param quantiles: number of quantiles to generate or list of quantiles :type quantiles: Union[int, list of float]
:return: DataFrame
"""
if isinstance(quantiles, int):
labels = [column + "_" + str(int(quantile / quantiles * 100)) + "q" for quantile in range(1, quantiles + 1)]
if isinstance(quantiles, list):
labels = [column + "_" + str(int(quantile * 100)) + "q" for quantile in quantiles]
del labels[0] # Bin labels must be one fewer than the number of bin edges
data = pd.concat([data, pd.get_dummies(pd.qcut(x=data[column],
q=quantiles,
labels=labels, duplicates='drop'))], axis=1)
return data
zs = np.zeros(3)
rs = np.random.randint(1, 100, size=3)
arr=np.concatenate((zs, rs))
ser = pd.Series(arr)
df = pd.DataFrame({'numbers':ser})
print(df)
#numbers
#0 0.0
#1 0.0
#2 0.0
#3 33.0
#4 81.0
#5 13.0
print(add_quantiles(df, 'numbers'))
Traceback (most recent call last):
File "pandas_qcut.py", line 29, in <module>
print(add_quantiles(df, 'numbers'))
File "pandas_qcut.py", line 20, in add_quantiles
labels=labels, duplicates='drop'))], axis=1)
File "/home/mindcraft/anaconda3/lib/python3.6/site-packages/pandas/core/reshape/tile.py", line 206, in qcut
dtype=dtype, duplicates=duplicates)
File "/home/mindcraft/anaconda3/lib/python3.6/site-packages/pandas/core/reshape/tile.py", line 252, in _bins_to_cuts
raise ValueError('Bin labels must be one fewer than '
ValueError: Bin labels must be one fewer than the number of bin edges
When using this function with quantiles that return repeated bins, the function raises "ValueError: Bin labels must be one fewer than the number of bin edges". When using the optional parameter "duplicates" the only way to pass a valid "labels" parameters is checking for duplicate bins beforehand, repeating code in order to calculate the bins.
Pd.qcut should return the quantilizated column with the labels corresponding to the indices of the unique bins.
E.g output of add_quantiles.:
numbers numbers_50q numbers_75q numbers_100q
0 0.0 1 0 0
1 0.0 1 0 0
2 0.0 1 0 0
3 33.0 0 0 1
4 81.0 0 0 1
5 13.0 0 1 0
pd.show_versions()
commit: None
pandas: 0.22.0
pytest: 3.5.0
pip: 18.0
setuptools: 40.0.0
Cython: 0.28.1
numpy: 1.14.2
scipy: 1.0.0
pyarrow: None
xarray: None
IPython: 6.2.1
sphinx: 1.7.2
patsy: 0.5.0
dateutil: 2.7.2
pytz: 2018.3
blosc: None
bottleneck: 1.2.1
tables: 3.4.2
numexpr: 2.6.4
feather: None
matplotlib: 2.2.2
openpyxl: 2.5.1
xlrd: 1.1.0
xlwt: 1.2.0
xlsxwriter: 1.0.2
lxml: 4.2.1
bs4: 4.6.0
html5lib: 1.0.1
sqlalchemy: 1.2.5
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
@olveirap : Thanks for reporting this? Could you update your issue to show what output you're getting currently and what output you would expect? That would be very helpful.
cc @jreback
Sorry for not being clear enough, I've edited the issue with more clear expected output and current behavior
Can you make your sample a minimally reproducible one? There are a lot of extraneous elements that make it more difficult to decipher.
http://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports
Sorry again, I provide here a more simple use of pd.qcut which represent the issue I'm talking about:
import pandas as pd
import numpy as np
zs = np.zeros(3)
rs = np.random.randint(1, 100, size=3)
arr=np.concatenate((zs, rs))
ser = pd.Series(arr)
df = pd.DataFrame({'numbers':ser})
df
numbers | |
---|---|
0 | 0.0 |
1 | 0.0 |
2 | 0.0 |
3 | 84.0 |
4 | 95.0 |
5 | 77.0 |
labels = ['numbers_25q', 'numbers_50q', 'numbers_75q', 'numbers_100q']
pd.qcut(x=df['numbers'],
q=4,
labels=labels, duplicates='drop')
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-14-1afe71bbc979> in <module>()
1 pd.qcut(x=df['numbers'],
2 q=4,
----> 3 labels=labels, duplicates='drop')
~/anaconda3/lib/python3.6/site-packages/pandas/core/reshape/tile.py in qcut(x, q, labels, retbins, precision, duplicates)
204 fac, bins = _bins_to_cuts(x, bins, labels=labels,
205 precision=precision, include_lowest=True,
--> 206 dtype=dtype, duplicates=duplicates)
207
208 return _postprocess_for_cut(fac, bins, retbins, x_is_series,
~/anaconda3/lib/python3.6/site-packages/pandas/core/reshape/tile.py in _bins_to_cuts(x, bins, right, labels, precision, include_lowest, dtype, duplicates)
250 else:
251 if len(labels) != len(bins) - 1:
--> 252 raise ValueError('Bin labels must be one fewer than '
253 'the number of bin edges')
254 if not is_categorical_dtype(labels):
ValueError: Bin labels must be one fewer than the number of bin edges
labels = ['numbers_50q', 'numbers_75q', 'numbers_100q']
pd.qcut(x=df['numbers'],
q=4,
labels=labels, duplicates='drop')
0 numbers_50q
1 numbers_50q
2 numbers_50q
3 numbers_100q
4 numbers_100q
5 numbers_75q
Name: numbers, dtype: category
Categories (3, object): [numbers_50q < numbers_75q < numbers_100q]
Not sure that your desired output is plausible here as it's rather ambiguous what the bins should be when you are asking for 4 of them but only 3 ranges are really possible.
So in that case the ValueError
makes sense, though perhaps a better error message can be thrown
Do you think dropping the label on the same index of duplicate bin is a bad solution?
I guess a warning while doing so would be the best of both worlds, informing the user of the possible ambiguity while returning something useful for some cases (such as mine)
Do you think dropping the label on the same index of duplicate bin is a bad solution?
Only because I don't think it's generalizable. With your example what is the differentiator that makes you want to drop "_25" but keep "_100" instead of the other way around? What would happen if you ended up with say only 1 bin but 4 labels?
I don't think there is a clear cut answer to situations like the above so I'd be -1 here. Of course let's see what others think
Because that would alter the order of the labels in a way that they are no longer assigned to the intended quantile. In the order of the labels it's implicitly passed the assignment to the bin, if you drop the second or the first and you drop the last label then you are arbitrarly changing the assigned labels.
First label was to the first quantile, since first and second quantile are repeated upon calculating the bins, the correct label for the unique bins calculated is to be the one of the second quantile, since it's equivalent to using the following code:
labels = ['numbers_50q', 'numbers_75q', 'numbers_100q']
quantiles = [.0,.5,.75,1.]
pd.qcut(x=df['numbers'],
q=quantiles,
labels=labels)
0 numbers_50q
1 numbers_50q
2 numbers_50q
3 numbers_100q
4 numbers_100q
5 numbers_75q
Name: numbers, dtype: category
Categories (3, object): [numbers_50q < numbers_75q < numbers_100q]
Here is an even simpler example. It works with duplicates='drop' alone:
>>> pandas.qcut([ord(x) for x in list('aaaaaabc')], q=3, retbins=True)
ValueError: Bin edges must be unique: array([ 97., 97., 97., 99.]).
You can drop duplicate edges by setting the 'duplicates' kwarg
>>> pandas.qcut([ord(x) for x in list('aaaaaabc')], q=3, retbins=True, duplicates='drop')
([(96.999, 99.0], (96.999, 99.0], (96.999, 99.0], (96.999, 99.0], (96.999, 99.0], (96.999, 99.0], (96.999, 99.0], (96.999, 99.0]]
Categories (1, interval[float64]): [(96.999, 99.0]], array([ 97., 99.]))
But if you try to apply labels, then it fails:
>>> pandas.qcut([ord(x) for x in list('aaaaaabc')], q=3, retbins=True, duplicates='drop', labels=[1, 2, 3])
ValueError: Bin labels must be one fewer than the number of bin edges
There is no way to know in advance how many bin edges Pandas is going to drop, or even which ones it has dropped after the fact, so it's pretty much impossible to use duplicates='drop'
and labels
together reliably.
Sorry, I found your example a bit convoluted, so this:
[ord(x) for x in list('aaaaaabc')
Returns this:
[97, 97, 97, 97, 97, 97, 98, 99]
What the code should try to do with q=3 is separate the numbers between the 0-percentile and 33-percentile in a bin, the same for 33-percentile and 66-percentile and lastly 66-percentile and 100-percenile. In the array above the value 97 is inside every bin, so what you get is a bin that goes from the 0-percentile to 100-percentile.
There is no way to know in advance how many bin edges Pandas is going to drop, or even which ones it has dropped after the fact, so it's pretty much impossible to use duplicates='drop' and labels together reliably.
Why? as far as I'm understanding the code, from this line: https://github.com/pandas-dev/pandas/blob/0409521665bd436a10aea7e06336066bf07ff057/pandas/core/reshape/tile.py#L327
I think calculating the difference between bins and unique_bins is pretty straightforward. I'm aware that what labels to keep when dropping bins isn't as evident and it depends of the use of labels.
In my example, my labels referred to the upper limit of the bin and that's why I was pushing for dropping the ones before the first nonduplicate, but this could be configured with an optional parameter to behave in the opposite way, keeping the label from first duplicate bin and absorving the ones that come after.
I'm aware that at this point I'm probably nitpicking about a functionality probably noone uses like me, I will try to do a fork with this functionality for myself if you don't find it would be proper to have it here
Moving test below to follow the line "bins = algos.quantile(x, quantiles)" fixed the problem for me.
#else:
# if len(labels) != len(bins) - 1:
# raise ValueError('Bin labels must be one fewer than '
# 'the number of bin edges')
Most helpful comment
Here is an even simpler example. It works with duplicates='drop' alone:
But if you try to apply labels, then it fails:
There is no way to know in advance how many bin edges Pandas is going to drop, or even which ones it has dropped after the fact, so it's pretty much impossible to use
duplicates='drop'
andlabels
together reliably.