Pandas: pd.cut() with user-specified labels and bins arguments does not use user-provided labels

Created on 29 May 2018  路  5Comments  路  Source: pandas-dev/pandas

Code Sample

import pandas as pd

bins = pd.IntervalIndex.from_tuples([(0, 2), (2, 2.5), (2.5, 5)])
pd.cut([1, 1.5, 2, 2.5, 3, 3.5], bins=bins, labels=['a', 'b', 'c']).tolist()

# output of pd.cut(...) is:
# [Interval(0.0, 2.0, closed='right'), Interval(0.0, 2.0, closed='right'), Interval(0.0, 2.0, closed='right'), Interval(2.0, 2.5, closed='right'), Interval(2.5, 5.0, closed='right'), Interval(2.5, 5.0, closed='right')]

Problem description

When labels are provided as an argument to pd.cut() with user-specified bins, then the output does not use the labels argument.

Expected Output

['a', 'a', 'b', 'b', 'c', 'c']

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 2.7.13.final.0
python-bits: 64
OS: Darwin
OS-release: 17.5.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: None.None

pandas: 0.23.0
pytest: 3.0.5
pip: 10.0.1
setuptools: 39.0.1
Cython: 0.25.2
numpy: 1.14.2
scipy: 1.0.1
pyarrow: None
xarray: None
IPython: 5.3.0
sphinx: None
patsy: 0.5.0
dateutil: 2.7.2
pytz: 2018.3
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 2.2.2
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: 4.6.0
html5lib: 0.9999999
sqlalchemy: None
pymysql: 0.7.11.None
psycopg2: None
jinja2: 2.9.5
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

Bug Interval cut

Most helpful comment

Just in case someone is looking for a workaround , you can create a helper dictionary and get the values:
For the mentioned list:

d = dict(zip(bins,['a','b','c']))
[d.get(i) 
   for i in pd.cut([1, 1.5, 2, 2.5, 3, 3.5], bins=bins, labels=['a', 'b', 'c']).tolist()]

For a pandas series , you can use series.map:

d = dict(zip(bins,['a','b','c']))
pd.cut(series,bins).map(d)

All 5 comments

Did you see the docstring for pd.cut?

labels : array or bool, optional
    Specifies the labels for the returned bins. Must be the same length as
    the resulting bins. If False, returns only integer indicators of the
    bins. This affects the type of the output container (see below).
    This argument is ignored when `bins` is an IntervalIndex.

I'm not sure exactly why, but labels is ignored for II bins.

yeah - not quite clear why labels are ignored with specified bins...

cc @jschendel Do you know why labels is ignored when bins is an IntervalIndex?

I don't know why labels is ignored, nor do I immediately see a reason why it couldn't be supported. It looks like this was done during the initial IntervalIndex implementation, which was before my time, so I'm not entirely sure what the historical context is for this. If I had to venture a guess, I'd say it's not so much that this can't be supported, but more so that it'd take a bit work to do properly.

The _bins_to_cuts function is what's responsible for applying the labels, and it currently has an IntervalIndex fastpath that ignores labels:

https://github.com/pandas-dev/pandas/blob/453fa85a8b88ca22c7b878a3fcf97e068f11b6c4/pandas/core/reshape/tile.py#L327-L332

I'm not sure what the optimal way for implementing labels with an IntervalIndex is, but I see two general approaches. The first would be to just add on labels specific code to the IntervalIndex fastpath. This could be viable if implementation is either short or significantly different enough from the existing code, but my worry is that this could lead to duplicate code that would become a burden to maintain.

The other approach would be to refactor _bins_to_cuts in a more generic way to handle both without impacting performance. There could be some minor annoyances here to reconcile, e.g. the first thing that comes to mind is that for IntervalIndex you want labels to be the same length as bins, but when bins is an array you want bins to have one extra element (n + 1 endpoints --> n intervals), and I suspect there'd be other similar things.

I don't think either approach would be particularly difficult, but would just require a bit of work to make sure that existing functionality is unchanged, and that it's implemented in a maintainable way.

Just in case someone is looking for a workaround , you can create a helper dictionary and get the values:
For the mentioned list:

d = dict(zip(bins,['a','b','c']))
[d.get(i) 
   for i in pd.cut([1, 1.5, 2, 2.5, 3, 3.5], bins=bins, labels=['a', 'b', 'c']).tolist()]

For a pandas series , you can use series.map:

d = dict(zip(bins,['a','b','c']))
pd.cut(series,bins).map(d)
Was this page helpful?
0 / 5 - 0 ratings