Pandas: API: preferred way to check if column/Series has Categorical dtype

Created on 14 Nov 2014 · 20Comments · Source: pandas-dev/pandas

From http://stackoverflow.com/questions/26924904/check-if-dataframe-column-is-categorical/26925340#26925340

What is the preferred way to check for categorical dtype?

I now answered:

In [42]: isinstance(df.cat_column.dtype, pd.core.common.CategoricalDtype)
Out[42]: True

In [43]: pd.core.common.is_categorical_dtype(df.cat_column)
Out[43]: True

But:

this seems somewhat buried in pandas. Should there be a more top-level function to do this?
we should add the preferred way to the categorical docs.

API Design Categorical Docs

Source

jorisvandenbossche

All 20 comments

That was me asking the question. I originally started writing up the question because I was working on a PR for pandas, and while writing I discovered is_categorical_dtype(). Since I was working on internal pandas code that works for my current usage.

Having something that's not so deeply buried would be good though. I tried df.col.dtype == 'category' because I thought that was a pretty standard way of doing a quick type check. Unless there are strong reasons not to use this method, it should probably work the same for categoricals as it does for other types (e.g. df.col.dtype == 'float64')

onesandzeroes on 14 Nov 2014

df.col.dtype == 'category' _does_ appear to work for me on pandas 0.15.1. As @onesandzeroes says, I think should be the preferred way to check for categorical types.

It does looks like Categorical.dtype.__ne__ needs to be defined, though -- currently it's not, so Python defaults to something arbitrary.

shoyer on 14 Nov 2014

@shoyer as you can see in my answer on SO, it does indeed work, but the problem is it raises for other dtypes instead of giving False, which is not very handy (and that is a numpy thing).
With the example of SO:

In [86]: df
Out[86]:
  cat_column   x   y
0          c   0   0
1          d  10   4
2          f  20   8
3          a  30  12
4          b  40  16
5          e  50  20

In [87]: df.cat_column.dtype == 'category'
Out[87]: True

In [88]: df.x.dtype
Out[88]: dtype('float64')

In [89]: df.x.dtype == 'category'
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-89-aff611b16544> in <module>()
----> 1 df.x.dtype == 'category'

TypeError: data type "category" not understood

So exactly because that is not working (as you would expect: returning False), I think we should provide a common way to do this (or at least document this in the categorical docs what is the best way to do this)

jorisvandenbossche on 14 Nov 2014

👍1

Ah, I see. A reasonable solution might be to wrap the dtype in str, e.g., str(df.x.dtype) == 'category'.

shoyer on 14 Nov 2014

cc @JanSchulz

jorisvandenbossche on 14 Nov 2014

In [10]: df = DataFrame({'A' : np.random.randn(5), 'B' : Series(list('aabbc')).astype('category')})

In [11]: df
Out[11]: 
          A  B
0 -0.064981  a
1  0.852717  a
2  0.693611  b
3  0.411486  b
4 -1.425537  c

In [12]: df.dtypes
Out[12]: 
A     float64
B    category
dtype: object

In [13]: df.dtypes == 'category'
Out[13]: 
A    False
B     True
dtype: bool

In [14]: df.select_dtypes(include=['category'])
Out[14]: 
   B
0  a
1  a
2  b
3  b
4  c

n [16]: pd.core.common.is_categorical_dtype(df.A.dtype)
Out[16]: False

So the preferred method of 'cheking' dtypes is simply to use select_dtypes or [13] works as well. Check a np.dtype('float64')=='category' blows up - I think maybe should create a bug report to have this fixed upstream. Not much we can do for this. Of course as @jtratner pointed out
``np.dtype('float64').dtype.name=='category' (will work correctly with numpy dtypes).

So I don't think it necessary to have the user actually use anything internal.

If pressed, com.is_categorical_dtype(...) would be ok

would not suggestion any mention/use of com.CategoricalDType (as an instance check) though - this is TOO internal.

To be honest this rarely should if ever come up. If the OP is trying to check individual dtypes for category then this is the wrong approach (and mostly certainly .select_dtypes() is the correct method.

So if someone wants to add a small doc section, ok.

jreback on 14 Nov 2014

Shouldn't this work?

if s.cat: 
   print("It's a categorical!")

jankatins on 14 Nov 2014

@JanSchulz

that will raise for non-cat types (as does .dt)

prevents user error. I think this is correct (these _should_ raise)

jreback on 14 Nov 2014

@JanSchulz no, because it gives a TypeError instead of False if it is not a categorical:

In [143]: df
Out[143]:
          A  B
0  0.299586  a
1  0.335853  a
2 -0.135405  b
3  1.247738  b
4 -0.232270  c

In [144]: bool(df.B.cat)
Out[144]: True

In [145]: bool(df.A.cat)
....
TypeError: Can only use .cat accessor with a 'category' dtype

jorisvandenbossche on 14 Nov 2014

I think is an issue that should be raised to numpy

In [9]: np.dtype('i8') == 'foo'
TypeError: data type "foo" not understood

closing on pandas side as this is sane on the pandas side

jreback on 30 Nov 2014

well, for all who care, I tried to push upstream. The user is now subject to random numpyisms that are really hard to fix downstream (impossible in this case).

jreback on 1 Dec 2014

The string coercion for dtype equality _is_ is an ugly API, and @njsmith is right that we are probably misusing the API here with dtype == 'category'. The category dtype should probably include all the metadata for the type of category (i.e., the categories and sorted-ness). This is how it currently works in dynd, for example.

So it think it would indeed be better to do this differently. Perhaps dtype.kind == 'C'? Or we could even make pd.is_categorical part of the API. Both options are compatible with numpy and not too terrible, IMO (the first has even fewer characters).

Either way, I think this should probably change evaluate to False (because the categories are different):

In [9]: pd.Categorical([1]).dtype == pd.Categorical([2]).dtype
Out[9]: True

shoyer on 1 Dec 2014

If the the last example should work (I think that was discussed during the design of Categoricals), then we have to put the categories into the dtype.

jankatins on 1 Dec 2014

@shoyer I disagree.

DyND does support categorical as a full-fledged datashape (see here. But using that impl is prob a ways away.

CategoricalDtype is basically a super-type for categories. You _could_ implement a concrete sub-class that allows an categories comparison, and maybe we should do that; it is nicer from a theoretical point of view.

But to be honest its a fair amount of complexity and not sure how much gain from that.

I am not sure anything is actually gained from explicty type checking with a pd.is_categorical().
pandas is meant to be practical, and I think s.dtype == 'category' is useful and in the spirit of all other numpy dtype comparisons.

jreback on 1 Dec 2014

@jreback It's one thing for pandas to take a pragmatic approach instead of waiting for a full solution, but designing an API that is incompatible with that full solution seems like a bad idea. s.dtype == 'category' is quite practical but it probably will/should break when we switch to dynd. The dynd API is certainly more flexible, but IMO Nathaniel raised some good points that will likely apply there as well.

In any case, perhaps it was premature to close this issue? s.dtype != 'category' does _not_ currently work -- do we have a preferred alternative? I do understand that you are frustrated with the response from upstream, but even if numpy changed things tomorrow this would still be an issue.

(I do agree it's probably not worth refining CategoricalDtype given that it's pretty well hidden from the public API.)

shoyer on 2 Dec 2014

@shoyer I'll buy that s.dtype != 'category' should work. Pls create a separate issue for that.

This is closed because pandas has does all it can to facilitate s.dtype == 'category' and provides many solns which don't need that.

Changing to use DyND type system will likely cause a bit of pain all around (good pain though). And will have to be revisited when DyND is more of a fixture.

If you have a better API idea which doesn't break anything, all ears.

jreback on 2 Dec 2014

This is closed because pandas has does all it can to facilitate s.dtype == 'category' and provides many solns which don't need that.

The reason I think this should not be closed already is the reason I initially opened this issue: just to document this issue in the categorical.rst docs.
And this isn't done yet, and we all know s.dtype == 'category' has it problems (and whether this are limitations on the numpy side or not is another dicussion but does not really matter for the _current situation and users who have this problem_).

So I can do quick PR to include this in the docs, but therefore, just make a quick choice what I put in there:

pd.core.common.is_categorical_dtype(df['cat'])
df['cat'].dtype.name == 'category'
..

Or provide this is_categorical_dtype (or is_categorical) as a top-level function.

jorisvandenbossche on 2 Dec 2014

well, neither of those are preferred at all

df.dtypes == 'category'
df.select_dtypes(include=['category'])

are the most correct ways to do this
if you want to mention in a very small not that df['cat'].dtype.name == 'category' then ok with that
using com.is_categorical_dtype(...) is actually ok too, but that is just so completely different for the avg user then it should be advertised.

of course s.dtype == 'category' WILL work if its actually a categorical type.....

amzing that this works!

In [2]: Series([1,2,3],dtype='int32').dtype=='i123'
Out[2]: True

jreback on 2 Dec 2014

👍1

going to bump this

jreback on 11 Dec 2014

Another option (see #9629) is that the preferred way to check if a series is categorical should be hasattr(s, 'cat'). This will work with pandas 0.16 or newer and sidesteps the numpy comparison issues...

shoyer on 11 Mar 2015

👍1

Was this page helpful?

0 / 5 - 0 ratings