Pandas: Stata read categoricals gives ValueError: Categorical categories must be unique

Created on 6 Aug 2016  Â·  13Comments  Â·  Source: pandas-dev/pandas

I am trying to open some Stata files generated in IPUMS International, but I am getting a ValueError: Categorical categories must be unique. I opened in Stata and could not find a repeated category for the column I am trying to import. I had similar issues with other datasets from the same source, which seemed to be generated by missing values, but that does not seem to be the case here. Here's the link to the file I am trying to read.

Code Sample, a copy-pastable example if possible

df = pd.read_stata('ipumsi_00014.dta', columns=['ethnicsn'])

Expected Output

df.shape = (1694761,1)

output of pd.show_versions()

commit: None
python: 2.7.9.final.0
python-bits: 64
OS: Darwin
OS-release: 13.4.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.18.1
nose: 1.3.7
pip: 8.1.2
setuptools: 18.2
Cython: 0.22.1
numpy: 1.11.1
scipy: 0.15.1
statsmodels: 0.6.1
xarray: None
IPython: 3.2.1
sphinx: None
patsy: 0.2.1
dateutil: 2.5.3
pytz: 2016.6.1
blosc: None
bottleneck: 0.8.0
tables: None
numexpr: 2.4
matplotlib: 1.4.3
openpyxl: 2.1.3
xlrd: 0.9.3
xlwt: 0.7.5
xlsxwriter: 0.6.4
lxml: 3.3.5
bs4: 4.3.2
html5lib: 0.999
httplib2: None
apiclient: None
sqlalchemy: 0.9.8
pymysql: None
psycopg2: 2.5.3 (dt dec pq3 ext)
jinja2: 2.7.3
boto: 2.34.0
pandas_datareader: None
Error Reporting IO Stata

Most helpful comment

Thanks for the suggestions @bashtage ...do you have an example on how one would access those labels?

Using your short file:

import pandas as pd
df = pd.read_stata('ipumsi_00014_ethn.dta',convert_categoricals=False)
sr = pd.io.stata.StataReader('ipumsi_00014_ethn.dta')
vl = sr.value_labels()
sr.close()

Then you can do whatever you need to with vl and df.

All 13 comments

Here's the link to a file with a column with only categoricals (non-repeated) and the same issue arises.

cc @bashtage
cc @kshedden

The value labels for the data posted a

{101: 'bainouk',
 102: 'badiaranke',
 ...
 111: 'diola',
 112: 'fulani',
 113: 'wolof',
 114: 'laobe',
 115: 'lebou',
 ...
 128: 'tandanke',
 129: 'toucouleur',
 130: 'wolof',
 131: 'khassonke',
 ...
 398: 'other countries',
 999: 'unknown'}

Note 130 and 113: wolof. This is the problem. I'm not sure what the correct behavior is here since there are two numeric values that make to same name. Categoricals don't understand this since there must be a 1-to-1 mapping between the underying numeric store and the labels.

I suppose the error could be trapped and a more meaningful error possibly with a report could be returned.

I thought it was having trouble due to possibly repeated numbering of the categories. Not sure why it should care if the label values (strings) are repeated, since I imagine the categories should work regardless of the value of the label value, no? Or am I missing something? I agree that a better error message and even a print out of the repeated categories would be very useful.

Thanks for the help!

Another possibility would be to allow Pandas to read the categories as strings.

Stata stores value labeled variables as labels and some number. Your data has 2 values that correspond to the same number. In pandas, a label is as good as its underlying integer data type, and so there is no way for a categorical to have 2 values with the same label. Stata value labels are not equivalent to pandas categoricals, only close. This is a case where the difference matters.

BTW, you can use convert_categoricals=False to read the data and return the integer values. You can also use StataReader to access the value labels. From these you can anything you want with the data.

so this looks like buggy Stata behavior? @bashtage

yeah I would prob raise here (if convert_categoricals is set), let the user figure it out. (I suppose you could just turn categorical conversion off and show a warning).

Thanks for the suggestions @bashtage ...do you have an example on how one would access those labels?

It is not buggy Stata behavior so much as just different. In Stata value
labels are just labels and are not actually values. Pandas doesnt have the
concept of a labeled Series, and so there is no way to map between the two
perfectly.

On Tue, Aug 9, 2016, 12:10 AM Jeff Reback [email protected] wrote:

so this looks like buggy Stata behavior? @bashtage
https://github.com/bashtage

yeah I would prob raise here (if convert_categoricals is set), let the
user figure it out. (I suppose you could just turn categorical conversion
off and show a warning).

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/pydata/pandas/issues/13923#issuecomment-238405327,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AFU5RTT5FYLD0Q50948r1gtpbD9sRKPgks5qd7dXgaJpZM4JeU3B
.

Thanks for the suggestions @bashtage ...do you have an example on how one would access those labels?

Using your short file:

import pandas as pd
df = pd.read_stata('ipumsi_00014_ethn.dta',convert_categoricals=False)
sr = pd.io.stata.StataReader('ipumsi_00014_ethn.dta')
vl = sr.value_labels()
sr.close()

Then you can do whatever you need to with vl and df.

@bashtage Cool! Thanks!

Was this page helpful?
0 / 5 - 0 ratings

Related issues

amelio-vazquez-reina picture amelio-vazquez-reina  Â·  3Comments

idanivanov picture idanivanov  Â·  3Comments

swails picture swails  Â·  3Comments

Ashutosh-Srivastav picture Ashutosh-Srivastav  Â·  3Comments

andreas-thomik picture andreas-thomik  Â·  3Comments