Pandas: Stata read categoricals gives ValueError: Categorical categories must be unique

Created on 6 Aug 2016 · 13Comments · Source: pandas-dev/pandas

I am trying to open some Stata files generated in IPUMS International, but I am getting a ValueError: Categorical categories must be unique. I opened in Stata and could not find a repeated category for the column I am trying to import. I had similar issues with other datasets from the same source, which seemed to be generated by missing values, but that does not seem to be the case here. Here's the link to the file I am trying to read.

Code Sample, a copy-pastable example if possible

df = pd.read_stata('ipumsi_00014.dta', columns=['ethnicsn'])

Expected Output

df.shape = (1694761,1)

output of `pd.show_versions()`

commit: None
python: 2.7.9.final.0
python-bits: 64
OS: Darwin
OS-release: 13.4.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.18.1
nose: 1.3.7
pip: 8.1.2
setuptools: 18.2
Cython: 0.22.1
numpy: 1.11.1
scipy: 0.15.1
statsmodels: 0.6.1
xarray: None
IPython: 3.2.1
sphinx: None
patsy: 0.2.1
dateutil: 2.5.3
pytz: 2016.6.1
blosc: None
bottleneck: 0.8.0
tables: None
numexpr: 2.4
matplotlib: 1.4.3
openpyxl: 2.1.3
xlrd: 0.9.3
xlwt: 0.7.5
xlsxwriter: 0.6.4
lxml: 3.3.5
bs4: 4.3.2
html5lib: 0.999
httplib2: None
apiclient: None
sqlalchemy: 0.9.8
pymysql: None
psycopg2: 2.5.3 (dt dec pq3 ext)
jinja2: 2.7.3
boto: 2.34.0
pandas_datareader: None

Error Reporting IO Stata

Source

ozak

Most helpful comment

Thanks for the suggestions @bashtage ...do you have an example on how one would access those labels?

Using your short file:

import pandas as pd
df = pd.read_stata('ipumsi_00014_ethn.dta',convert_categoricals=False)
sr = pd.io.stata.StataReader('ipumsi_00014_ethn.dta')
vl = sr.value_labels()
sr.close()

Then you can do whatever you need to with vl and df.

bashtage on 9 Aug 2016

👍7

All 13 comments

Here's the link to a file with a column with only categoricals (non-repeated) and the same issue arises.

ozak on 6 Aug 2016

cc @bashtage
cc @kshedden

jreback on 7 Aug 2016

The value labels for the data posted a

{101: 'bainouk',
 102: 'badiaranke',
 ...
 111: 'diola',
 112: 'fulani',
 113: 'wolof',
 114: 'laobe',
 115: 'lebou',
 ...
 128: 'tandanke',
 129: 'toucouleur',
 130: 'wolof',
 131: 'khassonke',
 ...
 398: 'other countries',
 999: 'unknown'}

Note 130 and 113: wolof. This is the problem. I'm not sure what the correct behavior is here since there are two numeric values that make to same name. Categoricals don't understand this since there must be a 1-to-1 mapping between the underying numeric store and the labels.

bashtage on 8 Aug 2016

I suppose the error could be trapped and a more meaningful error possibly with a report could be returned.

bashtage on 8 Aug 2016

I thought it was having trouble due to possibly repeated numbering of the categories. Not sure why it should care if the label values (strings) are repeated, since I imagine the categories should work regardless of the value of the label value, no? Or am I missing something? I agree that a better error message and even a print out of the repeated categories would be very useful.

Thanks for the help!

ozak on 8 Aug 2016

Another possibility would be to allow Pandas to read the categories as strings.

ozak on 8 Aug 2016

Stata stores value labeled variables as labels and some number. Your data has 2 values that correspond to the same number. In pandas, a label is as good as its underlying integer data type, and so there is no way for a categorical to have 2 values with the same label. Stata value labels are not equivalent to pandas categoricals, only close. This is a case where the difference matters.

bashtage on 9 Aug 2016

BTW, you can use convert_categoricals=False to read the data and return the integer values. You can also use StataReader to access the value labels. From these you can anything you want with the data.

bashtage on 9 Aug 2016

so this looks like buggy Stata behavior? @bashtage

yeah I would prob raise here (if convert_categoricals is set), let the user figure it out. (I suppose you could just turn categorical conversion off and show a warning).

jreback on 9 Aug 2016

Thanks for the suggestions @bashtage ...do you have an example on how one would access those labels?

ozak on 9 Aug 2016

It is not buggy Stata behavior so much as just different. In Stata value
labels are just labels and are not actually values. Pandas doesnt have the
concept of a labeled Series, and so there is no way to map between the two
perfectly.

On Tue, Aug 9, 2016, 12:10 AM Jeff Reback [email protected] wrote:

so this looks like buggy Stata behavior? @bashtage
https://github.com/bashtage

yeah I would prob raise here (if convert_categoricals is set), let the
user figure it out. (I suppose you could just turn categorical conversion
off and show a warning).

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/pydata/pandas/issues/13923#issuecomment-238405327,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AFU5RTT5FYLD0Q50948r1gtpbD9sRKPgks5qd7dXgaJpZM4JeU3B
.

bashtage on 9 Aug 2016

Thanks for the suggestions @bashtage ...do you have an example on how one would access those labels?

Using your short file:

import pandas as pd
df = pd.read_stata('ipumsi_00014_ethn.dta',convert_categoricals=False)
sr = pd.io.stata.StataReader('ipumsi_00014_ethn.dta')
vl = sr.value_labels()
sr.close()

Then you can do whatever you need to with vl and df.

bashtage on 9 Aug 2016

👍7

@bashtage Cool! Thanks!

ozak on 9 Aug 2016

Was this page helpful?

0 / 5 - 0 ratings