Pandas: OverflowError: Python int too large to convert to C long

Created on 4 Apr 2018  路  25Comments  路  Source: pandas-dev/pandas

Code Sample, a copy-pastable example if possible

import pandas

content = open('failing_pandas.json').readline()
pd = pandas.read_json(content, lines=True)

Problem description

This issue happens on 0.21.1+ and doesn't happen on 0.21.0 for instance. I also tried it using the last master branch 0.23.0 and got the same issue :

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/cscetbon/.virtualenvs/pandas1/lib/python2.7/site-packages/pandas/io/json/json.py", line 366, in read_json
    return json_reader.read()
  File "/Users/cscetbon/.virtualenvs/pandas1/lib/python2.7/site-packages/pandas/io/json/json.py", line 464, in read
    self._combine_lines(data.split('\n'))
  File "/Users/cscetbon/.virtualenvs/pandas1/lib/python2.7/site-packages/pandas/io/json/json.py", line 484, in _get_object_parser
    obj = FrameParser(json, **kwargs).parse()
  File "/Users/cscetbon/.virtualenvs/pandas1/lib/python2.7/site-packages/pandas/io/json/json.py", line 582, in parse
    self._try_convert_types()
  File "/Users/cscetbon/.virtualenvs/pandas1/lib/python2.7/site-packages/pandas/io/json/json.py", line 838, in _try_convert_types
    lambda col, c: self._try_convert_data(col, c, convert_dates=False))
  File "/Users/cscetbon/.virtualenvs/pandas1/lib/python2.7/site-packages/pandas/io/json/json.py", line 818, in _process_converter
    new_data, result = f(col, c)
  File "/Users/cscetbon/.virtualenvs/pandas1/lib/python2.7/site-packages/pandas/io/json/json.py", line 838, in <lambda>
    lambda col, c: self._try_convert_data(col, c, convert_dates=False))
  File "/Users/cscetbon/.virtualenvs/pandas1/lib/python2.7/site-packages/pandas/io/json/json.py", line 652, in _try_convert_data
    new_data = data.astype('int64')
  File "/Users/cscetbon/.virtualenvs/pandas1/lib/python2.7/site-packages/pandas/util/_decorators.py", line 118, in wrapper
    return func(*args, **kwargs)
  File "/Users/cscetbon/.virtualenvs/pandas1/lib/python2.7/site-packages/pandas/core/generic.py", line 4004, in astype
    **kwargs)
  File "/Users/cscetbon/.virtualenvs/pandas1/lib/python2.7/site-packages/pandas/core/internals.py", line 3462, in astype
    return self.apply('astype', dtype=dtype, **kwargs)
  File "/Users/cscetbon/.virtualenvs/pandas1/lib/python2.7/site-packages/pandas/core/internals.py", line 3329, in apply
    applied = getattr(b, f)(**kwargs)
  File "/Users/cscetbon/.virtualenvs/pandas1/lib/python2.7/site-packages/pandas/core/internals.py", line 544, in astype
    **kwargs)
  File "/Users/cscetbon/.virtualenvs/pandas1/lib/python2.7/site-packages/pandas/core/internals.py", line 625, in _astype
    values = astype_nansafe(values.ravel(), dtype, copy=True)
  File "/Users/cscetbon/.virtualenvs/pandas1/lib/python2.7/site-packages/pandas/core/dtypes/cast.py", line 692, in astype_nansafe
    return lib.astype_intsafe(arr.ravel(), dtype).reshape(arr.shape)
  File "pandas/_libs/lib.pyx", line 854, in pandas._libs.lib.astype_intsafe
  File "pandas/_libs/src/util.pxd", line 91, in util.set_value_at_unsafe
OverflowError: Python int too large to convert to C long

Expected Output

It should not crash ...

Output of pd.show_versions()

Here is the one working :

INSTALLED VERSIONS
------------------
commit: None
python: 2.7.11.final.0
python-bits: 64
OS: Darwin
OS-release: 16.6.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: None.None

pandas: 0.21.0
pytest: None
pip: 9.0.3
setuptools: 39.0.1
Cython: 0.28.1
numpy: 1.14.2
scipy: None
pyarrow: None
xarray: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.7.2
pytz: 2018.3
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

And one failing :

INSTALLED VERSIONS
------------------
commit: None
python: 2.7.11.final.0
python-bits: 64
OS: Darwin
OS-release: 16.6.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: None.None

pandas: 0.21.1
pytest: None
pip: 9.0.3
setuptools: 39.0.1
Cython: 0.28.1
numpy: 1.14.2
scipy: None
pyarrow: None
xarray: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.7.2
pytz: 2018.3
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

Bug IO JSON

All 25 comments

Interested in trying to bisect where things br3oke between 0.21.0 and 0.21.1?

We'll also need a reproducible example. read_json can take a json-string, so that should be easiest.

@TomAugspurger yes I'm interested in bisecting it. However I get a weird import issue when installing it in a local environment :

$ virtualenv env
New python executable in /Users/cscetbon/src/git/pandas/env/bin/python2.7
Also creating executable in /Users/cscetbon/src/git/pandas/env/bin/python
Installing setuptools, pip, wheel...done.
$ . env/bin/activate
$ python setup.py build_ext --inplace
$ python -m pip install -e .
Obtaining file:///Users/cscetbon/src/git/pandas
Collecting python-dateutil (from pandas==0.21.0)
  Using cached python_dateutil-2.7.2-py2.py3-none-any.whl
Collecting pytz>=2011k (from pandas==0.21.0)
  Using cached pytz-2018.3-py2.py3-none-any.whl
Collecting numpy>=1.9.0 (from pandas==0.21.0)
  Using cached numpy-1.14.2-cp27-cp27m-macosx_10_6_intel.macosx_10_9_intel.macosx_10_9_x86_64.macosx_10_10_intel.macosx_10_10_x86_64.whl
Collecting six>=1.5 (from python-dateutil->pandas==0.21.0)
  Using cached six-1.11.0-py2.py3-none-any.whl
Installing collected packages: six, python-dateutil, pytz, numpy, pandas
  Found existing installation: pandas 0.21.0
    Not uninstalling pandas at /Users/cscetbon/src/git/pandas, outside environment /Users/cscetbon/src/git/pandas/env
  Running setup.py develop for pandas
Successfully installed numpy-1.14.2 pandas python-dateutil-2.7.2 pytz-2018.3 six-1.11.0
$ pip freeze|grep -I panda
-e git+https://github.com/pandas-dev/pandas.git@81372093f1fdc0c07e4b45ba0f47b0360fabd405#egg=pandas
$ python -c 'import pandas; print pandas.__version__'
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "pandas/__init__.py", line 42, in <module>
    from pandas.core.api import *
  File "pandas/core/api.py", line 10, in <module>
    from pandas.core.groupby import Grouper
  File "/Users/cscetbon/src/git/pandas/pandas/core/groupby/__init__.py", line 2, in <module>
  File "/Users/cscetbon/src/git/pandas/pandas/core/groupby/groupby.py", line 47, in <module>
  File "/Users/cscetbon/src/git/pandas/pandas/core/arrays/__init__.py", line 1, in <module>
  File "/Users/cscetbon/src/git/pandas/pandas/core/arrays/base.py", line 4, in <module>
ImportError: cannot import name AbstractMethodError

Any idea ?

I'm not sure about these lines

  Found existing installation: pandas 0.21.0
    Not uninstalling pandas at /Users/cscetbon/src/git/pandas, outside environment /Users/cscetbon/src/git/pandas/env
  Running setup.py develop for pandas

I was able to find and solve the issue. I had to apply the following patch on v0.21.0 :

pandas.txt

This issue wasn't fixed by cf9f51336b0ca99

@So your original issue is not fixed on master? Can you submit a PR fixing it, along with tests & a release note? Thanks.

Yes it's not fixed on the master branch. It'll have to wait a bit for me to find some time. Don't you think the OverflowError exception should be caught everywhere though ? I don't really have the answer but it seems it could happen with other types like float64 for instance

Sorry, relabeling this because I don't know yet how to reproduce.

As far as the 0.23.0 release it still excepts but is a ValueError

>>> import json
>>> import pandas as pd
>>> foo = 2**100000
>>> bar = {"foo": foo}
>>> baz = json.dumps(bar)
>>> pd = pd.read_json(baz)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/shan/test/lib/python3.6/site-packages/pandas/io/json/json.py", line 422, in read_json
    result = json_reader.read()
  File "/home/shan/test/lib/python3.6/site-packages/pandas/io/json/json.py", line 529, in read
    obj = self._get_object_parser(self.data)
  File "/home/shan/test/lib/python3.6/site-packages/pandas/io/json/json.py", line 546, in _get_object_parser
    obj = FrameParser(json, **kwargs).parse()
  File "/home/shan/test/lib/python3.6/site-packages/pandas/io/json/json.py", line 638, in parse
    self._parse_no_numpy()
  File "/home/shan/test/lib/python3.6/site-packages/pandas/io/json/json.py", line 853, in _parse_no_numpy
    loads(json, precise_float=self.precise_float), dtype=None)
ValueError: Value is too big

$ python -c "import pandas as pd; pd.show_versions()" | grep pandas
pandas: 0.23.0

@ssikdar1 : Was this example working on a previous version?

sorry guys I really didn't have time. If someone can start working from the patch I sent that'd be great.

For v22 i get the same error:

    self._parse_no_numpy()
  File "/Users/ssikdar/workspace27/acquire-expand/workspace3/lib/python3.6/site-packages/pandas/io/json/json.py", line 793, in _parse_no_numpy
    loads(json, precise_float=self.precise_float), dtype=None)
ValueError: Value is too big
>>> 

$ python -c "import pandas as pd; pd.show_versions()" | grep pandas
pandas: 0.22.0
pandas_gbq: None
pandas_datareader: None

sorry guys I really didn't have time. If someone can start working from the patch I sent that'd be great.

@cscetbon : Thanks for letting us know! We can continue on from here.

@ssikdar1 : Does your code happen to work for 0.21.0 by any chance? BTW, you're going to have to provide an index for this to work (try your example with a smaller value for foo).

@cscetbon : Do you have an example that we can use to test your patch? That would be helpful actually.

@gfyoung same error for 21 unfortunately

digging deeper on 23:

  File "/home/shan/test23/lib/python3.6/site-packages/pandas/io/json/json.py", line 853, in _parse_no_numpy
    loads(json, precise_float=self.precise_float), dtype=None)
>>> import pandas._libs.json as json
>>> json.loads(json.dumps({'f':2**10000, 'b': 'sh'}))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
OverflowError: int too big to convert
>>> 
>>> import json
>>> json.loads(json.dumps({'f':2**10000, 'b': 'sh'}))
{'f': 19950631168807583848837421626835850838234968318861924548520089498529438830221946631919961684036194597899331129423209124271556491349413781117593785932096323957855730046793794526765246551266059895520550086918193311542508608460618104685509074866089624888090489894838009253941633257850621568309473902556912388065225096643874441046759871626985453222868538161694315775629640762836880760732228535091641476183956381458969463899410840960536267821064621427333394036525565649530603142680234969400335934316651459297773279665775606172582031407994198179607378245683762280037302885487251900834464581454650557929601414833921615734588139257095379769119277800826957735674444123062018757836325502728323789270710373802866393031428133241401624195671690574061419654342324638801248856147305207431992259611796250130992860241708340807605932320161268492288496255841312844061536738951487114256315111089745514203313820202931640957596464756010405845841566072044962867016515061920631004186422275908670900574606417856951911456055068251250406007519842261898059237118054444788072906395242548339221982707404473162376760846613033778706039803413197133493654622700563169937455508241780972810983291314403571877524768509857276937926433221599399876886660808368837838027643282775172273657572744784112294389733810861607423253291974813120197604178281965697475898164531258434135959862784130128185406283476649088690521047580882615823961985770122407044330583075869039319604603404973156583208672105913300903752823415539745394397715257455290510212310947321610753474825740775273986348298498340756937955646638621874569499279016572103701364433135817214311791398222983845847334440270964182851005072927748364550578634501100852987812389473928699540834346158807043959118985815145779177143619698728131459483783202081474982171858011389071228250905826817436220577475921417653715687725614904582904992461028630081535583308130101987675856234343538955409175623400844887526162643568648833519463720377293240094456246923254350400678027273837755376406726898636241037491410966718557050759098100246789880178271925953381282421954028302759408448955014676668389697996886241636313376393903373455801407636741877711055384225739499110186468219696581651485130494222369947714763069155468217682876200362777257723781365331611196811280792669481887201298643660768551639860534602297871557517947385246369446923087894265948217008051120322365496288169035739121368338393591756418733850510970271613915439590991598154654417336311656936031122249937969999226781732358023111862644575299135758175008199839236284615249881088960232244362173771618086357015468484058622329792853875623486556440536962622018963571028812361567512543338303270029097668650568557157505516727518899194129711337690149916181315171544007728650573189557450920330185304847113818315407324053319038462084036421763703911550639789000742853672196280903477974533320468368795868580237952218629120080742819551317948157624448298518461509704888027274721574688131594750409732115080498190455803416826949787141316063210686391511681774304792596709376, 'b': 'sh'}

@ssikdar1 : That definitely looks like a _libs/src/ujson investigation. That being said, your example from above still doesn't work even if I pass in a smaller value

Hey @gfyoung, you can use the following content :

{"a":"7868170657351128032018"},{"a":""}

If I change it to

{"a":"7868170657351128032018"},{"a":"10"}

It works .. The patch I provided allows to get the same behavior as before the change. However, at that time, I didn't know that the second content would work which now makes me think there might a bug somewhere.

@cscetbon : Thanks for this! That is indeed strange.

you should not need to touch the ujson code at all here - it cannot work with larger than uint64
the error above is trying to convert to a proper int64 - you need to catch the overflow and coerce to object dtype

@jreback : That makes sense. That was also what was proposed by @cscetbon above

That being said, patching is a little tricky since the issue emerges from argument validation on the json.loads call, which is all C. Thus, instead of aliasing loads to json.loads, we could define loads to wrap json.loads as follows:

~python
def loads(args, *kwargs):
try:
return json.loads(args, *kwargs)
except OverflowError:
# type coercion, etc.
~

no patching like this will not be accepted

there are 2 issues:

  • coercion to datetimes (can simply catch overflow error)
  • ujson parsing into an overflow - this is pretty tricky

Still got this error today
Python int too large to convert to C ssize_t

image

Code to reproduce

import pandas as pd
e = 4
rng_srt = 9*10**300 # range start
rng_end = 11*10**300 # range end

p = pd.DataFrame(dtype=object) # potencies p

p['b'] = pd.Series(range(rng_srt,rng_end+1)) # base b
p['e'] = n # exponent e
p['v'] = [value**e for value in p['b']] # value v

p.tail()

@mondaysunrise you are commenting on an issue about json parsing

you cannot hold these large ints directly and must use object dtype on the Series you are constructing

Okay, sorry, I was missing that. Thank you for telling me.

take

I'd like to fix this in the ujson implementation similarly to #34473

Was this page helpful?
0 / 5 - 0 ratings