Pandas: OverflowError: Python int too large to convert to C long

Created on 4 Apr 2018 · 25Comments · Source: pandas-dev/pandas

Code Sample, a copy-pastable example if possible

import pandas

content = open('failing_pandas.json').readline()
pd = pandas.read_json(content, lines=True)

Problem description

This issue happens on 0.21.1+ and doesn't happen on 0.21.0 for instance. I also tried it using the last master branch 0.23.0 and got the same issue :

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/cscetbon/.virtualenvs/pandas1/lib/python2.7/site-packages/pandas/io/json/json.py", line 366, in read_json
    return json_reader.read()
  File "/Users/cscetbon/.virtualenvs/pandas1/lib/python2.7/site-packages/pandas/io/json/json.py", line 464, in read
    self._combine_lines(data.split('\n'))
  File "/Users/cscetbon/.virtualenvs/pandas1/lib/python2.7/site-packages/pandas/io/json/json.py", line 484, in _get_object_parser
    obj = FrameParser(json, **kwargs).parse()
  File "/Users/cscetbon/.virtualenvs/pandas1/lib/python2.7/site-packages/pandas/io/json/json.py", line 582, in parse
    self._try_convert_types()
  File "/Users/cscetbon/.virtualenvs/pandas1/lib/python2.7/site-packages/pandas/io/json/json.py", line 838, in _try_convert_types
    lambda col, c: self._try_convert_data(col, c, convert_dates=False))
  File "/Users/cscetbon/.virtualenvs/pandas1/lib/python2.7/site-packages/pandas/io/json/json.py", line 818, in _process_converter
    new_data, result = f(col, c)
  File "/Users/cscetbon/.virtualenvs/pandas1/lib/python2.7/site-packages/pandas/io/json/json.py", line 838, in <lambda>
    lambda col, c: self._try_convert_data(col, c, convert_dates=False))
  File "/Users/cscetbon/.virtualenvs/pandas1/lib/python2.7/site-packages/pandas/io/json/json.py", line 652, in _try_convert_data
    new_data = data.astype('int64')
  File "/Users/cscetbon/.virtualenvs/pandas1/lib/python2.7/site-packages/pandas/util/_decorators.py", line 118, in wrapper
    return func(*args, **kwargs)
  File "/Users/cscetbon/.virtualenvs/pandas1/lib/python2.7/site-packages/pandas/core/generic.py", line 4004, in astype
    **kwargs)
  File "/Users/cscetbon/.virtualenvs/pandas1/lib/python2.7/site-packages/pandas/core/internals.py", line 3462, in astype
    return self.apply('astype', dtype=dtype, **kwargs)
  File "/Users/cscetbon/.virtualenvs/pandas1/lib/python2.7/site-packages/pandas/core/internals.py", line 3329, in apply
    applied = getattr(b, f)(**kwargs)
  File "/Users/cscetbon/.virtualenvs/pandas1/lib/python2.7/site-packages/pandas/core/internals.py", line 544, in astype
    **kwargs)
  File "/Users/cscetbon/.virtualenvs/pandas1/lib/python2.7/site-packages/pandas/core/internals.py", line 625, in _astype
    values = astype_nansafe(values.ravel(), dtype, copy=True)
  File "/Users/cscetbon/.virtualenvs/pandas1/lib/python2.7/site-packages/pandas/core/dtypes/cast.py", line 692, in astype_nansafe
    return lib.astype_intsafe(arr.ravel(), dtype).reshape(arr.shape)
  File "pandas/_libs/lib.pyx", line 854, in pandas._libs.lib.astype_intsafe
  File "pandas/_libs/src/util.pxd", line 91, in util.set_value_at_unsafe
OverflowError: Python int too large to convert to C long

Expected Output

It should not crash ...

Output of `pd.show_versions()`

Here is the one working :

INSTALLED VERSIONS
------------------
commit: None
python: 2.7.11.final.0
python-bits: 64
OS: Darwin
OS-release: 16.6.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: None.None

pandas: 0.21.0
pytest: None
pip: 9.0.3
setuptools: 39.0.1
Cython: 0.28.1
numpy: 1.14.2
scipy: None
pyarrow: None
xarray: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.7.2
pytz: 2018.3
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

And one failing :

INSTALLED VERSIONS
------------------
commit: None
python: 2.7.11.final.0
python-bits: 64
OS: Darwin
OS-release: 16.6.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: None.None

pandas: 0.21.1
pytest: None
pip: 9.0.3
setuptools: 39.0.1
Cython: 0.28.1
numpy: 1.14.2
scipy: None
pyarrow: None
xarray: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.7.2
pytz: 2018.3
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

Bug IO JSON

Source

cscetbon

All 25 comments

Interested in trying to bisect where things br3oke between 0.21.0 and 0.21.1?

TomAugspurger on 4 Apr 2018

We'll also need a reproducible example. read_json can take a json-string, so that should be easiest.

TomAugspurger on 4 Apr 2018

@TomAugspurger yes I'm interested in bisecting it. However I get a weird import issue when installing it in a local environment :

$ virtualenv env
New python executable in /Users/cscetbon/src/git/pandas/env/bin/python2.7
Also creating executable in /Users/cscetbon/src/git/pandas/env/bin/python
Installing setuptools, pip, wheel...done.
$ . env/bin/activate
$ python setup.py build_ext --inplace
$ python -m pip install -e .
Obtaining file:///Users/cscetbon/src/git/pandas
Collecting python-dateutil (from pandas==0.21.0)
  Using cached python_dateutil-2.7.2-py2.py3-none-any.whl
Collecting pytz>=2011k (from pandas==0.21.0)
  Using cached pytz-2018.3-py2.py3-none-any.whl
Collecting numpy>=1.9.0 (from pandas==0.21.0)
  Using cached numpy-1.14.2-cp27-cp27m-macosx_10_6_intel.macosx_10_9_intel.macosx_10_9_x86_64.macosx_10_10_intel.macosx_10_10_x86_64.whl
Collecting six>=1.5 (from python-dateutil->pandas==0.21.0)
  Using cached six-1.11.0-py2.py3-none-any.whl
Installing collected packages: six, python-dateutil, pytz, numpy, pandas
  Found existing installation: pandas 0.21.0
    Not uninstalling pandas at /Users/cscetbon/src/git/pandas, outside environment /Users/cscetbon/src/git/pandas/env
  Running setup.py develop for pandas
Successfully installed numpy-1.14.2 pandas python-dateutil-2.7.2 pytz-2018.3 six-1.11.0
$ pip freeze|grep -I panda
-e git+https://github.com/pandas-dev/pandas.git@81372093f1fdc0c07e4b45ba0f47b0360fabd405#egg=pandas
$ python -c 'import pandas; print pandas.__version__'
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "pandas/__init__.py", line 42, in <module>
    from pandas.core.api import *
  File "pandas/core/api.py", line 10, in <module>
    from pandas.core.groupby import Grouper
  File "/Users/cscetbon/src/git/pandas/pandas/core/groupby/__init__.py", line 2, in <module>
  File "/Users/cscetbon/src/git/pandas/pandas/core/groupby/groupby.py", line 47, in <module>
  File "/Users/cscetbon/src/git/pandas/pandas/core/arrays/__init__.py", line 1, in <module>
  File "/Users/cscetbon/src/git/pandas/pandas/core/arrays/base.py", line 4, in <module>
ImportError: cannot import name AbstractMethodError

Any idea ?

cscetbon on 4 Apr 2018

I'm not sure about these lines

  Found existing installation: pandas 0.21.0
    Not uninstalling pandas at /Users/cscetbon/src/git/pandas, outside environment /Users/cscetbon/src/git/pandas/env
  Running setup.py develop for pandas

TomAugspurger on 4 Apr 2018

I was able to find and solve the issue. I had to apply the following patch on v0.21.0 :

pandas.txt

This issue wasn't fixed by cf9f51336b0ca99

cscetbon on 4 Apr 2018

@So your original issue is not fixed on master? Can you submit a PR fixing it, along with tests & a release note? Thanks.

TomAugspurger on 4 Apr 2018

Yes it's not fixed on the master branch. It'll have to wait a bit for me to find some time. Don't you think the OverflowError exception should be caught everywhere though ? I don't really have the answer but it seems it could happen with other types like float64 for instance

cscetbon on 4 Apr 2018

Sorry, relabeling this because I don't know yet how to reproduce.

gfyoung on 10 Apr 2018

As far as the 0.23.0 release it still excepts but is a ValueError

>>> import json
>>> import pandas as pd
>>> foo = 2**100000
>>> bar = {"foo": foo}
>>> baz = json.dumps(bar)
>>> pd = pd.read_json(baz)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/shan/test/lib/python3.6/site-packages/pandas/io/json/json.py", line 422, in read_json
    result = json_reader.read()
  File "/home/shan/test/lib/python3.6/site-packages/pandas/io/json/json.py", line 529, in read
    obj = self._get_object_parser(self.data)
  File "/home/shan/test/lib/python3.6/site-packages/pandas/io/json/json.py", line 546, in _get_object_parser
    obj = FrameParser(json, **kwargs).parse()
  File "/home/shan/test/lib/python3.6/site-packages/pandas/io/json/json.py", line 638, in parse
    self._parse_no_numpy()
  File "/home/shan/test/lib/python3.6/site-packages/pandas/io/json/json.py", line 853, in _parse_no_numpy
    loads(json, precise_float=self.precise_float), dtype=None)
ValueError: Value is too big

$ python -c "import pandas as pd; pd.show_versions()" | grep pandas
pandas: 0.23.0

ssikdar1 on 8 Jun 2018

@ssikdar1 : Was this example working on a previous version?

gfyoung on 8 Jun 2018

sorry guys I really didn't have time. If someone can start working from the patch I sent that'd be great.

cscetbon on 8 Jun 2018

For v22 i get the same error:

    self._parse_no_numpy()
  File "/Users/ssikdar/workspace27/acquire-expand/workspace3/lib/python3.6/site-packages/pandas/io/json/json.py", line 793, in _parse_no_numpy
    loads(json, precise_float=self.precise_float), dtype=None)
ValueError: Value is too big
>>>

$ python -c "import pandas as pd; pd.show_versions()" | grep pandas
pandas: 0.22.0
pandas_gbq: None
pandas_datareader: None

ssikdar1 on 8 Jun 2018

sorry guys I really didn't have time. If someone can start working from the patch I sent that'd be great.

@cscetbon : Thanks for letting us know! We can continue on from here.

@ssikdar1 : Does your code happen to work for 0.21.0 by any chance? BTW, you're going to have to provide an index for this to work (try your example with a smaller value for foo).

gfyoung on 9 Jun 2018

@cscetbon : Do you have an example that we can use to test your patch? That would be helpful actually.

gfyoung on 9 Jun 2018

@gfyoung same error for 21 unfortunately

digging deeper on 23:

  File "/home/shan/test23/lib/python3.6/site-packages/pandas/io/json/json.py", line 853, in _parse_no_numpy
    loads(json, precise_float=self.precise_float), dtype=None)

>>> import pandas._libs.json as json
>>> json.loads(json.dumps({'f':2**10000, 'b': 'sh'}))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
OverflowError: int too big to convert
>>> 
>>> import json
>>> json.loads(json.dumps({'f':2**10000, 'b': 'sh'}))
{'f': 19950631168807583848837421626835850838234968318861924548520089498529438830221946631919961684036194597899331129423209124271556491349413781117593785932096323957855730046793794526765246551266059895520550086918193311542508608460618104685509074866089624888090489894838009253941633257850621568309473902556912388065225096643874441046759871626985453222868538161694315775629640762836880760732228535091641476183956381458969463899410840960536267821064621427333394036525565649530603142680234969400335934316651459297773279665775606172582031407994198179607378245683762280037302885487251900834464581454650557929601414833921615734588139257095379769119277800826957735674444123062018757836325502728323789270710373802866393031428133241401624195671690574061419654342324638801248856147305207431992259611796250130992860241708340807605932320161268492288496255841312844061536738951487114256315111089745514203313820202931640957596464756010405845841566072044962867016515061920631004186422275908670900574606417856951911456055068251250406007519842261898059237118054444788072906395242548339221982707404473162376760846613033778706039803413197133493654622700563169937455508241780972810983291314403571877524768509857276937926433221599399876886660808368837838027643282775172273657572744784112294389733810861607423253291974813120197604178281965697475898164531258434135959862784130128185406283476649088690521047580882615823961985770122407044330583075869039319604603404973156583208672105913300903752823415539745394397715257455290510212310947321610753474825740775273986348298498340756937955646638621874569499279016572103701364433135817214311791398222983845847334440270964182851005072927748364550578634501100852987812389473928699540834346158807043959118985815145779177143619698728131459483783202081474982171858011389071228250905826817436220577475921417653715687725614904582904992461028630081535583308130101987675856234343538955409175623400844887526162643568648833519463720377293240094456246923254350400678027273837755376406726898636241037491410966718557050759098100246789880178271925953381282421954028302759408448955014676668389697996886241636313376393903373455801407636741877711055384225739499110186468219696581651485130494222369947714763069155468217682876200362777257723781365331611196811280792669481887201298643660768551639860534602297871557517947385246369446923087894265948217008051120322365496288169035739121368338393591756418733850510970271613915439590991598154654417336311656936031122249937969999226781732358023111862644575299135758175008199839236284615249881088960232244362173771618086357015468484058622329792853875623486556440536962622018963571028812361567512543338303270029097668650568557157505516727518899194129711337690149916181315171544007728650573189557450920330185304847113818315407324053319038462084036421763703911550639789000742853672196280903477974533320468368795868580237952218629120080742819551317948157624448298518461509704888027274721574688131594750409732115080498190455803416826949787141316063210686391511681774304792596709376, 'b': 'sh'}

ssikdar1 on 9 Jun 2018

@ssikdar1 : That definitely looks like a _libs/src/ujson investigation. That being said, your example from above still doesn't work even if I pass in a smaller value

gfyoung on 9 Jun 2018

Hey @gfyoung, you can use the following content :

{"a":"7868170657351128032018"},{"a":""}

If I change it to

{"a":"7868170657351128032018"},{"a":"10"}

It works .. The patch I provided allows to get the same behavior as before the change. However, at that time, I didn't know that the second content would work which now makes me think there might a bug somewhere.

cscetbon on 10 Jun 2018

@cscetbon : Thanks for this! That is indeed strange.

gfyoung on 10 Jun 2018

👍1

you should not need to touch the ujson code at all here - it cannot work with larger than uint64
the error above is trying to convert to a proper int64 - you need to catch the overflow and coerce to object dtype

jreback on 10 Jun 2018

@jreback : That makes sense. That was also what was proposed by @cscetbon above

That being said, patching is a little tricky since the issue emerges from argument validation on the json.loads call, which is all C. Thus, instead of aliasing loads to json.loads, we could define loads to wrap json.loads as follows:

~python
def loads(args, *kwargs):
try:
return json.loads(args, *kwargs)
except OverflowError:
# type coercion, etc.
~

gfyoung on 10 Jun 2018

no patching like this will not be accepted

there are 2 issues:

coercion to datetimes (can simply catch overflow error)
ujson parsing into an overflow - this is pretty tricky

jreback on 10 Jun 2018

Still got this error today
Python int too large to convert to C ssize_t

Code to reproduce

import pandas as pd
e = 4
rng_srt = 9*10**300 # range start
rng_end = 11*10**300 # range end

p = pd.DataFrame(dtype=object) # potencies p

p['b'] = pd.Series(range(rng_srt,rng_end+1)) # base b
p['e'] = n # exponent e
p['v'] = [value**e for value in p['b']] # value v

p.tail()