RecursionError in DataFrame.replace
import pandas as pd
import datetime
df = pd.DataFrame({
"dt" : [datetime.datetime(3017, 12, 20)],
"str" : ["blah"]
})
df.replace("blah", "cats")
Leads to lots of
OutOfBoundsDatetime Traceback (most recent call last)
~/sandbox/pandas/pandas/core/internals.py in replace(self, to_replace, value, inplace, filter, regex, convert, mgr)
804 blocks = [b.convert(by_item=True, numeric=False,
--> 805 copy=not inplace) for b in blocks]
806 return blocks
~/sandbox/pandas/pandas/core/internals.py in <listcomp>(.0)
804 blocks = [b.convert(by_item=True, numeric=False,
--> 805 copy=not inplace) for b in blocks]
806 return blocks
~/sandbox/pandas/pandas/core/internals.py in convert(self, *args, **kwargs)
2355 if by_item and not self._is_single_block:
-> 2356 blocks = self.split_and_operate(None, f, False)
2357 else:
~/sandbox/pandas/pandas/core/internals.py in split_and_operate(self, mask, f, inplace)
508 if m.any():
--> 509 nv = f(m, v, i)
510 else:
~/sandbox/pandas/pandas/core/internals.py in f(m, v, i)
2345 shape = v.shape
-> 2346 values = fn(v.ravel(), **fn_kwargs)
2347 try:
~/sandbox/pandas/pandas/core/dtypes/cast.py in soft_convert_objects(values, datetime, numeric, timedelta, coerce, copy)
834 if datetime:
--> 835 values = lib.maybe_convert_objects(values, convert_datetime=datetime)
836
~/sandbox/pandas/pandas/_libs/src/inference.pyx in pandas._libs.lib.maybe_convert_objects()
1317 seen.datetime_ = 1
-> 1318 idatetimes[i] = convert_to_tsobject(
1319 val, None, None, 0, 0).value
~/sandbox/pandas/pandas/_libs/tslibs/conversion.pyx in pandas._libs.tslibs.conversion.convert_to_tsobject()
299 elif PyDateTime_Check(ts):
--> 300 return convert_datetime_to_tsobject(ts, tz, nanos)
301 elif PyDate_Check(ts):
~/sandbox/pandas/pandas/_libs/tslibs/conversion.pyx in pandas._libs.tslibs.conversion.convert_datetime_to_tsobject()
379
--> 380 check_dts_bounds(&obj.dts)
381 check_overflows(obj)
~/sandbox/pandas/pandas/_libs/tslibs/np_datetime.pyx in pandas._libs.tslibs.np_datetime.check_dts_bounds()
120 dts.min, dts.sec)
--> 121 raise OutOfBoundsDatetime(
122 'Out of bounds nanosecond timestamp: {fmt}'.format(fmt=fmt))
OutOfBoundsDatetime: Out of bounds nanosecond timestamp: 3017-12-20 00:00:00
(thanks to @ChrisMuir for the example)
Can you make your example reproducible? http://stackoverflow.com/help/mcve
Also here is the output of pd.show_versions()
commit: None
python: 3.6.4.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 69 Stepping 1, GenuineIntel
byteorder: little
LC_ALL: None
LANG: en
LOCALE: None.None
pandas: 0.22.0
pytest: 3.3.2
pip: 9.0.1
setuptools: 38.4.0
Cython: 0.27.3
numpy: 1.14.0
scipy: 1.0.0
pyarrow: None
xarray: None
IPython: 6.2.1
sphinx: 1.6.6
patsy: 0.5.0
dateutil: 2.6.1
pytz: 2017.3
blosc: None
bottleneck: 1.2.1
tables: 3.4.2
numexpr: 2.6.4
feather: None
matplotlib: 2.1.2
openpyxl: 2.4.10
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: 1.0.2
lxml: 4.1.1
bs4: 4.6.0
html5lib: 1.0.1
sqlalchemy: 1.2.1
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
Are you able to do some digging to see what's going on? I suspect none of
the maintainers will have time.
If not, perhaps post a traceback and someone else can take a look if they
hit the same issue.
On Fri, Mar 16, 2018 at 2:21 PM, JackKZ notifications@github.com wrote:
Also here is the output of pd.show_versions()
INSTALLED VERSIONScommit: None
python: 3.6.4.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 69 Stepping 1, GenuineIntel
byteorder: little
LC_ALL: None
LANG: en
LOCALE: None.Nonepandas: 0.22.0
pytest: 3.3.2
pip: 9.0.1
setuptools: 38.4.0
Cython: 0.27.3
numpy: 1.14.0
scipy: 1.0.0
pyarrow: None
xarray: None
IPython: 6.2.1
sphinx: 1.6.6
patsy: 0.5.0
dateutil: 2.6.1
pytz: 2017.3
blosc: None
bottleneck: 1.2.1
tables: 3.4.2
numexpr: 2.6.4
feather: None
matplotlib: 2.1.2
openpyxl: 2.4.10
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: 1.0.2
lxml: 4.1.1
bs4: 4.6.0
html5lib: 1.0.1
sqlalchemy: 1.2.1
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/pandas-dev/pandas/issues/20380#issuecomment-373818916,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABQHItJ5pNRLg91TJJmW0ndm1q_vg3A5ks5tfBCpgaJpZM4St8QO
.
Just to chime on this with more info, I'm getting a RecursionError exception when running df = df.replace(np.nan, 'NA') on Jack's data frame. Here's code to reproduce the issue:
import pandas as pd
from io import BytesIO
import zipfile
import requests
# Read the zipped file from this issue thread.
res = requests.get("https://github.com/pandas-dev/pandas/files/1820497/example_df.zip")
# Read the pkl file via pandas (pkl contains a data frame).
with zipfile.ZipFile(BytesIO(res.content)) as z:
pkl_file = z.namelist()[0]
with z.open(pkl_file) as pk:
df = pd.read_pickle(pk)
# Simple call to replace.
df = df.replace("blah", "NA")
Here's the traceback:
File "<ipython-input-6-0fda1114890b>", line 1, in <module>
df = df.replace(np.nan, "NA")
File "C:\Users\cmuir\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\core\generic.py", line 4612, in replace
regex=regex)
File "C:\Users\cmuir\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\core\internals.py", line 3468, in replace
return self.apply('replace', **kwargs)
File "C:\Users\cmuir\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\core\internals.py", line 3329, in apply
applied = getattr(b, f)(**kwargs)
File "C:\Users\cmuir\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\core\internals.py", line 2216, in replace
convert=convert, mgr=mgr)
File "C:\Users\cmuir\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\core\internals.py", line 771, in replace
filter=filter, regex=regex, convert=convert)
File "C:\Users\cmuir\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\core\internals.py", line 2216, in replace
convert=convert, mgr=mgr)
File "C:\Users\cmuir\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\core\internals.py", line 771, in replace
filter=filter, regex=regex, convert=convert)
File "C:\Users\cmuir\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\core\internals.py", line 2216, in replace
convert=convert, mgr=mgr)
....
<snip a ton of references to line 771 and line 2216>
....
File "C:\Users\cmuir\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\core\internals.py", line 763, in replace
copy=not inplace) for b in blocks]
File "C:\Users\cmuir\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\core\internals.py", line 763, in <listcomp>
copy=not inplace) for b in blocks]
File "C:\Users\cmuir\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\core\internals.py", line 2135, in convert
blocks = self.split_and_operate(None, f, False)
File "C:\Users\cmuir\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\core\internals.py", line 482, in split_and_operate
block = make_a_block(nv, [ref_loc])
File "C:\Users\cmuir\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\core\internals.py", line 458, in make_a_block
placement=ref_loc, fastpath=True)
File "C:\Users\cmuir\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\core\internals.py", line 210, in make_block
return make_block(values, placement=placement, ndim=ndim, **kwargs)
File "C:\Users\cmuir\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\core\internals.py", line 2957, in make_block
return klass(values, ndim=ndim, fastpath=fastpath, placement=placement)
File "C:\Users\cmuir\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\core\internals.py", line 2082, in __init__
placement=placement, **kwargs)
File "C:\Users\cmuir\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\core\internals.py", line 114, in __init__
self.mgr_locs = placement
File "C:\Users\cmuir\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\core\internals.py", line 229, in mgr_locs
new_mgr_locs = BlockPlacement(new_mgr_locs)
File "pandas/_libs/lib.pyx", line 1696, in pandas._libs.lib.BlockPlacement.__init__
File "C:\Users\cmuir\AppData\Local\Continuum\Anaconda3\lib\site-packages\numpy\core\numeric.py", line 729, in require
requirements = set(possible_flags[x.upper()] for x in requirements)
RecursionError: maximum recursion depth exceeded
I tried isolating the issue to a specific column
try:
df = df.replace("blah", "NA")
except RecursionError:
for curr_col in df.columns:
try:
df[curr_col] = df.loc[:, curr_col].replace("blah", "NA")
print("success for col %s" % curr_col)
except RecursionError:
print("fail for col %s" % curr_col)
This outputs:
success for col adulterant
success for col announcement_date
success for col data_source_detailed
success for col data_source_general
success for col failing_results
success for col filename
success for col food_name
success for col inspection_results
success for col legal_limit
success for col manufacturer_address
success for col manufacturer_name
success for col notice_number
success for col product_classification
fail for col production_date
success for col sampled_location_address
success for col sampled_location_name
success for col sampled_location_province
success for col sheetname
success for col specifications_model
success for col task_source_or_project_name
success for col test_outcome
success for col testing_agency
So the issue seems to be with col production_date. The data in this col is made up of str and datetime.datetime.
types = set()
for n in df["production_date"]:
if type(n) not in types:
types.add(type(n))
printing types gives:
{datetime.datetime, str}
This is as far as I've gotten. Not sure what to do with all this info, but figured I'd share.
Oh and here's my output for pd.show_versions()
INSTALLED VERSIONS
------------------
commit: None
python: 3.6.1.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 60 Stepping 3, GenuineIntel
byteorder: little
LC_ALL: None
LANG: en
LOCALE: None.None
pandas: 0.22.0
pytest: 3.0.7
pip: 9.0.1
setuptools: 37.0.0
Cython: 0.25.2
numpy: 1.12.1
scipy: 0.19.0
pyarrow: None
xarray: None
IPython: 5.3.0
sphinx: 1.5.6
patsy: 0.4.1
dateutil: 2.6.0
pytz: 2017.2
blosc: None
bottleneck: 1.2.1
tables: 3.2.2
numexpr: 2.6.2
feather: None
matplotlib: 2.0.2
openpyxl: 2.4.7
xlrd: 1.0.0
xlwt: 1.2.0
xlsxwriter: 0.9.6
lxml: 3.7.3
bs4: 4.6.0
html5lib: 0.999
sqlalchemy: 1.1.9
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
@TomAugspurger I believe I've narrowed this down to replace() trying to throw OutOfBoundsDatetime exception when hitting datetime.datetime values that are out of range, but instead is getting stuck in an inf loop. Check out the minimal example below.
This works without errors (year 2017 in the datetime object):
import pandas as pd
import datetime
df = pd.DataFrame({
"dt" : [datetime.datetime(2017, 12, 20)],
"str" : ["blah"]
})
df.replace("blah", "cats")
However this throws a RecursionError (year 3017 in the datetime object):
import pandas as pd
import datetime
df = pd.DataFrame({
"dt" : [datetime.datetime(3017, 12, 20)],
"str" : ["blah"]
})
df.replace("blah", "cats")
Saving the bad code to file and then calling it from command prompt, I'm able to see the entire traceback. Near the top, this block appears a bunch of times:
Traceback (most recent call last):
File "C:\Users\cmuir\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\core\internals.py", line 763, in replace
copy=not inplace) for b in blocks]
File "C:\Users\cmuir\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\core\internals.py", line 763, in <listcomp>
copy=not inplace) for b in blocks]
File "C:\Users\cmuir\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\core\internals.py", line 2135, in convert
blocks = self.split_and_operate(None, f, False)
File "C:\Users\cmuir\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\core\internals.py", line 478, in split_and_operate
nv = f(m, v, i)
File "C:\Users\cmuir\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\core\internals.py", line 2125, in f
values = fn(v.ravel(), **fn_kwargs)
File "C:\Users\cmuir\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\core\dtypes\cast.py", line 807, in soft_convert_objects
values = lib.maybe_convert_objects(values, convert_datetime=datetime)
File "pandas/_libs/src/inference.pyx", line 1290, in pandas._libs.lib.maybe_convert_objects
File "pandas/_libs/tslib.pyx", line 1575, in pandas._libs.tslib.convert_to_tsobject
File "pandas/_libs/tslib.pyx", line 1669, in pandas._libs.tslib.convert_datetime_to_tsobject
File "pandas/_libs/tslib.pyx", line 1848, in pandas._libs.tslib._check_dts_bounds
pandas._libs.tslib.OutOfBoundsDatetime: Out of bounds nanosecond timestamp: 3017-12-20 00:00:00
Followed by a bunch of calls referencing lines 771 and 2216, and ending with this:
File "C:\Users\cmuir\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\core\internals.py", line 771, in replace
filter=filter, regex=regex, convert=convert)
File "C:\Users\cmuir\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\core\internals.py", line 2216, in replace
convert=convert, mgr=mgr)
File "C:\Users\cmuir\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\core\internals.py", line 771, in replace
filter=filter, regex=regex, convert=convert)
File "C:\Users\cmuir\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\core\internals.py", line 2216, in replace
convert=convert, mgr=mgr)
File "C:\Users\cmuir\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\core\internals.py", line 760, in replace
blocks = self.putmask(mask, value, inplace=inplace)
File "C:\Users\cmuir\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\core\internals.py", line 1021, in putmask
return [self.make_block(new_values, fastpath=True)]
File "C:\Users\cmuir\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\core\internals.py", line 210, in make_block
return make_block(values, placement=placement, ndim=ndim, **kwargs)
File "C:\Users\cmuir\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\core\internals.py", line 2957, in make_block
return klass(values, ndim=ndim, fastpath=fastpath, placement=placement)
File "C:\Users\cmuir\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\core\internals.py", line 2082, in __init__
placement=placement, **kwargs)
File "C:\Users\cmuir\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\core\internals.py", line 114, in __init__
self.mgr_locs = placement
RecursionError: maximum recursion depth exceeded
Looking at function check_dts_bounds in pandas/_libs/tslibs/np_datetime.pyx, it looks like the call to replace is supposed to throw OutOfBoundsDatetime exception.
The recursion is occurring because OutofBoundsDatetime subclasses ValueError, which is getting caught here at line 807 in core/internals.py:
except (TypeError, ValueError):
# try again with a compatible block
block = self.astype(object)
return block.replace(
to_replace=original_to_replace, value=value, inplace=inplace,
filter=filter, regex=regex, convert=convert)
There is little sense in trying again if the datetime is out of bounds. This issue can be fixed by adding an except block for OutofBoundsDatetime before this block. A relevant question: should the out-of-bounds datetime have been caught when the DataFrame was created?
I'm inclined to think 'yes'. It's better to fail sooner rather than later.
Thanks for digging. In @ChrisMuir's small example, the column is object dtype, so I think that's OK.
passing convert=False also works in this case, though that probably breaks other things.
Updated the original post with the nice example.
raised a PR trying to fix this. 😃 convert=False breaks other cases by the look of it. at the moment using try except on cast module, new ideas welcome.
@JackKZ @ChrisMuir would you like an error to be raised (for the out of bound datetime) when encountering above example or ignore silently and carry out the replace anyhow?
Well, personally I'd prefer that replace() not throw an error when it hits an out of bound datetime, but that's coming from a purely selfish standpoint. I don't use Pandas a ton, and my use cases probably only represent a fraction of total common use cases (I'm always working with data generated by 3rd parties to which I'm completely disconnected). Also, I don't know enough about the Pandas internals and philosophy to weigh-in on which option makes the most sense, with all users in mind.
In the example given, and perhaps in most use cases, the OOB datetime is irrelevant. Of course that datetime could lead to other problems, but as far as replace is concerned it shouldn't matter. In my opinion, it would only make sense to raise the exception when manipulating the datetime.
Looks like this was closed by #22108