I am trying to read an Excel file which someone else created and the wrongly formatted a column as "date" when it is not. It has a large integer in it, which triggers an error
OverflowError: normalized days too large to fit in a C int
But I have "parse_dates=False" so I thought pandas.ExcelFile would not try to parse the dates and return a string instead. Is this a bug?
you would have to show an example and pd.show_versions()
In Excel (2013 Windows 7), I created a new Workbook. In Sheet1, I entered A in A1 and 10000000 in B1. I then formatted B1 to be a Short Date which displays the cell as #################. I saved the files as 'test.xlsx'.
I then ran the following python code
import pandas as pd
pd.show_versions()
xl_file = pd.ExcelFile('test.xlsx')
sheet = xl_file.parse('Sheet1',parse_dates=False)
which gives me the following output
INSTALLED VERSIONS
commit: None
python: 2.7.9.final.0
python-bits: 64
OS: Darwin
OS-release: 14.3.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8pandas: 0.16.0
nose: 1.3.6
Cython: 0.22
numpy: 1.9.2
scipy: 0.15.1
statsmodels: 0.6.1
IPython: 3.1.0
sphinx: 1.2.3
patsy: 0.3.0
dateutil: 2.4.2
pytz: 2015.2
bottleneck: None
tables: 3.1.1
numexpr: 2.3.1
matplotlib: 1.4.3
openpyxl: 2.0.2
xlrd: 0.9.3
xlwt: 1.0.0
xlsxwriter: 0.7.2
lxml: 3.4.3
bs4: 4.3.2
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.0.1
pymysql: None
psycopg2: None
Traceback (most recent call last):
File "test.py", line 5, in
sheet = xl_file.parse('Sheet1',parse_dates=False)
File "/Users/myname/anaconda/lib/python2.7/site-packages/pandas/io/excel.py", line 287, in parse
**kwds)
File "/Users/myname/anaconda/lib/python2.7/site-packages/pandas/io/excel.py", line 418, in _parse_excel
row.append(_parse_cell(value,typ))
File "/Users/myname/anaconda/lib/python2.7/site-packages/pandas/io/excel.py", line 342, in _parse_cell
epoch1904)
File "/Users/myname/anaconda/lib/python2.7/site-packages/xlrd/xldate.py", line 130, in xldate_as_datetime
return epoch + datetime.timedelta(days, seconds, 0, milliseconds)
OverflowError: date value out of range
parse_dates
is not implemented. But I think could be in the _should_parse
function. pull-requests are welcome.
Hi, I am adding this patch here, in case it's useful for those who do not want to parse dates from excel file by setting parse_dates=False. Please review. I had trouble parsing the following excel file from Crunchbase Excel Export which had really old dates which gave OverflowError.
From aae19c65e2a4b3a965f91bdffa5bd4595b0b7d7b Mon Sep 17 00:00:00 2001
From: Kwan Lee <[email protected]>
Date: Tue, 29 Mar 2016 14:52:03 +0300
Subject: [PATCH] parse_dates=False in read_excel should prevent from parsing
dates. It was throwing overflowerror when the dates were too old. update
#10001
---
pandas/io/excel.py | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/pandas/io/excel.py b/pandas/io/excel.py
index 5656c36..b4b8996 100644
--- a/pandas/io/excel.py
+++ b/pandas/io/excel.py
@@ -321,11 +321,11 @@ class ExcelFile(object):
epoch1904 = self.book.datemode
- def _parse_cell(cell_contents, cell_typ):
+ def _parse_cell(cell_contents, cell_typ, parse_dates=True):
"""converts the contents of the cell into a pandas
appropriate object"""
- if cell_typ == XL_CELL_DATE:
+ if cell_typ == XL_CELL_DATE and parse_dates:
if xlrd_0_9_3:
# Use the newer xlrd datetime handling.
cell_contents = xldate.xldate_as_datetime(cell_contents,
@@ -405,7 +405,7 @@ class ExcelFile(object):
should_parse[j] = self._should_parse(j, parse_cols)
if parse_cols is None or should_parse[j]:
- row.append(_parse_cell(value, typ))
+ row.append(_parse_cell(value, typ, parse_dates))
data.append(row)
if sheet.nrows == 0:
--
2.7.2
@kwantopia It'll be easier to review that if you put it up as a pull request. Then we can comment inline.
What's the desired behavior here? read_csv
seems to silently "fail" to parse the columns that can't be represented as datetime64s.
In [5]: !cat foo.csv
date,val
1500-01-01,1
1600-01-02,2
1700-01-01,3
1800-01-01,4
1900-01-01,5
2000-01-01,6
In [1]: pd.read_csv('foo.csv', parse_dates='date')
Out[1]:
date val
0 1500-01-01 1
1 1600-01-02 2
2 1700-01-01 3
3 1800-01-01 4
4 1900-01-01 5
5 2000-01-01 6
actually maybe @jorisvandenbossche can comment here. IIRC the parse_dates
kw only matters if in excel its NOT a date already (and in your case these are out-of-range, so they are strings and will be parsed to object
even in pandas).
this is related to the issue in #11544 and looks to be a dupe of these (there is a somewhat convoluted chain as to what the original issues actually though). maybe someone can figure this chain out and we can create a master issue so its more clear.
@TomAugspurger it's a problem in read_excel, but I guess I was also misunderstanding parse_dates field. I was assuming that parse_dates=True means parse the dates and parse_dates=False means do not parse the dates for pandas.read_excel
Hello, sorry for writing here but it seems a quite common wish the ability of disabling date parsing from the function pd.read_excel()
. Yet another stackoverflow question without a proper answer.
As @kwantopia underlined, I also would expect that setting the flag parse_dates=False
could help me to achieve the desired result but, instead, seems to be ignored or to work in a different (and hence obscure) way.
Can anyone tell me which is the situation of this issue and if actually it's an issue? And, in case everything it's regular, can I have a definitive answer that can allow me to read an .xls
without having the dates parsed?
If there's need for changing or fixing some code, I'm very willing and ready to help, I'd just like to be pointed in the right direction.
Thank you in advance!
Hi guys, I've also run into this issue. One workaround for this (given that the name of columns that you don't want to convert or their column positions are fixed) is that you can provide dictionary of names/column numbers as keys and desired type as value. So, if you set "str" as a value, columns in dataframe will have "object" type and won't get parsed.
pd.read_excel(**you_kwargs, dtype={"col_name1": "str", "col_name2": "str"})
btw, you may provide desired column types only for some columns, it's not required for all of them
@a59 @ffranchina
Most helpful comment
Hello, sorry for writing here but it seems a quite common wish the ability of disabling date parsing from the function
pd.read_excel()
. Yet another stackoverflow question without a proper answer.As @kwantopia underlined, I also would expect that setting the flag
parse_dates=False
could help me to achieve the desired result but, instead, seems to be ignored or to work in a different (and hence obscure) way.Can anyone tell me which is the situation of this issue and if actually it's an issue? And, in case everything it's regular, can I have a definitive answer that can allow me to read an
.xls
without having the dates parsed?If there's need for changing or fixing some code, I'm very willing and ready to help, I'd just like to be pointed in the right direction.
Thank you in advance!