pandas.ExcelFile ignore parse_dates=False

Created on 27 Apr 2015 · 11Comments · Source: pandas-dev/pandas

I am trying to read an Excel file which someone else created and the wrongly formatted a column as "date" when it is not. It has a large integer in it, which triggers an error

OverflowError: normalized days too large to fit in a C int

But I have "parse_dates=False" so I thought pandas.ExcelFile would not try to parse the dates and return a string instead. Is this a bug?

Bug IO Excel

Source

a59

👍1

Most helpful comment

Hello, sorry for writing here but it seems a quite common wish the ability of disabling date parsing from the function pd.read_excel(). Yet another stackoverflow question without a proper answer.
As @kwantopia underlined, I also would expect that setting the flag parse_dates=False could help me to achieve the desired result but, instead, seems to be ignored or to work in a different (and hence obscure) way.
Can anyone tell me which is the situation of this issue and if actually it's an issue? And, in case everything it's regular, can I have a definitive answer that can allow me to read an .xls without having the dates parsed?
If there's need for changing or fixing some code, I'm very willing and ready to help, I'd just like to be pointed in the right direction.
Thank you in advance!

ffranchina on 23 Feb 2019

👍7

All 11 comments

you would have to show an example and pd.show_versions()

jreback on 30 Apr 2015

In Excel (2013 Windows 7), I created a new Workbook. In Sheet1, I entered A in A1 and 10000000 in B1. I then formatted B1 to be a Short Date which displays the cell as #################. I saved the files as 'test.xlsx'.

I then ran the following python code

import pandas as pd
pd.show_versions()
xl_file = pd.ExcelFile('test.xlsx')
sheet = xl_file.parse('Sheet1',parse_dates=False)

which gives me the following output

INSTALLED VERSIONS

commit: None
python: 2.7.9.final.0
python-bits: 64
OS: Darwin
OS-release: 14.3.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.16.0
nose: 1.3.6
Cython: 0.22
numpy: 1.9.2
scipy: 0.15.1
statsmodels: 0.6.1
IPython: 3.1.0
sphinx: 1.2.3
patsy: 0.3.0
dateutil: 2.4.2
pytz: 2015.2
bottleneck: None
tables: 3.1.1
numexpr: 2.3.1
matplotlib: 1.4.3
openpyxl: 2.0.2
xlrd: 0.9.3
xlwt: 1.0.0
xlsxwriter: 0.7.2
lxml: 3.4.3
bs4: 4.3.2
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.0.1
pymysql: None
psycopg2: None
Traceback (most recent call last):
File "test.py", line 5, in
sheet = xl_file.parse('Sheet1',parse_dates=False)
File "/Users/myname/anaconda/lib/python2.7/site-packages/pandas/io/excel.py", line 287, in parse
**kwds)
File "/Users/myname/anaconda/lib/python2.7/site-packages/pandas/io/excel.py", line 418, in _parse_excel
row.append(_parse_cell(value,typ))
File "/Users/myname/anaconda/lib/python2.7/site-packages/pandas/io/excel.py", line 342, in _parse_cell
epoch1904)
File "/Users/myname/anaconda/lib/python2.7/site-packages/xlrd/xldate.py", line 130, in xldate_as_datetime
return epoch + datetime.timedelta(days, seconds, 0, milliseconds)
OverflowError: date value out of range

a59 on 4 May 2015

parse_dates is not implemented. But I think could be in the _should_parse function. pull-requests are welcome.

jreback on 4 May 2015

Hi, I am adding this patch here, in case it's useful for those who do not want to parse dates from excel file by setting parse_dates=False. Please review. I had trouble parsing the following excel file from Crunchbase Excel Export which had really old dates which gave OverflowError.

From aae19c65e2a4b3a965f91bdffa5bd4595b0b7d7b Mon Sep 17 00:00:00 2001
From: Kwan Lee <[email protected]>
Date: Tue, 29 Mar 2016 14:52:03 +0300
Subject: [PATCH] parse_dates=False in read_excel should prevent from parsing
 dates.  It was throwing overflowerror when the dates were too old. update
 #10001

---
 pandas/io/excel.py | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/pandas/io/excel.py b/pandas/io/excel.py
index 5656c36..b4b8996 100644
--- a/pandas/io/excel.py
+++ b/pandas/io/excel.py
@@ -321,11 +321,11 @@ class ExcelFile(object):

         epoch1904 = self.book.datemode

-        def _parse_cell(cell_contents, cell_typ):
+        def _parse_cell(cell_contents, cell_typ, parse_dates=True):
             """converts the contents of the cell into a pandas
                appropriate object"""

-            if cell_typ == XL_CELL_DATE:
+            if cell_typ == XL_CELL_DATE and parse_dates:
                 if xlrd_0_9_3:
                     # Use the newer xlrd datetime handling.
                     cell_contents = xldate.xldate_as_datetime(cell_contents,
@@ -405,7 +405,7 @@ class ExcelFile(object):
                         should_parse[j] = self._should_parse(j, parse_cols)

                     if parse_cols is None or should_parse[j]:
-                        row.append(_parse_cell(value, typ))
+                        row.append(_parse_cell(value, typ, parse_dates))
                 data.append(row)

             if sheet.nrows == 0:
-- 
2.7.2

kwantopia on 29 Mar 2016

@kwantopia It'll be easier to review that if you put it up as a pull request. Then we can comment inline.

What's the desired behavior here? read_csv seems to silently "fail" to parse the columns that can't be represented as datetime64s.

In [5]: !cat foo.csv
date,val
1500-01-01,1
1600-01-02,2
1700-01-01,3
1800-01-01,4
1900-01-01,5
2000-01-01,6

In [1]: pd.read_csv('foo.csv', parse_dates='date')
Out[1]:
         date  val
0  1500-01-01    1
1  1600-01-02    2
2  1700-01-01    3
3  1800-01-01    4
4  1900-01-01    5
5  2000-01-01    6

TomAugspurger on 29 Mar 2016

actually maybe @jorisvandenbossche can comment here. IIRC the parse_dates kw only matters if in excel its NOT a date already (and in your case these are out-of-range, so they are strings and will be parsed to object even in pandas).

jreback on 29 Mar 2016

this is related to the issue in #11544 and looks to be a dupe of these (there is a somewhat convoluted chain as to what the original issues actually though). maybe someone can figure this chain out and we can create a master issue so its more clear.

jreback on 29 Mar 2016

@TomAugspurger it's a problem in read_excel, but I guess I was also misunderstanding parse_dates field. I was assuming that parse_dates=True means parse the dates and parse_dates=False means do not parse the dates for pandas.read_excel

kwantopia on 29 Mar 2016

ffranchina on 23 Feb 2019

👍7

Hi guys, I've also run into this issue. One workaround for this (given that the name of columns that you don't want to convert or their column positions are fixed) is that you can provide dictionary of names/column numbers as keys and desired type as value. So, if you set "str" as a value, columns in dataframe will have "object" type and won't get parsed.
pd.read_excel(**you_kwargs, dtype={"col_name1": "str", "col_name2": "str"})
btw, you may provide desired column types only for some columns, it's not required for all of them