Pandas: Pandas' read_excel, ExcelFile, failing to open some .xls files.

Created on 2 Nov 2015 · 22Comments · Source: pandas-dev/pandas

I am trying to read in .xls files from http://www.eia.gov/coal/data.cfm#production -- specifically the Historical detailed coal production data (1983-2013) coalpublic2012.xls file that's freely available via the dropdown. Pandas cannot read it.

In contrast, the file for the most recent year available, 2013, coalpublic2013.xls file, works without a problem:

import pandas as pd
df1 = pd.read_excel("coalpublic2013.xls")

but the next decade of .xls files (2004-2012) do not load. I have looked at these files with Excel, and they open, and are not corrupted.

The error that I get from pandas is:

`````` ---------------------------------------------------------------------------
XLRDError Traceback (most recent call last)
in ()
----> 1 df = pd.read_excel("coalpublic2012.xlsx")

/Users/jonathan/anaconda/lib/python2.7/site-packages/pandas/io/excel.pyc in read_excel(io, sheetname, header, skiprows, skip_footer, index_col, parse_cols, parse_dates, date_parser, na_values, thousands, convert_float, has_index_names, converters, engine, **kwds)
161
162 if not isinstance(io, ExcelFile):
--> 163 io = ExcelFile(io, engine=engine)
164
165 return io._parse_excel(

/Users/jonathan/anaconda/lib/python2.7/site-packages/pandas/io/excel.pyc in __init__(self, io, **kwds)
204 self.book = xlrd.open_workbook(file_contents=data)
205 else:
--> 206 self.book = xlrd.open_workbook(io)
207 elif engine == 'xlrd' and isinstance(io, xlrd.Book):
208 self.book = io

/Users/jonathan/anaconda/lib/python2.7/site-packages/xlrd/__init__.pyc in open_workbook(filename, logfile, verbosity, use_mmap, file_contents, encoding_override, formatting_info, on_demand, ragged_rows)
433 formatting_info=formatting_info,
434 on_demand=on_demand,
--> 435 ragged_rows=ragged_rows,
436 )
437 return bk

/Users/jonathan/anaconda/lib/python2.7/site-packages/xlrd/book.pyc in open_workbook_xls(filename, logfile, verbosity, use_mmap, file_contents, encoding_override, formatting_info, on_demand, ragged_rows)
89 t1 = time.clock()
90 bk.load_time_stage_1 = t1 - t0
---> 91 biff_version = bk.getbof(XL_WORKBOOK_GLOBALS)
92 if not biff_version:
93 raise XLRDError("Can't determine file's BIFF version")

/Users/jonathan/anaconda/lib/python2.7/site-packages/xlrd/book.pyc in getbof(self, rqd_stream)
1228 bof_error('Expected BOF record; met end of file')
1229 if opcode not in bofcodes:
-> 1230 bof_error('Expected BOF record; found %r' % self.mem[savpos:savpos+8])
1231 length = self.get2bytes()
1232 if length == MY_EOF:

/Users/jonathan/anaconda/lib/python2.7/site-packages/xlrd/book.pyc in bof_error(msg)
1222 if DEBUG: print("reqd: 0x%04x" % rqd_stream, file=self.logfile)
1223 def bof_error(msg):
-> 1224 raise XLRDError('Unsupported format, or corrupt file: ' + msg)
1225 savpos = self._position
1226 opcode = self.get2bytes()

XLRDError: Unsupported format, or corrupt file: Expected BOF record; found '

And I have tried various other things:

```df = pd.ExcelFile("coalpublic2012.xls", encoding_override='cp1252')
import xlrd
wb = xlrd.open_workbook("coalpublic2012.xls")
``````

to no avail. My pandas version: 0.17.0

IO Excel

Source

jbwhit

Most helpful comment

@jbwhit I have run the following code:

import pandas as pd
#Read and write to excel
dataFileUrl = r"/Users/stutiverma/Downloads/coalpublic2012.xls"
data = pd.read_table(dataFileUrl)

This reads the file successfully without giving any error. But, it gives all the data in the exact format as mentioned. So, you may have to do extra efforts in order to process the data after reading it successfully.

vermastuti on 21 Jun 2018

👍3 ❤1

All 22 comments

This rather seems an issue with xlrd (what pandas uses to read the excel files). Can you try something like

    import xlrd
    book = xlrd.open_workbook("myfile.xls")
    print "The number of worksheets is", book.nsheets
    print "Worksheet name(s):", book.sheet_names()
    sh = book.sheet_by_index(0)

And if that fails, maybe report there https://github.com/python-excel/xlrd
Further, what version of xlrd are you using?

jorisvandenbossche on 2 Nov 2015

it looks like that file is actually in an old XML format (note, not the current .xlsx XML), despite the extension. Probably easiest to just re-save into a supported format.

http://stackoverflow.com/questions/10293143/python-excel-processing-error

chris-b1 on 2 Nov 2015

Thanks for your feedback, @jorisvandenbossche.

My version of xlrd is 0.9.4 (which should be the most up to date). The code snippet that you asked me to try failed in with the open_workbook command, with the same error that I posted above. I will crosspost this with the xlrd project.

jbwhit on 2 Nov 2015

@chris-b1 I suspect that you're right that it would be easiest to re-save into a supported format, but there are several problems that remain. The problem for me comes from the fact that I'm using this to teach a course that students are supposed to get this data themselves and to learn to use the read_excel function to read in Excel files found in the wild. Requiring them to own a copy of Excel and to re-save these files into a different Excel format is uninspiring.

Further, I'm not sure why @jreback closed this issue, has it been resolved? Or is it out of scope in some way? As I see it, the decision should be driven by the following questions:

Is the file a valid Excel file?

a. If not, it's a file corruption problem and not a problem with pandas (this issue should be closed).

b. If so (Excel seems to think so, as it opens w/o error), and it is unreadable by pandas, then this issue should remain unresolved until it can be read by the pandas function read_excel.

c. A final option is that the library that pandas relies on is the culprit, and so while the bug appears in a pandas call, it comes from an underlying source (and the issue should be closed in the pandas and opened in the appropriate library).

Am I missing some consideration? And if not, why is the issue currently closed?

jbwhit on 2 Nov 2015

@jbwhit this is c). we use the xlrd library to read these files. So this is an upstream to pandas issue. You can file a report there.

jreback on 2 Nov 2015

Great, thanks for the prompt reply @jreback
Done: https://github.com/python-excel/xlrd/issues/156

jbwhit on 2 Nov 2015

Hey, this code is working correctly.

Read and write to excel

dataFileUrl = R"D:\real_names.xlsx"
data = pd.read_excel(dataFileUrl)

vermastuti on 28 Mar 2018

@vermastuti you seem to misunderstand my report. The key part is starting with my example Excel spreadsheet, namely, go to this link: http://www.eia.gov/coal/data.cfm#production then click the year 2012, and download the file named coalpublic2012.xls.

Pandas cannot read it.

Your code does not read that file. Here's the relevant error when using your code (today):

import pandas as pd
dataFileUrl = "coalpublic2012.xls"
data = pd.read_excel(dataFileUrl)

---------------------------------------------------------------------------
XLRDError                                 Traceback (most recent call last)
<ipython-input-6-e8da944daef2> in <module>()
      1 import pandas as pd
      2 dataFileUrl = "/Users/jonathan/junk/coalpublic2012.xls"
----> 3 data = pd.read_excel(dataFileUrl)

~/miniconda3/envs/dspy3/lib/python3.5/site-packages/pandas/util/_decorators.py in wrapper(*args, **kwargs)
    116                 else:
    117                     kwargs[new_arg_name] = new_arg_value
--> 118             return func(*args, **kwargs)
    119         return wrapper
    120     return _deprecate_kwarg

~/miniconda3/envs/dspy3/lib/python3.5/site-packages/pandas/io/excel.py in read_excel(io, sheet_name, header, skiprows, skip_footer, index_col, names, usecols, parse_dates, date_parser, na_values, thousands, convert_float, converters, dtype, true_values, false_values, engine, squeeze, **kwds)
    228 
    229     if not isinstance(io, ExcelFile):
--> 230         io = ExcelFile(io, engine=engine)
    231 
    232     return io._parse_excel(

~/miniconda3/envs/dspy3/lib/python3.5/site-packages/pandas/io/excel.py in __init__(self, io, **kwds)
    292             self.book = xlrd.open_workbook(file_contents=data)
    293         elif isinstance(self._io, compat.string_types):
--> 294             self.book = xlrd.open_workbook(self._io)
    295         else:
    296             raise ValueError('Must explicitly set engine if not passing in'

~/miniconda3/envs/dspy3/lib/python3.5/site-packages/xlrd/__init__.py in open_workbook(filename, logfile, verbosity, use_mmap, file_contents, encoding_override, formatting_info, on_demand, ragged_rows)
    160         formatting_info=formatting_info,
    161         on_demand=on_demand,
--> 162         ragged_rows=ragged_rows,
    163         )
    164     return bk

~/miniconda3/envs/dspy3/lib/python3.5/site-packages/xlrd/book.py in open_workbook_xls(filename, logfile, verbosity, use_mmap, file_contents, encoding_override, formatting_info, on_demand, ragged_rows)
     89         t1 = time.clock()
     90         bk.load_time_stage_1 = t1 - t0
---> 91         biff_version = bk.getbof(XL_WORKBOOK_GLOBALS)
     92         if not biff_version:
     93             raise XLRDError("Can't determine file's BIFF version")

~/miniconda3/envs/dspy3/lib/python3.5/site-packages/xlrd/book.py in getbof(self, rqd_stream)
   1269             bof_error('Expected BOF record; met end of file')
   1270         if opcode not in bofcodes:
-> 1271             bof_error('Expected BOF record; found %r' % self.mem[savpos:savpos+8])
   1272         length = self.get2bytes()
   1273         if length == MY_EOF:

~/miniconda3/envs/dspy3/lib/python3.5/site-packages/xlrd/book.py in bof_error(msg)
   1263         if DEBUG: print("reqd: 0x%04x" % rqd_stream, file=self.logfile)
   1264         def bof_error(msg):
-> 1265             raise XLRDError('Unsupported format, or corrupt file: ' + msg)
   1266         savpos = self._position
   1267         opcode = self.get2bytes()

XLRDError: Unsupported format, or corrupt file: Expected BOF record; found b'<?xml ve'

jbwhit on 29 Mar 2018

😕1

Is there any other way to open old xml as xls files?
Has this been resolved?

darshanlol on 20 Apr 2018

@darshanlol If you follow the various threads, you'll find that there are valid Excel files that cannot be read by Pandas, and that no one thinks this is a bug.

Pandas support will say that it's an xlrd problem, not a pandas problem, and will close (this) thread
xlrd here will say, "the file has been saved as "XML Spreadsheet (*.xml)" i.e. NOT in XLS or XLSX format, not supported by xlrd", and will close the thread
And if you happen to email eia.gov and ask them to provide their data in a different Excel format (or csv), they will simply ignore your request

As far as I can tell it's a "won't fix" situation by everyone involved.

jbwhit on 20 Apr 2018

😕2

@jbwhit I followed the thread but assumed this might have been addressed since the issue was in 2015

darshanlol on 20 Apr 2018

Pandas support will say that it's an xlrd problem, not a pandas problem, and will close (this) thread

To clarify, there is no code in pandas itself for actually reading the excel file, we only rely on third-party libraries, and I don't think we are going the change that. But it's not that we don't want to be able to read such format. If somebody writes a library (with similar api as xlrd) that can read it, we might consider relying upon it.

jorisvandenbossche on 20 Apr 2018

👍2

Although a non-Python solution is not very satisfying, this is an issue your students will face "in the wild". If one has access to a Windows 10 OS, PowerShell can be used to iterate over each of the 1997-2003 excel XML formatted files and convert them to the current .xlsx format. This is a quick and dirty solution I resort to in my day-to-day work. An example is attached
convertXls2Xlsx.txt

dgn2 on 21 Jun 2018

@jbwhit I have run the following code:

import pandas as pd
#Read and write to excel
dataFileUrl = r"/Users/stutiverma/Downloads/coalpublic2012.xls"
data = pd.read_table(dataFileUrl)

vermastuti on 21 Jun 2018

👍3 ❤1

has this been solved because i am experiencing similar issue now

AlugoIdris on 17 Jul 2019

So far the "fix" is to open in Excel/Windows, re-save it, then open with pandas/xlrd. Not sure of other options at this point.

oolongtea on 31 Jul 2019

👍2

import pandas as pd
from xml.sax import ContentHandler, parse

Reference https://goo.gl/KaOBG3

class ExcelHandler(ContentHandler):
def __init__(self):
self.chars = [ ]
self.cells = [ ]
self.rows = [ ]
self.tables = [ ]
def characters(self, content):
self.chars.append(content)
def startElement(self, name, atts):
if name=="Cell":
self.chars = [ ]
elif name=="Row":
self.cells=[ ]
elif name=="Table":
self.rows = [ ]
def endElement(self, name):
if name=="Cell":
self.cells.append(''.join(self.chars))
elif name=="Row":
self.rows.append(self.cells)
elif name=="Table":
self.tables.append(self.rows)

excelHandler = ExcelHandler()
parse('coalpublic2012.xls', excelHandler)
df1 = pd.DataFrame(excelHandler.tables[0][4:], columns=excelHandler.tables[0][3])

Saurabhtiwari2890 on 17 Dec 2019

I faced the same issue by Iran's stock market historical data files in xls format and used @vermastuti 's solution and it worked.
But I don't know how to process table data. I want to read data from about 2000 same excell files and insert them to database.

jnoruzi on 27 Apr 2020

@jbwhit I have run the following code:
import pandas as pd
#Read and write to excel
dataFileUrl = r"/Users/stutiverma/Downloads/coalpublic2012.xls"
data = pd.read_table(dataFileUrl)
This reads the file successfully without giving any error. But, it gives all the data in the exact format as mentioned. So, you may have to do extra efforts in order to process the data after reading it successfully.

Thank you man. You are my hero.

ryuncha on 29 May 2020

👍2

Hi, I had faced the same xlrd.biffh.XLRDError: Unsupported format, or corrupt file: Expected BOF record; error and solved it by writing an XML to XLSX converter.

import pandas as pd
from bs4 import BeautifulSoup

def convert_to_xlsx():
    with open('sample.xls') as xml_file:
        soup = BeautifulSoup(xml_file.read(), 'xml')
        writer = pd.ExcelWriter('sample.xlsx')
        for sheet in soup.findAll('Worksheet'):
            sheet_as_list = []
            for row in sheet.findAll('Row'):
                sheet_as_list.append([cell.Data.text if cell.Data else '' for cell in row.findAll('Cell')])
            pd.DataFrame(sheet_as_list).to_excel(writer, sheet_name=sheet.attrs['ss:Name'], index=False, header=False)

        writer.save()