Pandas: Pandas' read_excel, ExcelFile, failing to open some .xls files.

Created on 2 Nov 2015  路  22Comments  路  Source: pandas-dev/pandas

I am trying to read in .xls files from http://www.eia.gov/coal/data.cfm#production -- specifically the Historical detailed coal production data (1983-2013) coalpublic2012.xls file that's freely available via the dropdown. Pandas cannot read it.

In contrast, the file for the most recent year available, 2013, coalpublic2013.xls file, works without a problem:

import pandas as pd
df1 = pd.read_excel("coalpublic2013.xls")

but the next decade of .xls files (2004-2012) do not load. I have looked at these files with Excel, and they open, and are not corrupted.

The error that I get from pandas is:

`````` ---------------------------------------------------------------------------
XLRDError Traceback (most recent call last)
in ()
----> 1 df = pd.read_excel("coalpublic2012.xlsx")

/Users/jonathan/anaconda/lib/python2.7/site-packages/pandas/io/excel.pyc in read_excel(io, sheetname, header, skiprows, skip_footer, index_col, parse_cols, parse_dates, date_parser, na_values, thousands, convert_float, has_index_names, converters, engine, **kwds)
161
162 if not isinstance(io, ExcelFile):
--> 163 io = ExcelFile(io, engine=engine)
164
165 return io._parse_excel(

/Users/jonathan/anaconda/lib/python2.7/site-packages/pandas/io/excel.pyc in __init__(self, io, **kwds)
204 self.book = xlrd.open_workbook(file_contents=data)
205 else:
--> 206 self.book = xlrd.open_workbook(io)
207 elif engine == 'xlrd' and isinstance(io, xlrd.Book):
208 self.book = io

/Users/jonathan/anaconda/lib/python2.7/site-packages/xlrd/__init__.pyc in open_workbook(filename, logfile, verbosity, use_mmap, file_contents, encoding_override, formatting_info, on_demand, ragged_rows)
433 formatting_info=formatting_info,
434 on_demand=on_demand,
--> 435 ragged_rows=ragged_rows,
436 )
437 return bk

/Users/jonathan/anaconda/lib/python2.7/site-packages/xlrd/book.pyc in open_workbook_xls(filename, logfile, verbosity, use_mmap, file_contents, encoding_override, formatting_info, on_demand, ragged_rows)
89 t1 = time.clock()
90 bk.load_time_stage_1 = t1 - t0
---> 91 biff_version = bk.getbof(XL_WORKBOOK_GLOBALS)
92 if not biff_version:
93 raise XLRDError("Can't determine file's BIFF version")

/Users/jonathan/anaconda/lib/python2.7/site-packages/xlrd/book.pyc in getbof(self, rqd_stream)
1228 bof_error('Expected BOF record; met end of file')
1229 if opcode not in bofcodes:
-> 1230 bof_error('Expected BOF record; found %r' % self.mem[savpos:savpos+8])
1231 length = self.get2bytes()
1232 if length == MY_EOF:

/Users/jonathan/anaconda/lib/python2.7/site-packages/xlrd/book.pyc in bof_error(msg)
1222 if DEBUG: print("reqd: 0x%04x" % rqd_stream, file=self.logfile)
1223 def bof_error(msg):
-> 1224 raise XLRDError('Unsupported format, or corrupt file: ' + msg)
1225 savpos = self._position
1226 opcode = self.get2bytes()

XLRDError: Unsupported format, or corrupt file: Expected BOF record; found '

And I have tried various other things:

```df = pd.ExcelFile("coalpublic2012.xls", encoding_override='cp1252')
import xlrd
wb = xlrd.open_workbook("coalpublic2012.xls")
``````

to no avail. My pandas version: 0.17.0

IO Excel

Most helpful comment

@jbwhit I have run the following code:

import pandas as pd
#Read and write to excel
dataFileUrl = r"/Users/stutiverma/Downloads/coalpublic2012.xls"
data = pd.read_table(dataFileUrl)

This reads the file successfully without giving any error. But, it gives all the data in the exact format as mentioned. So, you may have to do extra efforts in order to process the data after reading it successfully.

All 22 comments

This rather seems an issue with xlrd (what pandas uses to read the excel files). Can you try something like

    import xlrd
    book = xlrd.open_workbook("myfile.xls")
    print "The number of worksheets is", book.nsheets
    print "Worksheet name(s):", book.sheet_names()
    sh = book.sheet_by_index(0)

And if that fails, maybe report there https://github.com/python-excel/xlrd
Further, what version of xlrd are you using?

it looks like that file is actually in an old XML format (note, not the current .xlsx XML), despite the extension. Probably easiest to just re-save into a supported format.

http://stackoverflow.com/questions/10293143/python-excel-processing-error

Thanks for your feedback, @jorisvandenbossche.

My version of xlrd is 0.9.4 (which should be the most up to date). The code snippet that you asked me to try failed in with the open_workbook command, with the same error that I posted above. I will crosspost this with the xlrd project.

@chris-b1 I suspect that you're right that it would be easiest to re-save into a supported format, but there are several problems that remain. The problem for me comes from the fact that I'm using this to teach a course that students are supposed to get this data themselves and to learn to use the read_excel function to read in Excel files found in the wild. Requiring them to own a copy of Excel and to re-save these files into a different Excel format is uninspiring.

Further, I'm not sure why @jreback closed this issue, has it been resolved? Or is it out of scope in some way? As I see it, the decision should be driven by the following questions:

  1. Is the file a valid Excel file?

a. If not, it's a file corruption problem and not a problem with pandas (this issue should be closed).

b. If so (Excel seems to think so, as it opens w/o error), and it is unreadable by pandas, then this issue should remain unresolved until it can be read by the pandas function read_excel.

c. A final option is that the library that pandas relies on is the culprit, and so while the bug appears in a pandas call, it comes from an underlying source (and the issue should be closed in the pandas and opened in the appropriate library).

Am I missing some consideration? And if not, why is the issue currently closed?

@jbwhit this is c). we use the xlrd library to read these files. So this is an upstream to pandas issue. You can file a report there.

Great, thanks for the prompt reply @jreback
Done: https://github.com/python-excel/xlrd/issues/156

Hey, this code is working correctly.

Read and write to excel

dataFileUrl = R"D:\real_names.xlsx"
data = pd.read_excel(dataFileUrl)

@vermastuti you seem to misunderstand my report. The key part is starting with my example Excel spreadsheet, namely, go to this link: http://www.eia.gov/coal/data.cfm#production then click the year 2012, and download the file named coalpublic2012.xls.

Pandas cannot read it.

Your code does not read that file. Here's the relevant error when using your code (today):

import pandas as pd
dataFileUrl = "coalpublic2012.xls"
data = pd.read_excel(dataFileUrl)

---------------------------------------------------------------------------
XLRDError                                 Traceback (most recent call last)
<ipython-input-6-e8da944daef2> in <module>()
      1 import pandas as pd
      2 dataFileUrl = "/Users/jonathan/junk/coalpublic2012.xls"
----> 3 data = pd.read_excel(dataFileUrl)

~/miniconda3/envs/dspy3/lib/python3.5/site-packages/pandas/util/_decorators.py in wrapper(*args, **kwargs)
    116                 else:
    117                     kwargs[new_arg_name] = new_arg_value
--> 118             return func(*args, **kwargs)
    119         return wrapper
    120     return _deprecate_kwarg

~/miniconda3/envs/dspy3/lib/python3.5/site-packages/pandas/io/excel.py in read_excel(io, sheet_name, header, skiprows, skip_footer, index_col, names, usecols, parse_dates, date_parser, na_values, thousands, convert_float, converters, dtype, true_values, false_values, engine, squeeze, **kwds)
    228 
    229     if not isinstance(io, ExcelFile):
--> 230         io = ExcelFile(io, engine=engine)
    231 
    232     return io._parse_excel(

~/miniconda3/envs/dspy3/lib/python3.5/site-packages/pandas/io/excel.py in __init__(self, io, **kwds)
    292             self.book = xlrd.open_workbook(file_contents=data)
    293         elif isinstance(self._io, compat.string_types):
--> 294             self.book = xlrd.open_workbook(self._io)
    295         else:
    296             raise ValueError('Must explicitly set engine if not passing in'

~/miniconda3/envs/dspy3/lib/python3.5/site-packages/xlrd/__init__.py in open_workbook(filename, logfile, verbosity, use_mmap, file_contents, encoding_override, formatting_info, on_demand, ragged_rows)
    160         formatting_info=formatting_info,
    161         on_demand=on_demand,
--> 162         ragged_rows=ragged_rows,
    163         )
    164     return bk

~/miniconda3/envs/dspy3/lib/python3.5/site-packages/xlrd/book.py in open_workbook_xls(filename, logfile, verbosity, use_mmap, file_contents, encoding_override, formatting_info, on_demand, ragged_rows)
     89         t1 = time.clock()
     90         bk.load_time_stage_1 = t1 - t0
---> 91         biff_version = bk.getbof(XL_WORKBOOK_GLOBALS)
     92         if not biff_version:
     93             raise XLRDError("Can't determine file's BIFF version")

~/miniconda3/envs/dspy3/lib/python3.5/site-packages/xlrd/book.py in getbof(self, rqd_stream)
   1269             bof_error('Expected BOF record; met end of file')
   1270         if opcode not in bofcodes:
-> 1271             bof_error('Expected BOF record; found %r' % self.mem[savpos:savpos+8])
   1272         length = self.get2bytes()
   1273         if length == MY_EOF:

~/miniconda3/envs/dspy3/lib/python3.5/site-packages/xlrd/book.py in bof_error(msg)
   1263         if DEBUG: print("reqd: 0x%04x" % rqd_stream, file=self.logfile)
   1264         def bof_error(msg):
-> 1265             raise XLRDError('Unsupported format, or corrupt file: ' + msg)
   1266         savpos = self._position
   1267         opcode = self.get2bytes()

XLRDError: Unsupported format, or corrupt file: Expected BOF record; found b'<?xml ve'

Is there any other way to open old xml as xls files?
Has this been resolved?

@darshanlol If you follow the various threads, you'll find that there are valid Excel files that cannot be read by Pandas, and that no one thinks this is a bug.

  • Pandas support will say that it's an xlrd problem, not a pandas problem, and will close (this) thread
  • xlrd here will say, "the file has been saved as "XML Spreadsheet (*.xml)" i.e. NOT in XLS or XLSX format, not supported by xlrd", and will close the thread
  • And if you happen to email eia.gov and ask them to provide their data in a different Excel format (or csv), they will simply ignore your request

As far as I can tell it's a "won't fix" situation by everyone involved.

@jbwhit I followed the thread but assumed this might have been addressed since the issue was in 2015

Pandas support will say that it's an xlrd problem, not a pandas problem, and will close (this) thread

To clarify, there is no code in pandas itself for actually reading the excel file, we only rely on third-party libraries, and I don't think we are going the change that. But it's not that we don't want to be able to read such format. If somebody writes a library (with similar api as xlrd) that can read it, we might consider relying upon it.

Although a non-Python solution is not very satisfying, this is an issue your students will face "in the wild". If one has access to a Windows 10 OS, PowerShell can be used to iterate over each of the 1997-2003 excel XML formatted files and convert them to the current .xlsx format. This is a quick and dirty solution I resort to in my day-to-day work. An example is attached
convertXls2Xlsx.txt

@jbwhit I have run the following code:

import pandas as pd
#Read and write to excel
dataFileUrl = r"/Users/stutiverma/Downloads/coalpublic2012.xls"
data = pd.read_table(dataFileUrl)

This reads the file successfully without giving any error. But, it gives all the data in the exact format as mentioned. So, you may have to do extra efforts in order to process the data after reading it successfully.

has this been solved because i am experiencing similar issue now

So far the "fix" is to open in Excel/Windows, re-save it, then open with pandas/xlrd. Not sure of other options at this point.

import pandas as pd
from xml.sax import ContentHandler, parse

Reference https://goo.gl/KaOBG3

class ExcelHandler(ContentHandler):
def __init__(self):
self.chars = [ ]
self.cells = [ ]
self.rows = [ ]
self.tables = [ ]
def characters(self, content):
self.chars.append(content)
def startElement(self, name, atts):
if name=="Cell":
self.chars = [ ]
elif name=="Row":
self.cells=[ ]
elif name=="Table":
self.rows = [ ]
def endElement(self, name):
if name=="Cell":
self.cells.append(''.join(self.chars))
elif name=="Row":
self.rows.append(self.cells)
elif name=="Table":
self.tables.append(self.rows)

excelHandler = ExcelHandler()
parse('coalpublic2012.xls', excelHandler)
df1 = pd.DataFrame(excelHandler.tables[0][4:], columns=excelHandler.tables[0][3])

I faced the same issue by Iran's stock market historical data files in xls format and used @vermastuti 's solution and it worked.
But I don't know how to process table data. I want to read data from about 2000 same excell files and insert them to database.

@jbwhit I have run the following code:

import pandas as pd
#Read and write to excel
dataFileUrl = r"/Users/stutiverma/Downloads/coalpublic2012.xls"
data = pd.read_table(dataFileUrl)

This reads the file successfully without giving any error. But, it gives all the data in the exact format as mentioned. So, you may have to do extra efforts in order to process the data after reading it successfully.

Thank you man. You are my hero.

Hi, I had faced the same xlrd.biffh.XLRDError: Unsupported format, or corrupt file: Expected BOF record; error and solved it by writing an XML to XLSX converter.

import pandas as pd
from bs4 import BeautifulSoup

def convert_to_xlsx():
    with open('sample.xls') as xml_file:
        soup = BeautifulSoup(xml_file.read(), 'xml')
        writer = pd.ExcelWriter('sample.xlsx')
        for sheet in soup.findAll('Worksheet'):
            sheet_as_list = []
            for row in sheet.findAll('Row'):
                sheet_as_list.append([cell.Data.text if cell.Data else '' for cell in row.findAll('Cell')])
            pd.DataFrame(sheet_as_list).to_excel(writer, sheet_name=sheet.attrs['ss:Name'], index=False, header=False)

        writer.save()

I was receiving "XLRDError: Unsupported format, or corrupt file: Expected BOF record; found b'

The function _pandas.read_html_ function worked for me, it returns a list of dataframes:

(df, ) = pd.read_html('file.xls')
Was this page helpful?
0 / 5 - 0 ratings

Related issues

jreback picture jreback  路  61Comments

ShaharNaveh picture ShaharNaveh  路  51Comments

bgrayburn picture bgrayburn  路  46Comments

MarkiesFredje picture MarkiesFredje  路  42Comments

quicknir picture quicknir  路  58Comments