I am trying to read in .xls
files from http://www.eia.gov/coal/data.cfm#production -- specifically the Historical detailed coal production data (1983-2013) coalpublic2012.xls
file that's freely available via the dropdown. Pandas cannot read it.
In contrast, the file for the most recent year available, 2013, coalpublic2013.xls
file, works without a problem:
import pandas as pd
df1 = pd.read_excel("coalpublic2013.xls")
but the next decade of .xls
files (2004-2012) do not load. I have looked at these files with Excel, and they open, and are not corrupted.
The error that I get from pandas is:
`````` ---------------------------------------------------------------------------
XLRDError Traceback (most recent call last)
----> 1 df = pd.read_excel("coalpublic2012.xlsx")
/Users/jonathan/anaconda/lib/python2.7/site-packages/pandas/io/excel.pyc in read_excel(io, sheetname, header, skiprows, skip_footer, index_col, parse_cols, parse_dates, date_parser, na_values, thousands, convert_float, has_index_names, converters, engine, **kwds)
161
162 if not isinstance(io, ExcelFile):
--> 163 io = ExcelFile(io, engine=engine)
164
165 return io._parse_excel(
/Users/jonathan/anaconda/lib/python2.7/site-packages/pandas/io/excel.pyc in __init__(self, io, **kwds)
204 self.book = xlrd.open_workbook(file_contents=data)
205 else:
--> 206 self.book = xlrd.open_workbook(io)
207 elif engine == 'xlrd' and isinstance(io, xlrd.Book):
208 self.book = io
/Users/jonathan/anaconda/lib/python2.7/site-packages/xlrd/__init__.pyc in open_workbook(filename, logfile, verbosity, use_mmap, file_contents, encoding_override, formatting_info, on_demand, ragged_rows)
433 formatting_info=formatting_info,
434 on_demand=on_demand,
--> 435 ragged_rows=ragged_rows,
436 )
437 return bk
/Users/jonathan/anaconda/lib/python2.7/site-packages/xlrd/book.pyc in open_workbook_xls(filename, logfile, verbosity, use_mmap, file_contents, encoding_override, formatting_info, on_demand, ragged_rows)
89 t1 = time.clock()
90 bk.load_time_stage_1 = t1 - t0
---> 91 biff_version = bk.getbof(XL_WORKBOOK_GLOBALS)
92 if not biff_version:
93 raise XLRDError("Can't determine file's BIFF version")
/Users/jonathan/anaconda/lib/python2.7/site-packages/xlrd/book.pyc in getbof(self, rqd_stream)
1228 bof_error('Expected BOF record; met end of file')
1229 if opcode not in bofcodes:
-> 1230 bof_error('Expected BOF record; found %r' % self.mem[savpos:savpos+8])
1231 length = self.get2bytes()
1232 if length == MY_EOF:
/Users/jonathan/anaconda/lib/python2.7/site-packages/xlrd/book.pyc in bof_error(msg)
1222 if DEBUG: print("reqd: 0x%04x" % rqd_stream, file=self.logfile)
1223 def bof_error(msg):
-> 1224 raise XLRDError('Unsupported format, or corrupt file: ' + msg)
1225 savpos = self._position
1226 opcode = self.get2bytes()
XLRDError: Unsupported format, or corrupt file: Expected BOF record; found '
And I have tried various other things:
```df = pd.ExcelFile("coalpublic2012.xls", encoding_override='cp1252')
import xlrd
wb = xlrd.open_workbook("coalpublic2012.xls")
``````
to no avail. My pandas version: 0.17.0
This rather seems an issue with xlrd
(what pandas uses to read the excel files). Can you try something like
import xlrd
book = xlrd.open_workbook("myfile.xls")
print "The number of worksheets is", book.nsheets
print "Worksheet name(s):", book.sheet_names()
sh = book.sheet_by_index(0)
And if that fails, maybe report there https://github.com/python-excel/xlrd
Further, what version of xlrd
are you using?
it looks like that file is actually in an old XML format (note, not the current .xlsx XML), despite the extension. Probably easiest to just re-save into a supported format.
http://stackoverflow.com/questions/10293143/python-excel-processing-error
Thanks for your feedback, @jorisvandenbossche.
My version of xlrd
is 0.9.4 (which should be the most up to date). The code snippet that you asked me to try failed in with the open_workbook
command, with the same error that I posted above. I will crosspost this with the xlrd
project.
@chris-b1 I suspect that you're right that it would be easiest to re-save into a supported format, but there are several problems that remain. The problem for me comes from the fact that I'm using this to teach a course that students are supposed to get this data themselves and to learn to use the read_excel
function to read in Excel files found in the wild. Requiring them to own a copy of Excel and to re-save these files into a different Excel format is uninspiring.
Further, I'm not sure why @jreback closed this issue, has it been resolved? Or is it out of scope in some way? As I see it, the decision should be driven by the following questions:
a. If not, it's a file corruption problem and not a problem with pandas
(this issue should be closed).
b. If so (Excel seems to think so, as it opens w/o error), and it is unreadable by pandas
, then this issue should remain unresolved until it can be read by the pandas
function read_excel
.
c. A final option is that the library that pandas
relies on is the culprit, and so while the bug appears in a pandas
call, it comes from an underlying source (and the issue should be closed in the pandas
and opened in the appropriate library).
Am I missing some consideration? And if not, why is the issue currently closed?
@jbwhit this is c). we use the xlrd
library to read these files. So this is an upstream to pandas issue. You can file a report there.
Great, thanks for the prompt reply @jreback
Done: https://github.com/python-excel/xlrd/issues/156
Hey, this code is working correctly.
dataFileUrl = R"D:\real_names.xlsx"
data = pd.read_excel(dataFileUrl)
@vermastuti you seem to misunderstand my report. The key part is starting with my example Excel spreadsheet, namely, go to this link: http://www.eia.gov/coal/data.cfm#production then click the year 2012, and download the file named coalpublic2012.xls
.
Pandas cannot read it.
Your code does not read that file. Here's the relevant error when using your code (today):
import pandas as pd
dataFileUrl = "coalpublic2012.xls"
data = pd.read_excel(dataFileUrl)
---------------------------------------------------------------------------
XLRDError Traceback (most recent call last)
<ipython-input-6-e8da944daef2> in <module>()
1 import pandas as pd
2 dataFileUrl = "/Users/jonathan/junk/coalpublic2012.xls"
----> 3 data = pd.read_excel(dataFileUrl)
~/miniconda3/envs/dspy3/lib/python3.5/site-packages/pandas/util/_decorators.py in wrapper(*args, **kwargs)
116 else:
117 kwargs[new_arg_name] = new_arg_value
--> 118 return func(*args, **kwargs)
119 return wrapper
120 return _deprecate_kwarg
~/miniconda3/envs/dspy3/lib/python3.5/site-packages/pandas/io/excel.py in read_excel(io, sheet_name, header, skiprows, skip_footer, index_col, names, usecols, parse_dates, date_parser, na_values, thousands, convert_float, converters, dtype, true_values, false_values, engine, squeeze, **kwds)
228
229 if not isinstance(io, ExcelFile):
--> 230 io = ExcelFile(io, engine=engine)
231
232 return io._parse_excel(
~/miniconda3/envs/dspy3/lib/python3.5/site-packages/pandas/io/excel.py in __init__(self, io, **kwds)
292 self.book = xlrd.open_workbook(file_contents=data)
293 elif isinstance(self._io, compat.string_types):
--> 294 self.book = xlrd.open_workbook(self._io)
295 else:
296 raise ValueError('Must explicitly set engine if not passing in'
~/miniconda3/envs/dspy3/lib/python3.5/site-packages/xlrd/__init__.py in open_workbook(filename, logfile, verbosity, use_mmap, file_contents, encoding_override, formatting_info, on_demand, ragged_rows)
160 formatting_info=formatting_info,
161 on_demand=on_demand,
--> 162 ragged_rows=ragged_rows,
163 )
164 return bk
~/miniconda3/envs/dspy3/lib/python3.5/site-packages/xlrd/book.py in open_workbook_xls(filename, logfile, verbosity, use_mmap, file_contents, encoding_override, formatting_info, on_demand, ragged_rows)
89 t1 = time.clock()
90 bk.load_time_stage_1 = t1 - t0
---> 91 biff_version = bk.getbof(XL_WORKBOOK_GLOBALS)
92 if not biff_version:
93 raise XLRDError("Can't determine file's BIFF version")
~/miniconda3/envs/dspy3/lib/python3.5/site-packages/xlrd/book.py in getbof(self, rqd_stream)
1269 bof_error('Expected BOF record; met end of file')
1270 if opcode not in bofcodes:
-> 1271 bof_error('Expected BOF record; found %r' % self.mem[savpos:savpos+8])
1272 length = self.get2bytes()
1273 if length == MY_EOF:
~/miniconda3/envs/dspy3/lib/python3.5/site-packages/xlrd/book.py in bof_error(msg)
1263 if DEBUG: print("reqd: 0x%04x" % rqd_stream, file=self.logfile)
1264 def bof_error(msg):
-> 1265 raise XLRDError('Unsupported format, or corrupt file: ' + msg)
1266 savpos = self._position
1267 opcode = self.get2bytes()
XLRDError: Unsupported format, or corrupt file: Expected BOF record; found b'<?xml ve'
Is there any other way to open old xml as xls files?
Has this been resolved?
@darshanlol If you follow the various threads, you'll find that there are valid Excel files that cannot be read by Pandas, and that no one thinks this is a bug.
xlrd
problem, not a pandas problem, and will close (this) threadxlrd
here will say, "the file has been saved as "XML Spreadsheet (*.xml)" i.e. NOT in XLS or XLSX format, not supported by xlrd", and will close the threadAs far as I can tell it's a "won't fix" situation by everyone involved.
@jbwhit I followed the thread but assumed this might have been addressed since the issue was in 2015
Pandas support will say that it's an xlrd problem, not a pandas problem, and will close (this) thread
To clarify, there is no code in pandas itself for actually reading the excel file, we only rely on third-party libraries, and I don't think we are going the change that. But it's not that we don't want to be able to read such format. If somebody writes a library (with similar api as xlrd) that can read it, we might consider relying upon it.
Although a non-Python solution is not very satisfying, this is an issue your students will face "in the wild". If one has access to a Windows 10 OS, PowerShell can be used to iterate over each of the 1997-2003 excel XML formatted files and convert them to the current .xlsx format. This is a quick and dirty solution I resort to in my day-to-day work. An example is attached
convertXls2Xlsx.txt
@jbwhit I have run the following code:
import pandas as pd
#Read and write to excel
dataFileUrl = r"/Users/stutiverma/Downloads/coalpublic2012.xls"
data = pd.read_table(dataFileUrl)
This reads the file successfully without giving any error. But, it gives all the data in the exact format as mentioned. So, you may have to do extra efforts in order to process the data after reading it successfully.
has this been solved because i am experiencing similar issue now
So far the "fix" is to open in Excel/Windows, re-save it, then open with pandas/xlrd. Not sure of other options at this point.
import pandas as pd
from xml.sax import ContentHandler, parse
class ExcelHandler(ContentHandler):
def __init__(self):
self.chars = [ ]
self.cells = [ ]
self.rows = [ ]
self.tables = [ ]
def characters(self, content):
self.chars.append(content)
def startElement(self, name, atts):
if name=="Cell":
self.chars = [ ]
elif name=="Row":
self.cells=[ ]
elif name=="Table":
self.rows = [ ]
def endElement(self, name):
if name=="Cell":
self.cells.append(''.join(self.chars))
elif name=="Row":
self.rows.append(self.cells)
elif name=="Table":
self.tables.append(self.rows)
excelHandler = ExcelHandler()
parse('coalpublic2012.xls', excelHandler)
df1 = pd.DataFrame(excelHandler.tables[0][4:], columns=excelHandler.tables[0][3])
I faced the same issue by Iran's stock market historical data files in xls format and used @vermastuti 's solution and it worked.
But I don't know how to process table data. I want to read data from about 2000 same excell files and insert them to database.
@jbwhit I have run the following code:
import pandas as pd #Read and write to excel dataFileUrl = r"/Users/stutiverma/Downloads/coalpublic2012.xls" data = pd.read_table(dataFileUrl)
This reads the file successfully without giving any error. But, it gives all the data in the exact format as mentioned. So, you may have to do extra efforts in order to process the data after reading it successfully.
Thank you man. You are my hero.
Hi, I had faced the same xlrd.biffh.XLRDError: Unsupported format, or corrupt file: Expected BOF record;
error and solved it by writing an XML to XLSX converter.
import pandas as pd
from bs4 import BeautifulSoup
def convert_to_xlsx():
with open('sample.xls') as xml_file:
soup = BeautifulSoup(xml_file.read(), 'xml')
writer = pd.ExcelWriter('sample.xlsx')
for sheet in soup.findAll('Worksheet'):
sheet_as_list = []
for row in sheet.findAll('Row'):
sheet_as_list.append([cell.Data.text if cell.Data else '' for cell in row.findAll('Cell')])
pd.DataFrame(sheet_as_list).to_excel(writer, sheet_name=sheet.attrs['ss:Name'], index=False, header=False)
writer.save()
I was receiving "XLRDError: Unsupported format, or corrupt file: Expected BOF record; found b'
The function _pandas.read_html_ function worked for me, it returns a list of dataframes:
(df, ) = pd.read_html('file.xls')
I found a solution here https://stackoverflow.com/a/24476655/12692651
Most helpful comment
@jbwhit I have run the following code:
This reads the file successfully without giving any error. But, it gives all the data in the exact format as mentioned. So, you may have to do extra efforts in order to process the data after reading it successfully.