We came across a situation where we had a file object representing Excel data (came from HTTP POST but I'm thinking it could also come from MongoDB for example), and would've liked to pass it directly to Pandas to parse (vs saving it to disk and passing path to Pandas).
Could this be possible?
I saw that xlrd had file_contents as a possible argument of open_workbook:
https://github.com/python-excel/xlrd/blob/master/xlrd/__init__.py#L385
Maybe ExcelFile in Pandas could take path_or_buffer as argument, and pass along the correct one to xlrd.
https://github.com/pydata/pandas/blob/master/pandas/io/parsers.py#L1133
Don't know if that could work for openpyxl also.
Thoughts?
Thanks!
Nicolas
Have you tried passing a file-like object (e.g. StringIO)?
Thanks for the quick reply!
It seems that open_workbook calls open_workbook_xls which in turn calls biff2_8_load (https://github.com/python-excel/xlrd/blob/master/xlrd/book.py#L549)
They all have the same logic which basically is:
if file_contents:
filestr = file_contents
else:
f = open(filename, 'rb')
filestr = f.read()
f.close()
So they don't check a single variable to see what type it is (string, ie a path, or StringIO). I guess Pandas would have to do it for them.
A fair point. the ExcelFile class doesn't accept file-like objects on closer inspection. I'll have a look (or you can make a pull request if you get energetic)
Looked at openpyxl, and they do the check "file object vs path" themselves (https://github.com/chronossc/openpyxl/blob/master/openpyxl/reader/excel.py#L43). So would mean only doing it for xlrd.
Tell you what, I feel energetic, so I'll try and look into it :)
Beat you to it. xlrd can't accept file-like objects so as a hack I write the data out to a temporary file
Haha well done! I didn't know we were racing ;)
Yeah didn't get a chance to work on it as much as I wanted. But I did spot:
file object (not file-like, has to be an instance of file, which is kind of limitting..)I also saw that xlrd was going to support Excel 2007 in future versions, I don't know what you want to do about that, ie keep using both, or switch to only xlrd.
I didn't know about tempfile (nice trick!). I was going to just feed xlrd the bytes from file.read(), but that wouldn't work for openpyxl. I also had put together a function to check what type of Excel file it is, inspired with how the master branch of xlrd does it:
def _excel_type(filepath_or_buffer):
# Thanks to xlrd for this
peeksz = 4
if isinstance(filepath_or_buffer, str):
f = open(filepath_or_buffer, "rb")
peek = f.read(peeksz)
f.close()
elif hasattr(filepath_or_buffer, 'read') \
and hasattr(filepath_or_buffer, 'seek'):
f = filepath_or_buffer
peek = f.read(peeksz)
f.seek(0)
else:
raise TypeError("You must provide the path to a file "
"or a file-like object")
# Check if ZIP file
if peek == "PK\x03\x04" \
or peek == "PK\x03\x04".encode('latin1'): # Python 3
return 'xlsx'
else:
return 'xls'
Then I would've checked the type (and I like your way of checking for file-like, ie needs a read method, possibly a seek too, this way it works with "file objects" coming from Flask HTTP uploads, or MongoDB GridFS..), and done:
wb = xlrd.open_workbook(filename=filename)
# or
f = open(filename, 'rb')
bytes = f.read()
f.close()
wb = xlrd.open_workbook(file_contents=bytes)
I guess the only advantage there is it saves having to use a tempfile and an I/O trip to the disk. But your solution has the advantage that it just _works_, and also is compatible with openpyxl.
Thanks for taking the time!
Ah, well you put in more effort on figuring it out than I did. We should return to this at some point and improve the solution, maybe once xlrd/openpyxl have settled down a bit
Agreed!
OK I refactored to not use a tempfile, a StringIO for openpyxl and passing file_contents to xlrd
https://github.com/pydata/pandas/commit/acb5f03d4bd89ede5d54849d88453299a2c3819a
Great! I think that's better :)
Just a couple small things:
BytesIO with openpyxl? Does the ZipFile constructor, which they use, accept a buffer?Thanks
sweet
Most helpful comment
Haha well done! I didn't know we were racing ;)
Yeah didn't get a chance to work on it as much as I wanted. But I did spot:
https://github.com/python-excel/xlrd/blob/master/xlrd/__init__.py#L385
fileobject (not file-like, has to be an instance offile, which is kind of limitting..)https://github.com/chronossc/openpyxl/blob/master/openpyxl/reader/excel.py#L43
I also saw that xlrd was going to support Excel 2007 in future versions, I don't know what you want to do about that, ie keep using both, or switch to only xlrd.
I didn't know about
tempfile(nice trick!). I was going to just feed xlrd the bytes from file.read(), but that wouldn't work for openpyxl. I also had put together a function to check what type of Excel file it is, inspired with how the master branch of xlrd does it:Then I would've checked the type (and I like your way of checking for file-like, ie needs a
readmethod, possibly aseektoo, this way it works with "file objects" coming from Flask HTTP uploads, or MongoDB GridFS..), and done:I guess the only advantage there is it saves having to use a tempfile and an I/O trip to the disk. But your solution has the advantage that it just _works_, and also is compatible with openpyxl.
Thanks for taking the time!