Pandas: Parse Excel from in-memory file object

Created on 26 Jun 2012  路  11Comments  路  Source: pandas-dev/pandas

We came across a situation where we had a file object representing Excel data (came from HTTP POST but I'm thinking it could also come from MongoDB for example), and would've liked to pass it directly to Pandas to parse (vs saving it to disk and passing path to Pandas).

Could this be possible?

I saw that xlrd had file_contents as a possible argument of open_workbook:
https://github.com/python-excel/xlrd/blob/master/xlrd/__init__.py#L385

Maybe ExcelFile in Pandas could take path_or_buffer as argument, and pass along the correct one to xlrd.
https://github.com/pydata/pandas/blob/master/pandas/io/parsers.py#L1133

Don't know if that could work for openpyxl also.

Thoughts?

Thanks!
Nicolas

Enhancement

Most helpful comment

Haha well done! I didn't know we were racing ;)

Yeah didn't get a chance to work on it as much as I wanted. But I did spot:

I also saw that xlrd was going to support Excel 2007 in future versions, I don't know what you want to do about that, ie keep using both, or switch to only xlrd.

I didn't know about tempfile (nice trick!). I was going to just feed xlrd the bytes from file.read(), but that wouldn't work for openpyxl. I also had put together a function to check what type of Excel file it is, inspired with how the master branch of xlrd does it:

def _excel_type(filepath_or_buffer):
    # Thanks to xlrd for this
    peeksz = 4
    if isinstance(filepath_or_buffer, str):
        f = open(filepath_or_buffer, "rb")
        peek = f.read(peeksz)
        f.close()
    elif hasattr(filepath_or_buffer, 'read') \
            and hasattr(filepath_or_buffer, 'seek'):
        f = filepath_or_buffer
        peek = f.read(peeksz)
        f.seek(0)
    else:
        raise TypeError("You must provide the path to a file "
                        "or a file-like object")
    # Check if ZIP file
    if peek == "PK\x03\x04" \
        or peek == "PK\x03\x04".encode('latin1'): # Python 3
        return 'xlsx'
    else:
        return 'xls'

Then I would've checked the type (and I like your way of checking for file-like, ie needs a read method, possibly a seek too, this way it works with "file objects" coming from Flask HTTP uploads, or MongoDB GridFS..), and done:

wb = xlrd.open_workbook(filename=filename)
# or
f = open(filename, 'rb')
bytes = f.read()
f.close()
wb = xlrd.open_workbook(file_contents=bytes)

I guess the only advantage there is it saves having to use a tempfile and an I/O trip to the disk. But your solution has the advantage that it just _works_, and also is compatible with openpyxl.

Thanks for taking the time!

All 11 comments

Have you tried passing a file-like object (e.g. StringIO)?

Thanks for the quick reply!

It seems that open_workbook calls open_workbook_xls which in turn calls biff2_8_load (https://github.com/python-excel/xlrd/blob/master/xlrd/book.py#L549)

They all have the same logic which basically is:

if file_contents:
   filestr = file_contents
else:
   f = open(filename, 'rb')
   filestr = f.read()
   f.close()

So they don't check a single variable to see what type it is (string, ie a path, or StringIO). I guess Pandas would have to do it for them.

A fair point. the ExcelFile class doesn't accept file-like objects on closer inspection. I'll have a look (or you can make a pull request if you get energetic)

Looked at openpyxl, and they do the check "file object vs path" themselves (https://github.com/chronossc/openpyxl/blob/master/openpyxl/reader/excel.py#L43). So would mean only doing it for xlrd.

Tell you what, I feel energetic, so I'll try and look into it :)

Beat you to it. xlrd can't accept file-like objects so as a hack I write the data out to a temporary file

Haha well done! I didn't know we were racing ;)

Yeah didn't get a chance to work on it as much as I wanted. But I did spot:

I also saw that xlrd was going to support Excel 2007 in future versions, I don't know what you want to do about that, ie keep using both, or switch to only xlrd.

I didn't know about tempfile (nice trick!). I was going to just feed xlrd the bytes from file.read(), but that wouldn't work for openpyxl. I also had put together a function to check what type of Excel file it is, inspired with how the master branch of xlrd does it:

def _excel_type(filepath_or_buffer):
    # Thanks to xlrd for this
    peeksz = 4
    if isinstance(filepath_or_buffer, str):
        f = open(filepath_or_buffer, "rb")
        peek = f.read(peeksz)
        f.close()
    elif hasattr(filepath_or_buffer, 'read') \
            and hasattr(filepath_or_buffer, 'seek'):
        f = filepath_or_buffer
        peek = f.read(peeksz)
        f.seek(0)
    else:
        raise TypeError("You must provide the path to a file "
                        "or a file-like object")
    # Check if ZIP file
    if peek == "PK\x03\x04" \
        or peek == "PK\x03\x04".encode('latin1'): # Python 3
        return 'xlsx'
    else:
        return 'xls'

Then I would've checked the type (and I like your way of checking for file-like, ie needs a read method, possibly a seek too, this way it works with "file objects" coming from Flask HTTP uploads, or MongoDB GridFS..), and done:

wb = xlrd.open_workbook(filename=filename)
# or
f = open(filename, 'rb')
bytes = f.read()
f.close()
wb = xlrd.open_workbook(file_contents=bytes)

I guess the only advantage there is it saves having to use a tempfile and an I/O trip to the disk. But your solution has the advantage that it just _works_, and also is compatible with openpyxl.

Thanks for taking the time!

Ah, well you put in more effort on figuring it out than I did. We should return to this at some point and improve the solution, maybe once xlrd/openpyxl have settled down a bit

Agreed!

OK I refactored to not use a tempfile, a StringIO for openpyxl and passing file_contents to xlrd

https://github.com/pydata/pandas/commit/acb5f03d4bd89ede5d54849d88453299a2c3819a
Great! I think that's better :)

Just a couple small things:

  • you might have 'kind' as a parameter still in your doctring, but you didn't end up using it, might want to take it out to avoid confusion?
  • I'm guessing the answer is yes, but did you test the BytesIO with openpyxl? Does the ZipFile constructor, which they use, accept a buffer?

Thanks

sweet

Was this page helpful?
0 / 5 - 0 ratings

Related issues

amelio-vazquez-reina picture amelio-vazquez-reina  路  3Comments

tade0726 picture tade0726  路  3Comments

Ashutosh-Srivastav picture Ashutosh-Srivastav  路  3Comments

ebran picture ebran  路  3Comments

ericdf picture ericdf  路  3Comments