Pandas: Deprecate using `xlrd` engine in favor of openpyxl

Created on 20 Sep 2019  路  23Comments  路  Source: pandas-dev/pandas

xlrd is unmaintained and the previous maintainer has asked us to move towards openpyxl. xlrd works now, but might have some issues when Python 3.9 or later gets released and changes some elements of the XML parser, as default usage right now throws a PendingDeprecationWarning

Considering that I think we need to deprecate using xlrd in favor of openpyxl. We might not necessarily need to remove the former and it does offer some functionality the latter doesn't (namely reading .xls files) but should at the very least start moving towards the latter

Deprecate IO Excel good first issue

Most helpful comment

Given that people using pandas are often not in control of the data they receive, would it be possible for pandas-dev to patch xlrd's broken use of getiterator?

All 23 comments

@WillAyd Should we start with adding openpyxl as an engine in pandas.read_excel.

Happy to contribute a PR for it.

It is already available just need to make it default over time, so want to raise a FutureWarning when the user doesn鈥檛 explicitly provide an engine that it will change to openpxyl in a future releaae

Would it be an option to simply switch the default, without first raising a warning? Or do the two engines give different results in quite some cases?

Reason I am asking is because if for 99% of the use cases both give exactly the same, raising a warning for all those users feels a bit annoying.

The docstrings already require some updating as they currently indicate 'xlrd' is the only option for 'engine'.

Would it be an option to simply switch the default, without first raising a warning? Or do the two engines give different results in quite some cases?

It would require installing a different optional package, so a version with deprecation messages/future warnings would be useful.

Would it be an option to simply switch the default, without first raising a warning? Or do the two engines give different results in quite some cases?

Reason I am asking is because if for 99% of the use cases both give exactly the same, raising a warning for all those users feels a bit annoying.

Just to add - xlrd is AFAIK the only library that can read the legacy .xls format. From experience even ".xlsx" formats aren't as standardized as you'd hope. The openpyxl reader is pretty new so I guess will see through proper deprecation cycle what differences, if any, arise

Just to add - xlrd is AFAIK the only library that can read the legacy .xls format.

And to be clear, for those the default would stay xlrd, and this would not be deprecated, right?

So the question about "switching the default" was only for xlsx files.

Up for debate but for that I think we want to push people towards reading .xlsx files. xlrd is not maintained any more and might break with Python 3.9, so would want to get ahead of that as much as possible

So what is the decision ?

Dump xlrd disregarding .xls support and to replace it with openpyxl ?

We need to deprecate using xlrd by default. I think it's fine to do for all extensions, including .xls - interesting in a PR?

Hi @WillAyd I'm interested in working on this. Is the decision to just raise a FutureWarning for all extensions when no engine is explicitly provided by the user?

@GallowayJ great - that would be much appreciated! Yes I think let's start with that and see how it looks

Okay thanks! I'll get cracking!

@GallowayJ hey , are you working on it ?

@Kathakali123 Yeah

take

@TomAugspurger @jreback FYI removing from 1.1 milestone. linked PR isn't milestoned.

Given that people using pandas are often not in control of the data they receive, would it be possible for pandas-dev to patch xlrd's broken use of getiterator?

we don鈥檛 maintain xlrd at all

i suppose a monkey patch makes it work from the community would be ok

we don鈥檛 maintain xlrd at all

Yeah, nobody does :( , so even though this bug is absolutely trivial to fix there's nowhere to submit the patch -> https://github.com/python-excel/xlrd/compare/python-excel:f8371f0...fiendish:e995456

(TBH, ElementTree.iter was introduced in python 2.7, which is the oldest version that xlrd claims to support anyway. I'm not even sure why it bothers looking at getiterator at all)

you can try to monkey patch - if it works would be willing to consider a patch

A year down the line, it's time to see this change made. It's disappointing to see it dropped from milestones when it needlessly results in pain for people trying to read modern Excel files.

For .xlsx, xlrd absolutely positively should not be used, and I say that as the main maintainer of xlrd over the last decade plus.

What proportion of users are still reading data from .xls files (as opposed to .xlsx)? While I feel for these users, they either need to stick on an old version of Python or Pandas, or someone needs to step up and properly maintain xlrd. Nevermind the dangers of using the .xls pseudo-standard that have caused some quite high profile problems of late.

What proportion of users are still reading data from .xls files (as opposed to .xlsx)?

Sadly some of us don't get to pick and choose what data files we work with. That's definitely not your problem to solve, but it is mine so I have to try to defend keeping xlrd alive here. Even the latest version of Excel for mac calls xls a "Common Format".

While I feel for these users, they either need to stick on an old version of Python or Pandas, or someone needs to step up and properly maintain xlrd.

If you want to transfer ownership of the repository to me so that I can make a two line change via s/getiterator/iter (or the more nostalgic patch I linked earlier), I'm happy to make that change and no other changes just to stop the only available option for reading xls files from getting forced into the bin for a terrible reason (I mean the deprecation of getiterator, not your choice to stop maintaining). It seems reasonable to do on the premise that ElementTree.iter has existed since Python 2.7.

But I don't need anyone to "step up and properly maintain" it. I just need it to not stop being an option entirely. If push comes to shove, if someone (including me) can't monkey patch around xlrd's forced obsolescence inside pandas, I can at least keep using my own patched version of xlrd as long as Pandas doesn't work towards dropping xlrd entirely.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

ebran picture ebran  路  3Comments

amelio-vazquez-reina picture amelio-vazquez-reina  路  3Comments

tade0726 picture tade0726  路  3Comments

Ashutosh-Srivastav picture Ashutosh-Srivastav  路  3Comments

MatzeB picture MatzeB  路  3Comments