Pandas: Deprecate using `xlrd` engine in favor of openpyxl

Created on 20 Sep 2019 · 23Comments · Source: pandas-dev/pandas

xlrd is unmaintained and the previous maintainer has asked us to move towards openpyxl. xlrd works now, but might have some issues when Python 3.9 or later gets released and changes some elements of the XML parser, as default usage right now throws a PendingDeprecationWarning

Considering that I think we need to deprecate using xlrd in favor of openpyxl. We might not necessarily need to remove the former and it does offer some functionality the latter doesn't (namely reading .xls files) but should at the very least start moving towards the latter

Deprecate IO Excel good first issue

Source

WillAyd

Most helpful comment

Given that people using pandas are often not in control of the data they receive, would it be possible for pandas-dev to patch xlrd's broken use of getiterator?

fiendish on 18 Nov 2020

👍2

All 23 comments

@WillAyd Should we start with adding openpyxl as an engine in pandas.read_excel.

Happy to contribute a PR for it.

arpit1997 on 20 Sep 2019

It is already available just need to make it default over time, so want to raise a FutureWarning when the user doesn’t explicitly provide an engine that it will change to openpxyl in a future releaae

WillAyd on 20 Sep 2019

👍1

Would it be an option to simply switch the default, without first raising a warning? Or do the two engines give different results in quite some cases?

Reason I am asking is because if for 99% of the use cases both give exactly the same, raising a warning for all those users feels a bit annoying.

jorisvandenbossche on 21 Sep 2019

👍1

The docstrings already require some updating as they currently indicate 'xlrd' is the only option for 'engine'.

153957 on 23 Sep 2019

👍1

Would it be an option to simply switch the default, without first raising a warning? Or do the two engines give different results in quite some cases?

It would require installing a different optional package, so a version with deprecation messages/future warnings would be useful.

153957 on 23 Sep 2019

Would it be an option to simply switch the default, without first raising a warning? Or do the two engines give different results in quite some cases?

Reason I am asking is because if for 99% of the use cases both give exactly the same, raising a warning for all those users feels a bit annoying.

Just to add - xlrd is AFAIK the only library that can read the legacy .xls format. From experience even ".xlsx" formats aren't as standardized as you'd hope. The openpyxl reader is pretty new so I guess will see through proper deprecation cycle what differences, if any, arise

WillAyd on 23 Sep 2019

Just to add - xlrd is AFAIK the only library that can read the legacy .xls format.

And to be clear, for those the default would stay xlrd, and this would not be deprecated, right?

So the question about "switching the default" was only for xlsx files.

jorisvandenbossche on 23 Sep 2019

Up for debate but for that I think we want to push people towards reading .xlsx files. xlrd is not maintained any more and might break with Python 3.9, so would want to get ahead of that as much as possible

WillAyd on 23 Sep 2019

So what is the decision ?

Dump xlrd disregarding .xls support and to replace it with openpyxl ?

Hiyorimi on 9 Oct 2019

We need to deprecate using xlrd by default. I think it's fine to do for all extensions, including .xls - interesting in a PR?

WillAyd on 9 Oct 2019

Hi @WillAyd I'm interested in working on this. Is the decision to just raise a FutureWarning for all extensions when no engine is explicitly provided by the user?

GallowayJ on 17 Oct 2019

@GallowayJ great - that would be much appreciated! Yes I think let's start with that and see how it looks

WillAyd on 17 Oct 2019

Okay thanks! I'll get cracking!

GallowayJ on 17 Oct 2019

@GallowayJ hey , are you working on it ?

Kathakali123 on 17 Oct 2019

@Kathakali123 Yeah

GallowayJ on 17 Oct 2019

take

roberthdevries on 27 Jun 2020

@TomAugspurger @jreback FYI removing from 1.1 milestone. linked PR isn't milestoned.

simonjayhawkins on 22 Jul 2020

👎1

Given that people using pandas are often not in control of the data they receive, would it be possible for pandas-dev to patch xlrd's broken use of getiterator?

fiendish on 18 Nov 2020

👍2

we don’t maintain xlrd at all

i suppose a monkey patch makes it work from the community would be ok

jreback on 18 Nov 2020

we don’t maintain xlrd at all

Yeah, nobody does :( , so even though this bug is absolutely trivial to fix there's nowhere to submit the patch -> https://github.com/python-excel/xlrd/compare/python-excel:f8371f0...fiendish:e995456

(TBH, ElementTree.iter was introduced in python 2.7, which is the oldest version that xlrd claims to support anyway. I'm not even sure why it bothers looking at getiterator at all)

fiendish on 18 Nov 2020

you can try to monkey patch - if it works would be willing to consider a patch

jreback on 19 Nov 2020

A year down the line, it's time to see this change made. It's disappointing to see it dropped from milestones when it needlessly results in pain for people trying to read modern Excel files.

For .xlsx, xlrd absolutely positively should not be used, and I say that as the main maintainer of xlrd over the last decade plus.

What proportion of users are still reading data from .xls files (as opposed to .xlsx)? While I feel for these users, they either need to stick on an old version of Python or Pandas, or someone needs to step up and properly maintain xlrd. Nevermind the dangers of using the .xls pseudo-standard that have caused some quite high profile problems of late.

cjw296 on 29 Nov 2020

What proportion of users are still reading data from .xls files (as opposed to .xlsx)?

Sadly some of us don't get to pick and choose what data files we work with. That's definitely not your problem to solve, but it is mine so I have to try to defend keeping xlrd alive here. Even the latest version of Excel for mac calls xls a "Common Format".

While I feel for these users, they either need to stick on an old version of Python or Pandas, or someone needs to step up and properly maintain xlrd.

If you want to transfer ownership of the repository to me so that I can make a two line change via s/getiterator/iter (or the more nostalgic patch I linked earlier), I'm happy to make that change and no other changes just to stop the only available option for reading xls files from getting forced into the bin for a terrible reason (I mean the deprecation of getiterator, not your choice to stop maintaining). It seems reasonable to do on the premise that ElementTree.iter has existed since Python 2.7.

But I don't need anyone to "step up and properly maintain" it. I just need it to not stop being an option entirely. If push comes to shove, if someone (including me) can't monkey patch around xlrd's forced obsolescence inside pandas, I can at least keep using my own patched version of xlrd as long as Pandas doesn't work towards dropping xlrd entirely.

fiendish on 29 Nov 2020

👍1

Was this page helpful?

0 / 5 - 0 ratings