Pandas: read_excel return empty dataframe when using usecols

Created on 14 Nov 2017 · 32Comments · Source: pandas-dev/pandas

In [3]: data = pd.read_excel("A.xlsx")

In [4]: data
Out[4]:
   A  B
0  1  2
1  3  4

In [5]: data1 = pd.read_excel("A.xlsx",usecols=['B'])

In [6]: data1
Out[6]:
Empty DataFrame
Columns: []
Index: []

In [7]: pd.__version__
Out[7]: '0.21.0'

Problem description

Having a excel file name A.xlsx(or A.xls) with column A,B
read_excel return empty dataframe if usecols used

Bug IO Excel Regression good first issue

Source

SenWang

👍11 😕1

Most helpful comment

Copying my comment from #20480, welcome any feedback on how this api should work

18273 is actually two somewhat separate problems and we have a bit of an API tangle here

passing a list of Excel column _letters_ doesn't work (this fixes that)
passing a list column _names_ (e.g. column title 'foo' in the spreadsheet) to usecols doesn't work anymore

Number 2 wasn't part of the read_excel documented args, but once worked, because this param used to be called parse_cols (#17774), and if you passed something to usecols it would get passed down to the TextParser logic and do the right thing
https://github.com/pandas-dev/pandas/blob/1915ffc53ea60494f24d83844bbff00efa392c82/pandas/io/excel.py#L529

Not sure what the solution is, there could be ambiguous cases (column titled'A' in spreadsheet column B), that would be easiest to handle with separate kwargs. Or could just have usecols do everything and warn in cases of ambiguity.

chris-b1 on 26 Mar 2018

👍2

All 32 comments

Guessing this is due to a conflict in the two different kind of specs we accept for usecols - columns labels, or Excel column letters (A, B, C, ...). A workaround is to select the Excel column completely ('B:B') but this should work.

df = pd.DataFrame({'A': [1, 3], 'B': [3, 4]})
df.to_excel('tmp.xlsx', index=False)

pd.read_excel('tmp.xlsx', usecols=['B'])
Out[86]: 
Empty DataFrame
Columns: []
Index: []


pd.read_excel('tmp.xlsx', usecols='B:B')
Out[88]: 
   B
0  3
1  4

chris-b1 on 14 Nov 2017

I'm interested in contributing and thought this looked like a good place to take a first step.

Reviewing pandas/io/excel.py, it looks like the change needs to be made in the _should_parse function of the ExcelFile class. Specifically, here: https://github.com/pandas-dev/pandas/blob/1915ffc53ea60494f24d83844bbff00efa392c82/pandas/io/excel.py#L355-L360

It looks like the current implementation checks for an integer first (i.e. a max number of columns to use), a string second (i.e. assuming a comma separated list of column names in a single string), and assumes a list (technically any container that implements the "in" operator) otherwise. When a list is assumed for usecols, the check for the column index (i) assumes that it is a list of integers.

The simplest way to implement the requested functionality would be to add a new conditional to check whether the first element is a string and, if so, concatenate the list into a single string like case 2 and re-use the _range2cols function to convert to numeric values before returning the comparison:

if isinstance(usecols, list) and isinstance(usecols[0], compat.string_types):
    return i in _range2cols(', '.join(usecols))

Additionally, there would probably need to be another check (if there isn't already?) to handle the behavior for an empty list. Should this be assumed to mean the same thing as None? It doesn't make sense to read a sheet and not return any data.

If we take this a little further, we could add support for mixed lists of integers and strings (if desired) by doing something like:

def _list2cols(area_list):
    parsed_list = list()
    for e in area_list:
        if isinstance(e, int):
            parsed_list.append(e)
        elif isinstance(e, compat.string_types):
            parsed_list.extend(_range2cols(e))
        else:
            pass # Assuming other types should not be considered

    return parsed_list

if isinstance(usecols, list):
    return i in _list2cols(usecols)

Interested in feedback on which direction should be taken.

jhall6226 on 20 Nov 2017

What I think would be easiest here would be to only have _should_parse handle the case when usecols is an Excel column specification (e.g. 'A,B,D:E'), and in all other cases pass through usecols to TextParser here.
https://github.com/pandas-dev/pandas/blob/1915ffc53ea60494f24d83844bbff00efa392c82/pandas/io/excel.py#L529

It already has logic to handle column names/locations, and will raise in the mixed case.

from pandas.io.parsers import TextParser
TextParser([['a', 'b', 'c'],
            [1, 2, 3]], usecols=['b']).read()

Out[81]: 
   b
0  2

chris-b1 on 20 Nov 2017

Ok. So in that case, you would remove the preprocessing steps here.
https://github.com/pandas-dev/pandas/blob/1915ffc53ea60494f24d83844bbff00efa392c82/pandas/io/excel.py#L478

Then, you would call a function like the current _should_parse to convert the usecols value (which would only have to be done once vice the rows x columns amount of times that it is currently done) into something readable by TextParser and pass that new value where you referenced.

Is that the correct interpretation?

jhall6226 on 20 Nov 2017

Yes, I'm thinking that will most straightforward, rather than
re-implementing what TextParser does. Feel free to put up a
work-in-progress pull request and ping me, it is often easier to answer
questions looking at the actual changes.

On Mon, Nov 20, 2017 at 12:53 PM, Jordan Hall notifications@github.com
wrote:

Ok. So in that case, you would remove the preprocessing steps here.
https://github.com/pandas-dev/pandas/blob/1915ffc53ea60494f24d83844bbff0
0efa392c82/pandas/io/excel.py#L478 http://url

Then, you would call a function like the current _should_parse to convert
the usecols value (which would only have to be done once vice the rows x
columns amount of times that it is currently done) into something readable
by TextParser and pass that new value where you referenced.

Is that the correct interpretation?

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/pandas-dev/pandas/issues/18273#issuecomment-345791675,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AB1b_K69yoT3EOyfp3ALntzCjc8BZ9dHks5s4cq3gaJpZM4Qcs_f
.

chris-b1 on 20 Nov 2017

Any progress on this? I couldn't find the mentioned pull request.

JeroenDelcour on 30 Nov 2017

Wow, after 2 years of using this, I stumbled upon this after I updated my pandas.

This is scary since most of my scripts use the usecol parameter heavily. I guess I have to downgrade for now.

Vonatzki on 9 Jan 2018

Yeah, the same issue with 0.21.0 and 0.22.0. No issues with 0.20.3.

All columns' names are cyrillic complex (not A, B, etc.)

kuraga on 14 Jan 2018

I have a set of excel files, all of which have a few same columns I need to read, but other columns are different. The old usecols with specified column names works nicely.

Do I have to downgrade Pandas? Currently I have 0.22.0

Linlinzhao on 15 Jan 2018

Same issue here. usecols is returning an empty dataframe after upgrading.

ldacey on 24 Jan 2018

Same issue here.usecols is returning an empty dataframe with 0.22.0

weifei0228 on 1 Feb 2018

Same issue in empty dataframe returning with 0.22.0

Bravico on 19 Mar 2018

For me
Instead of typing usecols=['B'], try usecols='B'

LISHITING on 23 Mar 2018

👎2 👍1

I'm going to work on this issue to solve it.

If anyone is already working on it, please, tell me.

jacksonjos on 25 Mar 2018

Same issue here. I have a file with 100 columns and I want to select 20 by their colum label, but I cannot do that. I get an empty dataframe. My workaround is to save the excel as csv... and then read it.

Cristianasp on 26 Mar 2018

Copying my comment from #20480, welcome any feedback on how this api should work

18273 is actually two somewhat separate problems and we have a bit of an API tangle here

passing a list of Excel column _letters_ doesn't work (this fixes that)
passing a list column _names_ (e.g. column title 'foo' in the spreadsheet) to usecols doesn't work anymore

chris-b1 on 26 Mar 2018

👍2

Also for people needing a work-around, it works, and is essentially as efficient to do pd.read_excel(...)[usecols]. The entire Excel file has to parsed no matter what because of the row oriented xml layout.

chris-b1 on 26 Mar 2018

👍1

What do people think about a new keyword vs. guessing (well, warning if there's ambiguity)? I suppose the third option is deprecate the current behavior of usecols meaning the Excel ranges in favor of a new keyword, but I suppose we shouldn't break the one thing that works consistently across pandas versions.

It's unfortunate, but I think best to introduce a new keyword here for named columns. Or just recommend subsetting after reading, if we don't want to add a new keyword.

TomAugspurger on 27 Mar 2018

I prefer a new keyword too because is less confusing and error prone than making the same keyword having multiple behaviors.

When a decisions is made I can implement it.

jacksonjos on 27 Mar 2018

Hi,

Has a decision been made on what the new argument will be yet? This is a very useful part of the pandas to no longer work.

mrfaptastic on 15 Apr 2018

Hi,

I was already working on this issue and finally the reviewers have decided
what will be the named arguments, so I'll implement this today.

Best regards,

2018-04-15 10:11 GMT-03:00 mrfaptastic notifications@github.com:

Hi,

Has a decision been made on what the new argument will be yet? This is a
pretty key part of the pandas to be kept broken.

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
https://github.com/pandas-dev/pandas/issues/18273#issuecomment-381405460,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AANuQqMyO8ArxXexm0fEGWA8XL3uzoXEks5to0b8gaJpZM4Qcs_f
.

jacksonjos on 15 Apr 2018

Great news! Thank you!

Em dom, 15 de abr de 2018 10:21, Jackson Souza notifications@github.com
escreveu:

Hi,

I was already working on this issue and finally the reviewers have decided
what will be the named arguments, so I'll implement this today.

Best regards,

2018-04-15 10:11 GMT-03:00 mrfaptastic notifications@github.com:

Hi,

Has a decision been made on what the new argument will be yet? This is a
pretty key part of the pandas to be kept broken.

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<
https://github.com/pandas-dev/pandas/issues/18273#issuecomment-381405460>,
or mute the thread
<
https://github.com/notifications/unsubscribe-auth/AANuQqMyO8ArxXexm0fEGWA8XL3uzoXEks5to0b8gaJpZM4Qcs_f

.

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/pandas-dev/pandas/issues/18273#issuecomment-381406141,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AcYZgpsXKscM85210AtjA5wEH_MVsZSaks5to0lzgaJpZM4Qcs_f
.

Cristianasp on 15 Apr 2018

I was having the same issue and had to use pd.read_csv() instead.

Good to know that there is a solution coming up.

franktoffel on 18 Apr 2018

Hi everyone, I'm having trouble to have this fix added to Pandas.

I'm updating the commit for the pull request for almost 2 months as you can see in the conversation at https://github.com/pandas-dev/pandas/pull/20480/ and after every update one of the reviewers point changes that were already in the commit and should be pointed out before. What I mean is that I'm trying hard to have this issue solved but because of the lack of standard to review the changes every time there is something else to change in the commit making contributing to Pandas really awful and demotivating.

Does someone have a suggestion of what should I do about it? May anyone from Pandas core team help me to find out what is the problem with the contribution I'm making?

Also, the behavior of usecols was redefined and it was created another keyword argument called usecols_excel to handle exclusively excel columns indexes as a string. So, the request to handle excel columns as a list of strings was revoked as you can see at the documentation of usecols and usecols_excel at https://github.com/pandas-dev/pandas/pull/20480/files#diff-0153a977179cfc73a2645cf0b7cd1f7f

jacksonjos on 13 May 2018

Hi @jacksonjos - I'm getting back to your change now - thank you for the patience, my apologies that I haven't been very responsive, I haven't been putting much time towards pandas lately.

Our review model is pretty informal and iterative, I get that it can be frustrating to address all the comments and not have a PR approved, but that's largely how it works, so I've found it helpful to have that as an expectation going in.

chris-b1 on 14 May 2018

@chris-b1, may you take a look at issue https://github.com/pandas-dev/pandas/pull/20480, please?

@jreback requested your review there again and without you it can't be closed.
Thank you for your attention.

jacksonjos on 6 Jun 2018

👍1

Thanks for resolving the read_excel return empty dataframe.

Teckloon on 3 Jul 2018

Throwing my two cents in:

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_excel.html

Supporting lists of strings is not technically addressed in the documentation, so I'm a little hesitant to call this a bug as of the current version of pandas (0.23.4).

That being said, this issue does bring up a lot of questions re: how to handle usecols for read_excel, in particular, why its handling is so different from usecols for CSV:

Column ranges (e.g. A:C) - that's totally fine with me. That's special to Excel.
"If int then indicates last column to be parsed" - We don't support this for CSV. I don't see why we support this for Excel?
List-like support for Excel is pretty bad. We don't support values like usecols=[0, 1, 2] or usecols=['A', 'B', 'C'], which I think we should.
We don't support callables for usecols, though I don't see why we shouldn't.

What do you guys think?

gfyoung on 29 Oct 2018

👍1

My personal belief is that usecols should operate the same way that it does in read_csv and excel-specific behavior should be handled in a separate parameter (something to the effect of use_range).

It would require some deprecations from the current state but I think that logical separation would clarify any ambiguity between range "A:A" in Excel and a column named "A".

WillAyd on 29 Oct 2018

My personal belief is that usecols should operate the same way that it does in read_csv and excel-specific behavior should be handled in a separate parameter (something to the effect of use_range).

@WillAyd : Not sure I fully agree with adding a new parameter, but at least we have consensus that usecols should be widely consistent across the board. 👍

@jreback @chris-b1 : Thoughts?

gfyoung on 29 Oct 2018

Problem still exists in pandas==0.23.4.
I just want to read specific columns of excel file with column names, i don' t know what column number display in excel.