In [3]: data = pd.read_excel("A.xlsx")
In [4]: data
Out[4]:
A B
0 1 2
1 3 4
In [5]: data1 = pd.read_excel("A.xlsx",usecols=['B'])
In [6]: data1
Out[6]:
Empty DataFrame
Columns: []
Index: []
In [7]: pd.__version__
Out[7]: '0.21.0'
Having a excel file name A.xlsx(or A.xls) with column A,B
read_excel return empty dataframe if usecols used
Guessing this is due to a conflict in the two different kind of specs we accept for usecols - columns labels, or Excel column letters (A, B, C, ...). A workaround is to select the Excel column completely ('B:B'
) but this should work.
df = pd.DataFrame({'A': [1, 3], 'B': [3, 4]})
df.to_excel('tmp.xlsx', index=False)
pd.read_excel('tmp.xlsx', usecols=['B'])
Out[86]:
Empty DataFrame
Columns: []
Index: []
pd.read_excel('tmp.xlsx', usecols='B:B')
Out[88]:
B
0 3
1 4
I'm interested in contributing and thought this looked like a good place to take a first step.
Reviewing pandas/io/excel.py, it looks like the change needs to be made in the _should_parse function of the ExcelFile class. Specifically, here: https://github.com/pandas-dev/pandas/blob/1915ffc53ea60494f24d83844bbff00efa392c82/pandas/io/excel.py#L355-L360
It looks like the current implementation checks for an integer first (i.e. a max number of columns to use), a string second (i.e. assuming a comma separated list of column names in a single string), and assumes a list (technically any container that implements the "in" operator) otherwise. When a list is assumed for usecols, the check for the column index (i) assumes that it is a list of integers.
The simplest way to implement the requested functionality would be to add a new conditional to check whether the first element is a string and, if so, concatenate the list into a single string like case 2 and re-use the _range2cols function to convert to numeric values before returning the comparison:
if isinstance(usecols, list) and isinstance(usecols[0], compat.string_types):
return i in _range2cols(', '.join(usecols))
Additionally, there would probably need to be another check (if there isn't already?) to handle the behavior for an empty list. Should this be assumed to mean the same thing as None? It doesn't make sense to read a sheet and not return any data.
If we take this a little further, we could add support for mixed lists of integers and strings (if desired) by doing something like:
def _list2cols(area_list):
parsed_list = list()
for e in area_list:
if isinstance(e, int):
parsed_list.append(e)
elif isinstance(e, compat.string_types):
parsed_list.extend(_range2cols(e))
else:
pass # Assuming other types should not be considered
return parsed_list
if isinstance(usecols, list):
return i in _list2cols(usecols)
Interested in feedback on which direction should be taken.
What I think would be easiest here would be to only have _should_parse
handle the case when usecols
is an Excel column specification (e.g. 'A,B,D:E'
), and in all other cases pass through usecols
to TextParser
here.
https://github.com/pandas-dev/pandas/blob/1915ffc53ea60494f24d83844bbff00efa392c82/pandas/io/excel.py#L529
It already has logic to handle column names/locations, and will raise in the mixed case.
from pandas.io.parsers import TextParser
TextParser([['a', 'b', 'c'],
[1, 2, 3]], usecols=['b']).read()
Out[81]:
b
0 2
Ok. So in that case, you would remove the preprocessing steps here.
https://github.com/pandas-dev/pandas/blob/1915ffc53ea60494f24d83844bbff00efa392c82/pandas/io/excel.py#L478
Then, you would call a function like the current _should_parse to convert the usecols value (which would only have to be done once vice the rows x columns amount of times that it is currently done) into something readable by TextParser and pass that new value where you referenced.
Is that the correct interpretation?
Yes, I'm thinking that will most straightforward, rather than
re-implementing what TextParser does. Feel free to put up a
work-in-progress pull request and ping me, it is often easier to answer
questions looking at the actual changes.
On Mon, Nov 20, 2017 at 12:53 PM, Jordan Hall notifications@github.com
wrote:
Ok. So in that case, you would remove the preprocessing steps here.
https://github.com/pandas-dev/pandas/blob/1915ffc53ea60494f24d83844bbff0
0efa392c82/pandas/io/excel.py#L478 http://urlThen, you would call a function like the current _should_parse to convert
the usecols value (which would only have to be done once vice the rows x
columns amount of times that it is currently done) into something readable
by TextParser and pass that new value where you referenced.Is that the correct interpretation?
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/pandas-dev/pandas/issues/18273#issuecomment-345791675,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AB1b_K69yoT3EOyfp3ALntzCjc8BZ9dHks5s4cq3gaJpZM4Qcs_f
.
Any progress on this? I couldn't find the mentioned pull request.
Wow, after 2 years of using this, I stumbled upon this after I updated my pandas.
This is scary since most of my scripts use the usecol parameter heavily. I guess I have to downgrade for now.
Yeah, the same issue with 0.21.0 and 0.22.0. No issues with 0.20.3.
All columns' names are cyrillic complex (not A
, B
, etc.)
I have a set of excel files, all of which have a few same columns I need to read, but other columns are different. The old usecols with specified column names works nicely.
Do I have to downgrade Pandas? Currently I have 0.22.0
Same issue here. usecols is returning an empty dataframe after upgrading.
Same issue here.usecols is returning an empty dataframe with 0.22.0
Same issue in empty dataframe returning with 0.22.0
For me
Instead of typing usecols=['B']
, try usecols='B'
I'm going to work on this issue to solve it.
If anyone is already working on it, please, tell me.
Same issue here. I have a file with 100 columns and I want to select 20 by their colum label, but I cannot do that. I get an empty dataframe. My workaround is to save the excel as csv... and then read it.
Copying my comment from #20480, welcome any feedback on how this api should work
'foo'
in the spreadsheet) to usecols
doesn't work anymoreNumber 2 wasn't part of the read_excel
documented args, but once worked, because this param used to be called parse_cols
(#17774), and if you passed something to usecols
it would get passed down to the TextParser logic and do the right thing
https://github.com/pandas-dev/pandas/blob/1915ffc53ea60494f24d83844bbff00efa392c82/pandas/io/excel.py#L529
Not sure what the solution is, there could be ambiguous cases (column titled'A'
in spreadsheet column B
), that would be easiest to handle with separate kwargs. Or could just have usecols do everything and warn in cases of ambiguity.
Also for people needing a work-around, it works, and is essentially as efficient to do pd.read_excel(...)[usecols].
The entire Excel file has to parsed no matter what because of the row oriented xml layout.
What do people think about a new keyword vs. guessing (well, warning if there's ambiguity)? I suppose the third option is deprecate the current behavior of usecols
meaning the Excel ranges in favor of a new keyword, but I suppose we shouldn't break the one thing that works consistently across pandas versions.
It's unfortunate, but I think best to introduce a new keyword here for named columns. Or just recommend subsetting after reading, if we don't want to add a new keyword.
I prefer a new keyword too because is less confusing and error prone than making the same keyword having multiple behaviors.
When a decisions is made I can implement it.
Hi,
Has a decision been made on what the new argument will be yet? This is a very useful part of the pandas to no longer work.
Hi,
I was already working on this issue and finally the reviewers have decided
what will be the named arguments, so I'll implement this today.
Best regards,
2018-04-15 10:11 GMT-03:00 mrfaptastic notifications@github.com:
Hi,
Has a decision been made on what the new argument will be yet? This is a
pretty key part of the pandas to be kept broken.—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
https://github.com/pandas-dev/pandas/issues/18273#issuecomment-381405460,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AANuQqMyO8ArxXexm0fEGWA8XL3uzoXEks5to0b8gaJpZM4Qcs_f
.
Great news! Thank you!
Em dom, 15 de abr de 2018 10:21, Jackson Souza notifications@github.com
escreveu:
Hi,
I was already working on this issue and finally the reviewers have decided
what will be the named arguments, so I'll implement this today.Best regards,
2018-04-15 10:11 GMT-03:00 mrfaptastic notifications@github.com:
Hi,
Has a decision been made on what the new argument will be yet? This is a
pretty key part of the pandas to be kept broken.—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<
https://github.com/pandas-dev/pandas/issues/18273#issuecomment-381405460>,
or mute the thread
<
https://github.com/notifications/unsubscribe-auth/AANuQqMyO8ArxXexm0fEGWA8XL3uzoXEks5to0b8gaJpZM4Qcs_f.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/pandas-dev/pandas/issues/18273#issuecomment-381406141,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AcYZgpsXKscM85210AtjA5wEH_MVsZSaks5to0lzgaJpZM4Qcs_f
.
I was having the same issue and had to use pd.read_csv() instead.
Good to know that there is a solution coming up.
Hi everyone, I'm having trouble to have this fix added to Pandas.
I'm updating the commit for the pull request for almost 2 months as you can see in the conversation at https://github.com/pandas-dev/pandas/pull/20480/ and after every update one of the reviewers point changes that were already in the commit and should be pointed out before. What I mean is that I'm trying hard to have this issue solved but because of the lack of standard to review the changes every time there is something else to change in the commit making contributing to Pandas really awful and demotivating.
Does someone have a suggestion of what should I do about it? May anyone from Pandas core team help me to find out what is the problem with the contribution I'm making?
Also, the behavior of usecols
was redefined and it was created another keyword argument called usecols_excel
to handle exclusively excel columns indexes as a string. So, the request to handle excel columns as a list of strings was revoked as you can see at the documentation of usecols
and usecols_excel
at https://github.com/pandas-dev/pandas/pull/20480/files#diff-0153a977179cfc73a2645cf0b7cd1f7f
Hi @jacksonjos - I'm getting back to your change now - thank you for the patience, my apologies that I haven't been very responsive, I haven't been putting much time towards pandas lately.
Our review model is pretty informal and iterative, I get that it can be frustrating to address all the comments and not have a PR approved, but that's largely how it works, so I've found it helpful to have that as an expectation going in.
@chris-b1, may you take a look at issue https://github.com/pandas-dev/pandas/pull/20480, please?
@jreback requested your review there again and without you it can't be closed.
Thank you for your attention.
Thanks for resolving the read_excel return empty dataframe.
Throwing my two cents in:
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_excel.html
Supporting lists of strings is not technically addressed in the documentation, so I'm a little hesitant to call this a bug as of the current version of pandas
(0.23.4
).
That being said, this issue does bring up a lot of questions re: how to handle usecols
for read_excel
, in particular, why its handling is so different from usecols
for CSV:
A:C
) - that's totally fine with me. That's special to Excel.usecols=[0, 1, 2]
or usecols=['A', 'B', 'C']
, which I think we should.usecols
, though I don't see why we shouldn't.What do you guys think?
My personal belief is that usecols should operate the same way that it does in read_csv and excel-specific behavior should be handled in a separate parameter (something to the effect of use_range
).
It would require some deprecations from the current state but I think that logical separation would clarify any ambiguity between range "A:A" in Excel and a column named "A".
My personal belief is that usecols should operate the same way that it does in read_csv and excel-specific behavior should be handled in a separate parameter (something to the effect of use_range).
@WillAyd : Not sure I fully agree with adding a new parameter, but at least we have consensus that usecols
should be widely consistent across the board. 👍
@jreback @chris-b1 : Thoughts?
Problem still exists in pandas==0.23.4.
I just want to read specific columns of excel file with column names, i don' t know what column number display in excel.
this was closed in 0.24.0
Most helpful comment
Copying my comment from #20480, welcome any feedback on how this api should work
18273 is actually two somewhat separate problems and we have a bit of an API tangle here
'foo'
in the spreadsheet) tousecols
doesn't work anymoreNumber 2 wasn't part of the
read_excel
documented args, but once worked, because this param used to be calledparse_cols
(#17774), and if you passed something tousecols
it would get passed down to the TextParser logic and do the right thinghttps://github.com/pandas-dev/pandas/blob/1915ffc53ea60494f24d83844bbff00efa392c82/pandas/io/excel.py#L529
Not sure what the solution is, there could be ambiguous cases (column titled
'A'
in spreadsheet columnB
), that would be easiest to handle with separate kwargs. Or could just have usecols do everything and warn in cases of ambiguity.