Pandas: DataFrame.loc[] returns inconsistent types depending on row count

Created on 2 Oct 2015  路  9Comments  路  Source: pandas-dev/pandas

If a dataframe has a single row for a given index entry, it returns a Series. If it
has two rows for that index, it returns a DataFrame. I believe that it should return a
DataFrame in either case for consistency.

Image attached, small dataframe and notebook exhibiting the problem attached. OK, so
I can't attach either the dataframe or the notebook, even suffixing them with .txt (github barfs).
So I'm pasting the text fragment after the image..

pandasinconsistenttypes

dataframe = '''\
Locus,Decision,Group,Var,Region,Gene,Rows,Mutation,Profile
chr01:0018961727,Homopolymer,VS,CA,exonic,PAX7,1.0,synonymous SNV,000000010010001000000000001101
chr01:0027057772,Bad,IR-PM-VS,CA,exonic,ARID1A,1.0,nonsynonymous SNV,000000000000000000001000100000
chr01:0027057772,Bad,IR-PM-VS,CA,exonic,ARID1A,1.0,nonsynonymous SNV,000000000000000000011001110100
chr01:0027057772,Bad,IR-PM-VS,CA,exonic,ARID1A,1.0,nonsynonymous SNV,100000000001010000010001110110\
'''
df = [line.split(',') for line in txt.split('\n')]
tdf = pd.DataFrame.from_records(df, index=(0,))
tdf
type(tdf.loc['chr01:0018961727']), type(tdf.loc['chr01:0027057772'])

Indexing Usage Question

Most helpful comment

It is much simpler to use a guaranteed syntax

eg.

df.loc[[....]] which will _always_ return a frame.

If you would like to add a note about using duplicates and selection (and how to use the guaranteed syntax) that would be fine.

All 9 comments

Sorry, the [0:1] in the image might mislead (I was trying to return exactly the first matching line of the group), but note that the call in the text is simply for the .loc[]

A second thing I forgot to mention was tying this to this issue, because I think they may be related:

5839

I say that because I first encountered the problem above in the context of .groupby() where a group of one row is a Series, and a group of two or more is a DataFrame.

Oh, and because I couldn't add the notebook, I forgot to mention this is 0.16.2 with 2.7.
pandasinconsistenttypesversions

I agree this might be a good idea, but it would certainly be a major break in the API. So I think it's unlikely to be feasible for pandas.

Agreed that this would be too disruptive a change.

It certainly warrants a modification to
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.loc.html#pandas.DataFrame.loc
describing the differential return values.
Right now that page does not even discuss return values.

@jerryatmda

this is solely due to the fact that you have duplicates in the index.

In a unique index, you will _always_ get the same type of data.

so a doc-note is fine, but this is actually a very rare case.

It is much simpler to use a guaranteed syntax

eg.

df.loc[[....]] which will _always_ return a frame.

If you would like to add a note about using duplicates and selection (and how to use the guaranteed syntax) that would be fine.

OK, that works, but I guess I don't understand "guaranteed syntax."
I just searched it in the docs, and came up with a single reference to the word "syntax."
This is pretty clearly a pandas term of art that has somehow escaped documentation in the manual thus far.
Since I don't know what it means, I am not the person to write that, sorry.

Jeff meant that passing in a list as an indexer will always return a DataFrame. So in your case it's tdf.loc[['chr01:0018961727']]) (notice the two sets of square brackets).

Oh, and yes, I agree, it's due to the duplicates in the index, and sure, the wise DBA normalizes his data to 4th normal form -- unless he wants to use it.

Thanks to all for your help. I will hope to propose a note for the docs, but I really don't know where to start.

Thanks again,
Jerry

Was this page helpful?
0 / 5 - 0 ratings