Pandas: pd.Series.loc.getitem promotes to float64 instead of raising KeyError

Created on 30 Mar 2019 · 16Comments · Source: pandas-dev/pandas

Code Sample, a copy-pastable example if possible

For:


result = a2b.loc[vals]       # pd.Series()[ np.array ]

If a2b is a series that maps {int64:int64} and vals is an int64 array, the result should be a series that maps {int64:int64}, or a KeyError should be thrown

Pasteable repo:

       import pandas as pd
       import numpy as np
       a2b = pd.Series(
           index = np.array([ 9724501000001103,  9724701000001109,  9725101000001107,
                     9725301000001109,  9725601000001103,  9725801000001104,
                     9730701000001104, 10049011000001109, 10328511000001105]),
           data = np.array([999000011000001104, 999000011000001104, 999000011000001104,
                        999000011000001104, 999000011000001104, 999000011000001104,
                        999000011000001104, 999000011000001104, 999000011000001104])
       )
       assert a2b.dtype==np.int64
       assert a2b.index.dtype==np.int64
       key = np.array([ 9724501000001103,  9724701000001109,  9725101000001107,
                             9725301000001109,  9725601000001103,  9725801000001104,
                             9730701000001104,
                             10047311000001102, # Misin in a2b.index
                             10049011000001109,
                             10328511000001105])
       result = a2b.loc[key]
       result
       assert result.dtype==np.int64
       assert result.index.dtype==np.int64

What happens:

In [2]:         import pandas as pd
   ...:         import numpy as np
   ...:         a2b = pd.Series(
   ...:             index = np.array([ 9724501000001103,  9724701000001109,  9725101000001107,
   ...:                       9725301000001109,  9725601000001103,  9725801000001104,
   ...:                       9730701000001104, 10049011000001109, 10328511000001105]),
   ...:             data = np.array([999000011000001104, 999000011000001104, 999000011000001104,
   ...:                          999000011000001104, 999000011000001104, 999000011000001104,
   ...:                          999000011000001104, 999000011000001104, 999000011000001104])
   ...:         )
   ...:         assert a2b.dtype==np.int64
   ...:         assert a2b.index.dtype==np.int64
   ...:         key = np.array([ 9724501000001103,  9724701000001109,  9725101000001107,
   ...:                               9725301000001109,  9725601000001103,  9725801000001104,
   ...:                               9730701000001104,
   ...:                               10047311000001102, # Misin in a2b.index
   ...:                               10049011000001109,
   ...:                               10328511000001105])
   ...:         result = a2b.loc[key]
   ...:         result
   ...: 
Out[2]: 
 9.990000e+17   NaN
 9.990000e+17   NaN
 9.990000e+17   NaN
 9.990000e+17   NaN
 9.990000e+17   NaN
 9.990000e+17   NaN
 9.990000e+17   NaN
NaN             NaN
 9.990000e+17   NaN
 9.990000e+17   NaN
dtype: float64

In [3]:         assert result.dtype==np.int64
   ...:         assert result.index.dtype==np.int64
   ...: 
---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-3-be86ec17a393> in <module>()
----> 1 assert result.dtype==np.int64
      2 assert result.index.dtype==np.int64

AssertionError:

Problem description

I don't like this behavior because:

I have quietly lost all my data due to cast to float64
in other calls to __getitem__ a KeyError is raised if a value is not found in the index.

Expected Output

Asserts should not fail.

Output of `pd.show_versions()`

[paste the output of pd.show_versions() here below this line]
In [4]: pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 2.7.15.candidate.1
python-bits: 64
OS: Linux
OS-release: 4.15.0-46-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: None.None

pandas: 0.22.0
pytest: None
pip: 18.1
setuptools: 40.6.2
Cython: 0.29.1
numpy: 1.16.1
scipy: 1.2.0
pyarrow: None
xarray: None
IPython: 5.0.0
sphinx: None
patsy: 0.5.1
dateutil: 2.6.0
pytz: 2016.10
blosc: None
bottleneck: None
tables: None
numexpr: 2.6.8
feather: None
matplotlib: 2.1.0
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: 4.6.0
html5lib: 0.9999999
sqlalchemy: 1.2.17
pymysql: None
psycopg2: 2.7.7 (dt dec pq3 ext lo64)
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

Bug Indexing

Source

stuz5000

All 16 comments

Same in: 0.23.4, 0.24.2

stuz5000 on 30 Mar 2019

use .iloc as that is what is designed for selecting by position as the docs indicate: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#selection-by-position

getitem it falling back here as described http://pandas.pydata.org/pandas-docs/stable/user_guide/advanced.html#miscellaneous-indexing-faq

this is as expected behavior and is not likely to change

jreback on 30 Mar 2019

😕1

My series has integer labels, not billions of rows.
.loc would be correct (not .iloc).

It fails for .loc and ._getitem__ both.

This is bug. Please reopen.

stuz5000 on 30 Mar 2019

Have update the title and sample to make it clear.

.loc is the correct indexer (integer label, and integer index).

Please reopen.

stuz5000 on 30 Mar 2019

this is on master

In [7]: pd.__version__                                                                                                                                                                                                                                                  
Out[7]: '0.25.0.dev0+337.g1d4c89f4d'

In [6]: a2b.loc[key]                                                                                                                                                                                                                                                    
/Users/jreback/miniconda3/envs/pandas/bin/ipython:1: FutureWarning: 
Passing list-likes to .loc or [] with any missing label will raise
KeyError in the future, you can use .reindex() as an alternative.

See the documentation here:
https://pandas.pydata.org/pandas-docs/stable/indexing.html#deprecate-loc-reindex-listlike
  #!/Users/jreback/miniconda3/envs/pandas/bin/python
Out[6]: 
9724501000001103     9.990000e+17
9724701000001109     9.990000e+17
9725101000001107     9.990000e+17
9725301000001109     9.990000e+17
9725601000001103     9.990000e+17
9725801000001104     9.990000e+17
9730701000001104     9.990000e+17
10047311000001102             NaN
10049011000001109    9.990000e+17
10328511000001105    9.990000e+17
dtype: float64

In [14]: a2b.index.isin(key)                                                                                                                                                                                                                                            
Out[14]: array([ True,  True,  True,  True,  True,  True,  True,  True,  True])

prob a bug somewhere, would need investigation by the community

jreback on 30 Mar 2019

This doesn't really look like a bug. Once a KeyError is thrown, you'll never get this far anyway. But suppose you do, and the expected behavior is a NaN for the extra index. The Series is dtype int64 and NaN is compatible with that dtype so values get cast to float64. If you want to avoid the cast, then use Int64 instead.

cbertinato on 3 Apr 2019

no this is an issue i think; see how the value is i. the index; but there is a disconnect somewhere after the get_indexer call (way before the erroneous KeyError)

jreback on 3 Apr 2019

Was there an erroneous KeyError? The key 10047311000001102 isn't in a2b, so the warning given for a2b.loc[key] seems appropriate.

cbertinato on 3 Apr 2019

There was no KeyError.
For other examples of missing input a KeyError is thrown.

On Tue, Apr 2, 2019 at 6:41 PM Chris Bertinato notifications@github.com
wrote:

Was there an erroneous KeyError? The key 10047311000001102 isn't in a2b,
so the warning given for a2b.loc[key] seems appropriate.

—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
https://github.com/pandas-dev/pandas/issues/25927#issuecomment-479290834,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AHcErPj7PmN80jE8uwjY0rdWVHcjfmUuks5vdAbRgaJpZM4cTZQC
.

stuz5000 on 3 Apr 2019

I see the KeyError with __getitem__, but the message given when trying a2b.loc[key] indicates that, while no error is thrown now, it will be in the future. It seems to me that, while the current behavior is not ideal, it is expected.

But I'm still trying to sleuth out whether there's another issue here. @jreback do you mean that because one of the labels in key is not in a2b, then that label shouldn't show up in the index of a2b.loc[key]?

cbertinato on 3 Apr 2019

👍1

@cbertinato My apologies - I missed that a warning was being printed, and I had to do some digging to find the case where KeyError was thrown:

# Scalar key
for k in key:
    print a2b.loc[k]   # KeyError: 10047311000001102

# List like key
for k in key:
    print a2b.loc[[k]]  # KeyError: u'None of [[10047311000001102]] are in the [index]'

k = 10047311000001102  # Not in a2b.index
print a2b.loc[np.array([k])]  # KeyError: u'None of [[10047311000001102]] are in the [index]'
print a2b.loc[np.array([k,k])]  # KeyError: u'None of [[10047311000001102]] are in the [index]'

print a2b.loc[np.array([k,k, 9724501000001103])] # result.values is promoted to float
print a2b.loc[np.array([9724501000001103, k,k, ])] # result.values is promoted to float
print a2b.loc[key]  # result.values is promoted to float

So it seems the behavior is: if ALL of the query labels are not in series's index, only then is a KeyError thrown. Otherwise, its promotion to float. Is expected? Not sure. But it sure is strange behavior:

1) It leaves the caller unsure whether to expect: a KeyError, or result.values to be a float column.

2) Automatic promotion to float is a special kind of footgun when the promoted ints were category labels. They are rarely compatible with floats and (particularly for >int32 ints) and the conversion should strictly be treated as data corruption. In my example they are medical codes -- suddenly I found fewer people with heart disease than expected -- the disease category codes in the int32 range worked fine.

Automatic value type promotion in select-like operations is also inconsistent with a Series being a 'datacolumn' that is 'kinda' like an SQL column + index. SQL doesn't do it (precisely because its not safe and leads to data-corruption).
The result could return the non-NULL rows (as SQL), though this may silently break things for people.
KeyError seems preferable since there's no good way to know what to place in result's value's for missing labels (.reindex allows a default parameter, which it why its preferred).

Is there any way to force a KeyError on missing labels for .loc? (since warnings easily slip through CI testing).

stuz5000 on 3 Apr 2019

Is there any way to force a KeyError on missing labels for .loc? (since warnings easily slip through CI testing).

For the cases where the result.values is promoted to float, then I think the answer is "yes", eventually a KeyError will be thrown. It's just not implemented, but as the warning states, it will be in the future.

https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#indexing-with-list-with-missing-labels-is-deprecated

Not sure when exactly this will turn from a warning into throwing a KeyError.

cbertinato on 3 Apr 2019

Related? (seems similar but different https://github.com/pandas-dev/pandas/issues/22252 )

stuz5000 on 3 Apr 2019

It is similar. In both cases, the promotion to float is expected, but the message in 1 of that issue does seem misplaced.

cbertinato on 5 Apr 2019

A KeyError is thrown now. I think we can close this issue?