For:
result = a2b.loc[vals] # pd.Series()[ np.array ]
If a2b is a series that maps {int64:int64} and vals is an int64 array, the result should be a series that maps {int64:int64}, or a KeyError should be thrown
Pasteable repo:
import pandas as pd
import numpy as np
a2b = pd.Series(
index = np.array([ 9724501000001103, 9724701000001109, 9725101000001107,
9725301000001109, 9725601000001103, 9725801000001104,
9730701000001104, 10049011000001109, 10328511000001105]),
data = np.array([999000011000001104, 999000011000001104, 999000011000001104,
999000011000001104, 999000011000001104, 999000011000001104,
999000011000001104, 999000011000001104, 999000011000001104])
)
assert a2b.dtype==np.int64
assert a2b.index.dtype==np.int64
key = np.array([ 9724501000001103, 9724701000001109, 9725101000001107,
9725301000001109, 9725601000001103, 9725801000001104,
9730701000001104,
10047311000001102, # Misin in a2b.index
10049011000001109,
10328511000001105])
result = a2b.loc[key]
result
assert result.dtype==np.int64
assert result.index.dtype==np.int64
What happens:
In [2]: import pandas as pd
...: import numpy as np
...: a2b = pd.Series(
...: index = np.array([ 9724501000001103, 9724701000001109, 9725101000001107,
...: 9725301000001109, 9725601000001103, 9725801000001104,
...: 9730701000001104, 10049011000001109, 10328511000001105]),
...: data = np.array([999000011000001104, 999000011000001104, 999000011000001104,
...: 999000011000001104, 999000011000001104, 999000011000001104,
...: 999000011000001104, 999000011000001104, 999000011000001104])
...: )
...: assert a2b.dtype==np.int64
...: assert a2b.index.dtype==np.int64
...: key = np.array([ 9724501000001103, 9724701000001109, 9725101000001107,
...: 9725301000001109, 9725601000001103, 9725801000001104,
...: 9730701000001104,
...: 10047311000001102, # Misin in a2b.index
...: 10049011000001109,
...: 10328511000001105])
...: result = a2b.loc[key]
...: result
...:
Out[2]:
9.990000e+17 NaN
9.990000e+17 NaN
9.990000e+17 NaN
9.990000e+17 NaN
9.990000e+17 NaN
9.990000e+17 NaN
9.990000e+17 NaN
NaN NaN
9.990000e+17 NaN
9.990000e+17 NaN
dtype: float64
In [3]: assert result.dtype==np.int64
...: assert result.index.dtype==np.int64
...:
---------------------------------------------------------------------------
AssertionError Traceback (most recent call last)
<ipython-input-3-be86ec17a393> in <module>()
----> 1 assert result.dtype==np.int64
2 assert result.index.dtype==np.int64
AssertionError:
I don't like this behavior because:
Asserts should not fail.
pd.show_versions()[paste the output of pd.show_versions() here below this line]
In [4]: pd.show_versions()
commit: None
python: 2.7.15.candidate.1
python-bits: 64
OS: Linux
OS-release: 4.15.0-46-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: None.None
pandas: 0.22.0
pytest: None
pip: 18.1
setuptools: 40.6.2
Cython: 0.29.1
numpy: 1.16.1
scipy: 1.2.0
pyarrow: None
xarray: None
IPython: 5.0.0
sphinx: None
patsy: 0.5.1
dateutil: 2.6.0
pytz: 2016.10
blosc: None
bottleneck: None
tables: None
numexpr: 2.6.8
feather: None
matplotlib: 2.1.0
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: 4.6.0
html5lib: 0.9999999
sqlalchemy: 1.2.17
pymysql: None
psycopg2: 2.7.7 (dt dec pq3 ext lo64)
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
Same in: 0.23.4, 0.24.2
use .iloc as that is what is designed for selecting by position as the docs indicate: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#selection-by-position
getitem it falling back here as described http://pandas.pydata.org/pandas-docs/stable/user_guide/advanced.html#miscellaneous-indexing-faq
this is as expected behavior and is not likely to change
My series has integer labels, not billions of rows.
.loc would be correct (not .iloc).
It fails for .loc and ._getitem__ both.
This is bug. Please reopen.
Have update the title and sample to make it clear.
.loc is the correct indexer (integer label, and integer index).
Please reopen.
this is on master
In [7]: pd.__version__
Out[7]: '0.25.0.dev0+337.g1d4c89f4d'
In [6]: a2b.loc[key]
/Users/jreback/miniconda3/envs/pandas/bin/ipython:1: FutureWarning:
Passing list-likes to .loc or [] with any missing label will raise
KeyError in the future, you can use .reindex() as an alternative.
See the documentation here:
https://pandas.pydata.org/pandas-docs/stable/indexing.html#deprecate-loc-reindex-listlike
#!/Users/jreback/miniconda3/envs/pandas/bin/python
Out[6]:
9724501000001103 9.990000e+17
9724701000001109 9.990000e+17
9725101000001107 9.990000e+17
9725301000001109 9.990000e+17
9725601000001103 9.990000e+17
9725801000001104 9.990000e+17
9730701000001104 9.990000e+17
10047311000001102 NaN
10049011000001109 9.990000e+17
10328511000001105 9.990000e+17
dtype: float64
In [14]: a2b.index.isin(key)
Out[14]: array([ True, True, True, True, True, True, True, True, True])
prob a bug somewhere, would need investigation by the community
This doesn't really look like a bug. Once a KeyError is thrown, you'll never get this far anyway. But suppose you do, and the expected behavior is a NaN for the extra index. The Series is dtype int64 and NaN is compatible with that dtype so values get cast to float64. If you want to avoid the cast, then use Int64 instead.
no this is an issue i think; see how the value is i. the index; but there is a disconnect somewhere after the get_indexer call (way before the erroneous KeyError)
Was there an erroneous KeyError? The key 10047311000001102 isn't in a2b, so the warning given for a2b.loc[key] seems appropriate.
There was no KeyError.
For other examples of missing input a KeyError is thrown.
On Tue, Apr 2, 2019 at 6:41 PM Chris Bertinato notifications@github.com
wrote:
Was there an erroneous KeyError? The key 10047311000001102 isn't in a2b,
so the warning given for a2b.loc[key] seems appropriate.—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
https://github.com/pandas-dev/pandas/issues/25927#issuecomment-479290834,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AHcErPj7PmN80jE8uwjY0rdWVHcjfmUuks5vdAbRgaJpZM4cTZQC
.
I see the KeyError with __getitem__, but the message given when trying a2b.loc[key] indicates that, while no error is thrown now, it will be in the future. It seems to me that, while the current behavior is not ideal, it is expected.
But I'm still trying to sleuth out whether there's another issue here. @jreback do you mean that because one of the labels in key is not in a2b, then that label shouldn't show up in the index of a2b.loc[key]?
@cbertinato My apologies - I missed that a warning was being printed, and I had to do some digging to find the case where KeyError was thrown:
# Scalar key
for k in key:
print a2b.loc[k] # KeyError: 10047311000001102
# List like key
for k in key:
print a2b.loc[[k]] # KeyError: u'None of [[10047311000001102]] are in the [index]'
k = 10047311000001102 # Not in a2b.index
print a2b.loc[np.array([k])] # KeyError: u'None of [[10047311000001102]] are in the [index]'
print a2b.loc[np.array([k,k])] # KeyError: u'None of [[10047311000001102]] are in the [index]'
print a2b.loc[np.array([k,k, 9724501000001103])] # result.values is promoted to float
print a2b.loc[np.array([9724501000001103, k,k, ])] # result.values is promoted to float
print a2b.loc[key] # result.values is promoted to float
So it seems the behavior is: if ALL of the query labels are not in series's index, only then is a KeyError thrown. Otherwise, its promotion to float. Is expected? Not sure. But it sure is strange behavior:
1) It leaves the caller unsure whether to expect: a KeyError, or result.values to be a float column.
2) Automatic promotion to float is a special kind of footgun when the promoted ints were category labels. They are rarely compatible with floats and (particularly for >int32 ints) and the conversion should strictly be treated as data corruption. In my example they are medical codes -- suddenly I found fewer people with heart disease than expected -- the disease category codes in the int32 range worked fine.
Automatic value type promotion in select-like operations is also inconsistent with a Series being a 'datacolumn' that is 'kinda' like an SQL column + index. SQL doesn't do it (precisely because its not safe and leads to data-corruption).
The result could return the non-NULL rows (as SQL), though this may silently break things for people.
KeyError seems preferable since there's no good way to know what to place in result's value's for missing labels (.reindex allows a default parameter, which it why its preferred).
Is there any way to force a KeyError on missing labels for .loc? (since warnings easily slip through CI testing).
Is there any way to force a KeyError on missing labels for .loc? (since warnings easily slip through CI testing).
For the cases where the result.values is promoted to float, then I think the answer is "yes", eventually a KeyError will be thrown. It's just not implemented, but as the warning states, it will be in the future.
https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#indexing-with-list-with-missing-labels-is-deprecated
Not sure when exactly this will turn from a warning into throwing a KeyError.
Related? (seems similar but different https://github.com/pandas-dev/pandas/issues/22252 )
It is similar. In both cases, the promotion to float is expected, but the message in 1 of that issue does seem misplaced.
A KeyError is thrown now. I think we can close this issue?
can u see if we have a test for this; ok to add this one in a similar place