Pandas: Unexpected results when filtering with .isin (some fields contain python datastructures)

Created on 30 Apr 2018 · 7Comments · Source: pandas-dev/pandas

Code Sample, a copy-pastable example if possible

import pandas as pd

data = [
    {'id': 1, 'content': [{'values': 3}]},
    {'id': 2, 'content': u'whats going on'},
    {'id': 3, 'content': u'whaaaaaaaaat'},
    {'id': 4, 'content': [{'values': 4}]}
]

if __name__ == '__main__':
    df = pd.DataFrame.from_dict(data)
    v = [u'whats going on', u'whaaaaaat']
    print df[df.content.isin(v)]
    v = [u'whats going on', u'what']
    print df[df.content.isin(v)]

Problem description

The first print statement executes sucessfully, filtering to the single row 'id': 2, 'content': u'whats going on', however the second filter throws an error even though the only difference is the length of one of the elements in the list v.

Output for the code snippet above:

          content  id
1  whats going on   2
/home/attila/digital/env/local/lib/python2.7/site-packages/pandas/core/indexes/range.py:473: RuntimeWarning: tp_compare didn't return -1 or -2 for exception
  return max(0, -(-(self._stop - self._start) // self._step))
Traceback (most recent call last):
  File "test_pandas.py", line 15, in <module>
    print df[df.content.isin(v)]
  File "/home/attila/digital/env/local/lib/python2.7/site-packages/pandas/core/series.py", line 2804, in isin
    return self._constructor(result, index=self.index).__finalize__(self)
  File "/home/attila/digital/env/local/lib/python2.7/site-packages/pandas/core/series.py", line 264, in __init__
    raise_cast_failure=True)
  File "/home/attila/digital/env/local/lib/python2.7/site-packages/pandas/core/series.py", line 3269, in _sanitize_array
    if len(subarr) != len(index) and len(subarr) == 1:
  File "/home/attila/digital/env/local/lib/python2.7/site-packages/pandas/core/indexes/range.py", line 473, in __len__
    return max(0, -(-(self._stop - self._start) // self._step))
TypeError: unhashable type: 'list'

INSTALLED VERSIONS

commit: None
python: 2.7.12.final.0
python-bits: 64
OS: Linux
OS-release: 4.13.0-37-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_GB.UTF-8
LOCALE: None.None

pandas: 0.22.0
pytest: 2.9.2
pip: 9.0.1
setuptools: 36.4.0
Cython: None
numpy: 1.14.2
scipy: 0.18.1
pyarrow: None
xarray: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.6.1
pytz: 2017.2
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 2.1.2
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
sqlalchemy: 1.0.5
pymysql: None
psycopg2: 2.6.1 (dt dec pq3 ext lo64)
jinja2: 2.9.6
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

Bug Nested Data isin

Source

atc0m

All 7 comments

I have a different output:

In [7]: df.content.isin(v)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
TypeError: unhashable type: 'list'

The above exception was the direct cause of the following exception:

SystemError                               Traceback (most recent call last)
<ipython-input-7-5a60788e7bc7> in <module>()
----> 1 df.content.isin(v)

~/sandbox/pandas/pandas/core/series.py in isin(self, values)
   3576         Name: animal, dtype: bool
   3577         """
-> 3578         result = algorithms.isin(self, values)
   3579         return self._constructor(result, index=self.index).__finalize__(self)
   3580

~/sandbox/pandas/pandas/core/algorithms.py in isin(comps, values)
    444             comps = comps.astype(object)
    445
--> 446     return f(comps, values)
    447
    448

~/sandbox/pandas/pandas/core/algorithms.py in <lambda>(x, y)
    419
    420     # faster for larger cases to use np.in1d
--> 421     f = lambda x, y: htable.ismember_object(x, values)
    422
    423     # GH16012

~/sandbox/pandas/pandas/_libs/hashtable_func_helper.pxi in pandas._libs.hashtable.ismember_object()
    470
    471     kh_destroy_pymap(table)
--> 472     return result.view(np.bool_)
    473
    474

SystemError: <built-in method view of numpy.ndarray object at 0x1078a93f0> returned a result with an error set

In general, nested data like this aren't well supported at the moment. The upcoming 0.23 release is laying some groundwork to better-support this, but it'll take some time.

TomAugspurger on 30 Apr 2018

👍1

Similar issue, with a single value in the sdf.id.values, the following error occurs, with 2 or more values no error.

(Pdb) df.isin(sdf.id.values)
* SystemError:

apiszcz on 24 Apr 2019

Still not working in Pandas version '0.24.2'. I am having the same error than @TomAugspurger using python 3.7.3. It worked perfectly in python 2.7.15.

Any idea to sort this out?

JavierClearImageAI on 10 Jun 2019

I don't think anyone has investigated deeply. Could you @javi-clear-image-ai?

TomAugspurger on 10 Jun 2019

I don't think anyone has investigated deeply. Could you @javi-clear-image-ai?

I did (a bit), but without much luck. I ended up moving from pandas to numpy (df.values) and working with the numpy array. It worked for me, so that would be the walk around I would suggest for the moment.

JavierClearImageAI on 27 Jun 2019

Simpler test case:

pd.Series([0, [1, 2]]).isin(['a', 'b'])

(so unrelated to indexing, or DataFrame).

toobaz on 29 Jun 2019

👎2 👍1

The problem in my case was because my column instead to be an str was an element/object in pandas, _i.e_. my data was an array and I was using a list to perform the comparison directly.

I just pass

# Iterate on the top of words in a column
for textract_value in textract_keywords:
    textract_value = str(textract_value).lower()
    for handwerkskammer in handwerkskammer_name:
        handwerkskammer = str(handwerkskammer).lower()

        if textract_value == handwerkskammer:
            print(f'Contains: {handwerkskammer}')

This solved my problem.