Pandas: HDF5.remove with where removing all rows

Created on 18 Sep 2017  Â·  4Comments  Â·  Source: pandas-dev/pandas

when I use 'in' condition in HDFStore.remove, it removes all data in the table instead of selected rows

In[0]:
store = pd.HDFStore("d:/org.h5")
a = store.get(table["table_name"])
a.index
​
Out[0]:
Index(['{7D23701A-991B-11E7-977E-448A5B7647CF}',
       '{7D23701E-991B-11E7-977E-448A5B7647CF}',
       '{7D237020-991B-11E7-977E-448A5B7647CF}',
       ...
       '{9E1257B7-99E4-11E7-B36E-448A5B7647CF}',
       '{9E1257B9-99E4-11E7-B36E-448A5B7647CF}',
       '{9E1257CB-99E4-11E7-B36E-448A5B7647CF}'],
      dtype='object', name='object_id', length=6710)

In [1]:
a[a['trade_dt'] == '2017-09-14'].index

Out[1]:
Index(['{7D23701A-991B-11E7-977E-448A5B7647CF}',
       '{7D23701E-991B-11E7-977E-448A5B7647CF}',
       '{7D237020-991B-11E7-977E-448A5B7647CF}',
       ...
       '{7D236E8C-991B-11E7-977E-448A5B7647CF}',
       '{7D236E9A-991B-11E7-977E-448A5B7647CF}',
       '{7D236EA0-991B-11E7-977E-448A5B7647CF}'],
      dtype='object', name='object_id', length=3353)

In [2]:
store.select(table["table_name"], where="index in %s" % list(a[a['trade_dt'] == '2017-09-14'].index))

Out[2]:
(3353, 22)

In [3]:
store.remove(table["table_name"], where="index in %s" % list(a[a['trade_dt'] == '2017-09-14'].index))

Out[3]:
6710

Problem description

The “where" condition in HDFStore.select returns selected 3353 rows. I supposed using same condition with "where" in HDFStore.remove should do the same which is remove this 3353 rows. But it remove 6710 rows which is all lines in this table.

I tested with only one single row in the "index in [list]" where condition. It removes one row which is I expected.


In [4]:
[a[a['trade_dt'] == '2017-09-14'].index[0]]

Out[4]:
['{7D23701A-991B-11E7-977E-448A5B7647CF}']

In [5]:
store.select(table["table_name"], where="index in %s" % [a[a['trade_dt'] == '2017-09-14'].index[0]]).shape

Out[5]:
(1, 22)

In [6]:
store.remove(table["table_name"], where="index in %s" % [a[a['trade_dt'] == '2017-09-14'].index[0]])

Out[6]:
1

I tested with equivalent condition ops such as "=" returns the same

Output of pd.show_versions()

'0.20.3'

Bug IO HDF5

Most helpful comment

Tangentially related to #15937. The remove method uses numexpr so selecting as shown above would work as long as <= 31 elements. A warning pointing to http://pandas.pydata.org/pandas-docs/stable/io.html#selecting-using-a-where-mask seems reasonable.

Swapping the commented lines here will return an empty dataframe.
```python
In [2]:
df = pd.DataFrame({'A': np.arange(1000),
'B': pd.date_range('20130101', periods=1000)}
).set_index('B')

select = list(df.reset_index().B.sample(32))
store = pd.HDFStore('temp.h5')
store.put('df', df, format='table')
where = df.index.isin(select)

store.select('df', where=where)

store.remove('df', where='index in %s' % select)

store.remove('df', where=where)

Out[2]:
32

In [3]:
len(store.select('df'))

Out[3]:
968

All 4 comments

So you would have to show a reproducible example; the following works.

In [1]: df = pd.DataFrame({'A': np.arange(10), 'B': pd.date_range('20130101',periods=10)}).set_index('B')
   ...: store = pd.HDFStore('test.h5')
   ...: store.append('df', df)
   ...: assert len(store.select('df')) == 10
   ...: store.remove('df', where='index in ["20130102", "20130103"]')
   ...: assert len(store.select('df')) == 8
   ...: 

In [2]: store.select('df')
Out[2]: 
            A
B            
2013-01-01  0
2013-01-04  3
2013-01-05  4
2013-01-06  5
2013-01-07  6
2013-01-08  7
2013-01-09  8
2013-01-10  9

@jreback

I did a slight change to your test code then it reproduces now.

In [1]:
df = pd.DataFrame({'A': np.arange(1000), 'B': pd.date_range('20130101',periods=1000)}).set_index('B')
select = list(df.reset_index().B.sample(100))
store = pd.HDFStore('test.h5')
store.put('df', df, format='table')
assert len(store.select('df')) == 1000
print('select len %d' % len(store.select('df', where='index in %s' % select)))
store.remove('df', where='index in %s' % select)
print('after remove %d' % len(store.select('df')))
assert len(store.select('df')) == 900

select len 100
after remove 0
---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-136-cf9062a4d575> in <module>()
      7 store.remove('df', where='index in %s' % select)
      8 print('after remove %d' % len(store.select('df')))
----> 9 assert len(store.select('df')) == 900


AssertionError: 

In [2]:
store.select('df')

Out[2]:
A
B   

ok, would appreciate if you can debug. there are not very many test cases for this, so certainly can add this one and fix.

Tangentially related to #15937. The remove method uses numexpr so selecting as shown above would work as long as <= 31 elements. A warning pointing to http://pandas.pydata.org/pandas-docs/stable/io.html#selecting-using-a-where-mask seems reasonable.

Swapping the commented lines here will return an empty dataframe.
```python
In [2]:
df = pd.DataFrame({'A': np.arange(1000),
'B': pd.date_range('20130101', periods=1000)}
).set_index('B')

select = list(df.reset_index().B.sample(32))
store = pd.HDFStore('temp.h5')
store.put('df', df, format='table')
where = df.index.isin(select)

store.select('df', where=where)

store.remove('df', where='index in %s' % select)

store.remove('df', where=where)

Out[2]:
32

In [3]:
len(store.select('df'))

Out[3]:
968

Was this page helpful?
0 / 5 - 0 ratings