Pandas: HDFStore.select_column gives KeyError

Created on 24 May 2018  Â·  5Comments  Â·  Source: pandas-dev/pandas

Code Sample, a copy-pastable example if possible

hdf = pd.HDFStore('db/trips.h5')
print(hdf.info()) # works
print(hdf.select('df',columns=['trip_id'])) # works
print(hdf.select_column('df','index') ) # works
print(hdf.select_column('df','trip_id') ) # throws KeyError

Output

print(hdf.info()):

<class 'pandas.io.pytables.HDFStore'>
File path: db/trips.h5
/df            frame_table  (typ->appendable,nrows->216357,ncols->10,indexers->[index])

print(hdf.select('df',columns=['trip_id'])):

0       88921931
1       88921934
2       88921937
...
[216357 rows x 1 columns]

print(hdf.select_column('df','index') ):

0              0
1              1
2              2
...
Name: index, Length: 216357, dtype: int64

print(hdf.select_column('df','trip_id') ):

KeyError                                  Traceback (most recent call last)
<ipython-input-42-67a3ca7d6da9> in <module>()
     20     '''
     21     return returnList
---> 22 readColumnDB('trips','trip_id')

<ipython-input-42-67a3ca7d6da9> in readColumnDB(tablename, column)
     11 
     12     print(hdf.select_column('df','index') )
---> 13     print(hdf.select_column('df','trip_id') )
     14     returnList = []
     15     '''

~/.local/lib/python3.5/site-packages/pandas/io/pytables.py in select_column(self, key, column, **kwargs)
    775 
    776         """
--> 777         return self.get_storer(key).read_column(column=column, **kwargs)
    778 
    779     def select_as_multiple(self, keys, where=None, selector=None, columns=None,

~/.local/lib/python3.5/site-packages/pandas/io/pytables.py in read_column(self, column, where, start, stop, **kwargs)
   3779                                       a.tz, True), name=column)
   3780 
-> 3781         raise KeyError("column [%s] not found in the table" % column)
   3782 
   3783 

KeyError: 'column [trip_id] not found in the table'

After this, tried :

hdf.create_table_index('df')
print(hdf.select_column('df','trip_id') )

Still got same KeyError.

Problem description

select_column() function is not working for any column other than 'index'. Error message says that column was not found. But hdf.select('df',columns=['trip_id']) works fine.

Inspecting the code,
https://github.com/pandas-dev/pandas/blob/bc37ea2e05019a89adaa48159b220483598d1898/pandas/io/pytables.py#L759
I don't get what they mean by:

return a single column from the table. This is generally only useful to select an indexable

select_column is also advised at other places, as an alternative for hdf.select('df',columns=['trip_id']).

... and there is no mention about having to index the column to make it work. And I don't want to necessarily index the column that I want to extract: what if my table already has an index? What if I just want all the values from a values column?

Expected Output

print(hdf.select_column('df','trip_id') ) should print the contents of trip_id column as a one-column dataframe, as shown in this documentation example:

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.5.2.final.0
python-bits: 32
OS: Linux
OS-release: 4.13.0-43-generic
machine: i686
processor: i686
byteorder: little
LC_ALL: None
LANG: en_IN
LOCALE: en_IN.ISO8859-1

pandas: 0.23.0
pytest: None
pip: 10.0.1
setuptools: 20.7.0
Cython: 0.27.3
numpy: 1.13.3
scipy: None
pyarrow: None
xarray: None
IPython: 6.2.1
sphinx: None
patsy: None
dateutil: 2.6.1
pytz: 2017.3
blosc: None
bottleneck: None
tables: 3.4.3
numexpr: 2.6.4
feather: None
matplotlib: None
openpyxl: 2.5.3
xlrd: 1.1.0
xlwt: None
xlsxwriter: None
lxml: 3.5.0
bs4: 4.6.0
html5lib: 1.0.1
sqlalchemy: 1.2.7
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

Most helpful comment

@jreback I got it! Dug into the code and came upon this data indexable thing again..

def read_column(self, column, where=None, start=None, stop=None, **kwargs):
...
        # find the axes
        for a in self.axes:
            if column == a.name:
                if not a.is_data_indexable:
                    raise ValueError(
                        "column [%s] can not be extracted individually; it is "
                        "not data indexable" % column)

if I ran this: [a.name for a in hdf.get_storer('df').axes]
then I'd get this: ['index', 'values_block_0', 'values_block_1']
And it was clear that my column would never show up.

Then I looked at the to_hdf() function doc again: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_hdf.html

data_columns : list of columns or True, optional
List of columns to create as indexed data columns for on-disk queries, or True to use all columns. By default only the axes of the object are indexed. See Query via Data Columns. Applicable only to format=’table’.

So now I created the .h5 file again, this time with one more argument: data_columns=True. And now all the columns got listed in the axes.

>>> df.to_hdf('trips.h5', 'df', format='table', mode='w', complevel=1, data_columns=True)
>>> hdf = pd.HDFStore('trips.h5')
>>> [a.name for a in hdf.get_storer('df').axes]
['index', 'route_id', 'service_id', 'trip_id', 'trip_headsign', 'trip_short_name', 'direction_id', 'block_id', 'shape_id', 'wheelchair_accessible', 'bikes_allowed']
>>> print(hdf.select_column('df','trip_id') ) # NOW IT WORKS!

Note: the time taken by the compiler to create the .h5 file seems to increase when all data_columns=True. So, it's better to know which columns we'll be likely to use and data_columns=['trip_id'] instead.

I think we ought to change this:
raise KeyError("column [%s] not found in the table" % column)
... to convey that in addition to the column perhaps not being present in the table, that it also may not be data indexable and data_columns option has to be used when storing the table in the .h5 file (pd.to_hdf(), or HDFStore.put() or HDFStore.append()).

All 5 comments

you would need to show the initial construction via code

@jreback thanks for looking into this! By initial construction do you mean how the 'db/trips.h5' file was created? It's from a CSV called trips.txt from here: http://transitfeeds.com/p/verkehrsverbund-berlin-brandenburg/213/latest .

I just found I can directly load it from the net (so nice to work on open data), so will re-do the sample code so you can copy-paste this whole thing at your end:

Code Sample, a copy-pastable example if possible

import pandas as pd
GTFS_dtypes = { 'trip_id' :'str', 'block_id':'str', 'trip_headsign':'str', 'direction_id':'int64', 'route_id':'str' }
url = 'https://transitfeeds-data.s3-us-west-1.amazonaws.com/public/feeds/verkehrsverbund-berlin-brandenburg/213/20180426/original/trips.txt'

df = pd.read_csv(url , na_filter=False, dtype=GTFS_dtypes)
df.to_hdf('trips.h5', 'df', format='table', mode='w', complevel=1)

hdf = pd.HDFStore('trips.h5')
print(hdf.info()) # works
print(hdf.select('df',columns=['trip_id'])) # works
print(hdf.select_column('df','index') ) # works
print(hdf.select_column('df','trip_id') ) # throws KeyError

On running this all over again I'm getting the same error I posted.

  • I just tried this with a completely different .csv file. Same error. hdf.select is working fine but hdf.select_column only works if the column is index.
  • Recreated .h5 file with complevel=0. No difference, Same error.

@jreback I got it! Dug into the code and came upon this data indexable thing again..

def read_column(self, column, where=None, start=None, stop=None, **kwargs):
...
        # find the axes
        for a in self.axes:
            if column == a.name:
                if not a.is_data_indexable:
                    raise ValueError(
                        "column [%s] can not be extracted individually; it is "
                        "not data indexable" % column)

if I ran this: [a.name for a in hdf.get_storer('df').axes]
then I'd get this: ['index', 'values_block_0', 'values_block_1']
And it was clear that my column would never show up.

Then I looked at the to_hdf() function doc again: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_hdf.html

data_columns : list of columns or True, optional
List of columns to create as indexed data columns for on-disk queries, or True to use all columns. By default only the axes of the object are indexed. See Query via Data Columns. Applicable only to format=’table’.

So now I created the .h5 file again, this time with one more argument: data_columns=True. And now all the columns got listed in the axes.

>>> df.to_hdf('trips.h5', 'df', format='table', mode='w', complevel=1, data_columns=True)
>>> hdf = pd.HDFStore('trips.h5')
>>> [a.name for a in hdf.get_storer('df').axes]
['index', 'route_id', 'service_id', 'trip_id', 'trip_headsign', 'trip_short_name', 'direction_id', 'block_id', 'shape_id', 'wheelchair_accessible', 'bikes_allowed']
>>> print(hdf.select_column('df','trip_id') ) # NOW IT WORKS!

Note: the time taken by the compiler to create the .h5 file seems to increase when all data_columns=True. So, it's better to know which columns we'll be likely to use and data_columns=['trip_id'] instead.

I think we ought to change this:
raise KeyError("column [%s] not found in the table" % column)
... to convey that in addition to the column perhaps not being present in the table, that it also may not be data indexable and data_columns option has to be used when storing the table in the .h5 file (pd.to_hdf(), or HDFStore.put() or HDFStore.append()).

you can certainly put up a PR to make this more informative

Was this page helpful?
0 / 5 - 0 ratings