hdf = pd.HDFStore('db/trips.h5')
print(hdf.info()) # works
print(hdf.select('df',columns=['trip_id'])) # works
print(hdf.select_column('df','index') ) # works
print(hdf.select_column('df','trip_id') ) # throws KeyError
print(hdf.info()):
<class 'pandas.io.pytables.HDFStore'>
File path: db/trips.h5
/df frame_table (typ->appendable,nrows->216357,ncols->10,indexers->[index])
print(hdf.select('df',columns=['trip_id'])):
0 88921931
1 88921934
2 88921937
...
[216357 rows x 1 columns]
print(hdf.select_column('df','index') ):
0 0
1 1
2 2
...
Name: index, Length: 216357, dtype: int64
print(hdf.select_column('df','trip_id') ):
KeyError Traceback (most recent call last)
<ipython-input-42-67a3ca7d6da9> in <module>()
20 '''
21 return returnList
---> 22 readColumnDB('trips','trip_id')
<ipython-input-42-67a3ca7d6da9> in readColumnDB(tablename, column)
11
12 print(hdf.select_column('df','index') )
---> 13 print(hdf.select_column('df','trip_id') )
14 returnList = []
15 '''
~/.local/lib/python3.5/site-packages/pandas/io/pytables.py in select_column(self, key, column, **kwargs)
775
776 """
--> 777 return self.get_storer(key).read_column(column=column, **kwargs)
778
779 def select_as_multiple(self, keys, where=None, selector=None, columns=None,
~/.local/lib/python3.5/site-packages/pandas/io/pytables.py in read_column(self, column, where, start, stop, **kwargs)
3779 a.tz, True), name=column)
3780
-> 3781 raise KeyError("column [%s] not found in the table" % column)
3782
3783
KeyError: 'column [trip_id] not found in the table'
After this, tried :
hdf.create_table_index('df')
print(hdf.select_column('df','trip_id') )
Still got same KeyError.
select_column() function is not working for any column other than 'index'. Error message says that column was not found. But hdf.select('df',columns=['trip_id']) works fine.
Inspecting the code,
https://github.com/pandas-dev/pandas/blob/bc37ea2e05019a89adaa48159b220483598d1898/pandas/io/pytables.py#L759
I don't get what they mean by:
return a single column from the table. This is generally only useful to select an indexable
select_column is also advised at other places, as an alternative for hdf.select('df',columns=['trip_id']).
... and there is no mention about having to index the column to make it work. And I don't want to necessarily index the column that I want to extract: what if my table already has an index? What if I just want all the values from a values column?
print(hdf.select_column('df','trip_id') ) should print the contents of trip_id column as a one-column dataframe, as shown in this documentation example:
pd.show_versions()commit: None
python: 3.5.2.final.0
python-bits: 32
OS: Linux
OS-release: 4.13.0-43-generic
machine: i686
processor: i686
byteorder: little
LC_ALL: None
LANG: en_IN
LOCALE: en_IN.ISO8859-1
pandas: 0.23.0
pytest: None
pip: 10.0.1
setuptools: 20.7.0
Cython: 0.27.3
numpy: 1.13.3
scipy: None
pyarrow: None
xarray: None
IPython: 6.2.1
sphinx: None
patsy: None
dateutil: 2.6.1
pytz: 2017.3
blosc: None
bottleneck: None
tables: 3.4.3
numexpr: 2.6.4
feather: None
matplotlib: None
openpyxl: 2.5.3
xlrd: 1.1.0
xlwt: None
xlsxwriter: None
lxml: 3.5.0
bs4: 4.6.0
html5lib: 1.0.1
sqlalchemy: 1.2.7
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
you would need to show the initial construction via code
@jreback thanks for looking into this! By initial construction do you mean how the 'db/trips.h5' file was created? It's from a CSV called trips.txt from here: http://transitfeeds.com/p/verkehrsverbund-berlin-brandenburg/213/latest .
I just found I can directly load it from the net (so nice to work on open data), so will re-do the sample code so you can copy-paste this whole thing at your end:
import pandas as pd
GTFS_dtypes = { 'trip_id' :'str', 'block_id':'str', 'trip_headsign':'str', 'direction_id':'int64', 'route_id':'str' }
url = 'https://transitfeeds-data.s3-us-west-1.amazonaws.com/public/feeds/verkehrsverbund-berlin-brandenburg/213/20180426/original/trips.txt'
df = pd.read_csv(url , na_filter=False, dtype=GTFS_dtypes)
df.to_hdf('trips.h5', 'df', format='table', mode='w', complevel=1)
hdf = pd.HDFStore('trips.h5')
print(hdf.info()) # works
print(hdf.select('df',columns=['trip_id'])) # works
print(hdf.select_column('df','index') ) # works
print(hdf.select_column('df','trip_id') ) # throws KeyError
On running this all over again I'm getting the same error I posted.
hdf.select is working fine but hdf.select_column only works if the column is index.complevel=0. No difference, Same error.@jreback I got it! Dug into the code and came upon this data indexable thing again..
def read_column(self, column, where=None, start=None, stop=None, **kwargs):
...
# find the axes
for a in self.axes:
if column == a.name:
if not a.is_data_indexable:
raise ValueError(
"column [%s] can not be extracted individually; it is "
"not data indexable" % column)
if I ran this: [a.name for a in hdf.get_storer('df').axes]
then I'd get this: ['index', 'values_block_0', 'values_block_1']
And it was clear that my column would never show up.
Then I looked at the to_hdf() function doc again: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_hdf.html
data_columns : list of columns or True, optional
List of columns to create as indexed data columns for on-disk queries, or True to use all columns. By default only the axes of the object are indexed. See Query via Data Columns. Applicable only to format=’table’.
So now I created the .h5 file again, this time with one more argument: data_columns=True. And now all the columns got listed in the axes.
>>> df.to_hdf('trips.h5', 'df', format='table', mode='w', complevel=1, data_columns=True)
>>> hdf = pd.HDFStore('trips.h5')
>>> [a.name for a in hdf.get_storer('df').axes]
['index', 'route_id', 'service_id', 'trip_id', 'trip_headsign', 'trip_short_name', 'direction_id', 'block_id', 'shape_id', 'wheelchair_accessible', 'bikes_allowed']
>>> print(hdf.select_column('df','trip_id') ) # NOW IT WORKS!
Note: the time taken by the compiler to create the .h5 file seems to increase when all data_columns=True. So, it's better to know which columns we'll be likely to use and data_columns=['trip_id'] instead.
I think we ought to change this:
raise KeyError("column [%s] not found in the table" % column)
... to convey that in addition to the column perhaps not being present in the table, that it also may not be data indexable and data_columns option has to be used when storing the table in the .h5 file (pd.to_hdf(), or HDFStore.put() or HDFStore.append()).
you can certainly put up a PR to make this more informative
Most helpful comment
@jreback I got it! Dug into the code and came upon this data indexable thing again..
if I ran this:
[a.name for a in hdf.get_storer('df').axes]then I'd get this:
['index', 'values_block_0', 'values_block_1']And it was clear that my column would never show up.
Then I looked at the
to_hdf()function doc again: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_hdf.htmlSo now I created the .h5 file again, this time with one more argument:
data_columns=True. And now all the columns got listed in the axes.Note: the time taken by the compiler to create the .h5 file seems to increase when all
data_columns=True. So, it's better to know which columns we'll be likely to use anddata_columns=['trip_id']instead.I think we ought to change this:
raise KeyError("column [%s] not found in the table" % column)... to convey that in addition to the column perhaps not being present in the table, that it also may not be data indexable and
data_columnsoption has to be used when storing the table in the .h5 file (pd.to_hdf(), orHDFStore.put()orHDFStore.append()).