Distributed: [Feature] Add support for ClickHouse

Created on 15 Jun 2016  路  12Comments  路  Source: dask/distributed

_Disclaimer. It is just "a nice idea" since there is only HTTP interface available at this point of time._

ClickHouse is an open-source column-oriented database management system that allows generating analytical data reports in real time.

ClickHouse manages extremely large volumes of data in a stable and sustainable manner. It currently powers Yandex.Metrica, world鈥檚 second largest web analytics platform, with over 13 trillion database records and over 20 billion events a day, generating customized reports on-the-fly, directly from non-aggregated data. This system was successfully implemented at CERN鈥檚 LHCb experiment to store and process metadata on 10bn events with over 1000 attributes per event registered in 2011.

Most helpful comment

@martindurant @frol I've created a minimal pandas reader/writer here https://github.com/kszucs/pandahouse

The next one will be dask support.

All 12 comments

Generally integrating with other systems is good, especially over standard protocols (like ODBC). I would prefer to see integration with individual databases start in third-party modules rather than be part of the Dask codebase.

Closing this for now as out-of-scope. I'd be happy to help anyone who wanted to do this externally.

@frol I'm planning to implement an adapter for clickhouse. Are You still interested?

@kszucs I would be very happy to see and try it. Currently, we are settled down with HDFS/Parquet, but ClickHouse and Kudu look quite interesting to me. Thank you for tackling this issue!

@wesm @martindurant What do You think, how difficult would it be to create an adapter for clickhouse?

On short term we can connect via http client, on long term with a cython wrapper around odbc.

How should we get started? Some general advice about partitioning, data locality?

Is https://github.com/cloudflare/sqlalchemy-clickhouse enough to use dask.dataframe.read_sql_table?

I guess yes, but I'll take a try.
How about writing/persisting a ddf? Just ddf.map_partitions(lambda df: df.to_sql(...))?

I haven't got around to writing a to_sql method yet but yes, it should be simple like that (using insert for the partitions).

I had a couple of problems with reading from clickhouse:

  • incorrent divisions related to https://github.com/dask/dask/issues/2260
  • empty resultset raises a non-existing column when trying to sat as index (it might be a dask issue instead of sqlalchemy-clickhouse)

I'm planning to create a library to provide: read_clickhouse_table along with read_clickhouse (query). When reading a clickhouse table we can query the partitions (or dask divisions) from system.parts for the table beforehand.

@frol AFAIK you have a lot of experience with building alpine and/or c++ stuff. Could You give me a hand to compile https://github.com/yandex/clickhouse-odbc ?
I've got the following:

[ 99%] Building CXX object driver/CMakeFiles/clickhouse-odbc.dir/statement.cpp.o
[100%] Linking CXX shared library clickhouse-odbc.so
/usr/lib/gcc/x86_64-alpine-linux-musl/5.3.0/../../../../x86_64-alpine-linux-musl/bin/ld: cannot find -lodbc
collect2: error: ld returned 1 exit status
driver/CMakeFiles/clickhouse-odbc.dir/build.make:304: recipe for target 'driver/clickhouse-odbc.so' failed
make[2]: *** [driver/clickhouse-odbc.so] Error 1
CMakeFiles/Makefile2:326: recipe for target 'driver/CMakeFiles/clickhouse-odbc.dir/all' failed
make[1]: *** [driver/CMakeFiles/clickhouse-odbc.dir/all] Error 2
Makefile:149: recipe for target 'all' failed
make: *** [all] Error 2

I have unixodbc-dev installed.

@martindurant Does read_sql_table (or pandas.read_sql) support pyodbc?

@martindurant @frol I've created a minimal pandas reader/writer here https://github.com/kszucs/pandahouse

The next one will be dask support.

@martindurant @frol I've created a minimal pandas reader/writer here https://github.com/kszucs/pandahouse

The next one will be dask support.

@kszucs any update on dask support ?

Was this page helpful?
0 / 5 - 0 ratings

Related issues

lostmygithubaccount picture lostmygithubaccount  路  4Comments

mrocklin picture mrocklin  路  3Comments

DPeterK picture DPeterK  路  3Comments

madsbk picture madsbk  路  4Comments

muammar picture muammar  路  6Comments