Vector: New `clickhouse` sink

Created on 2 Jul 2019  路  4Comments  路  Source: timberio/vector

Would love to see a clickhouse sink for this project.

feature

Most helpful comment

Actually, ClickHouse can be used as a sink right away using http sink and ClickHouse HTTP API. ClickHouse supports input format JSONEachRow, which is the same format as ndjson in Vector.

It is enough to set the url to be something like

http://clickhouse:8123/?query=INSERT%20INTO%20my_log%20FORMAT%20JSONEachRow

to insert the data to a ClickHouse table.

I made an example that demonstrates how to send logs from Vector to ClickHouse using HTTP sink.

All 4 comments

Actually, ClickHouse can be used as a sink right away using http sink and ClickHouse HTTP API. ClickHouse supports input format JSONEachRow, which is the same format as ndjson in Vector.

It is enough to set the url to be something like

http://clickhouse:8123/?query=INSERT%20INTO%20my_log%20FORMAT%20JSONEachRow

to insert the data to a ClickHouse table.

I made an example that demonstrates how to send logs from Vector to ClickHouse using HTTP sink.

@a-rodin this looks great! I think this is a good first step, though I see more opportunities to support clickhouse more thoroughly. Thanks for doing this!

@lukesteensen before you begin work on this, I think it's worth putting together a simple high-level spec for the first version. This way we can get consensus on what that looks like.

For example, should we start with encoding the event to JSON and storing that in a single column? It appears ClickHouse offers JSON functions that would make it possible to parse and operate on the data at query time. I'm sure there is a performance cost for this, but I could see it being a viable option for unpredictable schemas. I just don't know enough about ClickHouse to know if this is even a smart strategy, which is what I want investigate with this spec.

Alternatively, if we're going to work with static schemas we have a few ways to approach this:

  1. Just map the fields 1 to 1 with columns and hope the user has the columns defined in ClickHoouse (I dislike this, but including it for completeness).
  2. @bruceg is working on https://github.com/timberio/vector/issues/405 which will introduce generic coercion and type consistency. This can serve as a a poor man's schema definition and ensure types and shapes are consistent. This, at least, forms a contract with Vector and ClickHouse.
  3. A user could define white-listed fields in the ClickHouse schema.
  4. I'm open to any other ideas. Like I said, I still have a lot to learn about ClickHouse.

To re-iterate, I think we should start simple, something that we could hopefully ship this week, and then follow up with enhancement issues covering the points above.

I personally think that the approach to the schema is not something too critical because the user anyway can apply subsequent transformations on the data on ClickHouse side by creating another table with desired schema and using a materialized view to transform and pipe data there from the source table.

However, it is necessary to have a separate column at least for timestamp, because to use engines from MergeTree family efficiently it is necessary to have good partitioning and primary keys.

I also want to point out that DateTime type in ClickHouse is 32-bit UNIX timestamp, which has coarser resolution than timestamps in Vector's events (see this issue).

Just as reference, there are two related projects that tackle similar problem, logstash-output-clickhouse and graphite-clickhouse. The latter uses a special GraphiteMergeTree table engine which has support for rollups. On the other hand, rollup for metrics be could done in ClickHouse by copying the data to AggregatingMergeTree instead.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

lewisthompson picture lewisthompson  路  3Comments

valyala picture valyala  路  3Comments

leebenson picture leebenson  路  3Comments

Hoverbear picture Hoverbear  路  3Comments

LucioFranco picture LucioFranco  路  3Comments