Would love to see a clickhouse sink for this project.
Actually, ClickHouse can be used as a sink right away using http sink and ClickHouse HTTP API. ClickHouse supports input format JSONEachRow, which is the same format as ndjson in Vector.
It is enough to set the url to be something like
http://clickhouse:8123/?query=INSERT%20INTO%20my_log%20FORMAT%20JSONEachRow
to insert the data to a ClickHouse table.
I made an example that demonstrates how to send logs from Vector to ClickHouse using HTTP sink.
@a-rodin this looks great! I think this is a good first step, though I see more opportunities to support clickhouse more thoroughly. Thanks for doing this!
@lukesteensen before you begin work on this, I think it's worth putting together a simple high-level spec for the first version. This way we can get consensus on what that looks like.
For example, should we start with encoding the event to JSON and storing that in a single column? It appears ClickHouse offers JSON functions that would make it possible to parse and operate on the data at query time. I'm sure there is a performance cost for this, but I could see it being a viable option for unpredictable schemas. I just don't know enough about ClickHouse to know if this is even a smart strategy, which is what I want investigate with this spec.
Alternatively, if we're going to work with static schemas we have a few ways to approach this:
To re-iterate, I think we should start simple, something that we could hopefully ship this week, and then follow up with enhancement issues covering the points above.
I personally think that the approach to the schema is not something too critical because the user anyway can apply subsequent transformations on the data on ClickHouse side by creating another table with desired schema and using a materialized view to transform and pipe data there from the source table.
However, it is necessary to have a separate column at least for timestamp, because to use engines from MergeTree family efficiently it is necessary to have good partitioning and primary keys.
I also want to point out that DateTime type in ClickHouse is 32-bit UNIX timestamp, which has coarser resolution than timestamps in Vector's events (see this issue).
Just as reference, there are two related projects that tackle similar problem, logstash-output-clickhouse and graphite-clickhouse. The latter uses a special GraphiteMergeTree table engine which has support for rollups. On the other hand, rollup for metrics be could done in ClickHouse by copying the data to AggregatingMergeTree instead.
Most helpful comment
Actually, ClickHouse can be used as a sink right away using
httpsink and ClickHouse HTTP API. ClickHouse supports input formatJSONEachRow, which is the same format asndjsonin Vector.It is enough to set the url to be something like
to insert the data to a ClickHouse table.
I made an example that demonstrates how to send logs from Vector to ClickHouse using HTTP sink.