Is your feature request related to a problem? Please describe.
I wish to forward my logs to Clickhouse without the need to use some strange third-party plugin within Fluentd, I would like to use a lightweight tool like Fluentbit for that.
Describe the solution you'd like
Just an output to Clickhouse, there also seem to be libraries for that available like https://github.com/artpaul/clickhouse-cpp.
Describe alternatives you've considered
The only alternative I can see is to use Fluentd, but since Clickhouse is only supported by some third-party plugin where I even don't really like the source I can see this looks like a bad direction.
Additional context
I would contribute that feature, but C/C++ is not really the language I'm feeling comfortable with. This would take a long time to build, I guess.
At least I have started a branch on my fork, let's see if I can provide some PoC and gather some feedback :D
https://github.com/fluent/fluent-bit/compare/master...tboerger:clickhouse?expand=1
I made a tiny progress, I am able to initialize the upstream connection and I have added all required attributes. Now I need to transform the data into JSON and send it to the Clickhouse HTTP API.
@tboerger, you are right about existing Fluentd plugins for ClickHouse, they are at least incomplete.
Obviously, ClickHouse native binary interface is the most efficient way to ingest events. I believe we'll eventually come to this.
However, I found an easy and quick solution via out_http plugin. ClickHouse supports both formats json_stream and json_lines. It's also possible to use https, but I've not tested it yet. This is not so efficient as the native interface, but it is enough for the most cases:
[OUTPUT]
Name http
Host clickhouse-server-addr
Port 8123
URI /?query=INSERT%20INTO%20t%20FORMAT%20JSONEachRow
Format json_stream
Unfortunately, the most critical issue with this solution for production use is the only one destination server. In the v0.14.0 version announced upstream interface for the forward plugin and I really hope out_http plugin will use this interface soon.
For the complete solution, it will be good to compress JSON stream with native HTTP Content-Encoding: gzip. TSV or CSV formats can also help to reduce traffic amounts.
BTW, ClickHouse strengths and perspectives are just underestimated. We found it much more useful than ElasticSearch. Just read how it used by Cloudflare.
That's also a pretty interesting approach, never thought about that. Should I skip my current effort and just use this way?
I would really like to replace my current elasticsearch setup with a clickhouse-based setup. The only missing bit is a really good UI for that since loghouse doesn't provide a really good UX.
As far as I understand, it is not as simple as using http plugin (correct me if I'm wrong).
From my experience, if you want the Clickhouse to accept output of http plugin, you need to provide DateTime in the correct format. json_date_format iso8601 in http plugin configuration almost nails it,
clickhouse receives jsonStream string like this:
{"date":"2018-10-13T18:40:08.000000Z", "host":"62.43.57.50", "user":"-", "method":"GET", "path":"/api/track?tracking=CD025602700RU", "code":"200", "size":"7210", "referer":"-", "agent":"-"}
{"date":"2018-10-13T18:40:09.000000Z", "host":"16.121.182.122", "user":"-", "method":"GET", "path":"/", "code":"200", "size":"7087", "referer":"-", "agent":"Zabbix"}
but throws exception on date parsing.
I've created https://github.com/fluent/fluent-bit/issues/848 because of this.
My Clickhouse table is:
create table log(
date Datetime,
host String,
user String,
method String,
path String,
code UInt16,
size UInt32,
referer String,
agent String
) ENGINE = MergeTree PARTITION BY toYYYYMM(date) ORDER BY date;
The quick&dirty way to deal with it is to launch simple intermediary server which accepts fluentbit data, converts datetime to clickhouse format, and inserts converted data into clickhouse.
So, how possible is that native output support for clickhouse will appears in fluent-bit? Without all that mess with http output and time formatting.
I'd consider following pipeline
myserver -> fluentbit -> json stdout -> collect&transform -> clickhouse
On collect&transform step logs must be transformed to clickhouse friendly formats and also logs are accumulated and inserted by bunch of 10000 lines (recommendation).
So I wonder how can fluentbit handle collect&transform step, I guess separate application would be better here.
I solved this in my case with next architecture: fluent-bit collecting logs on many endpoint hosts and forwarding to several fluentd hosts for aggregation, where fluentd hosts transform input entries as needed, with format and inject sections, and uploads by fluentd output exec plugin to clickhouse servers, with calling clickhouse-client through shell script. That works really well.
But having working clickhouse output plugin still will be useful for some cases.
@fessmage I've seen similar approach in loghouse helm chart. I don't know how fluentd can accumulate log messages.
Could you share configs?
I agree that own clickhouse plugin would be useful.
This is more appropriate for discussion of fluentd, but ok, lets see it here.
general fluentd config:
```
@type forward
@type exec
command bash insert_clickhouse.sh
time_key time_local
time_type unixtime
tag_key tag
@type json
`
and this is shell script:
shell
cat $1 | clickhouse-client --host host --port port --user user --password password --query="INSERT INTO db.table FORMAT JSONEachRow"
````
This is more appropriate for discussion of fluentd, but ok, lets see it here.
general fluentd config:
<source> @type forward </source> <match some-tag> @type exec command bash insert_clickhouse.sh <inject> time_key time_local time_type unixtime tag_key tag </inject> <format> @type json </format> </match>and this is shell script:
#!/bin/bash cat $1 | clickhouse-client --host host --port port --user user --password password --query="INSERT INTO db.table FORMAT JSONEachRow"
Mate, hello, can you post database create query? I need to store netflow info(about 10k per sec) and little confused at start. Important step to properly create database :)
collect&transform
Where get collect&transform ?
Thanks!
Hi there!
One thing worth trying with Clickhouse is to use it's own "transform" capabilities:
Given the fluent-bit output format:
{"date":"2018-10-13T18:40:08.000000Z", "host":"62.43.57.50", "log": "blblblbla"}
{"date":"2018-10-13T18:40:09.000000Z", "host":"16.121.182.122", "log": "fofoffoffoffofffoo"}
First create a sink:
CREATE DATABASE IF NOT EXISTS logs;
CREATE TABLE logs.sink (
date String,
host String,
log String
)
ENGINE=Null;
And a table that will transform and save data:
CREATE MATERIALIZED VIEW logs.logs
ENGINE = MergeTree
PARTITION BY toYYYYMM(datetime)
ORDER BY (datetime, host)
AS
SELECT
parseDateTimeBestEffort(date) AS datetime,
host,
log
FROM logs.sink;
and output config
[OUTPUT]
Name http
Host clickhouse-server-addr
Port 8123
URI /?query=INSERT%20INTO%20logs.sink%20FORMAT%20JSONEachRow
Format json_stream
should just work. (at least I believe so, haven't tried that yet unfortunately)
I will close this issue as I'm not using clickhouse anymore, I have switched to graylog because of better web interface around it.
Can this be reopened? I'd be interested in seeing a clickhouse output plugin
Fair enough
At least I'm able to unsubscribe from that issue even being the author.
Following config:
[OUTPUT]
Name http
Host <clickhouse host>
Port 8123
URI /?query=INSERT+INTO+log.records+FORMAT+JSONEachRow
Format json_stream
Json_date_key timestamp
Json_date_format epoch
works for me neatly. ClickHouse converts epoch into timestamp automatically - provided you use correct data type DateTime('Etc/UTC')
Please create another issue if you are interested into that, I want to get my issues clean and I'm not using fluent-bit anymore.
Most helpful comment
@tboerger, you are right about existing Fluentd plugins for ClickHouse, they are at least incomplete.
Obviously, ClickHouse native binary interface is the most efficient way to ingest events. I believe we'll eventually come to this.
However, I found an easy and quick solution via out_http plugin. ClickHouse supports both formats json_stream and json_lines. It's also possible to use https, but I've not tested it yet. This is not so efficient as the native interface, but it is enough for the most cases:
Unfortunately, the most critical issue with this solution for production use is the only one destination server. In the v0.14.0 version announced upstream interface for the forward plugin and I really hope out_http plugin will use this interface soon.
For the complete solution, it will be good to compress JSON stream with native HTTP Content-Encoding: gzip. TSV or CSV formats can also help to reduce traffic amounts.
BTW, ClickHouse strengths and perspectives are just underestimated. We found it much more useful than ElasticSearch. Just read how it used by Cloudflare.