Clickhouse: Avro Support

Created on 13 Jun 2019 · 17Comments · Source: ClickHouse/ClickHouse

Reopen #1342 for triage.

comp-formats feature st-community-taken

Source

julienfr112

👍18

Most helpful comment

+1 for Avro support. This schema validated format is super useful on large scale projects involving many separate teams that may not even need to speak to each other to work with the data.

With Kafka and the AvroRegistry it ensures data is of the expected structure and type, and the full doc (schema) on that structure is viewable and easily understandable by anyone on the project(s). A "contract" is made on a topic once you push avro to it. Otherwise the data is rejected. Forces the devs to send clear predictable stuff that anyone can build on

Confluent Kafka Connect module even uses it to let you dump data from topic directly to an sql table and can even do upserts for you and give you exactly once delivery. (which any kafka consumer can do too)

.. It's probably obvious but I'd go with supporting an avro schema where the main type is a record with its fields.

{
  "type": "record",
  "name": "Something",
  "fields": [
    {"name": "some_id", "type": "int", "doc": "A super useful ID of something"},
    {"name": "some_string", "type": "string", "doc": "blabla"}
  ]
}

and just for kicks:

CREATE TABLE my_kafka_data ENGINE = Kafka()
  SETTINGS kafka_topic_list = 'my_kafka_topic', kafka_format = 'AvroRecord'

bksunday on 2 Aug 2019

👍19

All 17 comments

I've read the specs briefly. Avro format looks very sound. In simplest case it will be similar to RowBinary format. Schema parsing and processing may be a bit more complicated.

By the way, I'm not familiar with popular use cases of Avro format. Could you please tell, why you are using Avro; where data comes from; why that system is using Avro, what is the background of this task?

alexey-milovidov on 27 Jun 2019

@alexey-milovidov we are using confluent KSQL as our ETL tool on top of Kafka. We are pushing data to Kafka from multiple transaction databases using CDC, then we enrich information by combining data in different Kafka topics using KSQL. Confluent recommends using AVRO format, since it not only saves space, it is also faster to process using KSQL. So we convert all the source data to AVRO format using confluent schema registry before pushing to Kafka topics.
It will be great if we can directly sync data in AVRO format from Kafka into clickhouse

anindo-bandyopadhyay on 2 Jul 2019

@alexey-milovidov confluent have C library which serialise/deserialise avro and supports schema registry
https://github.com/confluentinc/libserdes

Sugaroverdose on 3 Jul 2019

Our usecase also requires Kafka to have Avro encoded data. It would be very nice if Clickhouse could support Avro for input.

kirankn on 8 Jul 2019

Kafka nowadays is one of the most popular event streaming platform and the most preferred message format for it is Avro. I really believe that adding avro connector extremely important for my company and whole the community in general.

Maiakov on 14 Jul 2019

+1 for Avro support. This schema validated format is super useful on large scale projects involving many separate teams that may not even need to speak to each other to work with the data.

.. It's probably obvious but I'd go with supporting an avro schema where the main type is a record with its fields.

{
  "type": "record",
  "name": "Something",
  "fields": [
    {"name": "some_id", "type": "int", "doc": "A super useful ID of something"},
    {"name": "some_string", "type": "string", "doc": "blabla"}
  ]
}

and just for kicks:

CREATE TABLE my_kafka_data ENGINE = Kafka()
  SETTINGS kafka_topic_list = 'my_kafka_topic', kafka_format = 'AvroRecord'

bksunday on 2 Aug 2019

👍19

+1
It would be efficient to store queues in Avro format, because of smaller size.

vasyaabr on 23 Sep 2019

danek2003 on 1 Nov 2019

This task is assigned to external contributor, Pavel Kruglov @Avogar.

alexey-milovidov on 1 Nov 2019

🎉16

Hi everyone. I ended up creating a PR to support Avro + AvroConfluent.
https://github.com/ClickHouse/ClickHouse/pull/8571

Anyone interested - give it a try. I'd appreciate any feedback.

oandrew on 9 Jan 2020

🎉1

@oandrew any chance that it will support basic authentication on schema registry, as well as multiple urls?

Sugaroverdose on 9 Jan 2020

@Sugaroverdose most likely with the next PR once the current one is merged. Probably using libserdes would be the easiest. The main issue would be to figure out how to share state across different instances of InputFormat

oandrew on 14 Jan 2020

@oandrew just to make it clear what i mean when saying 'multiple url's':
Multiple urls are used just to support failover like confluent client does:
in config client gets list of strings(maybe choose first node from list randomly, to spread load) and if chosen schema registry node responded with status 5xx or timed out(and maybe 403/401 for corner cases where admin didn't distribute authentication configuration on all nodes, with logging of course) - try another one.

So you don't have to slap HA load balancer on top of schema registry - client knows that there are multiple nodes of schema registry and can handle failure of it's nodes without leading to inability to read/write kafka topic until failed node has been restored or deserialiser/serialiser configuration changed

So it's actually single cluster with synchronous state of schemas/ids, you don't have to handle different shemas with same schema id - it's anti pattern

And since we're talking about confluent handling - list of schema registries also assumes that they share the same authentication credentials

Why do i believe that you should consider this in current PR: it's because of different configuration format which may lead to incompatibility or confusion if initial PR will go into release without this change

Sugaroverdose on 15 Jan 2020

In master.

alexey-milovidov on 24 Jan 2020

🎉4