I'm thinking of using ClickHouse in production within a big company here in Brazil, so these questions.
When should I use shard?
When should I use distributed tables?
When should I use replicated tables?
I want to be able to do parallel queries on a server group with ClickHouse and I also want my data to be replicated to fault, fail-tolerant.
What would be the creation order of the tables engine?
I'm confused...
Thx a lot.
Yep. I remember that for me all that engines was a bit confusing at the beginning too.
Generally: the main engine in Clickhouse is called MergeTree. It allows to store and process data on one server and feel all the advantages of Clickhouse. Basic usage of MergeTree does not require any special configuration, and you can start using it 'out of the box'.
But one server and one copy of data is not fault-tolerant - something can happen with server itself, with datacenter availability etc. So you need to have the replica(s) - i.e. server(s) with same data and which can 'substitute' the original server it any moment.
To have an extra copy (replica) of your data you need to use ReplicatedMergeTree engine. It can be used _instead_ of MergeTree engine, and you can always upgrade from MergeTree to ReplicatedMergeTree (and downgrade back) if you need. To use that you need to have ZooKeeper installed and running. For tests you can use one standalone Zookeeper instance, but for production usage you should have zookeeper ensemble at least of 3 servers.
When you use ReplicatedMergeTree then the inserted data is copied automatically to all the replicas, but all the SELECTs are executed on the server you have connected to. So you can have 5 replicas of your data, but if will always connect to one replica - it will not 'share' / 'balance' that traffic automatically between all the replicas, one server will be loaded and the rest will generally do nothing. If you need that balancing of load between multiple replicas - you can use internal 'loadbalancer' mechanism which is provided by Distrtibuted engine of Clickhouse. As an alternative in that scenario you can work without Distribured table, but with some external loadbalancer which will balance the requests between several replicas according to your specific rules or preferences.
Distributed engine does not store any data, but it can 'point' to same ReplicatedMergeTree/MergeTree table on multiple servers. To use Distributed engine you need to configure <cluser> settings in your ClickHouse server config file.
So let's say you have 3 replicas of table my_replicated_data with ReplicatedMergeTree engine. You can create a table with Distrtibuted engine called my_distributed_replicated_data which will 'point' to all of that 3 servers, and when you will select from that my_distributed_replicated_data table the select will be forwarded and executed on one of the replicas. So in that scenario each of the replica will get 1/3 of requests (but each request still will be fully executed on one chosen replica).
All that is great, and will work good while one copy of your data is fitting one one physical server, and can be processed by the resources of one server. When you have too much data to be stored/processed on one server - you need to use sharding (it's just a way to split the data to smaller parts). Sharding is the mechanism also provided be Distributed engine.
With sharding data is divided into parts (shards) according to some sharding key. You can just use random distribution, so let's say - throw a coin to decide on each of the server the data should be stored, or you can use some 'smarter' sharding scheme, to make the data connected to the same subject (let's say to the same customer) stored on one server, and to another subject on another. So in that case all the shards should be requested at the same time and later the 'common' result should be calculated.
In ClickHouse each shard work independently and process its' part of data, inside each shard replication can work. And later to query all the shards at the same time and combine the final result - Distributed engine is used. So Distributed work as loadbalancer inside each shard, and can combine the data coming from different shards together to make the 'common' result.
You can use Distribured table for inserts, it that case it will pass the data to one of the shards according to sharding key. Or you can insert to underlying table on one of the shards bypassing the Distributed table.
1) start with MergeTree
2) to have several copies of data use ReplicatedMergeTree
3) if your data is too big to fit/ to process on one server - use sharding
3) to balance the load between replicas and to combine the result of selects from different shards - use Distributed table.
Official tutorial clarify that a bit: https://clickhouse.yandex/tutorial.html
Please check also @alex-zaitsev presentation, which also covers that subject: https://www.youtube.com/watch?v=zbjub8BQPyE
( Slides are here: https://yadi.sk/i/iLA5ssAv3NdYGy )
P.S. Actually you can create replication without Zookeeper and ReplicatedMergeTree, just by using Distribured table above MergeTree and internal_replication=false cluster setting, but in that case there will no guarantee that all the replicas will have 100% the same data, so i rather would not recommend that scenario.
Hi,
In order to enable replication you can use tables from Replicated family. You can add Replicated to any *MergeTree engine and specify znode and replica name in two extra parameters.
You should have access to external zookeeper cluster, and specify in into ClickHouse configuration file. So, the order of actions is:
In order to have several shards, you should use Distributed table. Distributed table is a kind of metatable over other tables, and you can use it if you want insert into or select from other tables. You should specify cluster nodes in ClickHouse configuration file.
Distributed table can work with ordinary or replicated tables. When you use Replicated tables, you should set <internal_replication>true</internal_replication> for each shard in cluster configuration.
When you insert into Distributed, it split data between shards according to sharding_key parameter. Then, it insert parts into all replicas (or any replica per shard if internal_replication is true, because Replicated tables will replicate data internally).
When you select from distributed, it just read data from one replica per shard and merge result.
So, next steps is:
<internal_replication>true</internal_replication> for each shad.More information you can find into documentation.
Thx a lot guys!!!
But in the case of using shard and replica at the same time, will not conflicts happen?
For the records will be splited between the servers but at the same time all parts will be replicated.
Or the correct would be to have, for example, 2 servers in replica, being one of these part of the shard along with other servers that are not part of the replica.
Right?
@filimonov @KochetovNicolai
In this image is the Best MVP Replicated/Distributed ClickHouse Architecture?
@alexey-milovidov
But in the case of using shard and replica at the same time, will not conflicts happen?
It is 100% safe, and in it is a standard way of ClickHouse usage.
Use replication always.
Use replication + sharding if all your data does not fit to one server.
In this image is the Best MVP Replicated/Distributed ClickHouse Architecture?
There is no something like 'best' architecture in 'general case'.
Distibuted table over ReplicatedMergeTree tables - is how it works in ClickHouse if you need sharding + replication.
Kafka as data source recently is kind of 'industry strandard' (but other solutions also possible, depending on concrete flow / used tools).
With Kafka, the final intention is to make real-time analytics/dashboard.
Nice @filimonov thx a lot guy!!!!!
And the last question, for now, hehehe.
I'm going to build a RESTapi to deliver the data to the front end, run the querys in the database and deliver the result.
As I'm going to have two servers in case some of them fall and the dashboards keep running, how do I control which ClickHouse server I'm going to run the querys on?
You've suggested controlling this within this API that I'm going to build, call one server if it does not respond calls another.
Or did you suggest adding a proxy like proxySQL for example, to rotate the querys and identify if any server has flaws.
Thank you again.
If you create Distibuted table on both servers - you can connect to any of them and it will get data from the clickhouse cluster behind.
If one server will fall you need to switch to another server.
You can use some proxy in front of clickhouse (there is also nice proxy especially Clickhouse https://github.com/Vertamedia/chproxy ) but if you will have only one proxy, that proxy will be your single point of failure.
I'll probably use direct calls to the servers using some resilient http lib.
Thx a looooot again @filimonov!!!!
Guy's!!
I'm apply this architecture and work perfect!!!!
Thanks a lot guys. I'm new beginner to ClickHouse and also confusing like @filimonov.
And I have a question, how can we determine the size of a shard ? OR how many shards should we use ? I'm building a real-time event analytic for 50 millions events per day with 6 server (3 SSD and 3 HDD)
Thanks again.
Thanks a lot guys. I'm new beginner to ClickHouse and also confusing like @filimonov.
And I have a question, how can we determine the size of a shard ? OR how many shards should we use ? I'm building a real-time event analytic for 50 millions events per day with 6 server (3 SSD and 3 HDD)
Thanks again.
You don't have to use sharding if a single server can store and process your data. 50 millions rows per day are not that much, just create dummy data which simulates your case and insert it for a whole year (365 days) which will be 50M * 365 = 18 Billions rows. Then check the table sizes to figure out if you will be needing sharding or not. To check table sizes you can simply run this query :
SELECT table, formatReadableSize(size) as size, rows, days, formatReadableSize(avgDaySize) as avgDaySize FROM (
SELECT
table,
sum(bytes) AS size,
sum(rows) AS rows,
min(min_date) AS min_date,
max(max_date) AS max_date,
(max_date - min_date) AS days,
size / (max_date - min_date) AS avgDaySize
FROM system.parts
WHERE active
GROUP BY table
ORDER BY rows DESC
)
Most helpful comment
Yep. I remember that for me all that engines was a bit confusing at the beginning too.
Generally: the main engine in Clickhouse is called MergeTree. It allows to store and process data on one server and feel all the advantages of Clickhouse. Basic usage of MergeTree does not require any special configuration, and you can start using it 'out of the box'.
But one server and one copy of data is not fault-tolerant - something can happen with server itself, with datacenter availability etc. So you need to have the replica(s) - i.e. server(s) with same data and which can 'substitute' the original server it any moment.
To have an extra copy (replica) of your data you need to use ReplicatedMergeTree engine. It can be used _instead_ of MergeTree engine, and you can always upgrade from MergeTree to ReplicatedMergeTree (and downgrade back) if you need. To use that you need to have ZooKeeper installed and running. For tests you can use one standalone Zookeeper instance, but for production usage you should have zookeeper ensemble at least of 3 servers.
When you use ReplicatedMergeTree then the inserted data is copied automatically to all the replicas, but all the SELECTs are executed on the server you have connected to. So you can have 5 replicas of your data, but if will always connect to one replica - it will not 'share' / 'balance' that traffic automatically between all the replicas, one server will be loaded and the rest will generally do nothing. If you need that balancing of load between multiple replicas - you can use internal 'loadbalancer' mechanism which is provided by Distrtibuted engine of Clickhouse. As an alternative in that scenario you can work without Distribured table, but with some external loadbalancer which will balance the requests between several replicas according to your specific rules or preferences.
Distributed engine does not store any data, but it can 'point' to same ReplicatedMergeTree/MergeTree table on multiple servers. To use Distributed engine you need to configure
<cluser>settings in your ClickHouse server config file.So let's say you have 3 replicas of table my_replicated_data with ReplicatedMergeTree engine. You can create a table with Distrtibuted engine called my_distributed_replicated_data which will 'point' to all of that 3 servers, and when you will select from that my_distributed_replicated_data table the select will be forwarded and executed on one of the replicas. So in that scenario each of the replica will get 1/3 of requests (but each request still will be fully executed on one chosen replica).
All that is great, and will work good while one copy of your data is fitting one one physical server, and can be processed by the resources of one server. When you have too much data to be stored/processed on one server - you need to use sharding (it's just a way to split the data to smaller parts). Sharding is the mechanism also provided be Distributed engine.
With sharding data is divided into parts (shards) according to some sharding key. You can just use random distribution, so let's say - throw a coin to decide on each of the server the data should be stored, or you can use some 'smarter' sharding scheme, to make the data connected to the same subject (let's say to the same customer) stored on one server, and to another subject on another. So in that case all the shards should be requested at the same time and later the 'common' result should be calculated.
In ClickHouse each shard work independently and process its' part of data, inside each shard replication can work. And later to query all the shards at the same time and combine the final result - Distributed engine is used. So Distributed work as loadbalancer inside each shard, and can combine the data coming from different shards together to make the 'common' result.
You can use Distribured table for inserts, it that case it will pass the data to one of the shards according to sharding key. Or you can insert to underlying table on one of the shards bypassing the Distributed table.
Short summary
1) start with MergeTree
2) to have several copies of data use ReplicatedMergeTree
3) if your data is too big to fit/ to process on one server - use sharding
3) to balance the load between replicas and to combine the result of selects from different shards - use Distributed table.
More...
Official tutorial clarify that a bit: https://clickhouse.yandex/tutorial.html
Please check also @alex-zaitsev presentation, which also covers that subject: https://www.youtube.com/watch?v=zbjub8BQPyE
( Slides are here: https://yadi.sk/i/iLA5ssAv3NdYGy )
P.S. Actually you can create replication without Zookeeper and ReplicatedMergeTree, just by using Distribured table above MergeTree and internal_replication=false cluster setting, but in that case there will no guarantee that all the replicas will have 100% the same data, so i rather would not recommend that scenario.