Clickhouse: I wonder how distributed queries work.

Created on 14 Feb 2019  路  4Comments  路  Source: ClickHouse/ClickHouse

(I am not good at English and ask for your understanding.)

I wonder how distributed queries work.

I created tables for question.

create table users_dist (
  id int
) engine = Distributed( ... )

create table users (
  id int
) engine = ReplicatedMergeTree( ... )

And then, I send a following query against Distributed table users_dist.
select count(id) from users_dist group by id

How does ClickHouse collect all the individual id's on one server?

Is there any query optimizations for Distributed tables?

question question-answered

All 4 comments

Basically initial query comes to one server and it becomes responsible for communicating with other servers to provide the requested result. Partial aggregation is done on (by default) one replica of each shard and then this initial server merges these partial results to produce the final result.

Do you have any more specific questions?

Thank you for your answer!

I have more specific questions. I wonder the details of distributed queries.

When I send the query select count(id) from users_dist group by id, the initial server send the query select count(id) from users group by id to the other servers.

What are the partial results of the other servers? Is it in the below format?

| id | count(id) |
--- | ------------
| 1 | 10 |
| 2 | 13 |
| ... | ... |

In the document,

For example, for a query with GROUP BY, data will be aggregated on remote servers, and the intermediate states of aggregate functions will be sent to the requestor server. Then data will be further aggregated.
An intermediate state of the aggregation (for uniq, this is the hash table for calculating the number of unique values). This is an AggregateFunction(...) that can be used for further processing or stored in a table to finish aggregating later

When the ClickHouse processes select count(id) from users_dist group by id, do remote servers response intermediate states in the form of pair(id, count(id) to requestor?

Something like that. ClickHouse is a columnar database, so the responce is send in a form of so called block containing 2 columns.
One with id, another with aggregation function states.

I.e. smth like:

id column: 1,2,3,4,5
count(id) column: 100,100,100,100,100

Was this page helpful?
0 / 5 - 0 ratings