Is it a plan to support "distributed table on distributed table"?
My ClickHouse version is 18.10.3
AFAIK that story is in the backlog, but I'm not sure about terms and schedule.
@alexey-milovidov can you clarify that?
See also: #1726
Distributed table on Distributed table is supported if one of them contains no more than one shard.
For example, you can point to distributed table on one cluster from another cluster; you can even specify more than one replica for load balancing.
If you want to create Distributed tables that will look at multiple clusters and unify them as shards, you can specify a cluster that will contain all of the shards of all clusters and create Distributed table with this unified cluster.
AFAIK that story is in the backlog, but I'm not sure about terms and schedule.
@alexey-milovidov can you clarify that?
Full featured distributed on distributed tables are not in our backlog, but it isn't hard thing to do.
AFAIK that story is in the backlog, but I'm not sure about terms and schedule.
@alexey-milovidov can you clarify that?Full featured distributed on distributed tables are not in our backlog, but it isn't hard thing to do.
Hm. AFAIR @alex-zaitsev was mentioning that. There are few use cases when it can be useful.
1) better abstraction level. Several sharded clusters (for example per region), and supercluster who don't need to know how many shards and servers are in each region and want just to have single (or few) entry point(s).
2) avoiding query amplification. Let's say you have (theoretically) 1000 servers in your cluster, you send one query, it send sub-query to 1000 servers, and after that it send 1000 results back. And if you have 1000 qps... :) If some servers are geographically far away it will go even worse, also connection pool will be full all the time. Instead of that distributed looking on 10 distributed, each looking on another 10 distributed table, each consist of 10 shards. Number of packets/connection on each node is always below 10 + 10 + 10 = 30 which is much better than 1000.
The rationale is actually the different. If there are 100-1000 nodes in the cluster, it could be quite time consuming to update the configuration on all nodes if there are any changes (adding/replacing a node and so on). So having distributed tables pointing to sub-clusters, instead of individual nodes, would made it easier. If there are any changes, only sub-clusters needs to be re-configured.