Hello,
it's not clear from documentation what is the behavior of the system with respect to WITH clause evaluation. Consider the following 2 queries:
WITH dictGet('dict', 'version', toUInt64(0)) as _ver SELECT * from mytableWITH dictGet('dict', 'version', toUInt64(0)) AS _ver SELECT count(*) AS cnt FROM mytable
PREWHERE version = _verBoth queries are sent to distributed table and then to data nodes.
Questions:
For both queries above,
WITH 10 as _ver SELECT ... ?What we really need is a way to achieve the effect in 1 with WITH clause expansion in the DT node. This allows us to establish some common context for query execution on different data nodes. There are many useful application of this feature.
Will dictGet(..) be performed by DT and query sent to the data nodes be re-written as WITH 10 as _ver SELECT ... ?
This depends, without any extra settings it won't be executed on the initiator node (DT) (you can verify this by using something like this with sleep(1) as v select * from remote('127.{2,3}', system.one) where dummy=v and you will see that it will take ~1 second, and this means that the initiator node does not executes sleep() only "data" nodes in parallel, hence 1 second)
But, with optimize_skip_unused_shards, after #8846 it will be executed on the initiator node to filter out unused shards (and also there is another bit, completely internal, so just ignore it, after #9808 it will be executed while obtaining sample block)
Will dictGet be performed ONLY by the data nodes and DT will simply never evaluate the WITH clause?
It will be performed by the local (data) node (you can see this using SET send_logs_level='trace' and after you will execute the query you will see the query that was send to the remote nodes)
What we really need is a way to achieve the effect in 1 with WITH clause expansion in the DT node
Now let's get to the point, this can be done by wrapping dictGet into SELECT in WITH, i.e.:
with (select dictGet('dict', 'version', toUInt64(0))) as v select * from remote('127.{2,3}', system.one) where dummy=v;
Since it will be rewritten as follow:
WITH CAST(1, 'UInt8') AS v SELECT one.dummy FROM system.one WHERE dummy = v
P.S. may require recent version
Wrapping dictGet with a select is a beautiful idea. Thanks so much for the tip @azat! 🥇 💯
Is 20.3.5.53 - the latest LTS - recent enough?
Is 20.3.5.53 - the latest LTS - recent enough?
If it works - then it is recent enough!
Glad it works!
Yes it does work. We see the query expanded exactly like you said. So, the logic here is:
the subquery in WITH has a priority over the main SELECT hence DT must handle it first?
Does the same logic apply to all uncorrelated subqueries or WITH is a special case?
It would be cool to document it for the benefit of others. I found very little about the query plan evaluation strategies in CH. In the meantime, DT in front of data node is a unique design of ClickHouse and there are no standards covering how exactly distributed query processors do this.
We see the query expanded exactly like you said
Actually you can verify this with ANALYZE instead of looking into logs:
set enable_debug_queries=1;
analyze with (select crc64('')) as v select * from remote('127.{2,3}', system.one) where dummy=v;
Row 1:
──────
explain: WITH CAST(0, 'UInt64') AS v
SELECT dummy
FROM remote('127.{2,3}', 'system.one')
WHERE dummy = v
So, the logic here is:
the subquery in WITH has a priority over the main SELECT hence DT must handle it first?
Indeed, actually every scalar subquery (not only in WITH) will be replaced with their results as constants.
Does the same logic apply to all uncorrelated subqueries or WITH is a special case?
Yes
Will dictGet(..) be performed by DT and query sent to the data nodes be re-written as WITH 10 as _ver SELECT ... ?
That's the vanilla ClickHouse WITH, which introduces some aliases to the current query scope. I'd like to name it as CSE (common scalar expression) because it can only hold scalar values. The expressions are evaluated in a column context, which can be treated as a column. Since dictGet('dict', 'version', toUInt64(0)) as _ver is a constant expression, constant folding mechanism kicks in and ClickHouse rewrites it into an scalar value. When involving distributed execution, there is another optimization to send the scalar column value instead of its AST representation to remotes https://github.com/ClickHouse/ClickHouse/pull/7392 . Back to this question, it will be re-written as WITH 10 as _ver but other remote will likely receive a __getScalar(...) function instead.
https://github.com/ClickHouse/ClickHouse/pull/14771 introduces another type of WITH statement, which is more akin to the SQL standard as CTE (common table expression). However the introduced names will only apply to table contexts such as FROM and IN, which is not suitable here.
Will dictGet be performed ONLY by the data nodes and DT will simply never evaluate the WITH clause?
It's possible. https://github.com/ClickHouse/ClickHouse/pull/12567 introduces a table function VIEW which can wrap a query and send it to remote data nodes as is. You can use
select * from cluster(<you_cluster>, view(
WITH dictGet('dict', 'version', toUInt64(0)) AS _ver SELECT count(*) AS cnt FROM mytable PREWHERE version = _ver
))
to achieve the expected behavior.
select * from clusters(
, view(
WITH dictGet('dict', 'version', toUInt64(0)) AS _ver SELECT count(*) AS cnt FROM mytable PREWHERE version = _ver
))
The clusters function needs to be modified to cluster