Clickhouse: Support for non-scalar values in WITH clause

Created on 25 May 2018  ·  19Comments  ·  Source: ClickHouse/ClickHouse

Will ClickHouse support WITH clause (common table expression, cte)? Like https://www.postgresql.org/docs/9.1/static/queries-with.html

The thing that with CTE I can define temp dataset I can then query multiple times in one query statement. Currently, I'm running the same query twice as two subqueries. For example 1) to find distinct IDs and then 2) to get actual data that can be joined on those IDs I found before.

feature st-accepted

Most helpful comment

Are there any plans to support WITH clause for non-scalar values?

All 19 comments

Bump.

@simPod I am interested in the same, but I've seen it mentioned in other issues that if you have identical subqueries, ClickHouse only actually runs it once.

Clickhouse supports WITH clause since mid 2017, but only for scalar values.

Sample syntax:

  [1,2,3,4,5,6,7,8,9] as numbers,
  arrayMap(x -> x*2, numbers) as doubled_numbers,
  arrayMap(x -> x*x, numbers) as squared_numbers,
  arrayMap( (x,y) -> x+y, doubled_numbers, squared_numbers) as summed_numbers
SELECT
   summed_numbers;

@derekperkins cool, good to know, thanks!

Are there any plans to support WITH clause for non-scalar values?

Still not for the first half of 2019.

Thanks for update, your work is appreacited!

Do we have any trick to make a temporary table work like WITH?

@alexey-milovidov is it true that

if you have identical subqueries, ClickHouse only actually runs it once

are there any preconditions for it to work?

I have created this query

SELECT *
FROM (
    SELECT (intDiv(toUInt32(`time`), 300) * 300) AS `time`,
        avg(volume) AS volume
    FROM (
          SELECT `time`,
              sum(volume) AS volume
          FROM (
                SELECT (intDiv(toUInt32(fetched_at), 60) * 60) AS `time`,
                    port_id,
                    avg("bits_in") / 100 AS volume
                FROM counter_bit
                WHERE port_id IN ('50871ea4-b52a-4179-8d2b-3a8616a0cbba', '7fcd5ffa-c4fb-4bc3-ab4e-26f47100e5ae') AND
                    fetched_at BETWEEN '2019-02-25 16:29:17' AND '2019-02-26 16:29:17'
                GROUP BY `time`, port_id
              )
          GROUP BY `time`
        )
    GROUP BY `time`
    )
     CROSS JOIN (
    SELECT avg(volume) AS avg,
        max((`time`, volume)) . 2 AS current,
        max((volume, `time`)) AS max,
        quantile(0.95)(volume) AS p95,
        avg(volume) * count() * 60 AS sum
    FROM (
          SELECT `time`,
              sum(volume) AS volume
          FROM (
                SELECT `time`,
                    sum(volume) AS volume
                FROM (
                      SELECT (intDiv(toUInt32(fetched_at), 60) * 60) AS `time`,
                          port_id,
                          avg("bits_in") / 100 AS volume
                      FROM counter_bit
                      WHERE port_id IN ('50871ea4-b52a-4179-8d2b-3a8616a0cbba', '7fcd5ffa-c4fb-4bc3-ab4e-26f47100e5ae') AND
                          fetched_at BETWEEN '2019-02-25 16:29:17' AND '2019-02-26 16:29:17'
                      GROUP BY `time`, port_id
                    )
                GROUP BY `time`
                UNION ALL
                SELECT intDiv(
                           toUInt32(
                               toDateTime('2019-02-25 16:29:17') + number * 60
                               ),
                           60
                           ) * 60 AS `time`,
                    toFloat64(0) AS volume
                FROM numbers(toUInt64(((toDateTime('2019-02-26 16:29:17') - toDateTime('2019-02-25 16:29:17')) / 60)))
              )
          GROUP BY `time`
        )
    )
ORDER BY `time`

where the inner select is identical

                      SELECT (intDiv(toUInt32(fetched_at), 60) * 60) AS `time`,
                          port_id,
                          avg("bits_in") / 100 AS volume
                      FROM counter_bit
                      WHERE port_id IN ('50871ea4-b52a-4179-8d2b-3a8616a0cbba', '7fcd5ffa-c4fb-4bc3-ab4e-26f47100e5ae') AND
                          fetched_at BETWEEN '2019-02-25 16:29:17' AND '2019-02-26 16:29:17'
                      GROUP BY `time`, port_id

But number of rows that are required to be processed is doubled so the result is not reused.

@derekperkins Could you please show me the issue or link proving you words about ClickHouse only actually runs it once, if you have identical subqueries? Thank you

@alexey-milovidov I noticed that WITH clause speedups my query which involves a big bitmap(cardinality 99M ) filter from 394s to 93s. Could you explain why it happen? Thanks!

The original query:

    toDate ( order_complete_time ) AS rowData,
    city_tier AS columnData,
    count( DISTINCT mid_seq ),
    toDecimal128 ( SUM( order_total_sales ), 0 ),
    count( 1 ),
    toDecimal128 ( avg( order_total_sales ), 0 ),
    countIf (
        1 = hasAny ( [ '2434', '1440', '2435', '4937', '3449', '3452', '1600', '1100', '1403', '3466' ], detail.p_item_code ) 
        AND 0 = hasAll ( [ '2434', '1440', '2435', '4937', '3449', '3452', '1600', '1100', '1403', '3466' ], detail.p_item_code ) 
    ) AS addonGC,
    avgIf (
        order_total_sales,
        (
            1 = hasAny ( [ '2434', '1440', '2435', '4937', '3449', '3452', '1600', '1100', '1403', '3466' ], detail.p_item_code ) 
            AND 0 = hasAll ( [ '2434', '1440', '2435', '4937', '3449', '3452', '1600', '1100', '1403', '3466' ], detail.p_item_code ) 
        ) 
    ) AS addonAC,
    countIf (
        1 = hasAny ( [ '2434', '1440', '2435', '4937', '3449', '3452', '1600', '1100', '1403', '3466' ], detail.p_item_code ) 
        AND 0 = hasAll ( [ '2434', '1440', '2435', '4937', '3449', '3452', '1600', '1100', '1403', '3466' ], detail.p_item_code ) 
    ) / count( 1 ) AS addonRate 
FROM
    dist_cdp_orders_test 
WHERE
    toDate ( order_complete_time ) GLOBAL IN (
    SELECT
        toDate ( order_complete_time ) AS ROW 
    FROM
        dist_cdp_orders_test 
    WHERE
        ( order_complete_time >= '2019-07-01 00:00:00' AND order_complete_time <= '2019-07-31 23:59:59' ) 
        AND ( bitmapContains ( ( SELECT mid_seqs FROM cdp_segments WHERE seg_id = '8afe10136d333333016d339320100006' LIMIT 1 ), mid_seq ) ) 
    GROUP BY
    ROW 
    ORDER BY
    ROW DESC 
        LIMIT 0,
        10 
    ) 
    AND ( order_complete_time >= '2019-07-01 00:00:00' AND order_complete_time <= '2019-07-31 23:59:59' ) 
    AND ( bitmapContains ( ( SELECT mid_seqs FROM cdp_segments WHERE seg_id = '8afe10136d333333016d339320100006' LIMIT 1 ), mid_seq ) ) 
GROUP BY
    rowData,
    columnData 
ORDER BY
    rowData DESC

Rewrite withWITH clause:

    [ '2434', '1440', '2435', '4937', '3449', '3452', '1600', '1100', '1403', '3466' ] AS item_codes,
    (SELECT mid_seqs FROM cdp_segments WHERE seg_id = '8afe10136d333333016d339320100006' LIMIT 1) AS bm
SELECT
    toDate ( order_complete_time ) AS rowData,
    city_tier AS columnData,
    count( DISTINCT mid_seq ),
    toDecimal128 ( SUM( order_total_sales ), 0 ),
    count( 1 ),
    toDecimal128 ( avg( order_total_sales ), 0 ),
    countIf (
        1 = hasAny ( item_codes, detail.p_item_code ) 
        AND 0 = hasAll ( item_codes, detail.p_item_code ) 
    ) AS addonGC,
    avgIf (
        order_total_sales,
        (
            1 = hasAny ( item_codes, detail.p_item_code ) 
            AND 0 = hasAll ( item_codes, detail.p_item_code ) 
        ) 
    ) AS addonAC,
    countIf (
        1 = hasAny ( item_codes, detail.p_item_code ) 
        AND 0 = hasAll ( item_codes, detail.p_item_code ) 
    ) / count( 1 ) AS addonRate 
FROM
    dist_cdp_orders_test 
WHERE
    toDate ( order_complete_time ) GLOBAL IN (
    WITH (SELECT mid_seqs FROM cdp_segments WHERE seg_id = '8afe10136d333333016d339320100006' LIMIT 1) AS bm
    SELECT
        toDate ( order_complete_time ) AS ROW 
    FROM
        dist_cdp_orders_test 
    WHERE
        ( order_complete_time >= '2019-07-01 00:00:00' AND order_complete_time <= '2019-07-31 23:59:59' ) 
        AND ( bitmapContains ( bm, mid_seq ) ) 
    GROUP BY
    ROW 
    ORDER BY
    ROW DESC 
        LIMIT 0,
        10 
    ) 
    AND ( order_complete_time >= '2019-07-01 00:00:00' AND order_complete_time <= '2019-07-31 23:59:59' ) 
    AND ( bitmapContains ( bm, mid_seq ) ) 
GROUP BY
    rowData,
    columnData 
ORDER BY
    rowData DESC;

This task is still relevant, but is not assigned yet. There is high chance that we will assign it in nearest weeks.

@alexey-milovidov
will this end up like

WITH (SELECT a, b FROM t) as t1
SELECT * FROM t1 WHERE a=1

to reuse subquery result , please save a kid has suffered so much !!!
WITH (SELECT a, b, c FROM test) AS a_ table

to reuse subquery result , please save a kid has suffered so much !!!
WITH (SELECT a, b, c FROM test) AS a_ table

temporary tables https://clickhouse.yandex/docs/en/query_language/create/#temporary-tables

to reuse subquery result , please save a kid has suffered so much !!!
WITH (SELECT a, b, c FROM test) AS a_ table

temporary tables https://clickhouse.yandex/docs/en/query_language/create/#temporary-tables

this is not a very appropriate solution

this is not a very appropriate solution

There is no other solutions. So.

to reuse subquery result , please save a kid has suffered so much !!!

Let's help this poor kid https://github.com/ClickHouse/ClickHouse/pull/14771

Thx.~

| |
梁凡
邮箱:[email protected]
|

签名由 网易邮箱大师 定制

On 09/13/2020 01:08, Amos Bird wrote:

to reuse subquery result , please save a kid has suffered so much !!!

Let's help this poor kid #14771


You are receiving this because you commented.
Reply to this email directly, view it on GitHub, or unsubscribe.

Was this page helpful?
0 / 5 - 0 ratings