Clickhouse: [Question] how to make funnel query run faster?

Created on 23 Mar 2018  ยท  19Comments  ยท  Source: ClickHouse/ClickHouse

hi, I need to use clickhouse to funnels query

Data

  • the table trajectory_trackless_complex has 2894082787 rows
  • pageid: 2538 --> 3242 --> 1682, ( homepage --> listpage --> detailpage )

Use sequenceMatch, throw error

    sum(step1_condition) AS home, 
    sum(step2_condition) AS list, 
    sum(step3_condition) AS detail
FROM 
(
    SELECT 
        loginkey, 
        max(pageid = 2538) AS step1_condition, 
        sequenceMatch('(?1).*(?2)')(toDateTime(createtime), pageid = 2538, pageid = 3242) AS step2_condition, 
        sequenceMatch('(?1).*(?2).*(?3)')(toDateTime(createtime), pageid = 2538, pageid = 3242, pageid = 1682) AS step3_condition
    FROM datacenter.trajectory_trackless_complex 
    WHERE intotime = '2018-03-01'
    GROUP BY loginkey
) 

โ†— Progress: 875.45 million rows, 12.42 GB (76.01 million rows/s., 1.08 GB/s.) โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–Š 98%Received exception from server (version 1.1.54342):
Code: 160. DB::Exception: Received from hadoo:9000, 10.12.180.114. DB::Exception: Pattern application proves too difficult, exceeding max iterations (1000000). 

Query sql

    sum(step1_time != 0) AS home, 
    sum(step2_time != 0) AS list, 
    sum(step3_time != 0) AS detail
FROM 
(
    SELECT 
        loginkey, 
        groupArray(pageid) AS events, 
        groupArray(createtime) AS events_times, 
        arrayFilter((time, name) -> (name = 2538), events_times, events)[1] AS step1_time, 
        arrayFilter((time, name) -> ((name = 3242) AND (time >= step1_time) AND (step1_time != 0)), events_times, events)[1] AS step2_time, 
        arrayFilter((time, name) -> ((name = 1682) AND (time >= step2_time) AND (step2_time != 0)), events_times, events)[1] AS step3_time
    FROM 
    (
        SELECT 
            pageid, 
            createtime, 
            loginkey
        FROM datacenter.trajectory_trackless_complex 
        WHERE pageid IN (2538, 3242, 1682)
        ORDER BY createtime ASC
    ) 
    GROUP BY loginkey
) 

Execute result

it uses 377sec
image


Question

how to optimize the query ?
Thank you very match

Most helpful comment

@filimonov๏ผŒthanks for your help๏ผŒI use arrayReduce also can realize it.

    countIf(1, step1_time != toDateTime(0)) AS home, 
    countIf(1, step2_time != toDateTime(0)) AS list, 
    countIf(1, step3_time != toDateTime(0)) AS detail
FROM 
(
    SELECT 
        loginkey, 
        minIf(createtime, pageid = 2538) AS step1_time, 
        arrayReduce('min', arrayFilter(time -> ((step1_time != toDateTime(0)) AND (time > step1_time)), groupArrayIf(createtime, pageid = 3242))) AS step2_time, 
        arrayFilter(time -> ((step2_time != toDateTime(0)) AND (time > step2_time)), groupArrayIf(createtime, pageid = 1682))[1] AS step3_time
    FROM datacenter.trajectory_trackless_complex 
    WHERE (pageid IN (2538, 3242, 1682)) AND (intotime = '2018-03-01')
    GROUP BY loginkey
) 

All 19 comments

Probably you've already seen that video https://www.youtube.com/watch?v=YpurT78U2qA ?
Presentation is here: https://events.yandex.com/lib/talks/247/

Your funnel is not too complicated, you can avoid sorting here, just try to play a bit with array operators, and aggregate function modifiers, for example something like that will _probably_ work faster than your select:

SELECT 
    countIf(1,step1_time != toDateTime(0)) AS home, 
    countIf(1,step2_time != toDateTime(0)) AS list, 
    countIf(1,step3_time != toDateTime(0)) AS detail
FROM 
(

    SELECT 
        loginkey, 
        minIf(createtime, pageid = 2538) step1_time,
        minArray(arrayFilter(time ->  step1_time <> toDateTime(0) and time > step1_time, groupArrayIf(createtime, pageid = 3242) )) AS step2_time
        arrayFilter(time -> step2_time <> toDateTime(0) and time > step2_time, groupArrayIf(createtime, pageid = 1682) )[1] AS step3_time
    FROM datacenter.trajectory_trackless_complex 
    WHERE pageid IN (2538, 3242, 1682)
    GROUP BY loginkey
) 

@filimonov :+1: :+1: :+1: it's very useful for me, thank you very match

    countIf(1,step1_time != toDateTime(0)) AS home, 
    countIf(1,step2_time != toDateTime(0)) AS list, 
    countIf(1,step3_time != toDateTime(0)) AS detail
FROM 
(

    SELECT 
        loginkey, 
        minIf(createtime, pageid = 2538) step1_time,
        arrayFilter(time -> step1_time <> toDateTime(0) and time > step1_time, groupArrayIf(createtime, pageid = 3242) )[1] AS step2_time,
        arrayFilter(time -> step2_time <> toDateTime(0) and time > step2_time, groupArrayIf(createtime, pageid = 1682) )[1] AS step3_time
    FROM datacenter.trajectory_trackless_complex 
    WHERE pageid IN (2538, 3242, 1682)
    and intotime='2018-03-01'
    GROUP BY loginkey
) 

@filimonov, hi
Do you have some complicated funnel examples?
for exapmle, the funnel with time window?

@Jaxvk could u publish your dataset? In last year we solved the same window funnel problem. Maybe now I can do some help with it.

@sundy-li, thank you.
yes, here's the table structure and some datas.

CREATE TABLE trajectory_trackless_complex( 
    loginkey UInt64,
    provinceid Int32,
    cityid Int32,
    browser String,
    pageid UInt32,
    visittime DateTime,
    productcode UInt32,
    elepath String,
    createtime DateTime,
    intotime Date
) ENGINE = MergeTree 
PARTITION BY intotime 
ORDER BY (elepath, loginkey, pageid, createtime) 
SETTINGS index_granularity = 8192  ;
    createtime, 
    pageid, 
    loginkey
FROM trajectory_trackless_complex_daily 
WHERE (pageid IN (2538, 3242, 1682)) AND (intotime = '2018-03-01')
LIMIT 5

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€createtimeโ”€โ”ฌโ”€pageidโ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€loginkeyโ”€โ”
โ”‚ 2018-03-01 19:30:24 โ”‚   2538 โ”‚ 1515330059874917 โ”‚
โ”‚ 2018-03-01 20:48:45 โ”‚   2538 โ”‚ 1515330069578762 โ”‚
โ”‚ 2018-03-01 12:29:11 โ”‚   2538 โ”‚ 1515330088311902 โ”‚
โ”‚ 2018-03-01 20:45:29 โ”‚   3242 โ”‚ 1515330100408582 โ”‚
โ”‚ 2018-03-01 18:45:37 โ”‚   3242 โ”‚ 1515330104946826 โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

@sundy-li,
here are some examples use udf, can just use clickhouse's native grammar to realize the funnel query with time window? I have no idea.

yeah๏ผŒI am the author of that article

:+1: , filimonov's answer is enough for me right now, but I also want to realize the window funnel.
the udf can't use in new version, so it would be better if clickhouse can support use native api to realize it

You have chaged minArray to just first element of result array for step_2 time. That can give wrong result, because clickhouse can find some event which happened after step3, so step3 will not be foubd

With timeframes, once i've tryed, but it was a scary looking query with about 100 lines which worked for one particular case. Generally it was about spliting one chain of user events to multiple chains when some specific event appear so that was not fair time framing but event-based framing. Not sure if can be generalized:

SELECT
 chainid, 
  ...
FROM  
( 
SELECT
userid,
typeid,
timestamp,
chainid
FROM (
SELECT 
  userid,
  groupArray(event_typeid) as arr_typeid,
  groupArray(timestamp) as arr_timestamps,
  arrayMap( ind -> cityHash64(userid, arraySum( arraySlice( arrayMap( typeid -> typeid = 2, arr_typeid ) , 1, ind -1 ) )), arrayEnumerate(arr_typeid) ) as arr_chainid
FROM  ....
GROUP BY userid
)
ARRAY JOIN arr_typeid as typeid, arr_timestamps as timestamp, arr_chainid as chainid
)
GROUP BY chainid

@filimonov, I see, you are too responsible, thank you very match.
But it can't run through, if use minArray

    countIf(1, step1_time != toDateTime(0)) AS home, 
    countIf(1, step2_time != toDateTime(0)) AS list
FROM 
(
    SELECT 
        loginkey, 
        minIf(createtime, pageid = 2538) AS step1_time, 
        minArray(arrayFilter(time -> ((step1_time != toDateTime(0)) AND (time > step1_time)), groupArrayIf(createtime, pageid = 3242))) AS step2_time
    FROM test.trajectory_trackless_complex_daily 
    WHERE (pageid IN (2538, 3242)) AND (intotime = '2018-03-01')
    GROUP BY loginkey
) 

Received exception from server (version 1.1.54342):
Code: 184. DB::Exception: Received from hadoop:9000, DB::Exception: Aggregate function minIf(createtime, equals(pageid, 2538)) is found inside another aggregate function in query. 

Code: 184. DB::Exception: Received from hadoop:9000, DB::Exception: Aggregate function minIf(createtime, equals(pageid, 2538)) is found inside another aggregate function in query.

Hmmm... Sounds like a bug in ClickHouse. minArray is not 'fair' aggregate function. So it rather should be safe to use other 'real' aggregate functions inside. @alexey-milovidov, is ita bug or intended behaviour?
Minimal sample:

:) select minArray(groupArray(number)) from numbers(100);
Code: 184, e.displayText() = DB::Exception: Aggregate function groupArray(number) is found inside another aggregate function in query, e.what() = DB::Exception

As a woraround - you can just select all array of filtered events and you one more select above. Still should be better that sorting all the event chain

Something like that:

SELECT 
    countIf(1,step1_time != toDateTime(0)) AS home, 
    countIf(1,step2_time != toDateTime(0)) AS list, 
    countIf(1,step3_time != toDateTime(0)) AS detail
(
SELECT 
 loginkey, 
 step1_time,
 minArray(step2_all_times) as step2_time,
 arrayFilter(time -> time > step2_time, step3_all_times )[1] AS step3_time
FROM 
(
    SELECT 
        loginkey, 
        minIf(createtime, pageid = 2538) step1_time,
        arrayFilter(time -> step1_time != toDateTime(0) and time > step1_time, groupArrayIf(createtime, pageid = 3242) ) AS step2_all_times
        arrayFilter(time ->notEmpty(step2_all_times) and time > step1_time, groupArrayIf(createtime, pageid = 1682) ) AS step3_all_times
    FROM datacenter.trajectory_trackless_complex 
    WHERE pageid IN (2538, 3242, 1682)
    GROUP BY loginkey
) 
)

@filimonov๏ผŒthanks for your help๏ผŒI use arrayReduce also can realize it.

    countIf(1, step1_time != toDateTime(0)) AS home, 
    countIf(1, step2_time != toDateTime(0)) AS list, 
    countIf(1, step3_time != toDateTime(0)) AS detail
FROM 
(
    SELECT 
        loginkey, 
        minIf(createtime, pageid = 2538) AS step1_time, 
        arrayReduce('min', arrayFilter(time -> ((step1_time != toDateTime(0)) AND (time > step1_time)), groupArrayIf(createtime, pageid = 3242))) AS step2_time, 
        arrayFilter(time -> ((step2_time != toDateTime(0)) AND (time > step2_time)), groupArrayIf(createtime, pageid = 1682))[1] AS step3_time
    FROM datacenter.trajectory_trackless_complex 
    WHERE (pageid IN (2538, 3242, 1682)) AND (intotime = '2018-03-01')
    GROUP BY loginkey
) 

@Jaxvk Sorry for the delay, If you are still looking for the windowFunnel problem solution. You may try the release with embedded funnelWindow function.

Usage as follows:

    sumIf(c, level >= 1) AS _1,
    sumIf(c, level >= 2) AS _2,
    sumIf(c, level >= 3) AS _3
FROM
(
    SELECT
        level,
        count(*) AS c
    FROM
    (
        SELECT
            loginkey,
            funnelWindow(60, 2523, 2524, 2529)(createtime, pageid) AS level
        FROM trajectory_trackless_complex
        WHERE pageid IN (2523, 2524, 2529)
        GROUP BY loginkey
    )
    GROUP BY level
    ORDER BY level ASC
)

โ”Œโ”€โ”€โ”€โ”€โ”€_1โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€_2โ”€โ”ฌโ”€โ”€โ”€โ”€_3โ”€โ”
โ”‚ 150127 โ”‚ 100118 โ”‚ 99895 โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
1 rows in set. Elapsed: 2.023 sec. Processed 1.04 billion rows, 20.74 GB (512.44 million rows/s., 10.25 GB/s.)

The test dataset's amount is about 1 billion, this is the single node performance , and you can share the data by the column loginkey to improve the performance.

The function desc is funnelWindow( window_in_seconds , event1, event2, event3)( time_stamp_col, event_col ).

You can download the deb files in https://pan.baidu.com/s/1AwTXOKsc-08gCWwSi4gdjg , this feature commit is based on the latest master commit 03b01fc24967d8cf46e88b1227e1c3605b63c50d

@sundy-li๏ผŒ:+1: :+1:๏ผŒๅคช็ป™ๅŠ›ไบ†๏ผŒๅฐ่ฏ•ไฝฟ็”จไธ‹๏ผŒๆœŸๅพ…ๅ‘ๅธƒๅˆฐๆญฃๅผ็‰ˆใ€‚
ไฝ†ๆ˜ฏ่ฟ™้‡Œๆœ‰ไธ€ไธช็Ÿ›็›พ็‚น๏ผŒ่‹ฅๅˆ†ๅธƒๅผ่กจๆŒ‡ๅฎšไบ†shard key๏ผŒไฝ†ๅฎž้™…ๆ˜ฏๆŽจ่ๅพ€ๆœฌๅœฐ่กจๅ†™ๅ…ฅ็š„๏ผŒ้‚ฃๆŸฅ่ฏขๅˆ†ๅธƒๅผ่กจๆ—ถ๏ผŒshard key่ฟ˜ๆœ‰ไฝœ็”จๅ—๏ผŸ

@sundy-li
thanks 4 the share.
https://github.com/analysys/olap/tree/master/%E5%BC%80%E6%BA%90%E7%BB%84%E7%AC%AC%E4%B8%80%E5%90%8D
we are following the toutrial on the upper site, try to compile clickHouse.
but failed on git checkout ab7672f329f7736756542268178e6f9f7e32325a

could u release a new patch?

Many Thanks.

@Jingyi244 That code is not maintained by me , you could try the latest windowFunnel code here,based on commit 12905f5c6f621e7e1051e818f04bb4d56cbf06bb

@sundy-li
I have just built clickhouse with the patch. exploring windowFunnel now...

Thanks for the repo, i will try it later.

@filimonov

You have chaged minArray to just first element of result array for step_2 time. That can give wrong result, because clickhouse can find some event which happened after step3, so step3 will not be foubd

I am very happy you mentioned this. We are using Postgres for analytics and analysing CH now.
In Postgres, LATERAL JOIN is used for funnels and this is the result.

action1 | action2 | action3 | action4
------------ | ------------- | ------------- | -------------
16049 | 1403 | 98 | 3

But in CH,

action1 | action2 | action3 | action4
------------ | ------------- | ------------- | -------------
16049 | 1403 | 97 | 3

Query in CH:

    countIf(step1_time != 0) AS action1,
    countIf(step2_time != 0) AS action2,
    countIf(step3_time != 0) AS action3,
    countIf(step4_time != 0) AS action4
FROM
(
    SELECT
        customerid,
        minIf(time, actionid = 1) step1_time,
        arrayFilter(time -> step1_time != 0 and time >= step1_time, groupArrayIf(time, actionid = 2) )[1] AS step2_time,
        arrayFilter(time -> step2_time != 0 and time >= step2_time, groupArrayIf(time, actionid = 3) )[1] AS step3_time,
        arrayFilter(time -> step3_time != 0 and time >= step3_time, groupArrayIf(time, actionid = 4) )[1] AS step4_time
    FROM actions
        WHERE schemaid = 3000000000001 AND time between 1577817000000 AND 1593541800000 and actionid IN (1, 2, 3, 4)
    GROUP BY customerid
);

The query is changed to our case, 1. we use Epoch value timestamp, so no Date conversions 2. Its multi-tenant schema, so extra criteria on tenantid. Other than that, its the same.

I found the userid, got all the data and cross analyzed and they are same in both.
In the query, I collected the timestamps of the event 2 to analyze and this is the result

userid | step1_time | step2_time
------------ | ------------- | -------------
3100000057370 | 1591251656100 | [1593365831100,1592737788900]

The error occurred as the first element is got from the array rather than the minimum.
So I tried the arrayReduce returning the result of the min aggregate, proposed by @lamber-ken and it got fixed.
Also I tried using arraySort and getting first element. The same. It got fixed.

Is there any considerations on performance wise on which to use arrayReduce or arraySort? Seems the average response time for the queries are equal for both the methods.

Was this page helpful?
0 / 5 - 0 ratings