hi, I need to use clickhouse to funnels query
trajectory_trackless_complex has 2894082787 rows sum(step1_condition) AS home,
sum(step2_condition) AS list,
sum(step3_condition) AS detail
FROM
(
SELECT
loginkey,
max(pageid = 2538) AS step1_condition,
sequenceMatch('(?1).*(?2)')(toDateTime(createtime), pageid = 2538, pageid = 3242) AS step2_condition,
sequenceMatch('(?1).*(?2).*(?3)')(toDateTime(createtime), pageid = 2538, pageid = 3242, pageid = 1682) AS step3_condition
FROM datacenter.trajectory_trackless_complex
WHERE intotime = '2018-03-01'
GROUP BY loginkey
)
โ Progress: 875.45 million rows, 12.42 GB (76.01 million rows/s., 1.08 GB/s.) โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ 98%Received exception from server (version 1.1.54342):
Code: 160. DB::Exception: Received from hadoo:9000, 10.12.180.114. DB::Exception: Pattern application proves too difficult, exceeding max iterations (1000000).
sum(step1_time != 0) AS home,
sum(step2_time != 0) AS list,
sum(step3_time != 0) AS detail
FROM
(
SELECT
loginkey,
groupArray(pageid) AS events,
groupArray(createtime) AS events_times,
arrayFilter((time, name) -> (name = 2538), events_times, events)[1] AS step1_time,
arrayFilter((time, name) -> ((name = 3242) AND (time >= step1_time) AND (step1_time != 0)), events_times, events)[1] AS step2_time,
arrayFilter((time, name) -> ((name = 1682) AND (time >= step2_time) AND (step2_time != 0)), events_times, events)[1] AS step3_time
FROM
(
SELECT
pageid,
createtime,
loginkey
FROM datacenter.trajectory_trackless_complex
WHERE pageid IN (2538, 3242, 1682)
ORDER BY createtime ASC
)
GROUP BY loginkey
)
it uses 377sec

how to optimize the query ?
Thank you very match
Probably you've already seen that video https://www.youtube.com/watch?v=YpurT78U2qA ?
Presentation is here: https://events.yandex.com/lib/talks/247/
Your funnel is not too complicated, you can avoid sorting here, just try to play a bit with array operators, and aggregate function modifiers, for example something like that will _probably_ work faster than your select:
SELECT
countIf(1,step1_time != toDateTime(0)) AS home,
countIf(1,step2_time != toDateTime(0)) AS list,
countIf(1,step3_time != toDateTime(0)) AS detail
FROM
(
SELECT
loginkey,
minIf(createtime, pageid = 2538) step1_time,
minArray(arrayFilter(time -> step1_time <> toDateTime(0) and time > step1_time, groupArrayIf(createtime, pageid = 3242) )) AS step2_time
arrayFilter(time -> step2_time <> toDateTime(0) and time > step2_time, groupArrayIf(createtime, pageid = 1682) )[1] AS step3_time
FROM datacenter.trajectory_trackless_complex
WHERE pageid IN (2538, 3242, 1682)
GROUP BY loginkey
)
@filimonov :+1: :+1: :+1: it's very useful for me, thank you very match
countIf(1,step1_time != toDateTime(0)) AS home,
countIf(1,step2_time != toDateTime(0)) AS list,
countIf(1,step3_time != toDateTime(0)) AS detail
FROM
(
SELECT
loginkey,
minIf(createtime, pageid = 2538) step1_time,
arrayFilter(time -> step1_time <> toDateTime(0) and time > step1_time, groupArrayIf(createtime, pageid = 3242) )[1] AS step2_time,
arrayFilter(time -> step2_time <> toDateTime(0) and time > step2_time, groupArrayIf(createtime, pageid = 1682) )[1] AS step3_time
FROM datacenter.trajectory_trackless_complex
WHERE pageid IN (2538, 3242, 1682)
and intotime='2018-03-01'
GROUP BY loginkey
)
@filimonov, hi
Do you have some complicated funnel examples?
for exapmle, the funnel with time window?
@Jaxvk could u publish your dataset? In last year we solved the same window funnel problem. Maybe now I can do some help with it.
@sundy-li, thank you.
yes, here's the table structure and some datas.
CREATE TABLE trajectory_trackless_complex(
loginkey UInt64,
provinceid Int32,
cityid Int32,
browser String,
pageid UInt32,
visittime DateTime,
productcode UInt32,
elepath String,
createtime DateTime,
intotime Date
) ENGINE = MergeTree
PARTITION BY intotime
ORDER BY (elepath, loginkey, pageid, createtime)
SETTINGS index_granularity = 8192 ;
createtime,
pageid,
loginkey
FROM trajectory_trackless_complex_daily
WHERE (pageid IN (2538, 3242, 1682)) AND (intotime = '2018-03-01')
LIMIT 5
โโโโโโโโโโโcreatetimeโโฌโpageidโโฌโโโโโโโโโloginkeyโโ
โ 2018-03-01 19:30:24 โ 2538 โ 1515330059874917 โ
โ 2018-03-01 20:48:45 โ 2538 โ 1515330069578762 โ
โ 2018-03-01 12:29:11 โ 2538 โ 1515330088311902 โ
โ 2018-03-01 20:45:29 โ 3242 โ 1515330100408582 โ
โ 2018-03-01 18:45:37 โ 3242 โ 1515330104946826 โ
โโโโโโโโโโโโโโโโโโโโโโโดโโโโโโโโโดโโโโโโโโโโโโโโโโโโโ
@sundy-li,
here are some examples use udf, can just use clickhouse's native grammar to realize the funnel query with time window? I have no idea.
yeah๏ผI am the author of that article
:+1: , filimonov's answer is enough for me right now, but I also want to realize the window funnel.
the udf can't use in new version, so it would be better if clickhouse can support use native api to realize it
You have chaged minArray to just first element of result array for step_2 time. That can give wrong result, because clickhouse can find some event which happened after step3, so step3 will not be foubd
With timeframes, once i've tryed, but it was a scary looking query with about 100 lines which worked for one particular case. Generally it was about spliting one chain of user events to multiple chains when some specific event appear so that was not fair time framing but event-based framing. Not sure if can be generalized:
SELECT
chainid,
...
FROM
(
SELECT
userid,
typeid,
timestamp,
chainid
FROM (
SELECT
userid,
groupArray(event_typeid) as arr_typeid,
groupArray(timestamp) as arr_timestamps,
arrayMap( ind -> cityHash64(userid, arraySum( arraySlice( arrayMap( typeid -> typeid = 2, arr_typeid ) , 1, ind -1 ) )), arrayEnumerate(arr_typeid) ) as arr_chainid
FROM ....
GROUP BY userid
)
ARRAY JOIN arr_typeid as typeid, arr_timestamps as timestamp, arr_chainid as chainid
)
GROUP BY chainid
@filimonov, I see, you are too responsible, thank you very match.
But it can't run through, if use minArray
countIf(1, step1_time != toDateTime(0)) AS home,
countIf(1, step2_time != toDateTime(0)) AS list
FROM
(
SELECT
loginkey,
minIf(createtime, pageid = 2538) AS step1_time,
minArray(arrayFilter(time -> ((step1_time != toDateTime(0)) AND (time > step1_time)), groupArrayIf(createtime, pageid = 3242))) AS step2_time
FROM test.trajectory_trackless_complex_daily
WHERE (pageid IN (2538, 3242)) AND (intotime = '2018-03-01')
GROUP BY loginkey
)
Received exception from server (version 1.1.54342):
Code: 184. DB::Exception: Received from hadoop:9000, DB::Exception: Aggregate function minIf(createtime, equals(pageid, 2538)) is found inside another aggregate function in query.
Code: 184. DB::Exception: Received from hadoop:9000, DB::Exception: Aggregate function minIf(createtime, equals(pageid, 2538)) is found inside another aggregate function in query.
Hmmm... Sounds like a bug in ClickHouse. minArray is not 'fair' aggregate function. So it rather should be safe to use other 'real' aggregate functions inside. @alexey-milovidov, is ita bug or intended behaviour?
Minimal sample:
:) select minArray(groupArray(number)) from numbers(100);
Code: 184, e.displayText() = DB::Exception: Aggregate function groupArray(number) is found inside another aggregate function in query, e.what() = DB::Exception
As a woraround - you can just select all array of filtered events and you one more select above. Still should be better that sorting all the event chain
Something like that:
SELECT
countIf(1,step1_time != toDateTime(0)) AS home,
countIf(1,step2_time != toDateTime(0)) AS list,
countIf(1,step3_time != toDateTime(0)) AS detail
(
SELECT
loginkey,
step1_time,
minArray(step2_all_times) as step2_time,
arrayFilter(time -> time > step2_time, step3_all_times )[1] AS step3_time
FROM
(
SELECT
loginkey,
minIf(createtime, pageid = 2538) step1_time,
arrayFilter(time -> step1_time != toDateTime(0) and time > step1_time, groupArrayIf(createtime, pageid = 3242) ) AS step2_all_times
arrayFilter(time ->notEmpty(step2_all_times) and time > step1_time, groupArrayIf(createtime, pageid = 1682) ) AS step3_all_times
FROM datacenter.trajectory_trackless_complex
WHERE pageid IN (2538, 3242, 1682)
GROUP BY loginkey
)
)
@filimonov๏ผthanks for your help๏ผI use arrayReduce also can realize it.
countIf(1, step1_time != toDateTime(0)) AS home,
countIf(1, step2_time != toDateTime(0)) AS list,
countIf(1, step3_time != toDateTime(0)) AS detail
FROM
(
SELECT
loginkey,
minIf(createtime, pageid = 2538) AS step1_time,
arrayReduce('min', arrayFilter(time -> ((step1_time != toDateTime(0)) AND (time > step1_time)), groupArrayIf(createtime, pageid = 3242))) AS step2_time,
arrayFilter(time -> ((step2_time != toDateTime(0)) AND (time > step2_time)), groupArrayIf(createtime, pageid = 1682))[1] AS step3_time
FROM datacenter.trajectory_trackless_complex
WHERE (pageid IN (2538, 3242, 1682)) AND (intotime = '2018-03-01')
GROUP BY loginkey
)
@Jaxvk Sorry for the delay, If you are still looking for the windowFunnel problem solution. You may try the release with embedded funnelWindow function.
Usage as follows:
sumIf(c, level >= 1) AS _1,
sumIf(c, level >= 2) AS _2,
sumIf(c, level >= 3) AS _3
FROM
(
SELECT
level,
count(*) AS c
FROM
(
SELECT
loginkey,
funnelWindow(60, 2523, 2524, 2529)(createtime, pageid) AS level
FROM trajectory_trackless_complex
WHERE pageid IN (2523, 2524, 2529)
GROUP BY loginkey
)
GROUP BY level
ORDER BY level ASC
)
โโโโโโ_1โโฌโโโโโ_2โโฌโโโโ_3โโ
โ 150127 โ 100118 โ 99895 โ
โโโโโโโโโโดโโโโโโโโโดโโโโโโโโ
1 rows in set. Elapsed: 2.023 sec. Processed 1.04 billion rows, 20.74 GB (512.44 million rows/s., 10.25 GB/s.)
The test dataset's amount is about 1 billion, this is the single node performance , and you can share the data by the column loginkey to improve the performance.
The function desc is funnelWindow( window_in_seconds , event1, event2, event3)( time_stamp_col, event_col ).
You can download the deb files in https://pan.baidu.com/s/1AwTXOKsc-08gCWwSi4gdjg , this feature commit is based on the latest master commit 03b01fc24967d8cf46e88b1227e1c3605b63c50d
@sundy-li๏ผ:+1: :+1:๏ผๅคช็ปๅไบ๏ผๅฐ่ฏไฝฟ็จไธ๏ผๆๅพ
ๅๅธๅฐๆญฃๅผ็ใ
ไฝๆฏ่ฟ้ๆไธไธช็็พ็น๏ผ่ฅๅๅธๅผ่กจๆๅฎไบshard key๏ผไฝๅฎ้
ๆฏๆจ่ๅพๆฌๅฐ่กจๅๅ
ฅ็๏ผ้ฃๆฅ่ฏขๅๅธๅผ่กจๆถ๏ผshard key่ฟๆไฝ็จๅ๏ผ
@sundy-li
thanks 4 the share.
https://github.com/analysys/olap/tree/master/%E5%BC%80%E6%BA%90%E7%BB%84%E7%AC%AC%E4%B8%80%E5%90%8D
we are following the toutrial on the upper site, try to compile clickHouse.
but failed on git checkout ab7672f329f7736756542268178e6f9f7e32325a
could u release a new patch?
Many Thanks.
@Jingyi244 That code is not maintained by me , you could try the latest windowFunnel code here,based on commit 12905f5c6f621e7e1051e818f04bb4d56cbf06bb
@sundy-li
I have just built clickhouse with the patch. exploring windowFunnel now...
Thanks for the repo, i will try it later.
@filimonov
You have chaged minArray to just first element of result array for step_2 time. That can give wrong result, because clickhouse can find some event which happened after step3, so step3 will not be foubd
I am very happy you mentioned this. We are using Postgres for analytics and analysing CH now.
In Postgres, LATERAL JOIN is used for funnels and this is the result.
action1 | action2 | action3 | action4
------------ | ------------- | ------------- | -------------
16049 | 1403 | 98 | 3
But in CH,
action1 | action2 | action3 | action4
------------ | ------------- | ------------- | -------------
16049 | 1403 | 97 | 3
Query in CH:
countIf(step1_time != 0) AS action1,
countIf(step2_time != 0) AS action2,
countIf(step3_time != 0) AS action3,
countIf(step4_time != 0) AS action4
FROM
(
SELECT
customerid,
minIf(time, actionid = 1) step1_time,
arrayFilter(time -> step1_time != 0 and time >= step1_time, groupArrayIf(time, actionid = 2) )[1] AS step2_time,
arrayFilter(time -> step2_time != 0 and time >= step2_time, groupArrayIf(time, actionid = 3) )[1] AS step3_time,
arrayFilter(time -> step3_time != 0 and time >= step3_time, groupArrayIf(time, actionid = 4) )[1] AS step4_time
FROM actions
WHERE schemaid = 3000000000001 AND time between 1577817000000 AND 1593541800000 and actionid IN (1, 2, 3, 4)
GROUP BY customerid
);
The query is changed to our case, 1. we use Epoch value timestamp, so no Date conversions 2. Its multi-tenant schema, so extra criteria on tenantid. Other than that, its the same.
I found the userid, got all the data and cross analyzed and they are same in both.
In the query, I collected the timestamps of the event 2 to analyze and this is the result
userid | step1_time | step2_time
------------ | ------------- | -------------
3100000057370 | 1591251656100 | [1593365831100,1592737788900]
The error occurred as the first element is got from the array rather than the minimum.
So I tried the arrayReduce returning the result of the min aggregate, proposed by @lamber-ken and it got fixed.
Also I tried using arraySort and getting first element. The same. It got fixed.
Is there any considerations on performance wise on which to use arrayReduce or arraySort? Seems the average response time for the queries are equal for both the methods.
Most helpful comment
@filimonov๏ผthanks for your help๏ผI use
arrayReducealso can realize it.