We encountered a problem while doing update from 20.5.3.27 to version 20.9.2.20 on Clickhouse.
We made materialized view:
Example with arrayEnumerate:
CREATE MATERIALIZED VIEW case.view_mv_local
(
event_time DateTime,
tenant_id UInt32,
case_id UInt8,
partner_id UInt16
)
ENGINE = SummingMergeTree()
PARTITION BY toStartOfDay(event_time)
ORDER BY (event_time,
tenant_id,
case_id,
partner_id)
TTL event_time + toIntervalDay(7)
SETTINGS index_granularity = 8192 AS
SELECT
toStartOfHour(event_time) AS event_time,
tenant_id,
case_id,
partner_ids[num] AS partner_id
FROM case.view_local
ARRAY JOIN arrayEnumerate(partner_ids) AS num
GROUP BY
event_time,
tenant_id,
case_id,
partner_id;
After insert of data in Clickhouse we get this error:
err: clickhouse: Code: 8, e.displayText() = DB::Exception: Cannot find column session_id in source stream: while pushing to view case.view_mv_local (version 20.9.2.20 (official build))
We have column session_id in table case.view_local but it wasn't mentioned in MV. This worked well until we did update of Clickhouse on our system. We realised that we have problem with (ARRAY JOIN arrayEnumerate(partner_ids) AS num). When we dont use it on newest version on Clickhouse and just put [1] for arrays, everything starts to work fine.
Example with [1]:
CREATE MATERIALIZED VIEW case.view_mv_local
(
event_time DateTime,
tenant_id UInt32,
case_id UInt8,
partner_id UInt16
)
ENGINE = SummingMergeTree()
PARTITION BY toStartOfDay(event_time)
ORDER BY (event_time,
tenant_id,
case_id,
partner_id)
TTL event_time + toIntervalDay(7)
SETTINGS index_granularity = 8192 AS
SELECT
toStartOfHour(event_time) AS event_time,
tenant_id,
case_id,
partner_ids[1] AS partner_id
FROM case.view_local
GROUP BY
event_time,
tenant_id,
case_id,
partner_id;
source table:
CREATE TABLE case.view_local
(
session_id FixedString(11) CODEC(ZSTD(1)),
tenant_id UInt32 DEFAULT 0,
case_id UInt8 DEFAULT 0,
partner_ids Array(UInt16),
event_time DateTime
)
ENGINE = MergeTree()
PARTITION BY toYYYYMMDD(event_time)
ORDER BY (tenant_id,
event_time)
TTL event_time + toIntervalDay(60)
SETTINGS index_granularity = 8192;
Insert query:
INSERT INTO case.view_local
(session_id, tenant_id, case_id, partner_ids, event_time)
VALUES('SsO12":{fvc', 559041, 5512, [12,13,54,40], '2020-09-25 11:00:55');
Can you please tell me what could be the problem on Clickhouse in version 20.9.2.20 for arrayEnumerate?
sidenote:
you don't need arrayEnumerate.
These work
partner_id
...
ARRAY JOIN partner_ids AS partner_id
or not simply arrayJoin
select .... arrayJoin(partner_ids) as partner_id
without ARRAY JOIN
STR:
$ cat t.sql
drop table if exists st;
CREATE TABLE st(sssss String, pids Array(UInt16), t DateTime)
ENGINE = MergeTree() ORDER BY tuple();
drop table if exists stv;
CREATE MATERIALIZED VIEW stv ENGINE = MergeTree() ORDER BY tuple()
AS SELECT toStartOfHour(t) AS t, pids[num] pid
FROM st ARRAY JOIN arrayEnumerate(pids) AS num
GROUP BY t, pid;
INSERT INTO st VALUES('SsO12":{fvc', [12,13,54,40], '2020-09-25 11:00:55');
select * from stv;
$ clickhouse-client -mn <t.sql
Received exception from server (version 20.9.2):
Code: 8. DB::Exception: Received from localhost:9000. DB::Exception: Cannot find column sssss in source stream: while pushing to view dw.stv.
20.10.1.4751 FAIL.
20.9.2.20 FAIL.
20.8.3.18 FAIL.
20.7.3.7 OK.
Den, thank you for this workaround and for your fast response. We will try to implement this in our system.
Tell me if I am wrong, as I understand, this is clearly a bug? Does this mean that we should stop using arrayEnumerate and from version 20.8.3.18 start using your solution?
ArrayEnumerate was very elegant solution, and when we create num through one column that is type array, we then just use "num" everywhere because lenght of all columns (that are type array) in table is the same. We would just use client_ids(num) as client_id, partner_ids(num) as partner_id and so on...On our production database we have tables and MVs with more columns that are type array(UInt32, String...). Does this mean that we will have to create MV for example:
select .... arrayJoin(partner_ids) as partner_id, arrayJoin(client_ids) as client_id, arrayJoin(case_ids) as case_id...and so on?
I tried to use this logic on production table where we have 5 arrays of data. And when i insert one row with 5 arrays in it, and every array has 4 elements, I end up having 20 rows in my MV table...hmmm
Am I using this wrong? I put this logic in select just like you told me: arrayJoin(partner_ids) as partner_id, arrayJoin(client_ids) as client_id, arrayJoin(case_ids) as case_id...When I use arrayEnumerate as num, Clickhouse uses every first, second, third element from all rows and process it. With arrayJoin I end up processing frist element from first array, second element from second array, third element from fifth array and so on...There is no logic behind this, and the point is that all first elements in arrays are connected, just like all second and third and so on...:)
it's definitely a bug.
Does this mean that we should stop using arrayEnumerate and from version 20.8.3.18 start using your solution?
It does mean that you should not use 20.8.
ArrayEnumerate was very elegant solution
It's not elegant at all.
Nobody use it for this usecase. ArrayEnumerate brings excessive array and memory overhead.
select .... arrayJoin(partner_ids) as partner_id, arrayJoin(client_ids) as client_id, arrayJoin(case_ids) as case_id...and so on?
I mentioned 2 solutions. Array Join and arrayJoin
select a,b from
(select [1,2,3] a_arr, ['a','b','c'] b_arr)
array join a_arr as a, b_arr as b
โโaโโฌโbโโ
โ 1 โ a โ
โ 2 โ b โ
โ 3 โ c โ
โโโโโดโโโโ
select .... arrayJoin(partner_ids) as partner_id, arrayJoin(client_ids) as client_id, arrayJoin(case_ids) as case_id.
select .... partner_id, client_id, case_id
from table array join partner_ids as partner_id, client_ids as client_id, case_ids as case_id
Den, this solved our problem:
select .... partner_id, client_id, case_id
from table array join partner_ids as partner_id, client_ids as client_id, case_ids as case_id
Thank you very much!
Please don't close it. It's still a bug in arrayEnumerate and it should be addressed.
fixed: https://github.com/ClickHouse/ClickHouse/pull/15717
clickhouse-client -mn
2020-09-25 11:00:00 12
2020-09-25 11:00:00 54
2020-09-25 11:00:00 13
ClickHouse client version 20.10.1.4853.
Connecting to localhost:9000 as user default.
Connected to ClickHouse server version 20.10.1 revision 54441.
Most helpful comment
it's definitely a bug.
It does mean that you should not use 20.8.
It's not elegant at all.
Nobody use it for this usecase. ArrayEnumerate brings excessive array and memory overhead.
I mentioned 2 solutions. Array Join and arrayJoin