Can we build HLLs on the client side, before inserting data into clickhouse ?
I was going through https://www.altinity.com/blog/2017/7/10/clickhouse-aggregatefunctions-and-aggregatestate
create table events_unique (
date Date,
group_id String,
client_id String,
event_type String,
product_id String,
value AggregateFunction(uniq, String)
) ENGINE = MergeTree(date, (group_id, client_id, event_type, product_id, date), 8192);
```sql
INSERT INTO events_unique
SELECT date, group_id, client_id, event_type, product_id, uniqState(visitor_id) AS value
FROM events
GROUP BY date, group_id, client_id, event_type, product_id;
`value` column of type `AggregateFunction` is built by first inserting the raw table into clickhouse, and then building an HLL table on top of it.
Rather than doing the above steps, can we directly compute `AggregateFunction` on client side ?
For eg, with [postgres-hll](https://github.com/AdRoll/python-hll), one could compute it offline and then bulk upload it.
```python
from python_hll.hll import HLL
import mmh3
value_to_hash = 'foo'
hashed_value = mmh3.hash(value_to_hash)
hll = HLL(13, 5) # log2m=13, regwidth=5
hll.add_raw(hashed_value)
You can use Null table and MV
create table events_unique (
date Date,
group_id String,
client_id String,
event_type String,
product_id String,
value AggregateFunction(uniq, String)
) ENGINE = MergeTree(date, (group_id, client_id, event_type, product_id, date), 8192);
create table events(
date Date,
group_id String,
client_id String,
event_type String,
product_id String,
visitor_id String) Engine = Null;
create materialized view events_unique_mv to events_unique as
SELECT date, group_id, client_id, event_type, product_id, uniqState(visitor_id) AS value
FROM events
GROUP BY date, group_id, client_id, event_type, product_id;
insert into events values (.....)
Or https://clickhouse.yandex/docs/en/query_language/table_functions/input/
Rather than doing the above steps, can we directly compute AggregateFunction on client side ?
clickhouse-local can compute them and put into native format
arrayReduce function with uniqState can also be used to build the value of AggregateFunction(uniq, String) type.
Thanks @den-crane and @alexey-milovidov !
Would check out clickhouse-local
I was looking more of a java / python clickhouse library that would generate native format data and use it to build final clickhouse data with spark, and then bulk upload it. It might still be an expensive operation to do with clickhouse.
Basically looking for ways to generate data for events_unique table entirely with spark using events data
create table events_unique (
date Date,
group_id String,
client_id String,
event_type String,
product_id String,
value AggregateFunction(uniq, String)
) ENGINE = MergeTree(date, (group_id, client_id, event_type, product_id, date), 8192);
HLL format is quite complicated, if you can afford to store extra bytes, you can use a roaring bitmap, and then materialize it to uniq state:
CREATE TABLE example (
// byte payload for RoaringBitmap:
// format: Byte(1), VarInt(SerializedSizeInBytes), ByteArray(RoaringBitmap)
roaring_bitmap AggregateFunction(groupBitmap, UInt32),
uniq_state AggregateFunction(uniq, UInt32)
MATERIALIZED arrayReduce('uniqState', bitmapToArray(roaring_bitmap))
) ENGINE = Null
So is there a library that can we directly build a roaring bitmap with spark, and then bulk upload it?
You can use standard roaring bitmap implementation. You need to prefix with length encoded as VarInt and byte marker saying that blob isLarge. Here large means that it uses roaring bitmap, see:
Thanks @kanterov
I will have a try.