Clickhouse: [Question] Best practice for single value update (additional table/workaround)

Created on 18 Dec 2017  Β·  50Comments  Β·  Source: ClickHouse/ClickHouse

Hi there,
I have a database which is being populated early on and at a later time there is a single value that can change. In this case whether the content was seen or not.

In databases supporting update I just change the default boolean from false to true.
In Clickhouse my database layout is like this:

CREATE TABLE impressions
ENGINE = MergeTree(`date`, (`impressionUUID`, `date`), 8192)
AS SELECT
toFixedString(UUIDStringToNum(`impressionUUID`), 16) AS `impressionUUID`,
toDate(`timestamp`) as `date`,
[...]
CREATE TABLE impressions_seen
ENGINE = MergeTree(`date`, (`impressionUUID`, `date`), 8192)
AS SELECT
toFixedString(UUIDStringToNum(`impressionUUID`), 16) AS `impressionUUID`,
assumeNotNull(`seen`) AS `seen`,
toDate(`timestamp`) as `date`,
`timestamp`
FROM log
WHERE `seen` = 1;

The query would be:

SELECT
    language as _group,
    toDate(timestamp) as _date,
    [...]
    COUNT(seen) as seen
FROM impressions
ANY LEFT JOIN impressions_seen USING(impressionUUID)
WHERE `timestamp` BETWEEN '2017-01-01 00:00:00' AND '2017-11-28 00:00:00'
GROUP BY _group, _date;

Now I do not know the inner workings of ClickHouse. Does it make sense to set the key on impressionUUID for both tables instead of the timestamp?
Is MergeTree the best engine for the seen table?
Does date add any value on the seen table? (seen is only used in joins)
Is there a way to improve the join, e.g. by counting the hits instead of seen which is always 1?

question st-need-info

Most helpful comment

Each insert create it's own part. Later small parts are merged together and bigger parts appear which will also be merged later. That's why it's called MergeTree. During each merge data is sorted according to primary key. If you're using Replacing / Collapsing merge tree - duplicate rows are removed during Merges. If you're using Aggregating / Summing merge tree - rows with same primary key get grouped.

Parts from different partitions (by default: months) are never merged. If you get parts of the maximum size already - it's not merged with others.

So if you will insert that new version of the row several hours later, it will be merged with some smaller parts which were inserted recently. Later that bigger parts will be merged to even bigger. At the some moment in time parts containing old version of row and new version of row will be merged and older variant of the row will be removed .

Look on that picture from wikipedia:

LSM-Tree

If you add FINAL keyword it works like a group by primary key to get freshest variant of the row, so selects with FINAL clause work noticeable slower that regular selects.

All 50 comments

Did you check CollapsingMergeTree or ReplacingMergeTree? Both allows you to put new version of existing row at any time. During background merge process older version of row get replaced/collapsed. If you will add FINAL clause to your selects you will always get freshest row. Without FINAL clause you can get 2 variants of the same row untill background merges will remove older.

The update could happen seconds or hours later. My understanding is that merging happens at random times only for non Merged rows. Did I get that wrong?

Each insert create it's own part. Later small parts are merged together and bigger parts appear which will also be merged later. That's why it's called MergeTree. During each merge data is sorted according to primary key. If you're using Replacing / Collapsing merge tree - duplicate rows are removed during Merges. If you're using Aggregating / Summing merge tree - rows with same primary key get grouped.

Parts from different partitions (by default: months) are never merged. If you get parts of the maximum size already - it's not merged with others.

So if you will insert that new version of the row several hours later, it will be merged with some smaller parts which were inserted recently. Later that bigger parts will be merged to even bigger. At the some moment in time parts containing old version of row and new version of row will be merged and older variant of the row will be removed .

Look on that picture from wikipedia:

LSM-Tree

If you add FINAL keyword it works like a group by primary key to get freshest variant of the row, so selects with FINAL clause work noticeable slower that regular selects.

ok, so in order for it work in real time I would need to use FINAL in the query? Isn't the solution with a second table and join faster, then and especially more accurate?

The best way is to try. Each case is different - test on your data with your requests and your load is always the best way to check which approach suits you the best.

If your list of _seen_ records is small - you can even use dictionary for that.

Its only was seen/wasn't seen. I don't think a dictionary would help here. I'm using 0/1 for that or with the join a left join exists check.
At this time I do not have the amount of data necessary to run a proper test, hence I'm looking for best practices :) Currently it's in the millions/month, but ultimately it should deal with billions/month. Around 50% of the records need an update to seen.

If it's millions and 50% of the records are "seen" dictionary is definitely not an option.

I still think that CollapsingMergeTree / ReplacingMergeTree will be better for your case. It's more universal (with that engines you can also make other updates), and if you will update only fresh data (as you said: second up to hours) - that means that all older data will stay untouched and ready to use (no need to merge big old parts with small newly inserted), only during merging relatively fresh and relatively small parts collapsing / replacing will happen.

With extra table you will need to make joins and read 2 tables all the time. On my data joins are usually slower than group by (FINAL clause is kind of group by primary key), so i think it will also be faster. If you don't need to make reports in realtime you can also run OPTIMIZE TABLE ... before building reports instead of FINAL clause in SELECTs.

But I repeat: the only correct way to get the answer - is to test. If you don't have real data yet, you can make tests on some dummy data, and some simulated load.

About best practices and your SELECTs: in clickhouse if you have some typical SELECT pattern - it's better to add columns which are always used in group by to primary key.
So if in your select you're always (or almost always) grouping the data by language and date, - then it will be nice idea to add them at the begining of your primary key.
Also if you already have date column used for partitioning, it's always better to setup conditions on that date fields, to help clickhouse understand which partitions should be read. Optimizer in clickhouse if quite simple, and probably it will not understand what toDate(timestamp) as _date is actually the same as date column, and it will fairly covert timestamps to dates for each row instead of using column where that data is stored ready to use.

Thank you very much. Once I have a big dataset with load I'll do benchmarks with different schemas. Unfortunately that isn't the case yet and therefore I'm thankful for your guidance which allows me to get the most out of what I have.

A single table also has the advantage that I can use the join for other things without the need of workarounds that probably hurt performance :)

If the update occurs usually within 5 minutes, will merging be fast? E.g. if I run a realtime query without final, will only the last hour be off, since everything older is very likely to be merged by then?

About best practices and your SELECTs: in clickhouse if you have some typical SELECT pattern - it's better to add columns which are always used in group by to primary key.
So if in your select you're always (or almost always) grouping the data by language and date, - then it will be nice idea to add them at the begining of your primary key.
Also if you already have date column used for partitioning, it's always better to setup conditions on that date fields, to help clickhouse understand which partitions should be read. Optimizer in clickhouse if quite simple, and probably it will not understand what toDate(timestamp) as _date is actually the same as date column, and it will fairly covert timestamps to dates for each row instead of using column where that data is stored ready to use.

I always group by date and one of the variables, but I need timezone support. If I understood you right, it would be better to group and select on the date column used by the mergetree? But I guess this won't work if I need the data grouped in a specific timezone, does it?

The second group (_group) is always different, it can be a single column or multiple. Therefore, if needed one table for each group might be helpful? (since disk space is cheaper than computing power)

Btw. do you know if there is any gain from using NVMe SSDs (4-5 times the iops and seq. read) over SATA SSDs?

If the update occurs usually within 5 minutes, will merging be fast? E.g. if I run a realtime query without final, will only the last hour be off, since everything older is very likely to be merged by then?

There are no garantee that it will merge the parts as you would expect.

So, lets say your A parts contains impression, and B parts - "seen" information, and they are inserted like A1-B1-A2-B2-A3-B3, but merged A1B1A2, B2A3, B3. And it appereantly will have duplicates.

Just try it with a FINAL keyword. Clickhouse is really fast, most probably performance will satisfy your needs. For now it sound like you're doing premature optimizations.

The second group is always different, it can be a single column or multiple. Therefore, if needed one table for each group might be helpful? (since disk space is cheaper than computing power)

Just start with one table, and don't do too many premature optimizations. Sounds like you didn't try yet, but expecting that it will go slow. Most probably you will be surprized how fast can Clickhouse be, especially if you're switching from MySQL / Postgres :) And if you will not be satisfied - there a lot of options how you can make it even faster, includind sharding, materialized views with preaggregation etc.

Btw. do you know if there is any gain from using NVMe SSDs (4-5 times the iops and seq. read) over SATA SSDs?

Are you asking about CLickhouse or in general? :) In clickhouse disk reads happens only for "cold" data. If you have a lot of memory (>128Gb), and most of parts which are actively used fit in memory - disk speed is not too big problem, as disk reads happens not so often. So if your active 'hot' data can fit into RAM and your selects are rather local and requesting quite compact chunk of data (for example the freshest) - better disk give not too big performance improvements. Faster disks / storages will improve the performance significantly only if you're selecting a lot of 'cold' data.

We're using regular HDDs for clickhouse. AFAIK Yandex also use HDDs in their Metrika cluster.

I see, so I'm better off saving on the disks and putting the savings into memory.
Thank you for all the specific information :)

Btw. do you know if there is any gain from using NVMe SSDs (4-5 times the iops and seq. read) over SATA SSDs?

It depends on workload. ClickHouse works fine with SATA, especially in RAID. There is definitely the gain when using faster disks. In our experiments, SATA RAID-10 with 10-12 disks is slower than the same setup with SAS disks that comes close to NVMe, but still 1.5 times behind on test queries.

However, for most cases SATA is sufficient (and more affordable).

I have been trying to give the CollapsingMergeTree a run. Unfortunately, as soon as I add "FINAL" it doesn't return anything.

CREATE TABLE impressions
ENGINE = CollapsingMergeTree(`date`, (`impressionUUID`, `date`), 8192, sign)
AS SELECT
toInt8(-1) AS `sign`,
toFixedString(UUIDStringToNum(`impressionUUID`), 16) AS `impressionUUID`,
`seen`,
toDate(`timestamp`) as `date`,
[...]
SELECT
    language as _group,
    toDate(timestamp) as _date,
    [...]
    COUNT(seen) as seen
FROM impressions FINAL
WHERE `timestamp` BETWEEN '2017-01-01 00:00:00' AND '2017-11-28 00:00:00'
GROUP BY _group, _date;

Am I missing something?
When I get the seen event, I only have the impressionUUID. Does this mean I would need to run a select first (which won't scale)?

Edit: Final seems to work as that it groups by primary key with having sum of sign > 0. This doesn't seem to help in this case, since it won't select records that haven't been updated or only records that have been updated, depending on the use of sign.

Unfortunately the performance impact if FINAL is intense:
(Collapsing/Replacing)MergeTree - 13 million rows/s
CollapsingMergeTree FINAL - 5 million rows/s
ReplacingMergeTree FINAL - 3 million rows/s
MergeTree JOIN - 9 million rows/s
This is some major difference, unfortunately I don't know how it scales.

Unfortunately I still haven't found a way which performs with hundreds of million of rows for real time data. If only the data would be available at the same time and could be written with one query.

I'm seeing upwards 80 million row/s when selecting data from a single table. As soon as I join or or make use of Collapsing/Replacing MergeTrees I'm down to 3 million row/s.
Materialized views aren't helping either, since we can't update these either when the records of a join change.

@filimonov If you have any more ideas on how I could solve this, I would very much appreciate it.

Ok. So you need do one-time change the status (one column) of records which were inserted in last day. That status updates will happen in 'bulk' mode, like some ETL at the evening will fill the data for whole day. Did I understand that correctly?

You can try the following trick:
1) use usual MergeTree / ReplicatedMergeTree with smaller partitions (partition size should be close the size of updated range, i.e. if you plan to make that updates for last day - try dayly partitioning)
2) store your flag for last one/two days additionally somethere else, for example in mysql in simple table like (impressionUUID, seen). You will do updates there.
3) create the clickhouse dictionary to read the status of seen column from mysql table.
4) declare your seen column in clickhouse table like that:

CREATE TABLE impressions (
  impressionUUID FixedString(16),
  seen Int8 DEFAULT dictGetInt8('mysql_dict_name', 'seen', tuple(impressionUUID))
) ENGINE ...

After data was inserted in clickhouse, and flags was inserted in mysql, your selects will not use mysql. But there is a nice command called ALTER TABLE CLEAR COLUMN seen IN PARTITION ....
After you will run that - clickhouse will clean all the stored values for that partition, and during selects will use default expression, that will lead to read the data from mysql.

So any time you update the data in MySQL you should clean the column for corresponding partition.
After you finish all the updates and you want to 'fix' the final value for that column in clickhouse (so generally fill the column with real values back to avoid using dictionary lookup all the time) - for example after one day, you can call OPTIMIZE TABLE x PARTITION (...) FINAL
and all DEFAULT expressions will be materialized.

Some proof of concept:

:) CREATE TABLE x ( a Date DEFAULT today(), b Int32 DEFAULT rand() ) ENGINE = MergeTree(a, a, 1);

Ok.

:) insert into x(b) values (1),(1),(1);
Ok.

:) select * from x;
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€a─┬─b─┐
β”‚ 2018-04-12 β”‚ 1 β”‚
β”‚ 2018-04-12 β”‚ 1 β”‚
β”‚ 2018-04-12 β”‚ 1 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”˜

:) ALTER TABLE x CLEAR COLUMN b IN PARTITION '201804'
Ok.

/* now each select will get calculated random value from default expression */

:) select * from x;
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€a─┬───────────b─┐
β”‚ 2018-04-12 β”‚  1553142215 β”‚
β”‚ 2018-04-12 β”‚  -866675022 β”‚
β”‚ 2018-04-12 β”‚ -1689371698 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

:) select * from x;
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€a─┬───────────b─┐
β”‚ 2018-04-12 β”‚  -831424890 β”‚
β”‚ 2018-04-12 β”‚  1956187707 β”‚
β”‚ 2018-04-12 β”‚ -1184752955 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

/* 
if in a meanwhile some background merges will happen they will materialize b again
for merged subrange, so you need to call CLEAR COLUMN after every update of source,
to avoid using some values which were materialized by background merges.
*/

/* we can optimize final when we need to materialize and fix values of b */

:) optimize table x partition '201804' final;
Ok.

/* now b is fixed again */


:) select * from x;
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€a─┬──────────b─┐
β”‚ 2018-04-12 β”‚ 1818783249 β”‚
β”‚ 2018-04-12 β”‚ 2029058371 β”‚
β”‚ 2018-04-12 β”‚  375651238 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

:) select * from x;
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€a─┬──────────b─┐
β”‚ 2018-04-12 β”‚ 1818783249 β”‚
β”‚ 2018-04-12 β”‚ 2029058371 β”‚
β”‚ 2018-04-12 β”‚  375651238 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Of course instead of mysql you can use any other storage and appropropriate dictionary.

Wow, thank you. This sounds like a great work around.
I get the seen records whenever the event triggers, meaning at any given time of the day, just delayed to the impressions.
Do you think it is doable - performance wise - to update the data more frequently e.g. every minute or at least every 5 minutes?
Could I use another clickhouse table as data source and decide based on record exists/does not exist? (or would performance not be optimal for these kinds of operations?)

How is it that if this works performance wise, that there is no update feature or does it block the table while doing the update causing all queries during this time to fail?

Edit: or could we keep the dictionary lookup active for the last n hours/days and do the optimize partition final for older records? So old old partitions would be a single table read and new partitions with the dictionary lookup. I don't know how performance will with this. I have up to 100 million new records per day.

Updates in that scenario cost mostly nothing (as they happen outside of clickhouse). Selects will use external dict before 'fixing' the final value (by OPMIZIE FINAL), so it's more about performance of external dictionary during that time. But that allows you to store outside only limited amount of data (for last day or few) and only one column, so probably you can find some fast solution for that task (if you have a lot of SELECTs for current day and fast UPDATES).

Could I use another clickhouse table as data source and decide based on record exists/does not exist? (or would performance not be optimal for these kinds of operations?)

Clickhouse is not good/optimal for single row lookup, here a lot of such lookups will be needed.

How is it that if this works performance wise, that there is no update feature

Generally that can't be used as general purpose updates as it require clearing / rewriting of the whole column for one partition, and storing the processed data somewhere outside in a meanwhile. But in some cases it could be useful, so may by @alexey-milovidov or other developer from Yandex team can adopt it in some way.

Updates in that scenario cost mostly nothing (as they happen outside of clickhouse). Selects will use external dict before 'fixing' the final value (by OPMIZIE FINAL), so it's more about performance of external dictionary during that time. But that allows you to store outside only limited amount of data (for last day or few) and only one column, so probably you can find some fast solution for that task (if you have a lot of SELECTs for current day and fast UPDATES).

hmm, doesn't the merging which happens automatically break non optimized data?
When I run the OPTIMIZE FINAL, is the table able to write while it runs. e.g. if it needs to update 100 million rows with an external dictionary, that is probably going to take some time and I don't want to loose data.

I don't need to update data, I only need to set what was undefined/unknown before, hence when I use another database for temp storage I only have inserts on there.

For the external data source I'm probably good off with mongodb (if the look ups are fast enough, which they should), cassandra or redis. What do you think?

or could we keep the dictionary lookup active for the last n hours/days and do the optimize partition final for older records? So old old partitions would be a single table read and new partitions with the dictionary lookup. I don't know how performance will with this. I have up to 100 million new records per day.

Yes, that's what i've tried to propose, may be my expkanations wasn't clear enought. So last day/or few from dictionary. Older data - materialized and stored in column.

That will allow to keep dictionary relatively small and fast, and also performance downgrade caused by using dictionary will affect only last days or few (older data will be materialized and immutable).

When I run the OPTIMIZE FINAL, is the table able to write while it runs. e.g. if it needs to update 100 million rows with an external dictionary, that is probably going to take some time and I don't want to loose data.

Yes it is 100% safe, generally optimize final do the same job as normal background merges.

Yes, that's what i've tried to propose, may be my expkanations wasn't clear enought. So last day/or few from dictionary. Older data - materialized and stored in column.

You wrote:

if in a meanwhile some background merges will happen they will materialize b again
for merged subrange, so you need to call CLEAR COLUMN after every update of source,
to avoid using some values which were materialized by background merges.

Doesn't this mean, that the latest records are not being read from the dictionary if they haven been merged which happens automatically at random times?

For the external data source I'm probably good off with mongodb (if the look ups are fast enough, which they should), cassandra or redis

@Slind14 Dictionary lookups aren't done directly against your db, so it doesn't really matter what you use. ClickHouse polls your dictionary for changes and keeps the most recent data in RAM for query-time lookups. Just use whatever db is easiest for you.

@Slind14 Dictionary lookups aren't done directly against your db, so it doesn't really matter what you use. ClickHouse polls your dictionary for changes and keeps the most recent data in RAM for query-time lookups. Just use whatever db is easiest for you.

ahh, I see. @derekperkins is there something I can read on how it does this? E.g. if it checks for changes once an hour I have issues with real time data.

Doesn't this mean, that the latest records are not being read from the dictionary if they haven been merged which happens automatically at random times?

Yes. And It is not true realtime updates. It is some workaround for _some_ updates

Everything depends on concrete usage scenario.

Dictionary lookups aren't done directly against your db, so it doesn't really matter what you use

That's true if the data stored in dict has less than 1 mln rows. If it's more - perfrormance of dictionary matters.

Clickhouse is not good/optimal for single row lookup, here a lot of such lookups will be needed.

most recent data in RAM

hmm, when Dictionaries are cached in memory for quick access, why would it do one lookup for every key/value?

I have a few tables with ~100 million new rows per day, if we work with daily partitions and optimize, wouldn't this mean we have ~200 million data points in the dictionary cache at the end of the day? Considering the amount of tables that exist, I find this difficult to muster in terms of memory required for it.

Check that https://clickhouse.yandex/docs/en/dicts/external_dicts_dict_layout/

If flat or hashed will siute your needs - you don't need key/value lookups. For cache (which is slowest, but most universal) - key lookups needed.

My keys are binary UUIDs. So I should probably go with hashed?
Now since it states that it is "completely" stored in memory, doesn't this mean that I can use a clickhouse table where I don't have issues with write and read performance as long as writes are batched and the read being one query for all the key/values?

In my case I only need to read the data when the user is requesting it, hence the hit rate on cache is probably extremely low, especially on first read.

You said that you can have about 100 mln 'seen' records dayly - that means that your dict will have ~16Gb size overall for one day (100 mln * 16 chars for your uuid + some overhaed). hashed or flat need to reread whole table after each change. So in that case after every update of underlying table (and you want to do it each minute) 16Gb of data should be copied from underlying table to memory. Not sure if that solution will be optimal. Rather look on cache.

Another variant - if that "seen" event can happen only once in short timeframe - may be it would be optimal to you just to store one day of data in some other DB outside of clickhouse, and just copy it to clickchouse after that 'mutable' period? So some external DB like mysql/postgres/whatever will store data for one day, in that case the amount of data will not be too big, even simple mysql can deal quite well with 100 mln rows. And clickhouse will get the final version of data after some time, and will store the whole history.

hmm, since 100mln records daily is by table, it is a lot more if we look at all the tables. So yeah, its not going to work memory wise.

I'm still having trouble believing how even a fast storage like redis would be able to deliver tens of millions of single lookups in a matter of milliseconds.

The issue with another databases I have is that I need to read from two (one for everything older than two days and the other for the last two days) and therefore can't aggregate during query time.

I'll try it with a union query.

  1. table with +2 days old data, with the seen bool as column
  2. table with the live data partitioned by day and a join to the second table with seen records

This is the best I could come up with in terms of keeping it simple, on one database and fast.

I also have a click data (not just seen) hence I'll probably make two tables out of this one in order to be able to use the join with ANY. I know clickhouse isn't great for multiple joins and queries will look incredible bad, but it should help a lot with performance.

I'm still lacking a proper solution for this use case. I did a lot of work around's which add a lot of points of failure. If there is another better way which we haven't thought of, it would be great.

Which solution you ended up with at the moment?

I have two datasets, one consistent of two tables with the latest data and one single table with all the data.
Once a day I write the data from the day before yesterday from the "live" tables to the "history" table.
The issue I'm still facing is with datasets which span around hundred of millions of records per day. Although we have only the latest data across two tables the join is still spanning around massive amounts of data and even with indexed binary UUIDs and 48 cores this takes forever. This also prevents me from using aggregated engines for the "live" data.

@Slind14 And why it's just 48 cores? Which hardware setup do you have?

@blinkov the CPU power isn't even the bottleneck. A query runs for multiple minutes and the CPU usage stays below 30%. I assume this is due to the join since queries without a join get the CPU to 100%.
The SSDs are sleeping during this process as well and the memory is used 90% for cache. (512GB)

@blinkov do you have any idea/suggestion on how we can improve this? Sampling doesn't work either because the joined table is still massive.

@Slind14 I see you have tried CollapsingMergeTree earlier, but had some issues with FINAL. Meanwhile the recommended approach is to rewrite queries in a way that FINAL is not needed, like replacing COUNT(*) with SUM(Sign), have you managed to do that?

As of hardware, it sounds like you are using just one server, right?

Atm. just one server, yes. Using FINAL has been super slow as well, I could give it another shot but I doubt that it is much faster than what I'm currently doing. It wasn't back then. Especially since I'm not reporting on these metrics every time and final would then be needed for every view.

So have you rewritten your select queries so they return correct result on CollapsingMergeTree without FINAL? If not, it's definitely worth giving a shot.

How should it work without using final? Doesn't this inflate the data when it returns real time data which hasn't been merged yet?

So, I revisited the possibility of collapsing and replacing MergeTrees because we are looking at the multi billion records a day.

Due to the amount of requests we receive we have a load balancing setup with a good amount of nodes. Because some requests (updates) are coming in with < 5ms inbetween it seems impossible to work a temp storage for the collapsing merge tree. But initially the data is coming from the client which could cache the old data to send both the old and the new data. In this case we would only have an issue with the data that is collected server side (geo, UA) but this shouldn't change too often. So the only major remaining is the timestamp. The timestamp is generated by the database it self (now()). Is there any solution for this? Ultimately we want the initial timestamp and not the one from the update.

The timestamp is also an issue with the replacing merge tree.

I would like to reiterate this topic because there still isn't a solution.

To summarize, we are receiving data for the same object at different times, sometimes delayed by hours. We need the latest, summed or joined data on the spot in real-time including sampling support.

Suggested approaches and their result (using a small database of just 30 million unique objects at 50 million events):

Join (control): +4 seconds
Dictionary: +20 seconds
Replacing Merge Tree: +15 seconds (PREWHERE (id, version) IN (SELECT id, max(version)))
Collapsing Merge Tree: +4 seconds (same as join, GROUP BY id)
Summing Merge Tree: same as Collapsing Merge Tree

Why 'Alter table update' was not suggested in this case? Correct me if I am wrong but it says _'... mutations are intended for heavy operations that change a lot of rows in a table'_ on the offcial documentation.
It is not that much but I have tried 'alter table update' with 1M rows, it was actually pretty fast. (Updated 1K rows of 1M data, elapsed time was 6ms)

And if you could give further information about the 'alter table update's use case, performance etc, I would be appreciated @filimonov @blinkov

@MeteHanC mutations are done asynchronously, it just got scheduled that fast, not executed. To check how they are going you need to look into system.mutations table.

mutations are done asynchronously, it just got scheduled that fast, not executed. To check how they are going you need to look into system.mutations table.

I understand. I checked it and it is being done in no time. 'done' column was set to '1' immediately and it was also replicated to other CH nodes instantly as well.

What I am trying to understand is, how does alter mutation work? Is it not reliable and will the mutation take place 'eventually'? (just like Merging process of Merge Tree family) @blinkov

We still haven't found a solution to achieve this in a way that scales.

@MeteHanC Hi, I just wonder your test case run against on no replica merge tree? do you have checked the ReplicatedMergeTree, and QPS benchmark? Maybe no problem for one time update, but how about many updates in a short time, hybrid with many selects.
Just want to know the realtime performance of CH, @blinkov do you have any advice? thanks.

Was this page helpful?
0 / 5 - 0 ratings