Hi guys,
I'm asking here after having read and test some things.
My use case is the following : I've a big table of hospitality data (+100M rows) with many dimensions (20/30) and some measures (10).
The way we need to update our data is first doing a DELETE for an "hotel_id" + a date range. Once data is deleted we can INSERT the new data for this hotel_id and date range.
I'm considering 2 ways :
I've read that mutation should not be used on prod but I don't understand why.
From what I've tested an ALTER TABLE ... DELETE where htl_id=.. AND stay_date BETWEEN .. AND ... is fast (I've checked in system.mutations).
ReplacingMergeTree is more complex to handle.
My question is : what would you recommend? Especially with the usage of delete mutations.
Thanks for this great product!
Best regards
Jean
You can try to use ALTER TABLE ... DELETE as part of your day-to-day flow.
Though it's not recommended because ALTER TABLE ... DELETE more administrative operation and it was designed for rare massive processing.
An implementation still have issues and a mutation may stuck forever or execute for a long time.
It would be nice if ClickHouse would set a "to be deleted" flag on rows that are awaiting deletion, and those would be transparently skipped by all SELECT queries (time of execution for alter delete would not be instant, but rather the time needed to set the flag).
Do we know if this was ever considered?
@den-crane thanks for your answer. In my case we delete a small portion of the data each day and from what I've tested the ALTER TABLE ... DELETE is quick. But the test are not in a production situation and from what you wrote I don't want my INSERT to be blocked by the mutation.
Using the ReplacingMergeTree I've used the following logic :
ORDER KEY on all dimensions + make measure values to default 0INSERT INTO <table> (<dims>) SELECT DISTINCT <dims> FROM <table> WHERE .... It allows deleting data that is not in the fresh data anymore.This scenario is working fine. My only doubt is that I've around 20 columns in the ORDER KEY. I don't know if it's optimized but this is the only way of doing this DELETE + INSERTstrategy.
It would be nice if ClickHouse would set a "to be deleted" flag on rows that are awaiting deletion, and those would be transparently skipped by all SELECT queries (time of execution for alter delete would not be instant, but rather the time needed to set the flag).
Do we know if this was ever considered?
It not that easy. Parts are immutable. These deleted rows should be placed in a special new part (delete vector in terms of Vertica) and each select should merge own result with this part. Basically it will be the same as SELECT FROM T FINAL and I guess it will slowdown all selects to this table dramatically (10-100 times).
This scenario is working fine. My only doubt is that I've around 20 columns in the
ORDER KEY. I don't know if it's optimized but this is the only way of doing thisDELETE + INSERTstrategy.
20 columns in the ORDER KEY is OK. You can use https://clickhouse.tech/docs/en/engines/table-engines/mergetree-family/mergetree/#choosing-a-primary-key-that-differs-from-the-sorting-key
primary key a,b,c
order by a,b,c,d,e,f,g,h,i,j,k,...
in this case primary key index will contain only a,b,c, and you'll save memory and disk space. But merges will still use order by a,b,c,d,e,f,g,h,i,j,k,....
@den-crane thanks for your answer. Just wanted to add more info on that one.
Regarding the last info, I've switched to a CollapsingMergeTree approach.
Still need to validate performances on a big table but looks already simplier as it avoid creating views with argMax.
So given the above, what would be the current best practice to handle an ETL in which the last x days need be updated daily?
We are currently working with partition swap, I've considered switching to use the alter statements but it doesn't look like there's a way to guarantee atomic updates at this time, correct? Aside from using partition swap, of course.
@andrea-s seems we have kind of the same use case. Mine is future days data needs to be updated daily (hotel reservations).
For now I'm on the CollapsingMergeTree approach.
Also tested mutations for the delete with the SET mutations_sync=2 settings. But I think it's not a good idea as it can take too much time for big tables and we only delete a small portion of the data (only future days for 1 hotel on a big table will all hotels).
Mine is a little bit easier than yours (just every day we reprocess the last x days for some sources, as data could be fully reliable only with some delay - e.g. google search console data). So our "rewrites" are always based on the primary key... Thanks for the tip about mutation_sync, didn't know about that.
For what it's worth, in use cases similar to yours (e.g. daily snapshot of lifetime statistics from Facebook posts) we usually go with a self-join to determine a "is_most_recent" flag for each item - and then filter on that. Not sure if it applies directly to your use case though :-)
@andrea-s thanks for sharing this. Not sure if this "is_most_recent" will apply to our use case.
Seems that the CollapsingMergeTree approach is working fine to "delete" data and is simplier than ReplacingMergeTree
Interesting thing also for mutations.
Suppose you execute async : ALTER TABLE .... WHERE date=today().
Then you insert data for today().
If your mutation is still running during the insert you'll have no issue. Your new data in the insert will not be removed by the running mutation.
When you execute ALTER TABLE .... WHERE date=today() mutation locks a table for a microsecond and creates a list of active parts and stores this list into mutations.txt or zookeeper (if the table is replicated). After that mutations begins to mutate these parts from the list and so it does not see new parts inserted after.