Clickhouse: Can we use delete mutation within an ETL logic?

Created on 8 Jun 2020 · 11Comments · Source: ClickHouse/ClickHouse

Hi guys,

I'm asking here after having read and test some things.

My use case is the following : I've a big table of hospitality data (+100M rows) with many dimensions (20/30) and some measures (10).
The way we need to update our data is first doing a DELETE for an "hotel_id" + a date range. Once data is deleted we can INSERT the new data for this hotel_id and date range.

I'm considering 2 ways :

delete mutation
replacingMergeTree

I've read that mutation should not be used on prod but I don't understand why.
From what I've tested an ALTER TABLE ... DELETE where htl_id=.. AND stay_date BETWEEN .. AND ... is fast (I've checked in system.mutations).

ReplacingMergeTree is more complex to handle.

My question is : what would you recommend? Especially with the usage of delete mutations.

Thanks for this great product!

Best regards

Jean

question question-answered

Source

jeanatpi

All 11 comments

You can try to use ALTER TABLE ... DELETE as part of your day-to-day flow.
Though it's not recommended because ALTER TABLE ... DELETE more administrative operation and it was designed for rare massive processing.
An implementation still have issues and a mutation may stuck forever or execute for a long time.

den-crane on 9 Jun 2020

👍1

It would be nice if ClickHouse would set a "to be deleted" flag on rows that are awaiting deletion, and those would be transparently skipped by all SELECT queries (time of execution for alter delete would not be instant, but rather the time needed to set the flag).

Do we know if this was ever considered?

andrea-s on 9 Jun 2020

@den-crane thanks for your answer. In my case we delete a small portion of the data each day and from what I've tested the ALTER TABLE ... DELETE is quick. But the test are not in a production situation and from what you wrote I don't want my INSERT to be blocked by the mutation.

Using the ReplacingMergeTree I've used the following logic :

Create a table with ORDER KEY on all dimensions + make measure values to default 0
Create a view with argMax(, created_at) to always have up to date data.
Delete the data using INSERT INTO <table> (<dims>) SELECT DISTINCT <dims> FROM <table> WHERE .... It allows deleting data that is not in the fresh data anymore.
Insert the fresh data

This scenario is working fine. My only doubt is that I've around 20 columns in the ORDER KEY. I don't know if it's optimized but this is the only way of doing this DELETE + INSERTstrategy.

jeanatpi on 9 Jun 2020

It would be nice if ClickHouse would set a "to be deleted" flag on rows that are awaiting deletion, and those would be transparently skipped by all SELECT queries (time of execution for alter delete would not be instant, but rather the time needed to set the flag).

Do we know if this was ever considered?

It not that easy. Parts are immutable. These deleted rows should be placed in a special new part (delete vector in terms of Vertica) and each select should merge own result with this part. Basically it will be the same as SELECT FROM T FINAL and I guess it will slowdown all selects to this table dramatically (10-100 times).

den-crane on 22 Jul 2020

This scenario is working fine. My only doubt is that I've around 20 columns in the ORDER KEY. I don't know if it's optimized but this is the only way of doing this DELETE + INSERTstrategy.

20 columns in the ORDER KEY is OK. You can use https://clickhouse.tech/docs/en/engines/table-engines/mergetree-family/mergetree/#choosing-a-primary-key-that-differs-from-the-sorting-key

primary key a,b,c
order by a,b,c,d,e,f,g,h,i,j,k,...

in this case primary key index will contain only a,b,c, and you'll save memory and disk space. But merges will still use order by a,b,c,d,e,f,g,h,i,j,k,....

den-crane on 22 Jul 2020

@den-crane thanks for your answer. Just wanted to add more info on that one.

For the DELETE part (insert into) it's important to use the FINAL keyword. Else we'll insert too many rows (if the table is not merged between 2 ETL round.
I had some performance issues using ReplacingMergeTree + argMax view. Querying the view was too slow comparing to the original table.

Regarding the last info, I've switched to a CollapsingMergeTree approach.
Still need to validate performances on a big table but looks already simplier as it avoid creating views with argMax.

jeanatpi on 23 Jul 2020

So given the above, what would be the current best practice to handle an ETL in which the last x days need be updated daily?

We are currently working with partition swap, I've considered switching to use the alter statements but it doesn't look like there's a way to guarantee atomic updates at this time, correct? Aside from using partition swap, of course.

andrea-s on 23 Jul 2020

@andrea-s seems we have kind of the same use case. Mine is future days data needs to be updated daily (hotel reservations).

For now I'm on the CollapsingMergeTree approach.

Also tested mutations for the delete with the SET mutations_sync=2 settings. But I think it's not a good idea as it can take too much time for big tables and we only delete a small portion of the data (only future days for 1 hotel on a big table will all hotels).

jeanatpi on 23 Jul 2020

Mine is a little bit easier than yours (just every day we reprocess the last x days for some sources, as data could be fully reliable only with some delay - e.g. google search console data). So our "rewrites" are always based on the primary key... Thanks for the tip about mutation_sync, didn't know about that.

For what it's worth, in use cases similar to yours (e.g. daily snapshot of lifetime statistics from Facebook posts) we usually go with a self-join to determine a "is_most_recent" flag for each item - and then filter on that. Not sure if it applies directly to your use case though :-)

andrea-s on 23 Jul 2020

@andrea-s thanks for sharing this. Not sure if this "is_most_recent" will apply to our use case.

Seems that the CollapsingMergeTree approach is working fine to "delete" data and is simplier than ReplacingMergeTree

Interesting thing also for mutations.

Suppose you execute async : ALTER TABLE .... WHERE date=today().
Then you insert data for today().

If your mutation is still running during the insert you'll have no issue. Your new data in the insert will not be removed by the running mutation.

jeanatpi on 23 Jul 2020

When you execute ALTER TABLE .... WHERE date=today() mutation locks a table for a microsecond and creates a list of active parts and stores this list into mutations.txt or zookeeper (if the table is replicated). After that mutations begins to mutate these parts from the list and so it does not see new parts inserted after.

den-crane on 23 Jul 2020

👍1

Was this page helpful?

0 / 5 - 0 ratings