Clickhouse: Implement "min_bytes_to_merge_at_max_space_in_pool"

Created on 21 Dec 2017  Β·  4Comments  Β·  Source: ClickHouse/ClickHouse

The merging process is mainly controlled by two variables - maximum part size, and number of free workers in the background pool. If I understand it right, the number_of_free_entries_in_pool_to_lower_max_size_of_merge says that the maximum part size should be exponentially lowered when there are less than N free background workers, which effectively gives priority to smaller merges when the background workers are busy.

In our case that's not always a good thing, for example in one scenario there's a lot of smaller (<10GB) parts accumulated because during peak time the number of free background workers never go above the threshold, so the maximum part size is kept low. During off peak time, it would ideally start merging several smaller parts into maximum size parts instead of gradually increasing the maximum part size (to avoid write amplification). The idea is to create another variable min_bytes_to_merge_at_max_space_in_pool similar to https://github.com/yandex/ClickHouse/blob/master/dbms/src/Storages/MergeTree/MergeTreeSettings.h#L29 and another threshold, which would enforce a minimum result part size in off-peak times.

Basically what we want is:

  • 0 ... T1 free workers -> schedule min_bytes_to_merge_at_min_space_in_pool ... min_bytes_to_merge_at_max_space_in_pool (small) merges
  • T1 .. T2 free workers -> schedule min_bytes_to_merge_at_max_space_in_pool ... max_bytes_to_merge_at_max_space_in_pool (medium) merges
  • T2 ... free workers -> schedule max_bytes_to_merge_at_max_space_in_pool merges

Does that sound reasonable or do you have any other idea how to improve this?

cc @dqminh @bobrik

enhancement st-discussion

Most helpful comment

Another case for implementing aggressive merge to grab more small parts at
once.

We had some MVs that created many small parts from a bigger NULL table firehose,
and they do not merge very well, causing huge number of parts that will be merged,
replicated, deleted repeatedly.

    database,
    table,
    count() AS total,
    countIf(event_type = 1) AS new_parts,
    countIf(event_type = 2) AS merged_parts,
    countIf(event_type = 3) AS replicated_parts,
    countIf(event_type = 4) AS deleted_parts,
    avgIf(length(merged_from), event_type = 2) AS avg_merged_from_parts,
    formatReadableSize(avg(size_in_bytes)) AS avg_part_size
FROM system.part_log
WHERE event_date = yesterday()
GROUP BY
    database,
    table
ORDER BY total DESC
LIMIT 50

β”Œβ”€database─┬─table─────────────────────────────────┬──total─┬─new_parts─┬─merged_parts─┬─replicated_parts─┬─deleted_parts─┬─avg_merged_from_parts─┬─avg_part_size─┐
β”‚ r1       β”‚ table                                 β”‚ 119437 β”‚         0 β”‚        20315 β”‚            39405 β”‚         59717 β”‚     2.938764459758799 β”‚ 76.44 KiB     β”‚
β”‚ r2       β”‚ table                                 β”‚ 104441 β”‚         0 β”‚        15341 β”‚            36879 β”‚         52221 β”‚    3.3995176324881036 β”‚ 78.17 KiB     β”‚
β”‚ r0       β”‚ table                                 β”‚  99992 β”‚     36452 β”‚        13544 β”‚                0 β”‚         49996 β”‚    3.6913762551683402 β”‚ 78.40 KiB     β”‚

Here we insert into r0, and then replicate to r1 and r2. You can see that part
size is tiny, and number of merged parts per merge are also quite small.

    count(),
    length(merged_from) AS l,
    formatReadableSize(avg(size_in_bytes))
FROM system.part_log
WHERE (database = 'r0') AND (table = 'table') AND (event_time >= '2018-05-03 12:00:00') AND (event_time <= '2018-05-03 14:37:00') AND (event_type = 2)
GROUP BY l
β”Œβ”€count()─┬─l─┬─formatReadableSize(avg(size_in_bytes))─┐
β”‚    3706 β”‚ 2 β”‚ 913.85 KiB                             β”‚
β”‚     126 β”‚ 6 β”‚ 2.10 KiB                               β”‚
β”‚       6 β”‚ 7 β”‚ 2.22 KiB                               β”‚
β”‚       2 β”‚ 8 β”‚ 2.05 KiB                               β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Looking at number of merged parts, majority of merges are merging of two parts,
with size around 900KiB. This is off-peak for the cluster and surely we can
merge more than that.

Looking at individual merge:

SELECT *
FROM system.part_log
WHERE (part_name = '20180501_20180503_0_80019_64201') AND (table = 'table')
β”Œβ”€event_type─┬─event_date─┬──────────event_time─┬─size_in_bytes─┬─duration_ms─┬─database─┬─table──────────────────┬─part_name───────────────────────┬─merged_from───────────────────────────────────────────────────────────┐
β”‚          2 β”‚ 2018-05-03 β”‚ 2018-05-03 13:59:31 β”‚        948291 β”‚          86 β”‚ r0       β”‚ table β”‚ 20180501_20180503_0_80019_64201 β”‚ ['20180501_20180503_0_80018_64200','20180503_20180503_80019_80019_0'] β”‚
β”‚          4 β”‚ 2018-05-03 β”‚ 2018-05-03 13:59:53 β”‚        948291 β”‚           0 β”‚ r0       β”‚ table β”‚ 20180501_20180503_0_80019_64201 β”‚ []                                                                    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

SELECT *
FROM system.part_log
WHERE (part_name = '20180501_20180503_0_80018_64200') AND (table = 'table')
β”Œβ”€event_type─┬─event_date─┬──────────event_time─┬─size_in_bytes─┬─duration_ms─┬─database─┬─table──────────────────┬─part_name───────────────────────┬─merged_from───────────────────────────────────────────────────────────┐
β”‚          2 β”‚ 2018-05-03 β”‚ 2018-05-03 13:59:31 β”‚        948286 β”‚          97 β”‚ r0       β”‚ table β”‚ 20180501_20180503_0_80018_64200 β”‚ ['20180501_20180503_0_80017_64199','20180503_20180503_80018_80018_0'] β”‚
β”‚          4 β”‚ 2018-05-03 β”‚ 2018-05-03 13:59:31 β”‚        948286 β”‚           0 β”‚ r0       β”‚ table β”‚ 20180501_20180503_0_80018_64200 β”‚ []                                                                    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Here we can see that part '20180501_20180503_0_80019_64201' is merged from two
parts '20180501_20180503_0_80018_64200', '20180503_20180503_80019_80019_0',
final size is 948291. But part '20180501_20180503_0_80018_64200' itself is also
merged from two parts, final size is 948286. So looks like we keep merging 2
parts, one is around 900KiB, one is around 5B repeatedly together, this doesnt
look right to me.

All 4 comments

We cannot simply enforce minimum size of merges, because otherwise there is a chance of accumulation of many small parts far below min threshold in total.

For example, you set min threshold to 1 GB and there are 999 parts of 1 MB size each, and they won't be merged.

If I understand correctly, you want that during off-peak periods of time, merging should be more "aggressive" and grab more parts to merge at once instead of waiting for better combination of parts with more uniform sizes.

That's difficult - we have many variables in SimpleMergeSelector.h and we are not sure that algorithm is optimal and it is hard to tune. We have intern student that will experiment with simulations of merge algorithm to find optimal parameters and better variants of algorithms (but there is low chance that your case will be covered).

Another case for implementing aggressive merge to grab more small parts at
once.

We had some MVs that created many small parts from a bigger NULL table firehose,
and they do not merge very well, causing huge number of parts that will be merged,
replicated, deleted repeatedly.

    database,
    table,
    count() AS total,
    countIf(event_type = 1) AS new_parts,
    countIf(event_type = 2) AS merged_parts,
    countIf(event_type = 3) AS replicated_parts,
    countIf(event_type = 4) AS deleted_parts,
    avgIf(length(merged_from), event_type = 2) AS avg_merged_from_parts,
    formatReadableSize(avg(size_in_bytes)) AS avg_part_size
FROM system.part_log
WHERE event_date = yesterday()
GROUP BY
    database,
    table
ORDER BY total DESC
LIMIT 50

β”Œβ”€database─┬─table─────────────────────────────────┬──total─┬─new_parts─┬─merged_parts─┬─replicated_parts─┬─deleted_parts─┬─avg_merged_from_parts─┬─avg_part_size─┐
β”‚ r1       β”‚ table                                 β”‚ 119437 β”‚         0 β”‚        20315 β”‚            39405 β”‚         59717 β”‚     2.938764459758799 β”‚ 76.44 KiB     β”‚
β”‚ r2       β”‚ table                                 β”‚ 104441 β”‚         0 β”‚        15341 β”‚            36879 β”‚         52221 β”‚    3.3995176324881036 β”‚ 78.17 KiB     β”‚
β”‚ r0       β”‚ table                                 β”‚  99992 β”‚     36452 β”‚        13544 β”‚                0 β”‚         49996 β”‚    3.6913762551683402 β”‚ 78.40 KiB     β”‚

Here we insert into r0, and then replicate to r1 and r2. You can see that part
size is tiny, and number of merged parts per merge are also quite small.

    count(),
    length(merged_from) AS l,
    formatReadableSize(avg(size_in_bytes))
FROM system.part_log
WHERE (database = 'r0') AND (table = 'table') AND (event_time >= '2018-05-03 12:00:00') AND (event_time <= '2018-05-03 14:37:00') AND (event_type = 2)
GROUP BY l
β”Œβ”€count()─┬─l─┬─formatReadableSize(avg(size_in_bytes))─┐
β”‚    3706 β”‚ 2 β”‚ 913.85 KiB                             β”‚
β”‚     126 β”‚ 6 β”‚ 2.10 KiB                               β”‚
β”‚       6 β”‚ 7 β”‚ 2.22 KiB                               β”‚
β”‚       2 β”‚ 8 β”‚ 2.05 KiB                               β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Looking at number of merged parts, majority of merges are merging of two parts,
with size around 900KiB. This is off-peak for the cluster and surely we can
merge more than that.

Looking at individual merge:

SELECT *
FROM system.part_log
WHERE (part_name = '20180501_20180503_0_80019_64201') AND (table = 'table')
β”Œβ”€event_type─┬─event_date─┬──────────event_time─┬─size_in_bytes─┬─duration_ms─┬─database─┬─table──────────────────┬─part_name───────────────────────┬─merged_from───────────────────────────────────────────────────────────┐
β”‚          2 β”‚ 2018-05-03 β”‚ 2018-05-03 13:59:31 β”‚        948291 β”‚          86 β”‚ r0       β”‚ table β”‚ 20180501_20180503_0_80019_64201 β”‚ ['20180501_20180503_0_80018_64200','20180503_20180503_80019_80019_0'] β”‚
β”‚          4 β”‚ 2018-05-03 β”‚ 2018-05-03 13:59:53 β”‚        948291 β”‚           0 β”‚ r0       β”‚ table β”‚ 20180501_20180503_0_80019_64201 β”‚ []                                                                    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

SELECT *
FROM system.part_log
WHERE (part_name = '20180501_20180503_0_80018_64200') AND (table = 'table')
β”Œβ”€event_type─┬─event_date─┬──────────event_time─┬─size_in_bytes─┬─duration_ms─┬─database─┬─table──────────────────┬─part_name───────────────────────┬─merged_from───────────────────────────────────────────────────────────┐
β”‚          2 β”‚ 2018-05-03 β”‚ 2018-05-03 13:59:31 β”‚        948286 β”‚          97 β”‚ r0       β”‚ table β”‚ 20180501_20180503_0_80018_64200 β”‚ ['20180501_20180503_0_80017_64199','20180503_20180503_80018_80018_0'] β”‚
β”‚          4 β”‚ 2018-05-03 β”‚ 2018-05-03 13:59:31 β”‚        948286 β”‚           0 β”‚ r0       β”‚ table β”‚ 20180501_20180503_0_80018_64200 β”‚ []                                                                    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Here we can see that part '20180501_20180503_0_80019_64201' is merged from two
parts '20180501_20180503_0_80018_64200', '20180503_20180503_80019_80019_0',
final size is 948291. But part '20180501_20180503_0_80018_64200' itself is also
merged from two parts, final size is 948286. So looks like we keep merging 2
parts, one is around 900KiB, one is around 5B repeatedly together, this doesnt
look right to me.

Did you guys find a resolution to this? We're experiencing something similar.

Another thing we noticed is that clickhouse is for some reason prioritizing merging materialized views parts over raw data table parts. I'm guessing this is because table selection for merges is somewhat random, but it would be nice if clickhouse considered the active part count for the table in it's decision when selecting which table's parts to process.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

fizerkhan picture fizerkhan  Β·  3Comments

jangorecki picture jangorecki  Β·  3Comments

vixa2012 picture vixa2012  Β·  3Comments

SaltTan picture SaltTan  Β·  3Comments

goranc picture goranc  Β·  3Comments