The merging process is mainly controlled by two variables - maximum part size, and number of free workers in the background pool. If I understand it right, the number_of_free_entries_in_pool_to_lower_max_size_of_merge says that the maximum part size should be exponentially lowered when there are less than N free background workers, which effectively gives priority to smaller merges when the background workers are busy.
In our case that's not always a good thing, for example in one scenario there's a lot of smaller (<10GB) parts accumulated because during peak time the number of free background workers never go above the threshold, so the maximum part size is kept low. During off peak time, it would ideally start merging several smaller parts into maximum size parts instead of gradually increasing the maximum part size (to avoid write amplification). The idea is to create another variable min_bytes_to_merge_at_max_space_in_pool similar to https://github.com/yandex/ClickHouse/blob/master/dbms/src/Storages/MergeTree/MergeTreeSettings.h#L29 and another threshold, which would enforce a minimum result part size in off-peak times.
Basically what we want is:
min_bytes_to_merge_at_min_space_in_pool ... min_bytes_to_merge_at_max_space_in_pool (small) mergesmin_bytes_to_merge_at_max_space_in_pool ... max_bytes_to_merge_at_max_space_in_pool (medium) mergesmax_bytes_to_merge_at_max_space_in_pool mergesDoes that sound reasonable or do you have any other idea how to improve this?
cc @dqminh @bobrik
We cannot simply enforce minimum size of merges, because otherwise there is a chance of accumulation of many small parts far below min threshold in total.
For example, you set min threshold to 1 GB and there are 999 parts of 1 MB size each, and they won't be merged.
If I understand correctly, you want that during off-peak periods of time, merging should be more "aggressive" and grab more parts to merge at once instead of waiting for better combination of parts with more uniform sizes.
That's difficult - we have many variables in SimpleMergeSelector.h and we are not sure that algorithm is optimal and it is hard to tune. We have intern student that will experiment with simulations of merge algorithm to find optimal parameters and better variants of algorithms (but there is low chance that your case will be covered).
Another case for implementing aggressive merge to grab more small parts at
once.
We had some MVs that created many small parts from a bigger NULL table firehose,
and they do not merge very well, causing huge number of parts that will be merged,
replicated, deleted repeatedly.
database,
table,
count() AS total,
countIf(event_type = 1) AS new_parts,
countIf(event_type = 2) AS merged_parts,
countIf(event_type = 3) AS replicated_parts,
countIf(event_type = 4) AS deleted_parts,
avgIf(length(merged_from), event_type = 2) AS avg_merged_from_parts,
formatReadableSize(avg(size_in_bytes)) AS avg_part_size
FROM system.part_log
WHERE event_date = yesterday()
GROUP BY
database,
table
ORDER BY total DESC
LIMIT 50
ββdatabaseββ¬βtableββββββββββββββββββββββββββββββββββ¬ββtotalββ¬βnew_partsββ¬βmerged_partsββ¬βreplicated_partsββ¬βdeleted_partsββ¬βavg_merged_from_partsββ¬βavg_part_sizeββ
β r1 β table β 119437 β 0 β 20315 β 39405 β 59717 β 2.938764459758799 β 76.44 KiB β
β r2 β table β 104441 β 0 β 15341 β 36879 β 52221 β 3.3995176324881036 β 78.17 KiB β
β r0 β table β 99992 β 36452 β 13544 β 0 β 49996 β 3.6913762551683402 β 78.40 KiB β
Here we insert into r0, and then replicate to r1 and r2. You can see that part
size is tiny, and number of merged parts per merge are also quite small.
count(),
length(merged_from) AS l,
formatReadableSize(avg(size_in_bytes))
FROM system.part_log
WHERE (database = 'r0') AND (table = 'table') AND (event_time >= '2018-05-03 12:00:00') AND (event_time <= '2018-05-03 14:37:00') AND (event_type = 2)
GROUP BY l
ββcount()ββ¬βlββ¬βformatReadableSize(avg(size_in_bytes))ββ
β 3706 β 2 β 913.85 KiB β
β 126 β 6 β 2.10 KiB β
β 6 β 7 β 2.22 KiB β
β 2 β 8 β 2.05 KiB β
βββββββββββ΄ββββ΄βββββββββββββββββββββββββββββββββββββββββ
Looking at number of merged parts, majority of merges are merging of two parts,
with size around 900KiB. This is off-peak for the cluster and surely we can
merge more than that.
Looking at individual merge:
SELECT *
FROM system.part_log
WHERE (part_name = '20180501_20180503_0_80019_64201') AND (table = 'table')
ββevent_typeββ¬βevent_dateββ¬ββββββββββevent_timeββ¬βsize_in_bytesββ¬βduration_msββ¬βdatabaseββ¬βtableβββββββββββββββββββ¬βpart_nameββββββββββββββββββββββββ¬βmerged_fromββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β 2 β 2018-05-03 β 2018-05-03 13:59:31 β 948291 β 86 β r0 β table β 20180501_20180503_0_80019_64201 β ['20180501_20180503_0_80018_64200','20180503_20180503_80019_80019_0'] β
β 4 β 2018-05-03 β 2018-05-03 13:59:53 β 948291 β 0 β r0 β table β 20180501_20180503_0_80019_64201 β [] β
ββββββββββββββ΄βββββββββββββ΄ββββββββββββββββββββββ΄ββββββββββββββββ΄ββββββββββββββ΄βββββββββββ΄βββββββββββββββββββββββββ΄ββββββββββββββββββββββββββββββββββ΄ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
SELECT *
FROM system.part_log
WHERE (part_name = '20180501_20180503_0_80018_64200') AND (table = 'table')
ββevent_typeββ¬βevent_dateββ¬ββββββββββevent_timeββ¬βsize_in_bytesββ¬βduration_msββ¬βdatabaseββ¬βtableβββββββββββββββββββ¬βpart_nameββββββββββββββββββββββββ¬βmerged_fromββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β 2 β 2018-05-03 β 2018-05-03 13:59:31 β 948286 β 97 β r0 β table β 20180501_20180503_0_80018_64200 β ['20180501_20180503_0_80017_64199','20180503_20180503_80018_80018_0'] β
β 4 β 2018-05-03 β 2018-05-03 13:59:31 β 948286 β 0 β r0 β table β 20180501_20180503_0_80018_64200 β [] β
ββββββββββββββ΄βββββββββββββ΄ββββββββββββββββββββββ΄ββββββββββββββββ΄ββββββββββββββ΄βββββββββββ΄βββββββββββββββββββββββββ΄ββββββββββββββββββββββββββββββββββ΄ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Here we can see that part '20180501_20180503_0_80019_64201' is merged from two
parts '20180501_20180503_0_80018_64200', '20180503_20180503_80019_80019_0',
final size is 948291. But part '20180501_20180503_0_80018_64200' itself is also
merged from two parts, final size is 948286. So looks like we keep merging 2
parts, one is around 900KiB, one is around 5B repeatedly together, this doesnt
look right to me.
Did you guys find a resolution to this? We're experiencing something similar.
Another thing we noticed is that clickhouse is for some reason prioritizing merging materialized views parts over raw data table parts. I'm guessing this is because table selection for merges is somewhat random, but it would be nice if clickhouse considered the active part count for the table in it's decision when selecting which table's parts to process.
Most helpful comment
Another case for implementing aggressive merge to grab more small parts at
once.
We had some MVs that created many small parts from a bigger NULL table firehose,
and they do not merge very well, causing huge number of parts that will be merged,
replicated, deleted repeatedly.
Here we insert into r0, and then replicate to r1 and r2. You can see that part
size is tiny, and number of merged parts per merge are also quite small.
Looking at number of merged parts, majority of merges are merging of two parts,
with size around 900KiB. This is off-peak for the cluster and surely we can
merge more than that.
Looking at individual merge:
Here we can see that part '20180501_20180503_0_80019_64201' is merged from two
parts '20180501_20180503_0_80018_64200', '20180503_20180503_80019_80019_0',
final size is 948291. But part '20180501_20180503_0_80018_64200' itself is also
merged from two parts, final size is 948286. So looks like we keep merging 2
parts, one is around 900KiB, one is around 5B repeatedly together, this doesnt
look right to me.