Clickhouse: Any easier ways to understand SimpleMergeSelector?

Created on 2 Nov 2020 · 1Comment · Source: ClickHouse/ClickHouse

Hello:
I'm trying to understand the part selecting algorithm in SimpleMergeSelector.h/cpp, but find it hard to understand. Are there any other docs that help? eg: design documentation, discuss recordings...
Thanks a lot.

development question question-answered

Source

kekekedeng

Most helpful comment

We have a set of data parts that is dynamically changing - new data parts are added and there is background merging process.
Background merging process periodically selects continuous range of data parts to merge.

It tries to optimize the following metrics:

Write amplification: total amount of data written on disk (source data + merges) relative to the amount of source data. It can be also considered as the total amount of work for merges.
The number of data parts in the set at the random moment of time (average + quantiles).

Also taking the following considerations:

Older data parts should be merged less frequently than newer data parts.
Larger data parts should be merged less frequently than smaller data parts.
If no new parts arrive, we should continue to merge existing data parts to eventually optimize the table.
Never allow too many parts, because it will slow down SELECT queries significantly.
Multiple background merges can run concurrently but not too many.

It is not possible to optimize both metrics, because they contradict to each other.
To lower the number of parts we can merge eagerly but write amplification will increase.
Then we need some balance between optimization of these two metrics.

But some optimizations may improve both metrics.

For example, we can look at the "merge tree" - the tree of data parts that were merged. If the tree is perfectly balanced then its depth is proportonal to the log(data size), the total amount of work is proportional to data_size * log(data_size) and the write amplification is proportional to log(data_size). If it's not balanced (e.g. every new data part is always merged with existing data parts), its depth is proportional to the data size, total amount of work is proportional to data_size^2.

We can also control the "base of the logarithm" - you can look it as the number of data parts that are merged at once (the tree "arity"). But as the data parts are of different size, we should generalize it: calculate the ratio between total size of merged parts to the size of the largest part participated in merge. For example, if we merge 4 parts of size 5, 3, 1, 1 - then "base" will be 2 (10 / 5).

        /** Minimum ratio of size of one part to all parts in set of parts to merge (for usual cases).
          * For example, if all parts have equal size, it means, that at least 'base' number of parts should be merged.
          * If parts has non-uniform sizes, then minimum number of parts to merge is effectively increased.
          * This behaviour balances merge-tree workload.
          * It called 'base', because merge-tree depth could be estimated as logarithm with that base.
          *
          * If base is higher - then tree gets more wide and narrow, lowering write amplification.
          * If base is lower - then merges occurs more frequently, lowering number of parts in average.
          *
          * We need some balance between write amplification and number of parts.
          */
        double base = 5;

Base of the logarithm (simply called base in SimpleMergeSelector) is the main knob to control the write amplification. The more it is, the less is write amplification but we will have more data parts on average.

To fit all the considerations, we also adjust base depending on total parts count, parts size and parts age, with linear interpolation (then base is not a constant but a function of multiple variables, looking like a section of hyperplane).

        /** Base is lowered until 1 (effectively means "merge any two parts") depending on several variables:
          *
          * 1. Total number of parts in partition. If too many - then base is lowered.
          * It means: when too many parts - do merges more urgently.
          *
          * 2. Minimum age of parts participating in merge. If higher age - then base is lowered.
          * It means: do less wide merges only rarely.
          *
          * 3. Sum size of parts participating in merge. If higher - then more age is required to lower base. So, base is lowered slower.
          * It means: for small parts, it's worth to merge faster, even not so wide or balanced.
          *
          * We have multivariative dependency. Let it be logarithmic of size and somewhat multi-linear by other variables,
          *  between some boundary points, and constant outside.
          */

Then we apply the algorithm to select the optimal range of data parts to merge. There is a proof that this algorithm is optimal if we look in the future only by single step.

    static double score(double count, double sum_size, double sum_size_fixed_cost)
    {
        /** Consider we have two alternative ranges of data parts to merge.
          * Assume time to merge a range is proportional to sum size of its parts.
          *
          * Cost of query execution is proportional to total number of data parts in a moment of time.
          * Let define our target: to minimize average (in time) total number of data parts.
          *
          * Let calculate integral of total number of parts, if we are going to do merge of one or another range.
          * It must be lower, and thus we decide, what range is better to merge.
          *
          * The integral is lower iff the following formula is lower:
          *
          *  sum_size / (count - 1)
          *
          * But we have some tunes to prefer longer ranges.
          */
        return (sum_size + sum_size_fixed_cost * count) / (count - 1.9);
    }

The best range of data parts is selected.

We also apply some tunes:

there is a fixed const of merging small parts (that is added to the size of data part before all estimations);
there are some heuristics to "stick" ranges to large data parts.

It's still unclear if this algorithm is good or optimal at all. It's unclear if this algorithm is using the optimal coefficients.

To test and optimize SimpleMergeSelector, we apply the following methods:

insert/merge simulator: a model simulating parts insertion and merging; merge selecting algorithm is applied and the relevant metrics are calculated to allow to tune the algorithm;
insert/merge simulator on real system.part_log from production - it gives realistic information about inserted data parts: their sizes, at what time intervals they are inserted;

There is a research thesis dedicated to optimization of merge algorithm:
https://presentations.clickhouse.tech/hse_2019/merge_algorithm.pptx

This work made attempt to variate the coefficients in SimpleMergeSelector and to solve the optimization task: maybe some change in coefficients will give a clear win on all metrics. Surprisingly enough, it has found that our selection of coefficients is near optimal. It has found slightly more optimal coefficients, but I decided not to use them, because the representativeness of the test data is in question.

This work did not make any attempt to propose any other algorithm. This work did not make any attempt to analyze the task with analytical methods. That's why I still believe that there are many opportunities to optimize the merge selection algorithm.

Please do not mix the task with a similar task in other LSM-based systems (like RocksDB). Their problem statement is subtly different. Our set of data parts is consisted of data parts that are completely independent in stored data. Ranges of primary keys in data parts can intersect. When doing SELECT we read from all data parts. INSERTed data parts comes with unknown size...

alexey-milovidov on 2 Nov 2020

❤1 👍1

>All comments