Clickhouse: Performance degradation of queries containing an IN operator in all versions starting from 1.1.54378

Created on 28 Jun 2018  路  6Comments  路  Source: ClickHouse/ClickHouse

We observed a significant performance degradation when upgraded from 1.1.54310 to 1.1.54385.
Our testing shows that performance problems happen with queries containing an IN operator with a subquery inside that returns many rows (>10M) and it appeared somewhere between 1.1.54343 and 1.1.54378.

Example that is close to our query and reproduces the issue:

CREATE TABLE test_1
(
    date Date,
    number UInt64
)
ENGINE = MergeTree(date, number, 8192);


INSERT INTO test_1 SELECT
    number / 10000,
    rand()
FROM system.numbers
LIMIT 100000000;


CREATE TABLE test_2
(
    date Date,
    number_1 UInt64,
    number_2 UInt64
)
ENGINE = MergeTree(date, (number_1, number_2), 8192);


INSERT INTO test_2 SELECT
    number / 100000,
    rand(),
    rand()
FROM system.numbers
LIMIT 1000000000;


SELECT number
FROM test_1
WHERE number IN
(
    SELECT number_1
    FROM test_2
    PREWHERE number_2 < 100000000
)
LIMIT 10;

The last select in the example on 1.1.54343 gives:

10 rows in set. Elapsed: 2.590 sec. Processed 1.00 billion rows, 8.22 GB (386.63 million rows/s., 3.18 GB/s.)

and on 1.1.54378:

10 rows in set. Elapsed: 23.436 sec. Processed 16.38 thousand rows, 131.07 KB (699.10 rows/s., 5.59 KB/s.)

Replacing the IN with a JOIN solved the problem in our case. But obviously IN after 1.1.54343 works in a different way and sometimes it is not optimal.

I will be happy to provide any additional details!

performance

Most helpful comment

It's not normal. Will fix.

All 6 comments

This is because of the features:

ClickHouse release 1.1.54343
Now the table index is used for conditions like expr IN (subquery).

(ClickHouse release 1.1.54356)
An index of MergeTree tables is used when IN is applied to a tuple of expressions from the
columns of the primary key. Example: WHERE (UserID, EventDate) IN ((123, '2000-01-01'), ...)
(Anastasiya Tsarkova).

It seems that you should to use another set/order of columns in PK if you need to filter by NOT number column.

Dirty workaround to trick "optimizer":

concat(toString(number),'') IN
(
    SELECT toString(number_1)

It's not normal. Will fix.

1804

You can mitigate it with SET use_index_for_in_with_subqueries = 0

2584

Fixed in master with #2584 and #2677.

Was this page helpful?
0 / 5 - 0 ratings