Clickhouse: Massive slowdown when simply including a (large) column in the result

Created on 4 Oct 2019  Β·  17Comments  Β·  Source: ClickHouse/ClickHouse

Describe the situation
Simply adding a column in the result (even if only used for a final lookup) slows the whole query down massively since it get read completely as far as I can tell.

Please note that I'm just evaluating ClickHouse and I might be doing something dumb when specifying the table layout or writing queries.

How to reproduce
Running on a single host with NVMe (read bandwith ~2GiB/s) with ClickHouse 19.15.2.2

Dataset is publicly available: https://blob.dolansoft.org/datasets/boyter-10m-repos/clickhouse-columns.tar (:warning: ~50GiB / 3.5B+ rows)

ATTACH TABLE repositories
(
    `id` UInt32 CODEC(DoubleDelta), 
    `source` LowCardinality(String) CODEC(LZ4), 
    `user` String CODEC(LZ4), 
    `name` String CODEC(LZ4)
)
ENGINE = MergeTree
ORDER BY id
SETTINGS index_granularity = 8192

ATTACH TABLE files
(
    `repository_id` UInt32 CODEC(DoubleDelta), 
    `location` String CODEC(ZSTD(1)), 
    `language` LowCardinality(String), 
    `possible_languages` Array(LowCardinality(String)), 
    `bytes` UInt32 CODEC(T64, LZ4), 
    `lines` UInt32 CODEC(T64, LZ4), 
    `code` UInt32 CODEC(T64, LZ4), 
    `comment` UInt32 CODEC(T64, LZ4), 
    `blank` UInt32 CODEC(T64, LZ4), 
    `complexity` UInt32 CODEC(T64, LZ4), 
    `weighted_complexity` UInt32 CODEC(T64, LZ4)
)
ENGINE = MergeTree
ORDER BY repository_id
SETTINGS index_granularity = 8192

Query A (fast):

    repository_id, 
    lines
FROM files
WHERE language = 'Plain Text'
ORDER BY lines DESC
LIMIT 1

β”Œβ”€repository_id─┬─────lines─┐
β”‚       3919430 β”‚ 347671811 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

1 rows in set. Elapsed: 3.587 sec. Processed 3.53 billion rows, 4.82 GB (983.87 million rows/s., 1.34 GB/s.) 

Query B (slow):

    repository_id, 
    location, 
    lines
FROM files
WHERE language = 'Plain Text'
ORDER BY lines DESC
LIMIT 1

β”Œβ”€repository_id─┬─location────────────────┬─────lines─┐
β”‚       3919430 β”‚ data/1366100696temp.txt β”‚ 347671811 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

1 rows in set. Elapsed: 25.234 sec. Processed 3.53 billion rows, 13.53 GB (139.87 million rows/s., 536.21 MB/s.) 

Expected performance
I would expect both of these queries to take approximately the same time since ClickHouse can ignore the location column until it has found the single match and then just read that chunk from the (much heavier) location column. Instead it looks like it tries to read the whole location column and slows down the query around 8 times. I've also tried with argMax() instead of order by x limit 1 but it seems to suffer from the same issue.

Originally I also had a join to repositories in there, but that did not change the performance of either query so I've removed it in the interest of a more minimal reproducer.

performance st-hold

Most helpful comment

@filimonov the issue is not related to CODECs.

create table Xp(id Int64, a UInt8, b String) Engine = MergeTree order by id
settings index_granularity=1;

insert into Xp select number,
        number%103, 
       toString(arrayMap(x->sipHash64(number,x), range(50)))
  from numbers(10000000);

set max_threads=1;

select id, a, b from Xp order by a limit 1;
Elapsed: 18.953 sec. 

select id, a, b from Xp where id = (select id from Xp order by a limit 1);
Elapsed: 3.014 sec.

All 17 comments

try to replace where with prewhere

    repository_id, 
    location, 
    lines
FROM files
PREWHERE language = 'Plain Text'
ORDER BY lines DESC
LIMIT 1

It makes the first query that was already fast around 30% faster, but doesn't do anything for the already slow one.

can you show the output of slow query with debug info

set send_logs_level='trace';
...SELECT ...

Sure:

[green] 2019.10.03 23:37:53.986197 {cd510939-ded6-48f7-9f22-8198d3a6193c} [ 70 ] <Debug> executeQuery: (from 127.0.0.1:37212) SELECT repository_id, location, lines FROM files PREWHERE language = 'Plain Text' ORDER BY lines DESC LIMIT 1
↑ Progress: 0.00 rows, 0.00 B (0.00 rows/s., 0.00 B/s.) [green] 2019.10.03 23:37:53.986966 {cd510939-ded6-48f7-9f22-8198d3a6193c} [ 70 ] <Debug> default.files (SelectExecutor): Key condition: unknown
[green] 2019.10.03 23:37:53.986979 {cd510939-ded6-48f7-9f22-8198d3a6193c} [ 70 ] <Debug> default.files (SelectExecutor): Selected 1 parts by date, 1 parts by key, 430850 marks to read from 1 ranges
[green] 2019.10.03 23:37:53.987026 {cd510939-ded6-48f7-9f22-8198d3a6193c} [ 70 ] <Trace> default.files (SelectExecutor): Reading approx. 3529516251 rows with 24 streams
[green] 2019.10.03 23:37:53.987452 {cd510939-ded6-48f7-9f22-8198d3a6193c} [ 70 ] <Trace> InterpreterSelectQuery: FetchColumns -> Complete
[green] 2019.10.03 23:37:53.987947 {cd510939-ded6-48f7-9f22-8198d3a6193c} [ 70 ] <Debug> executeQuery: Query pipeline:
Limit
 Expression
  MergeSorting
   Union
    PartialSorting Γ— 24
     Expression
      MergeTreeThread

β”Œβ”€repository_id─┬─location────────────────┬─────lines─┐
β”‚       3919430 β”‚ data/1366100696temp.txt β”‚ 347671811 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
[green] 2019.10.03 23:38:20.049494 {cd510939-ded6-48f7-9f22-8198d3a6193c} [ 70 ] <Trace> UnionBlockInputStream: Waiting for threads to finish
[green] 2019.10.03 23:38:20.049533 {cd510939-ded6-48f7-9f22-8198d3a6193c} [ 70 ] <Trace> UnionBlockInputStream: Waited for threads to finish
[green] 2019.10.03 23:38:20.049620 {cd510939-ded6-48f7-9f22-8198d3a6193c} [ 70 ] <Information> executeQuery: Read 3529516251 rows, 12.60 GiB in 26.063 sec., 135421628 rows/sec., 495.10 MiB/sec.
[green] 2019.10.03 23:38:20.049634 {cd510939-ded6-48f7-9f22-8198d3a6193c} [ 70 ] <Debug> MemoryTracker: Peak memory usage (for query): 375.26 MiB.

1 rows in set. Elapsed: 26.065 sec. Processed 3.53 billion rows, 13.53 GB (135.41 million rows/s., 519.11 MB/s.) 

repository_id UInt32 CODEC(DoubleDelta),

This is probably should by CODEC(DoubleDelta, LZ4))
Because DoubleDelta does not do compression.

It seems the reason is slow decompression of ZSTD(1). And CH decompresses repository_id,location,lines before order by limit

Try this for a test (to check my guess) :

    repository_id, 
    location, 
    lines
FROM files
PREWHERE repository_id in (
           SELECT repository_id FROM files
           PREWHERE language = 'Plain Text'
           ORDER BY lines DESC
           LIMIT 1
           )
ORDER BY lines DESC
LIMIT 1

Thanks! Your query is even a bit faster than my first one, but I have to add the language constraint to the outer query otherwise I get wrong results sometimes.
As for the compounding factors:

  • I'm pretty sure that DoubleDelta does in fact compress the data (446MiB column size for 3.5B rows -> 0.1 byte per row), but LZ4 does actually improve on its compression rate. I might switch this over to a (Delta, LZ4) anyways, but it's still a very small column so it doesn't matter much.
  • Also I don't think that ZSTD is contributing to slowness since the CPU is not fully loaded, but the disk only delivers a bit less than 2GiB/s and is at 100% utilization and the column is ~30GiB compressed and ~175GiB uncompressed.

The only question that remains: Could ClickHouse do better on this query or is there something stopping it from doing something similar itself? Obviously it can't do exactly what you've done since this optimization depends on noticing that the number of files per repository is very small in general, but it could for example lazily load the column or at least not read all chunks from disk when it will not use 99.9% of them anyways. If this doesn't fit the ClickHouse devteam's agenda I think we can close this, otherwise let's leave it open as a tracking issue for this feature.

I'm pretty sure that DoubleDelta does in fact compress the data

It is.

I might switch this over to a (Delta, LZ4) anyways, but it's still a very small column so it doesn't matter much.

For now i would you recommend to do that (you can also consider T64) - as DoubleDelta is extremely slow now. :

That problem will be addressed in further releases. https://github.com/ClickHouse/ClickHouse/issues/7082

Let's use #7082 to continue tracking that issue.

@filimonov the issue is not related to CODECs.

create table Xp(id Int64, a UInt8, b String) Engine = MergeTree order by id
settings index_granularity=1;

insert into Xp select number,
        number%103, 
       toString(arrayMap(x->sipHash64(number,x), range(50)))
  from numbers(10000000);

set max_threads=1;

select id, a, b from Xp order by a limit 1;
Elapsed: 18.953 sec. 

select id, a, b from Xp where id = (select id from Xp order by a limit 1);
Elapsed: 3.014 sec.

Ok, reopened. but is it still the same isuue as in original issue description?

Yes, the same issue. Sort of feature request for performance improvement.

It look like quite easy to workaround with @den-crane proposed option, or (even better speed) with argMin, i.e. something like

SELECT  id, a, b FROM Xp
WHERE id IN ( SELECT argMin(id, a) FROM Xp )
ORDER BY a ASC
LIMIT 1

But will such optimisation be good "in general"?
For example - imagine that id extracted by subquery can much (let's say) 80% of rows in original table, or will return NULL?

select id, a, b from Xp where id = (select id from Xp order by a limit 1);

It's not about that specific optimization (as I've said above) but about ClickHouse reading whole columns when it only needs a tiny amount of values. The workaround is application-specific but the solution would actually be generalizable.

@lorenz It is wonderful that we have this dataset publicly available!
Maybe we can use it in our performance tests and examples...

Is it Ok if we will copy and redistribute this dataset from our servers?

We have similar optimization in mind: https://github.com/ClickHouse/ClickHouse/issues/5329

We will implement this optimization but it will be disabled by default, because we don't use the same table snapshot to process subqueries.

@lorenz Could you please describe this dataset?
And why I don't see ClickHouse inside?

@alexey-milovidov The dataset is for directly attaching to ClickHouse. You should be able to extract it to /var/lib/clickhouse/data/$some_db/ and then just use the attach commands in the first post.

The data was not collected by me, it comes from here. I've just asked the author for the raw data and then imported that into ClickHouse for testing and distribution (original dataset ist huge, around 1TiB since it is all JSON). From my side feel free to do whatever you want with the data. The cluster which hosts the data has a few Gbps of unmetered bandwitdth, you don't need to be concerned about bandwidth on my side. I also host the raw data at https://blob.dolansoft.org/datasets/boyter-10m-repos/raw.tar.gz

EDIT: If you want to double-check some analyses against the original blog post, be aware that some of them are wrong because the author's code has floating point accuracy issues. I've reported these, but he didn't update the post.

Yes, I have loaded the data but it shows strange numbers.
And the dataset doesn't contain our repository probably because it is from 2015.

Was this page helpful?
0 / 5 - 0 ratings