Describe the situation
Simply adding a column in the result (even if only used for a final lookup) slows the whole query down massively since it get read completely as far as I can tell.
Please note that I'm just evaluating ClickHouse and I might be doing something dumb when specifying the table layout or writing queries.
How to reproduce
Running on a single host with NVMe (read bandwith ~2GiB/s) with ClickHouse 19.15.2.2
Dataset is publicly available: https://blob.dolansoft.org/datasets/boyter-10m-repos/clickhouse-columns.tar (:warning: ~50GiB / 3.5B+ rows)
ATTACH TABLE repositories
(
`id` UInt32 CODEC(DoubleDelta),
`source` LowCardinality(String) CODEC(LZ4),
`user` String CODEC(LZ4),
`name` String CODEC(LZ4)
)
ENGINE = MergeTree
ORDER BY id
SETTINGS index_granularity = 8192
ATTACH TABLE files
(
`repository_id` UInt32 CODEC(DoubleDelta),
`location` String CODEC(ZSTD(1)),
`language` LowCardinality(String),
`possible_languages` Array(LowCardinality(String)),
`bytes` UInt32 CODEC(T64, LZ4),
`lines` UInt32 CODEC(T64, LZ4),
`code` UInt32 CODEC(T64, LZ4),
`comment` UInt32 CODEC(T64, LZ4),
`blank` UInt32 CODEC(T64, LZ4),
`complexity` UInt32 CODEC(T64, LZ4),
`weighted_complexity` UInt32 CODEC(T64, LZ4)
)
ENGINE = MergeTree
ORDER BY repository_id
SETTINGS index_granularity = 8192
Query A (fast):
repository_id,
lines
FROM files
WHERE language = 'Plain Text'
ORDER BY lines DESC
LIMIT 1
ββrepository_idββ¬βββββlinesββ
β 3919430 β 347671811 β
βββββββββββββββββ΄ββββββββββββ
1 rows in set. Elapsed: 3.587 sec. Processed 3.53 billion rows, 4.82 GB (983.87 million rows/s., 1.34 GB/s.)
Query B (slow):
repository_id,
location,
lines
FROM files
WHERE language = 'Plain Text'
ORDER BY lines DESC
LIMIT 1
ββrepository_idββ¬βlocationβββββββββββββββββ¬βββββlinesββ
β 3919430 β data/1366100696temp.txt β 347671811 β
βββββββββββββββββ΄ββββββββββββββββββββββββββ΄ββββββββββββ
1 rows in set. Elapsed: 25.234 sec. Processed 3.53 billion rows, 13.53 GB (139.87 million rows/s., 536.21 MB/s.)
Expected performance
I would expect both of these queries to take approximately the same time since ClickHouse can ignore the location column until it has found the single match and then just read that chunk from the (much heavier) location column. Instead it looks like it tries to read the whole location column and slows down the query around 8 times. I've also tried with argMax() instead of order by x limit 1 but it seems to suffer from the same issue.
Originally I also had a join to repositories in there, but that did not change the performance of either query so I've removed it in the interest of a more minimal reproducer.
try to replace where with prewhere
repository_id,
location,
lines
FROM files
PREWHERE language = 'Plain Text'
ORDER BY lines DESC
LIMIT 1
It makes the first query that was already fast around 30% faster, but doesn't do anything for the already slow one.
can you show the output of slow query with debug info
set send_logs_level='trace';
...SELECT ...
Sure:
[green] 2019.10.03 23:37:53.986197 {cd510939-ded6-48f7-9f22-8198d3a6193c} [ 70 ] <Debug> executeQuery: (from 127.0.0.1:37212) SELECT repository_id, location, lines FROM files PREWHERE language = 'Plain Text' ORDER BY lines DESC LIMIT 1
β Progress: 0.00 rows, 0.00 B (0.00 rows/s., 0.00 B/s.) [green] 2019.10.03 23:37:53.986966 {cd510939-ded6-48f7-9f22-8198d3a6193c} [ 70 ] <Debug> default.files (SelectExecutor): Key condition: unknown
[green] 2019.10.03 23:37:53.986979 {cd510939-ded6-48f7-9f22-8198d3a6193c} [ 70 ] <Debug> default.files (SelectExecutor): Selected 1 parts by date, 1 parts by key, 430850 marks to read from 1 ranges
[green] 2019.10.03 23:37:53.987026 {cd510939-ded6-48f7-9f22-8198d3a6193c} [ 70 ] <Trace> default.files (SelectExecutor): Reading approx. 3529516251 rows with 24 streams
[green] 2019.10.03 23:37:53.987452 {cd510939-ded6-48f7-9f22-8198d3a6193c} [ 70 ] <Trace> InterpreterSelectQuery: FetchColumns -> Complete
[green] 2019.10.03 23:37:53.987947 {cd510939-ded6-48f7-9f22-8198d3a6193c} [ 70 ] <Debug> executeQuery: Query pipeline:
Limit
Expression
MergeSorting
Union
PartialSorting Γ 24
Expression
MergeTreeThread
ββrepository_idββ¬βlocationβββββββββββββββββ¬βββββlinesββ
β 3919430 β data/1366100696temp.txt β 347671811 β
βββββββββββββββββ΄ββββββββββββββββββββββββββ΄ββββββββββββ
[green] 2019.10.03 23:38:20.049494 {cd510939-ded6-48f7-9f22-8198d3a6193c} [ 70 ] <Trace> UnionBlockInputStream: Waiting for threads to finish
[green] 2019.10.03 23:38:20.049533 {cd510939-ded6-48f7-9f22-8198d3a6193c} [ 70 ] <Trace> UnionBlockInputStream: Waited for threads to finish
[green] 2019.10.03 23:38:20.049620 {cd510939-ded6-48f7-9f22-8198d3a6193c} [ 70 ] <Information> executeQuery: Read 3529516251 rows, 12.60 GiB in 26.063 sec., 135421628 rows/sec., 495.10 MiB/sec.
[green] 2019.10.03 23:38:20.049634 {cd510939-ded6-48f7-9f22-8198d3a6193c} [ 70 ] <Debug> MemoryTracker: Peak memory usage (for query): 375.26 MiB.
1 rows in set. Elapsed: 26.065 sec. Processed 3.53 billion rows, 13.53 GB (135.41 million rows/s., 519.11 MB/s.)
repository_idUInt32 CODEC(DoubleDelta),
This is probably should by CODEC(DoubleDelta, LZ4))
Because DoubleDelta does not do compression.
It seems the reason is slow decompression of ZSTD(1). And CH decompresses repository_id,location,lines before order by limit
Try this for a test (to check my guess) :
repository_id,
location,
lines
FROM files
PREWHERE repository_id in (
SELECT repository_id FROM files
PREWHERE language = 'Plain Text'
ORDER BY lines DESC
LIMIT 1
)
ORDER BY lines DESC
LIMIT 1
Thanks! Your query is even a bit faster than my first one, but I have to add the language constraint to the outer query otherwise I get wrong results sometimes.
As for the compounding factors:
(Delta, LZ4) anyways, but it's still a very small column so it doesn't matter much.The only question that remains: Could ClickHouse do better on this query or is there something stopping it from doing something similar itself? Obviously it can't do exactly what you've done since this optimization depends on noticing that the number of files per repository is very small in general, but it could for example lazily load the column or at least not read all chunks from disk when it will not use 99.9% of them anyways. If this doesn't fit the ClickHouse devteam's agenda I think we can close this, otherwise let's leave it open as a tracking issue for this feature.
I'm pretty sure that DoubleDelta does in fact compress the data
It is.
I might switch this over to a (Delta, LZ4) anyways, but it's still a very small column so it doesn't matter much.
For now i would you recommend to do that (you can also consider T64) - as DoubleDelta is extremely slow now. :
That problem will be addressed in further releases. https://github.com/ClickHouse/ClickHouse/issues/7082
Let's use #7082 to continue tracking that issue.
@filimonov the issue is not related to CODECs.
create table Xp(id Int64, a UInt8, b String) Engine = MergeTree order by id
settings index_granularity=1;
insert into Xp select number,
number%103,
toString(arrayMap(x->sipHash64(number,x), range(50)))
from numbers(10000000);
set max_threads=1;
select id, a, b from Xp order by a limit 1;
Elapsed: 18.953 sec.
select id, a, b from Xp where id = (select id from Xp order by a limit 1);
Elapsed: 3.014 sec.
Ok, reopened. but is it still the same isuue as in original issue description?
Yes, the same issue. Sort of feature request for performance improvement.
It look like quite easy to workaround with @den-crane proposed option, or (even better speed) with argMin, i.e. something like
SELECT id, a, b FROM Xp
WHERE id IN ( SELECT argMin(id, a) FROM Xp )
ORDER BY a ASC
LIMIT 1
But will such optimisation be good "in general"?
For example - imagine that id extracted by subquery can much (let's say) 80% of rows in original table, or will return NULL?
select id, a, b from Xp where id = (select id from Xp order by a limit 1);
It's not about that specific optimization (as I've said above) but about ClickHouse reading whole columns when it only needs a tiny amount of values. The workaround is application-specific but the solution would actually be generalizable.
@lorenz It is wonderful that we have this dataset publicly available!
Maybe we can use it in our performance tests and examples...
Is it Ok if we will copy and redistribute this dataset from our servers?
We have similar optimization in mind: https://github.com/ClickHouse/ClickHouse/issues/5329
We will implement this optimization but it will be disabled by default, because we don't use the same table snapshot to process subqueries.
@lorenz Could you please describe this dataset?
And why I don't see ClickHouse inside?
@alexey-milovidov The dataset is for directly attaching to ClickHouse. You should be able to extract it to /var/lib/clickhouse/data/$some_db/ and then just use the attach commands in the first post.
The data was not collected by me, it comes from here. I've just asked the author for the raw data and then imported that into ClickHouse for testing and distribution (original dataset ist huge, around 1TiB since it is all JSON). From my side feel free to do whatever you want with the data. The cluster which hosts the data has a few Gbps of unmetered bandwitdth, you don't need to be concerned about bandwidth on my side. I also host the raw data at https://blob.dolansoft.org/datasets/boyter-10m-repos/raw.tar.gz
EDIT: If you want to double-check some analyses against the original blog post, be aware that some of them are wrong because the author's code has floating point accuracy issues. I've reported these, but he didn't update the post.
Yes, I have loaded the data but it shows strange numbers.
And the dataset doesn't contain our repository probably because it is from 2015.
Most helpful comment
@filimonov the issue is not related to CODECs.