Clickhouse: Query with fullscan is faster than query with primary key condition

Created on 3 Jun 2019 · 6Comments · Source: ClickHouse/ClickHouse

I discovered some strange behavior when O_DIRECT option was on.
Can anyone explain me this?

SET min_bytes_to_use_direct_io=1, merge_tree_uniform_read_distribution=0;

CREATE TABLE owner_inn_partition (
  sign FixedString(27), 
  sign__hash UInt64 DEFAULT cityHash64(sign), 
  owner_id String, 
  owner_id__hash UInt64 DEFAULT cityHash64(owner_id),
  owner_inn__for_partition String
) ENGINE = MergeTree()
PARTITION BY owner_inn__for_partition
ORDER BY (owner_id__hash, sign__hash, sign)
SETTINGS index_granularity = 8192;

Query

SELECT
    sign 
FROM `owner_inn_partition`
WHERE (owner_id IN ('00000000110823', '00000000110821', '00000000110822')
            AND owner_inn__for_partition IN ('5908') )
GROUP BY `sign`
LIMIT 0, 50

50 rows in set. Elapsed: 0.282 sec. Processed 38.30 million rows, 1.43 GB (135.59 million rows/s., 5.06 GB/s.)

reads more rows (and it is obvious), but is carried out faster than this query

SELECT
     sign
FROM `owner_inn_partition`
WHERE (owner_id__hash IN (cityHash64('00000000110823'), cityHash64('00000000110821'), cityHash64('00000000110822'))
             AND owner_inn__for_partition IN ('5908') )
GROUP BY `sign`
LIMIT 0, 50
50 rows in set. Elapsed: 0.565 sec. Processed 1.95 million rows, 95.47 MB (3.45 million rows/s., 169.02 MB/s.)

Of course, query wity cityHash is carried out faster than query with fullscan if O_DIRECT option is off.

question

Source

alexm93

All 6 comments

Most probably the reason is explained by the difference:

simple and fast linear disk read + simple & 'brute force' vector processing
vs
slow disk seeks + bit more complicated processing (leading to worse cpu & cache usage) + extra data processed on the borders of the blocks

filimonov on 8 Jun 2019

Most probably the reason is explained by the difference:

simple and fast linear disk read + simple & 'brute force' vector processing
vs
slow disk seeks + bit more complicated processing (leading to worse cpu & cache usage) + extra data processed on the borders of the blocks

Сould you please give advice?
Which variant is more preferable in terms of query execution speed?

alexm93 on 10 Jun 2019

@filimonov

Most probably the reason

No, this is not the reason.

alexey-milovidov on 11 Jun 2019

👀2

We have inefficiencies in index analysis.
If we don't count the time spent on index analysis, index query should not be slower than full scan in ClickHouse.

alexey-milovidov on 20 Jan 2020

Inefficiencies in index analysis stage were resolved by @CurtizJ in recent pull requests.

alexey-milovidov on 26 Apr 2020

But I cannot confirm that your use case was resolved because there is no steps to reproduce provided.

alexey-milovidov on 26 Apr 2020

Was this page helpful?

0 / 5 - 0 ratings