Clickhouse: Query with fullscan is faster than query with primary key condition

Created on 3 Jun 2019  路  6Comments  路  Source: ClickHouse/ClickHouse

I discovered some strange behavior when O_DIRECT option was on.
Can anyone explain me this?

SET min_bytes_to_use_direct_io=1, merge_tree_uniform_read_distribution=0;

CREATE TABLE owner_inn_partition (
  sign FixedString(27), 
  sign__hash UInt64 DEFAULT cityHash64(sign), 
  owner_id String, 
  owner_id__hash UInt64 DEFAULT cityHash64(owner_id),
  owner_inn__for_partition String
) ENGINE = MergeTree()
PARTITION BY owner_inn__for_partition
ORDER BY (owner_id__hash, sign__hash, sign)
SETTINGS index_granularity = 8192;

Query

SELECT
    sign 
FROM `owner_inn_partition`
WHERE (owner_id IN ('00000000110823', '00000000110821', '00000000110822')
            AND owner_inn__for_partition IN ('5908') )
GROUP BY `sign`
LIMIT 0, 50

50 rows in set. Elapsed: 0.282 sec. Processed 38.30 million rows, 1.43 GB (135.59 million rows/s., 5.06 GB/s.)

reads more rows (and it is obvious), but is carried out faster than this query

SELECT
     sign
FROM `owner_inn_partition`
WHERE (owner_id__hash IN (cityHash64('00000000110823'), cityHash64('00000000110821'), cityHash64('00000000110822'))
             AND owner_inn__for_partition IN ('5908') )
GROUP BY `sign`
LIMIT 0, 50
50 rows in set. Elapsed: 0.565 sec. Processed 1.95 million rows, 95.47 MB (3.45 million rows/s., 169.02 MB/s.)

Of course, query wity cityHash is carried out faster than query with fullscan if O_DIRECT option is off.

question

All 6 comments

Most probably the reason is explained by the difference:

simple and fast linear disk read + simple & 'brute force' vector processing
vs
slow disk seeks + bit more complicated processing (leading to worse cpu & cache usage) + extra data processed on the borders of the blocks

Most probably the reason is explained by the difference:

simple and fast linear disk read + simple & 'brute force' vector processing
vs
slow disk seeks + bit more complicated processing (leading to worse cpu & cache usage) + extra data processed on the borders of the blocks

小ould you please give advice?
Which variant is more preferable in terms of query execution speed?

@filimonov

Most probably the reason

No, this is not the reason.

We have inefficiencies in index analysis.
If we don't count the time spent on index analysis, index query should not be slower than full scan in ClickHouse.

Inefficiencies in index analysis stage were resolved by @CurtizJ in recent pull requests.

But I cannot confirm that your use case was resolved because there is no steps to reproduce provided.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

fizerkhan picture fizerkhan  路  3Comments

jangorecki picture jangorecki  路  3Comments

derekperkins picture derekperkins  路  3Comments

goranc picture goranc  路  3Comments

opavader picture opavader  路  3Comments