I discovered some strange behavior when O_DIRECT option was on.
Can anyone explain me this?
SET min_bytes_to_use_direct_io=1, merge_tree_uniform_read_distribution=0;
CREATE TABLE owner_inn_partition (
sign FixedString(27),
sign__hash UInt64 DEFAULT cityHash64(sign),
owner_id String,
owner_id__hash UInt64 DEFAULT cityHash64(owner_id),
owner_inn__for_partition String
) ENGINE = MergeTree()
PARTITION BY owner_inn__for_partition
ORDER BY (owner_id__hash, sign__hash, sign)
SETTINGS index_granularity = 8192;
Query
SELECT
sign
FROM `owner_inn_partition`
WHERE (owner_id IN ('00000000110823', '00000000110821', '00000000110822')
AND owner_inn__for_partition IN ('5908') )
GROUP BY `sign`
LIMIT 0, 50
50 rows in set. Elapsed: 0.282 sec. Processed 38.30 million rows, 1.43 GB (135.59 million rows/s., 5.06 GB/s.)
reads more rows (and it is obvious), but is carried out faster than this query
SELECT
sign
FROM `owner_inn_partition`
WHERE (owner_id__hash IN (cityHash64('00000000110823'), cityHash64('00000000110821'), cityHash64('00000000110822'))
AND owner_inn__for_partition IN ('5908') )
GROUP BY `sign`
LIMIT 0, 50
50 rows in set. Elapsed: 0.565 sec. Processed 1.95 million rows, 95.47 MB (3.45 million rows/s., 169.02 MB/s.)
Of course, query wity cityHash is carried out faster than query with fullscan if O_DIRECT option is off.
Most probably the reason is explained by the difference:
simple and fast linear disk read + simple & 'brute force' vector processing
vs
slow disk seeks + bit more complicated processing (leading to worse cpu & cache usage) + extra data processed on the borders of the blocks
Most probably the reason is explained by the difference:
simple and fast linear disk read + simple & 'brute force' vector processing
vs
slow disk seeks + bit more complicated processing (leading to worse cpu & cache usage) + extra data processed on the borders of the blocks
小ould you please give advice?
Which variant is more preferable in terms of query execution speed?
@filimonov
Most probably the reason
No, this is not the reason.
We have inefficiencies in index analysis.
If we don't count the time spent on index analysis, index query should not be slower than full scan in ClickHouse.
Inefficiencies in index analysis stage were resolved by @CurtizJ in recent pull requests.
But I cannot confirm that your use case was resolved because there is no steps to reproduce provided.