Clickhouse: Comparison with ElasticSearch

Created on 16 Jun 2016 · 23Comments · Source: ClickHouse/ClickHouse

Great work, thanks for opensourcing it.
It's rare for an open source project to have an abundance of documentation :+1:
But for some of the queries in presented benchmark it might be better to use inverted index,
like coun() on all the advertisements from a particular advertisement provider.
If filtering clicks for advertisement provider is the most complex stage of the query - inverted index would serve as precomputed filter. Faceting/Aggregation cause random IO which might be slow if there are a lot of clicks for that advertisement provider. So it's rather unclear which approach is better.
I'll probably compare them myself, but I'm interested in your opinion on this topic.

Source

kk00ss

❤3 👍1

Most helpful comment

It would be great if you could write your posts in english because we don't know russian. :)

buremba on 17 Jun 2016

👍12 😕1

All 23 comments

+1 It would be great to see elasticsearch in official benchmarks, I'd probably have some time next week to add it, if nobody will do that before me...

@alexey-milovidov что нужно подготовить, чтобы кто-нибудь потом мог запустить бенчмарк для эластика на том же железе что остальные? я к вам могу в офисе зайти если что с вопросами? вижу dbms/benchmarks, но не совсем понятно с чего начать вообще :-)

ei-grad on 16 Jun 2016

I've spent some time thinking how to represent data in ElasticSearch in the best way. And came up with conclusion that similar denormalization is possible in other systems (any DBMS probably) to implement manually, but it wouldn't be fair, in my opinion, because analytical queries are unknown in advance.
Some queries from the benchmark, like those with SomeID <>0 probably won't benefit from inverted index, but if we add queries for specific ID's, like SomeID == 1538 then inverted index will be of great help. And results of other DBMS could also change.

kk00ss on 16 Jun 2016

@ei-grad я тут на этих выходных планирую погонять на AWS'е, план такой:
Делаешь load (cli у Click'а и ES) данных (я планирую dump вики взять) ну и бомбишь join'ы в кучу потоков в ES через http и С++ демона Click'а - api там вроде понятное.
Думаю для тестовых join, union идеально подойдут категории википедии.

P.S. Можем вместе провести эксперимент )
P.S.S. Если mraa не займет все время, то точно в эти выходные.

dedok on 16 Jun 2016

Есть же уже готовый бенчмарк в dbms/benchmark, как я понимаю это он https://clickhouse.yandex/benchmark.html, без всяких join'ов, чисто под стандартный паттерн аналитики по событиям в веб/мобайл приложениях, под эластик отлично ложится, только надо загрузчик данных и запросы написать.Но было бы клево иметь возможность самому на амазоне его прогнать, да.

ei-grad on 16 Jun 2016

@alexey-milovidov что нужно подготовить, чтобы кто-нибудь потом мог запустить бенчмарк для эластика на том же железе что остальные? я к вам могу в офисе зайти если что с вопросами? вижу dbms/benchmarks, но не совсем понятно с чего начать вообще :-)

Напишу в личку, где можно взять данные для бенчмарка и на каком сервере можно экспериментировать...

alexey-milovidov on 16 Jun 2016

It would be great if you could write your posts in english because we don't know russian. :)

buremba on 17 Jun 2016

👍12 😕1

@ei-grad готовый есть, но мне интересно на другом дата сете попробовать, с сложными join'ми.

dedok on 17 Jun 2016

😄1

@buremba sorry, @dedok wants to proceed with his own test case containing complex join queries, and looking how clickhouse and elasticsearch deal with them.

I think it doesn't make a good sense, since nor ES, nor ClickHouse, are not designed to be good in such operations. Also, there is a good official benchmark (https://clickhouse.yandex/benchmark.html), which is also good for ES, and I don't see any need to reinvent the weel.

The russian part in my first message was about possibility of my communication with clickhouse devs IRL, since I work in the same office as they are.

ei-grad on 17 Jun 2016

@dedok elasticsearch doesn't support complex join queries. Only for specified parent-child relationships (defined in mapping) where parent table is ideally should be a low cardinality table. And parent-child id relationships index is actually gets fully loaded into memory during these 'join queries'. I would say it doesn't support joins well enough.

bogdanovich on 17 Jun 2016

I think the only fair comparisons would be simple cardinality and group by queries.

buremba on 17 Jun 2016

@bogdanovich fair enough.

dedok on 17 Jun 2016

ES could be better at filtering IMHO, for constraints like SomeID == 123 or SomeField == "abcd".

kk00ss on 17 Jun 2016

Comparing ClickHouse to Elasticsearch would be comparing oranges to apples. One is a columnar store and the other is a document-based search server.

synhershko on 19 Jun 2016

@synhershko Elasticsearch gets promoted as an analytics solution too.

bogdanovich on 19 Jun 2016

This does bring up the question how good clickhouse is with joins... I saw it being mentioned it might not be good at that?

matthiasg on 22 Jun 2016

In terms of capabilities, ClickHouse support INNER, LEFT, RIGHT, FULL OUTER
(and also CROSS JOIN).
For INNER, LEFT, RIGHT, FULL, only equality JOIN relation is supported.
Also, syntax for JOINs in ClickHouse is not compatible to standard SQL and too explicit: only two tables could be JOINed, it is recommended to write JOIN of subquery and subquery.
Look at examples here: https://clickhouse.yandex/reference_en.html#JOIN%20clause

In terms of performance, ClickHouse is using hash JOINs with carefully optimized hash tables for different cases of joining keys. Joining is fully multithreaded. For JOINing with subquery, which result fits in memory, it is difficult to do something more efficient. ClickHouse couldn't do JOINs with subquery, which result doesn't fit in memory.

When doing distributed JOINs, ClickHouse is using broadcast algorightm, which could be less efficient when joining huge tables, than "shuffling" joined table between nodes.

alexey-milovidov on 22 Jun 2016

ClickHouse dont use the full index,so this is point!

fengyuncrawl on 9 Aug 2016

@alexey-milovidov Comparison with Druid ? why not support bitmap index?

fengyuncrawl on 17 Aug 2016

As far as I know, Druid is intended to do aggregations of data by predefined schemas (multidimensional OLAP), while ClickHouse's sweet spot is to build reports directly from non-aggregated data (relational OLAP). That's why direct comparison is difficult.

Nevertheless, comparison is possible. Maybe you could store non-aggregated data in Druid. Or to do some aggregations, suitable for benchmark data. If you have some spare time, try to do benchmarks with 'ontime' dataset from quick start.

alexey-milovidov on 17 Aug 2016

We have not implemented any secondary indices, because when size of dataset is significantly larger than size of RAM, main concern is to maintain data locality on disk. When data is stored in order of primary key, it's difficult to achieve locality on another key without duplicating data.

But when dataset fits in RAM, and, even better, if it is not block-compressed, such indexing technique is very useful.

Some other not implemented indexing techniques are promising too:

"grid" indexes: keeping small aggregates (min, max, all unique values if not many) for each block of data, to allow skipping of data blocks;
Z-order index: to maintain some kind of data locality on few columns simultaneously.

alexey-milovidov on 17 Aug 2016

👍4

@alexey-milovidov So when I select other columns from table doesn't use the primary key,the speed is so slowly!

fengyuncrawl on 18 Aug 2016

Just put all data which you need to access by different key to a separate table. Or set up a different PK on the origin column.

ei-grad on 18 Aug 2016

When primary key is not used, ClickHouse will do full scan. Note, that:

only requested columns are scanned;
sometimes, ClickHouse will scan one of columns first to check condition and then read only that ranges of other columns, where condition is matched;
ClickHouse tries his best to achieve maximum performance of scan (read https://clickhouse.yandex/reference_en.html#Throughput%20for%20a%20single%20large%20query for some reference numbers);
when dataset doesn't fit in memory, it's hard to take advantage of secondary indices in general, without duplicating data and store it in different orders.

Nevertheless, if you use single virtual machine with single virtualized HDD, scan will be slow.

alexey-milovidov on 19 Aug 2016

Was this page helpful?

0 / 5 - 0 ratings