Great work, thanks for opensourcing it.
It's rare for an open source project to have an abundance of documentation :+1:
But for some of the queries in presented benchmark it might be better to use inverted index,
like coun() on all the advertisements from a particular advertisement provider.
If filtering clicks for advertisement provider is the most complex stage of the query - inverted index would serve as precomputed filter. Faceting/Aggregation cause random IO which might be slow if there are a lot of clicks for that advertisement provider. So it's rather unclear which approach is better.
I'll probably compare them myself, but I'm interested in your opinion on this topic.
+1 It would be great to see elasticsearch in official benchmarks, I'd probably have some time next week to add it, if nobody will do that before me...
@alexey-milovidov что нужно подготовить, чтобы кто-нибудь потом мог запустить бенчмарк для эластика на том же железе что остальные? я к вам могу в офисе зайти если что с вопросами? вижу dbms/benchmarks, но не совсем понятно с чего начать вообще :-)
I've spent some time thinking how to represent data in ElasticSearch in the best way. And came up with conclusion that similar denormalization is possible in other systems (any DBMS probably) to implement manually, but it wouldn't be fair, in my opinion, because analytical queries are unknown in advance.
Some queries from the benchmark, like those with SomeID <>0 probably won't benefit from inverted index, but if we add queries for specific ID's, like SomeID == 1538 then inverted index will be of great help. And results of other DBMS could also change.
@ei-grad я тут на этих выходных планирую погонять на AWS'е, план такой:
Делаешь load (cli у Click'а и ES) данных (я планирую dump вики взять) ну и бомбишь join'ы в кучу потоков в ES через http и С++ демона Click'а - api там вроде понятное.
Думаю для тестовых join, union идеально подойдут категории википедии.
P.S. Можем вместе провести эксперимент )
P.S.S. Если mraa не займет все время, то точно в эти выходные.
Есть же уже готовый бенчмарк в dbms/benchmark, как я понимаю это он https://clickhouse.yandex/benchmark.html, без всяких join'ов, чисто под стандартный паттерн аналитики по событиям в веб/мобайл приложениях, под эластик отлично ложится, только надо загрузчик данных и запросы написать.Но было бы клево иметь возможность самому на амазоне его прогнать, да.
@alexey-milovidov что нужно подготовить, чтобы кто-нибудь потом мог запустить бенчмарк для эластика на том же железе что остальные? я к вам могу в офисе зайти если что с вопросами? вижу dbms/benchmarks, но не совсем понятно с чего начать вообще :-)
Напишу в личку, где можно взять данные для бенчмарка и на каком сервере можно экспериментировать...
It would be great if you could write your posts in english because we don't know russian. :)
@ei-grad готовый есть, но мне интересно на другом дата сете попробовать, с сложными join'ми.
@buremba sorry, @dedok wants to proceed with his own test case containing complex join queries, and looking how clickhouse and elasticsearch deal with them.
I think it doesn't make a good sense, since nor ES, nor ClickHouse, are not designed to be good in such operations. Also, there is a good official benchmark (https://clickhouse.yandex/benchmark.html), which is also good for ES, and I don't see any need to reinvent the weel.
The russian part in my first message was about possibility of my communication with clickhouse devs IRL, since I work in the same office as they are.
@dedok elasticsearch doesn't support complex join queries. Only for specified parent-child relationships (defined in mapping) where parent table is ideally should be a low cardinality table. And parent-child id relationships index is actually gets fully loaded into memory during these 'join queries'. I would say it doesn't support joins well enough.
I think the only fair comparisons would be simple cardinality and group by queries.
@bogdanovich fair enough.
ES could be better at filtering IMHO, for constraints like SomeID == 123 or SomeField == "abcd".
Comparing ClickHouse to Elasticsearch would be comparing oranges to apples. One is a columnar store and the other is a document-based search server.
@synhershko Elasticsearch gets promoted as an analytics solution too.
This does bring up the question how good clickhouse is with joins... I saw it being mentioned it might not be good at that?
In terms of capabilities, ClickHouse support INNER, LEFT, RIGHT, FULL OUTER
(and also CROSS JOIN).
For INNER, LEFT, RIGHT, FULL, only equality JOIN relation is supported.
Also, syntax for JOINs in ClickHouse is not compatible to standard SQL and too explicit: only two tables could be JOINed, it is recommended to write JOIN of subquery and subquery.
Look at examples here: https://clickhouse.yandex/reference_en.html#JOIN%20clause
In terms of performance, ClickHouse is using hash JOINs with carefully optimized hash tables for different cases of joining keys. Joining is fully multithreaded. For JOINing with subquery, which result fits in memory, it is difficult to do something more efficient. ClickHouse couldn't do JOINs with subquery, which result doesn't fit in memory.
When doing distributed JOINs, ClickHouse is using broadcast algorightm, which could be less efficient when joining huge tables, than "shuffling" joined table between nodes.
ClickHouse dont use the full index,so this is point!
@alexey-milovidov Comparison with Druid ? why not support bitmap index?
As far as I know, Druid is intended to do aggregations of data by predefined schemas (multidimensional OLAP), while ClickHouse's sweet spot is to build reports directly from non-aggregated data (relational OLAP). That's why direct comparison is difficult.
Nevertheless, comparison is possible. Maybe you could store non-aggregated data in Druid. Or to do some aggregations, suitable for benchmark data. If you have some spare time, try to do benchmarks with 'ontime' dataset from quick start.
We have not implemented any secondary indices, because when size of dataset is significantly larger than size of RAM, main concern is to maintain data locality on disk. When data is stored in order of primary key, it's difficult to achieve locality on another key without duplicating data.
But when dataset fits in RAM, and, even better, if it is not block-compressed, such indexing technique is very useful.
Some other not implemented indexing techniques are promising too:
@alexey-milovidov So when I select other columns from table doesn't use the primary key,the speed is so slowly!
Just put all data which you need to access by different key to a separate table. Or set up a different PK on the origin column.
When primary key is not used, ClickHouse will do full scan. Note, that:
Nevertheless, if you use single virtual machine with single virtualized HDD, scan will be slow.
Most helpful comment
It would be great if you could write your posts in english because we don't know russian. :)