Hi Team,
I have a common scenario that there are 500-1000 users online simultaneously , they want make some exact ad-hoc queries e.g. select * from table1 where id = 'a';
From document ClickHouse allows Concurrent data access, then at the practice How High the concurrency does clickhouse archive?
Is there more detailed document or reference to demostrate this?
Please, remember that ClickHouse is not a key-value storage, and extracting one row with high level of concurrency is not its' strong point. ClickHouse is DB for analytics, to execute rather small amount of cpu-intensive requests, not a high number of simple data extractions.
Documentations says:
"When NOT to use ClickHouse:
Key-value access with high request rate"
About concurrency - you generally need to know few things:
1) if that is request for a "cold" data - it will need to read it from disk (it will add delays)
2) if request requires a lot of cpu work - and cpu is used by another request(s) then it will work slower.
3) often Clickhouse spread its work between multiple cpu cores even for one request, so then you have concurrent requests they need to share the same cores.
Maximum number of concurrent requests in ClickHouse is set to 100 by default (can be increased in config file).
So you can run 100 requests at a time, but if they will need a lot of cpu work - then they will work 100 times slower than 1 request. If all of them will need to read the "cold" data from disk - disk will create significant delays.
Best practice here - is to
1) use cpu(s) with more cores (up to 64)
2) use clustering, and spread the data between the servers to spread the requests load equally
3) use good primary key (otherwise it will need to do fullscans often)
4) have a lot of memory (the best if all your data fits in memory, it it's not possible - the more memory you have - the better).
5) test it on your data with your load. Abstract load on abstract data doesn't exist.
@filimonov Maximum number of concurrent requests in ClickHouse is set to 100 by default (can be increased in config file). whats the name of that param ?
Silviu
@silviucpp it's called max_concurrent_queries. See
https://github.com/yandex/ClickHouse/blob/4fcb081f5b0334663aa5c503682bd30db5da1715/dbms/src/Server/config.xml#L72
Hmm... And looks like it present only in russian documentation:
https://clickhouse.yandex/docs/ru/operations/server_settings/settings.html#max-concurrent-queries
@filimonov Your suggestion is good, especially Best Practice. but actually it is not very relevant between Key-Value and SQL at High-Concurrency Request. Because ClickHouse has memory table, If understand correctly, has also Cache Mechanism, then ClickHouse can cache an entry in this table like Redis does , to respond high requests from client-side application.
@theseusyang at that moment the only caching which Clickhouse use - is caching and memory-mapping of disk data. So if you have a lot of memory (the best: all data fits in memory) you will have practically no disk reads, only one at the beginning. But results of operation on that data are not cached, so if you will do sum of million rows multiple times it will fairy calculate that sum each time by iterating through million rows (and will do it in a very efficient way). The way how ClickHouse process data if very well optimized for bulk processing (for example do a group by or sum of million rows), but is very far from optimal when you need to exact one record at a time (it will always process few thousands of rows to find your row).
Main storage engine in ClickHouse is MergeTree (+ family), Memory engine is rather a side product used for temporary tables and similar cases.
But as i say - you can always make a test with your data and with your load. Nobody can't forbid you using ClickHouse as key-value, or for example as a document storage. I'm just telling that ClickHouse was not created for that, and usually it's better to use correct tools for correct tasks. ClickHouse is a tool for analyzing data and processing huge amount of rows at a time. For key-value storage and high concurrency there are a lot of other fancy tools like Redis/Aerospike/Riak
As I know, max_concurrent_queries limits the number of concurrent queries only on a single ClickHouse server. It doesn't limit concurrent requests to distributed tables on a cluster.
@theseusyang , take a look at chproxy - it is able to limit the number of concurrent queries sent to ClickHouse cluster among other useful features.
@theseusyang there are pretty detailed answers above, do you have any further questions?
Most helpful comment
Please, remember that ClickHouse is not a key-value storage, and extracting one row with high level of concurrency is not its' strong point. ClickHouse is DB for analytics, to execute rather small amount of cpu-intensive requests, not a high number of simple data extractions.
Documentations says:
"When NOT to use ClickHouse:
Key-value access with high request rate"
About concurrency - you generally need to know few things:
1) if that is request for a "cold" data - it will need to read it from disk (it will add delays)
2) if request requires a lot of cpu work - and cpu is used by another request(s) then it will work slower.
3) often Clickhouse spread its work between multiple cpu cores even for one request, so then you have concurrent requests they need to share the same cores.
Maximum number of concurrent requests in ClickHouse is set to 100 by default (can be increased in config file).
So you can run 100 requests at a time, but if they will need a lot of cpu work - then they will work 100 times slower than 1 request. If all of them will need to read the "cold" data from disk - disk will create significant delays.
Best practice here - is to
1) use cpu(s) with more cores (up to 64)
2) use clustering, and spread the data between the servers to spread the requests load equally
3) use good primary key (otherwise it will need to do fullscans often)
4) have a lot of memory (the best if all your data fits in memory, it it's not possible - the more memory you have - the better).
5) test it on your data with your load. Abstract load on abstract data doesn't exist.