This is the list of proposed tasks. It is to be extented. You can propose more tasks.
You can also find the previous list here: https://gist.github.com/alexey-milovidov/4251f71275f169d8fd0867e2051715e9
The tasks should be:
This topic is booked by Michael Kot @myrrc.
We want to calculate test coverage for each single test (we have about 2500 functional tests). It will allow to answer questions like: what tests cover this file / part of code / function; what tests are the most relevant for the code (something like TF*IDF metric); what code is covered by this test; what is the most relevant/specific code for this test...
The task is challenging, because the default format for generated test coverage data is too heavy (sparse), flushing and analyzing it for every test is too heavy. But LLVM compiler infrastructure has tools to implement our own custom coverage (-fsanitize=coverage, -fxray-instrument).
As an extension of this task, we can also implement lightweight runtime tracing (or tracing profiler).
This topic is booked by Ksenia Sumarokova @kssenii.
ClickHouse can interact and process data from various data sources via table functions and table engines. We already have multitude of them: ODBC, JDBC, MySQL, MongoDB, URL... With ODBC table function it's possible to talk to any ODBC compatible DBMS, including PostgreSQL. But it is less efficient and less convenient than using native PostgreSQL driver to interact with PostgreSQL.
The task is to add native support for PostgreSQL. Interesting detail is to add proper support for Array data types that PostgreSQL has. We should also implement support for PostgreSQL as dictionary source. As an extension to this task we can also investigate performance issues with Poco::ODBC client library and replace it to nanodbc. If everything will be alright we can also consider implementing replication from PostgreSQL to ClickHouse like pg2ch does.
This topic is booked by Anton Popov @CurtizJ
See #14196
Booked by @l1tsolaiki, @ltybc-coder
Modern SQL:2016 standard describes support for querying and managing JSON data with SQL. It's quite sophisticated - it includes mini-language like JSON Path.
ClickHouse already has support for querying JSON with simdjson library. This library has support for JSON Pointer API. But it does not match SQL/JSON. We have to parse and interpret the SQL/JSON language and map it to simdjson API.
Booked by Nikita Vasilev @nikvas0
ClickHouse has support for table constraints, e.g. URLDomain = domain(URL) or isValidUTF8(Title). Constraints are expressions that are checked on data insertion. We can also use constraints for query optimization. Example: if there is a constraint that URLDomain = domain(URL) and there is an expression domain(URL) in the query, we can assume that constraint is true and replace domain(URL) to URLDomain if it will be easier to read and calculate. Another example: simply replace isValidUTF8(Title) to 1.
We can implement support for other two notions similar to constraints: "assumptions" and "hypothesis". "Assumption" is similar to constraint: if the user write ASSUMPTION URLDomain = domain(URL) in the table definition, we don't check it on insert but still use it for query optimization (like constraint). "Hypothesis" is an expression that is checked on insertion but it's permitted to be false. Instead we will store the result: whether the hypothesis hold or not - as very lightweight index. This index can be used for query optimization when hypothesis was hold.
Booked by Igor Baliuk, @lodthe
Given first chunk of data in TSV, CSV or JSON formats, figure out what table structure (data types) is the most appropriate for this data. Various tweaks and heuristics will be involved.
https://github.com/ClickHouse/ClickHouse/issues/14450
Booked by Abi Palagashvili (MSU)
ClickHouse has support for LZ4 and ZSTD as generic compression methods. The choice of these particular methods is justified: these methods are pareto-optimal for compression level and speed across well known libraries. Nevertheless, there exist less well known compression libraries that can be somewhat better in certain cases. Among potentially faster there are: Lizard, LZSSE, density. Among more strong there are: bsc, csc, lzham. The task is to try and explore these libraries, integrate them to ClickHouse, make a comparison on various datasets.
Extensions to this task: reseach zlib compatible libraries (we have zlib-ng but it has unsatisfactoring quality); add support for Content-Encoding: zstd in HTTP interface and https://github.com/ClickHouse/ClickHouse/issues/8828
Booked by Albert @Provet.
Pre-learned models can be plugged in to ClickHouse and made available as functions. We have similar feature for Catboost.
Booked by Ivan Novitskiy, @RedClusive
Data sketches (also known as probablistic data structures) are data structures that can give approximate answer while using less memory or computation than with precise answer. We already have implemented the most demanded data sketches in ClickHouse: we have four variants of approx. count distinct and several variants of approx. quantiles. But there are much more unexplored interesting data structures worth trying.
Booked by Kiryl Shyrma, @Ruct
The user may write a program that will accept streams of serialized data to stdin (or multiple streams at several file descriptors), process the data and provide serialized result to stdout. We can allow to use these programs as table function. Table function may accept several SQL queries as arguments, prepare file descriptors, connect them with the program and pipe serialized data into them. This is similar to "map" in "mapreduce" model. It is intented for complex calculations that cannot be expressed in SQL.
There are various options how these programs can run: preinstalled programs available on server (easy part); third-party programs on blob-storage (s3, HDFS) that must be run in constrained environment (Linux namespaces, seccomp...)
Booked by Dmitri Torilov.
Implement a new caching layer. If data part is read as a whole (all rows but maybe subset of columns), cache deserialized blocks in memory. This will make performance of MergeTree tables the same as Memory tables.
Extension to this task is to research various cache eviction algorithms.
Booked by Slava Boben.
Figure out the subset of correlated subqieries that can be rewritten to JOINs and implement support for them via query rewrite on AST level.
Booked by Kirill Ershov.
Implement INTERSECT, EXCEPT and UNION DISTINCT operators (easy part). Then implement comparison with ANY/ALL of subquery and EXISTS subquery.
Booked by Maksim Sipliviy, @MaxTheHuman.
GROUPING SETS is the way to perform multiple different aggregations in a single pass within a single query.
+ User defined functions with SQL expressions.
Booked by Andrei Staroverov @Realist007
Booked by Daniil Ryazanov @rybbba
Unique key constraint assures that there is only one row for some user defined unique key. BTree + in memory Hash Table + Bloom Filter can be used as a data structure for deduplication. It is very difficult to implement proper support for unique key constraint for replicated tables. But it can be implemented for non-replicated MergeTree and for ReplicatedMergeTree in local fashion (data is deduplicated only if inserted to the same replica) - it will have some limited use.
Booked by Denis Bolonin.
Some people hate XML. Let's support YAML for configurations files, so XML and YAML can be used interchangingly (for example, main config can remain in XML and config.d files can be provided in YAML). There should be a mapping from YAML to XML features like attributes.
https://github.com/ClickHouse/ClickHouse/issues/3607
Booked by Egor Savin @Amesaru
Output in CapNProto format. Proper support for Arrays in Parquet format. Allow to multiple read from stdin in clickhouse-local if stdin is seekable #11124. Interactive mode in clickhouse-local.
Booked by Dmitri Rubashkin @dimarub2000
ClickHouse already has support for incremental aggregation (see AggregatingMergeTree). We can provide an alternative way that can sustain higher query rate, can be used for JOINs and dictionaries efficiently, in price of lost persistency.
When ClickHouse executes GROUP BY it creates data structure in memory to hold intermediate data for aggregation. This data structure only lives for query time and is destroyed when query finished. But we can hold aggregation data in memory and allow to incrementally feed more data into it and also allow to query it as key-value table / JOIN with it / use it as a dictionary. Typical usage example is antifraud filter that need to accumulate some statistics to filter data.
Booked by Ruslan Kamalov.
Add functions for text processing: lemmatization, stop word filtering, normalization, synonims extension. Look for Elasticsearch and Sphinxsearch for examples.
A task for frontend developer. Create a single page application that will allow to quickly navigate, search and filter through ClickHouse system.text_log and system.query_log tables. The main goal is to make the interface lightweight, beautiful and neat.
Booked by Flynn @ucasFL
ClickHouse has support for subscription and streaming data consumption from message queues: Kafka, RabbitMQ and also recently, from MySQL replication log. But the most simple example of streaming data - is append-only log on local filesystem. We don't have support to subscribe and consume logs from simple append-only file (generated by some third-party application) and it's possible to implement. With this feature, ClickHouse can be used as a replacement to Logstash.
https://github.com/ClickHouse/ClickHouse/issues/6953
Booked by Anastasia Grigoryeva, @weifoll
ClickHouse has deep introspection capabilities: per-query and per-thread metrics, sampling profiler, etc. But they are mostly metrics about ClickHouse itself. There's lack of metrics related to the server as a whole (e.g. total CPU load in system, amount of free memory, load average, network traffic...).
Usually there is no need to have these metrics in ClickHouse, because they are collected by separate monitoring agents. But there are several reasons why it's better for ClickHouse to collect metrics by itself:
There are some excellent examples of metric collection software (e.g. Netdata: https://github.com/netdata/netdata). Unfortunately, the code license of most of them is GPL - it means that we have to write our own metrics collection code.
https://github.com/ClickHouse/ClickHouse/issues/9430
Booked by Andrey Che, @andr0901
S2 is a library for geospatial data processing with space-filling curves. ClickHouse already has support for another library with similar concept (H3 from Uber - hierarchical coordinate system with hexagons).
The choice between these libraries is motivated by which library is already used inside a company. It means that we have no that choice in ClickHouse and it's better to support both.
Booked by Daniil Kondratyev, @dankondr
ClickHouse has very rich set of functions that are available out of the box. It's mostly superior than what you will find in other DBMS. Our functions are more performant, more consistent in behaviour and usually have better naming and usability.
Also we have a practice to add compatibility aliases for functions from other DBMS - so the functions will be available under their foreign names. It is possible to have compatibility for almost every function from MySQL.
Booked by Sergey Polukhin @sdpolukhin
We already have support for importing data in JSONEachRow format (flat JSON, a separate object for every row, a.k.a jsonlines). But when JSON contains deeply nested fields and we want to map subset from them to a table, the import become cumbersome. Also we don't have any means for XML import.
Example of complex nested JSON: https://www.gharchive.org/
Example of complex nested XML: https://meta.wikimedia.org/wiki/Data_dump_torrents#English_Wikipedia
The proposal is to add a format that will allow the user to specify:
When multiple elements are matched, we can map them to Array in ClickHouse.
Booked by Sergey Katkovskiy @s-kat
Add functions for text classification. They can use bag of words / character n-grams / word shingles models. Simple bayes models can be used. The data for models can be provided in static data files.
Example applications:
The main challenge is to make classification functions as efficient as possible to be applicable to massive datasets for on the fly processing (ClickHouse style).
Booked by @Avogar
Booked by @tavplubix
Booked by Alexandra Latysheva @alexelex
It allows to aggregate data not by exact values of the keys but by clusters of values near to each other. Clusters are dynamically formed during aggregation.
Make a desktop application similar to Telegram Desktop that will represent all issues and pull requests from GitHub repositories as chats, sort them by update time, maintain unread count and highlight when the user was mentioned. This application is intented to allow answering questions very quickly without opening web pages in browser (that often takes multiple seconds).
App should work on Linux. C++ and QT can be used for implementation. Alternatively any other technologies can be used instead (Flutter, Electron, ...). The main requirements are: low resource consumption, low input latency, quick startup time, surprise-free behaviour.
N*log(number of collected elements) instead of N*log(N) )Arrays as sets
ΠΠ°ΡΡΠΈΠ²Ρ ΠΊΠ°ΠΊ ΠΌΠ½ΠΎΠΆΠ΅ΡΡΠ°:
1) ΠΠ΅ΡΠ΅ΡΠ΅ΡΠ΅Π½ΠΈΠ΅. arrayIntersect(a,b) π
ββarrayIntersect([1, 2, 3], [3, 2, 5])ββ
β [3,2] β
ββββββββββββββββββββββββββββββββββββββββ
2) ΠΠ±ΡΠ΅Π΄ΠΈΠ½Π΅Π½ΠΈΠ΅. arrayReduce('groupUniqArray', [a,b] ) π±
ββarrayReduce('groupUniqArrayArray', array([1, 2, 3], [3, 2, 1, 5]))ββ
β [5,2,1,3] β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
3) Π Π°Π·Π½ΠΎΡΡΡ arrayDifference arrayFilter( k -> not(has(b,k)), a ) π₯΄
ββarrayFilter(lambda(tuple(k), not(has([2, 3, 5], k))), [1, 2, 3])ββ
β [1] β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββΠΠ°ΠΆΠ΅ΡΡΡ 2 ΠΈ 3 ΠΌΠΎΠ³ΡΡ Π±ΡΡΡ ΡΠ΄Π΅Π»Π°Π½Ρ Π±ΠΎΠ»Π΅Π΅ ΠΎΠΏΡΠΈΠΌΠ°Π»ΡΠ½ΠΎ ΠΈ Ρ ΠΊΡΠ°ΡΠΈΠ²ΡΠΌ ΡΠΈΠ½ΡΠ°ΠΊΡΠΈΡΠΎΠΌ.
This one seems nice / not too small / isolated: https://github.com/ClickHouse/ClickHouse/issues/4420
Implementation of a table engine to consume application log files in ClickHouse
Hi, can I do this task? @alexey-milovidov
@ucasFL Yes, I will reserve it for you.
ClickHouse has support for LZ4 and ZSTD as generic compression methods. The choice of these particular methods is justified: these methods are pareto-optimal for compression level and speed across well known libraries. Nevertheless, there exist less well known compression libraries that can be somewhat better in certain cases
This is a long reply, so I've provided a summary at the bottom.
I'm a pretty big fan of compression technology and I've tried to keep pace with the field, so I wanted to share my thoughts and offer up a proposal.
The marginal benefit of using other compression methods over lz4 and/or zstd is likely to be fairly low; that is, switching compressors is unlikely to result in across the board improvements, and whatever improvements you do get will probably be small and at the cost of major sacrifices to some other metrics. I say this even though bsc is my personal favorite compression algorithm.
From my reviews of the literature, it seems like the vast majority of compressor gains today come from three main classes:
Highly optimized routines that sacrifice ratios for performance, like--as you note--LZSSE.
Encoders that are specialized to the dataset, like what TimescaleDB and Facebook's Gorilla use. ClickHouse already supports some of these.
Lossy or almost-lossy techniques, like reducing numerical precision using bfloat16 or fp16 to store certain data, employing clever video filters, subsampling certain data, etc. ClickHouse's T64, as I understand it, is fairly similar, though it only seems to reduce numerical precision in cases where it's safe to do so losslessly. That seems to be about as much as anyone can reasonably do in a database.
It seems to me, then, that the first class is where the low-hanging fruit is, distantly followed by the second class.
On the topic of the first class, however, we're not the only people to recognize this low-hanging fruit: LZSSE's developer saw it, as have countless others. lz4 has implemented a decompression speed optimized compression mode and "ultra-fast" mode designed to plug the niche. zstd has a deceptively named "negative compression" mode aimed at the same niche. (Moreover, zstd has 10% faster decompression in the latest release, even without using any of these settings. That's free performance if zstd has its version bumped.)
On the topic of the second class, there is probably plenty of opportunities for certain datasets, but outside of time-series data, where highly effective heuristics are obvious, this seems like it could be pretty hard; if anything, it's probably closer to the topic of a PhD thesis than an intern project. However, things are far from hopeless. zstd and lz4 both support pre-trained dictionaries. These help specialize the compressor context for the dataset at hand; using zstd's dictionary builder, you can get double or quadruple the compression and decompression throughput _and_ better compression ratios at the same time. lz4 can use these pre-trained dictionaries, too. With the merge of the fast-cover dictionary builder into zstd, I would totally expect dictionary pre-training (using samples of the dataset) to be viable. This is on top of the incredible number of tunables and context-specific options that exist in zstd and lz4, each a knob that can help users get better results with compression.
To summarize, most high-performance generic compression algorithms perform fairly similarly when you hold some requirements (compression ratio, decompression speed, etc) constant. Some make certain tradeoffs to get large gains in certain niches, while sacrificing in others; for example, experimental compressors like LZSSE sacrifice compression ratios to get stunning (de)compression speeds. However, as of the last few years, experimental compressors have lost their monopoly on these niches. A wide array of tunables are exposed in mainstream compressors like zstd and lz4, allowing users to tap into similar benefits. Moreover, research into domain-specialized compression (which is where the majority of gains are in compression research) have spilled over into mainstream compressors, like lz4 and zstd; these libraries support a bunch of features like pre-trained dictionaries and context-specific options, the latter of which presents a massive array of knobs for improving performance on a per-dataset basis.
If I am allowed to, this brings me to a proposal: we can have both new compressors added, and research and development related to optimized usage of the compressors that are already supported. I would gladly volunteer to apply the state-of-the-art work being done by the lz4 and zstd developers to the codebase, while Abi Palagashvili continues the original mission. From skimming the codebase, it seems unlikely that (despite how close the goals are) we'd be editing too much overlapping code and creating many conflicts.
Implementation of GROUPING SETS in ClickHouse
Hello, can I do this task? @alexey-milovidov
User defined data types in ClickHouse
Hello, can I do this task? @alexey-milovidov
@gonzalezjo thanks for such detailed response!I'm getting started with this task soon.I'm not experienced enough in
optimization with cpp subtitles and sse instructions, but I hope I'll deal with it.I also found fresh paper from VLDB conference on strings compression: http://vldb.org/pvldb/vol13/p2649-boncz.pdf and it's source code respectively: https://github.com/cwida/fsst. I realise, that main goal is about general-purpose compression, but I'll test it too if there would be enough time.
@alexey-milovidov here is a list of proposed libraries:
https://github.com/inikep/lizard
https://github.com/ConorStokes/LZSSE
https://github.com/centaurean/density
https://github.com/IlyaGrebnov/libbsc
https://github.com/fusiyuan2010/CSC
https://github.com/richgel999/lzham_codec
Can you create corresponding issue or that's my responsibility?
Most helpful comment
This is a long reply, so I've provided a summary at the bottom.
I'm a pretty big fan of compression technology and I've tried to keep pace with the field, so I wanted to share my thoughts and offer up a proposal.
The marginal benefit of using other compression methods over lz4 and/or zstd is likely to be fairly low; that is, switching compressors is unlikely to result in across the board improvements, and whatever improvements you do get will probably be small and at the cost of major sacrifices to some other metrics. I say this even though bsc is my personal favorite compression algorithm.
From my reviews of the literature, it seems like the vast majority of compressor gains today come from three main classes:
Highly optimized routines that sacrifice ratios for performance, like--as you note--LZSSE.
Encoders that are specialized to the dataset, like what TimescaleDB and Facebook's Gorilla use. ClickHouse already supports some of these.
Lossy or almost-lossy techniques, like reducing numerical precision using bfloat16 or fp16 to store certain data, employing clever video filters, subsampling certain data, etc. ClickHouse's T64, as I understand it, is fairly similar, though it only seems to reduce numerical precision in cases where it's safe to do so losslessly. That seems to be about as much as anyone can reasonably do in a database.
It seems to me, then, that the first class is where the low-hanging fruit is, distantly followed by the second class.
On the topic of the first class, however, we're not the only people to recognize this low-hanging fruit: LZSSE's developer saw it, as have countless others. lz4 has implemented a decompression speed optimized compression mode and "ultra-fast" mode designed to plug the niche. zstd has a deceptively named "negative compression" mode aimed at the same niche. (Moreover, zstd has 10% faster decompression in the latest release, even without using any of these settings. That's free performance if zstd has its version bumped.)
On the topic of the second class, there is probably plenty of opportunities for certain datasets, but outside of time-series data, where highly effective heuristics are obvious, this seems like it could be pretty hard; if anything, it's probably closer to the topic of a PhD thesis than an intern project. However, things are far from hopeless. zstd and lz4 both support pre-trained dictionaries. These help specialize the compressor context for the dataset at hand; using zstd's dictionary builder, you can get double or quadruple the compression and decompression throughput _and_ better compression ratios at the same time. lz4 can use these pre-trained dictionaries, too. With the merge of the fast-cover dictionary builder into zstd, I would totally expect dictionary pre-training (using samples of the dataset) to be viable. This is on top of the incredible number of tunables and context-specific options that exist in zstd and lz4, each a knob that can help users get better results with compression.
To summarize, most high-performance generic compression algorithms perform fairly similarly when you hold some requirements (compression ratio, decompression speed, etc) constant. Some make certain tradeoffs to get large gains in certain niches, while sacrificing in others; for example, experimental compressors like LZSSE sacrifice compression ratios to get stunning (de)compression speeds. However, as of the last few years, experimental compressors have lost their monopoly on these niches. A wide array of tunables are exposed in mainstream compressors like zstd and lz4, allowing users to tap into similar benefits. Moreover, research into domain-specialized compression (which is where the majority of gains are in compression research) have spilled over into mainstream compressors, like lz4 and zstd; these libraries support a bunch of features like pre-trained dictionaries and context-specific options, the latter of which presents a massive array of knobs for improving performance on a per-dataset basis.
If I am allowed to, this brings me to a proposal: we can have both new compressors added, and research and development related to optimized usage of the compressors that are already supported. I would gladly volunteer to apply the state-of-the-art work being done by the lz4 and zstd developers to the codebase, while Abi Palagashvili continues the original mission. From skimming the codebase, it seems unlikely that (despite how close the goals are) we'd be editing too much overlapping code and creating many conflicts.