Clickhouse: Intern tasks 2020/2021

Created on 21 Sep 2020 · 10Comments · Source: ClickHouse/ClickHouse

This is the list of proposed tasks. It is to be extented. You can propose more tasks.
You can also find the previous list here: https://gist.github.com/alexey-milovidov/4251f71275f169d8fd0867e2051715e9

The tasks should be:

not too hard (doable within about a month) but usually not less than a week;
not alter some core components of the system;
mostly isolated, does not require full knowledge of the system;
somewhat interesting to implement or have some point of research;
not in critical path of our roadmap (ok to be throwed away after a year);
most of them are for C++ developers, but there should be also tasks for frontend developers or tools/research that only require Go/Python/whatever;
some tasks should allow team work;
cover various skills, e.g. system programming, algorithm knowledge, etc...

Advanced methods of test coverage calculation with LLVM compiler infrastructure

This topic is booked by Michael Kot @myrrc.

We want to calculate test coverage for each single test (we have about 2500 functional tests). It will allow to answer questions like: what tests cover this file / part of code / function; what tests are the most relevant for the code (something like TF*IDF metric); what code is covered by this test; what is the most relevant/specific code for this test...

The task is challenging, because the default format for generated test coverage data is too heavy (sparse), flushing and analyzing it for every test is too heavy. But LLVM compiler infrastructure has tools to implement our own custom coverage (-fsanitize=coverage, -fxray-instrument).

As an extension of this task, we can also implement lightweight runtime tracing (or tracing profiler).

PostgreSQL table engine

This topic is booked by Ksenia Sumarokova @kssenii.

ClickHouse can interact and process data from various data sources via table functions and table engines. We already have multitude of them: ODBC, JDBC, MySQL, MongoDB, URL... With ODBC table function it's possible to talk to any ODBC compatible DBMS, including PostgreSQL. But it is less efficient and less convenient than using native PostgreSQL driver to interact with PostgreSQL.

The task is to add native support for PostgreSQL. Interesting detail is to add proper support for Array data types that PostgreSQL has. We should also implement support for PostgreSQL as dictionary source. As an extension to this task we can also investigate performance issues with Poco::ODBC client library and replace it to nanodbc. If everything will be alright we can also consider implementing replication from PostgreSQL to ClickHouse like pg2ch does.

Efficient reading of subcolumns from tables; flexible serialization formats for a single data type

This topic is booked by Anton Popov @CurtizJ

See #14196

Implementation of SQL/JSON in ClickHouse

Booked by @l1tsolaiki, @ltybc-coder

Modern SQL:2016 standard describes support for querying and managing JSON data with SQL. It's quite sophisticated - it includes mini-language like JSON Path.

ClickHouse already has support for querying JSON with simdjson library. This library has support for JSON Pointer API. But it does not match SQL/JSON. We have to parse and interpret the SQL/JSON language and map it to simdjson API.

Table constraints and hypothesis on data for query optimization

Booked by Nikita Vasilev @nikvas0

ClickHouse has support for table constraints, e.g. URLDomain = domain(URL) or isValidUTF8(Title). Constraints are expressions that are checked on data insertion. We can also use constraints for query optimization. Example: if there is a constraint that URLDomain = domain(URL) and there is an expression domain(URL) in the query, we can assume that constraint is true and replace domain(URL) to URLDomain if it will be easier to read and calculate. Another example: simply replace isValidUTF8(Title) to 1.

We can implement support for other two notions similar to constraints: "assumptions" and "hypothesis". "Assumption" is similar to constraint: if the user write ASSUMPTION URLDomain = domain(URL) in the table definition, we don't check it on insert but still use it for query optimization (like constraint). "Hypothesis" is an expression that is checked on insertion but it's permitted to be false. Instead we will store the result: whether the hypothesis hold or not - as very lightweight index. This index can be used for query optimization when hypothesis was hold.

Schema inference for text formats

Booked by Igor Baliuk, @lodthe

Given first chunk of data in TSV, CSV or JSON formats, figure out what table structure (data types) is the most appropriate for this data. Various tweaks and heuristics will be involved.

https://github.com/ClickHouse/ClickHouse/issues/14450

Advanced compression methods in ClickHouse

Booked by Abi Palagashvili (MSU)

ClickHouse has support for LZ4 and ZSTD as generic compression methods. The choice of these particular methods is justified: these methods are pareto-optimal for compression level and speed across well known libraries. Nevertheless, there exist less well known compression libraries that can be somewhat better in certain cases. Among potentially faster there are: Lizard, LZSSE, density. Among more strong there are: bsc, csc, lzham. The task is to try and explore these libraries, integrate them to ClickHouse, make a comparison on various datasets.

Extensions to this task: reseach zlib compatible libraries (we have zlib-ng but it has unsatisfactoring quality); add support for Content-Encoding: zstd in HTTP interface and https://github.com/ClickHouse/ClickHouse/issues/8828

Integration of ClickHouse with Tensorflow

Booked by Albert @Provet.

Pre-learned models can be plugged in to ClickHouse and made available as functions. We have similar feature for Catboost.

Integration of streaming data sketches in ClickHouse

Booked by Ivan Novitskiy, @RedClusive

Data sketches (also known as probablistic data structures) are data structures that can give approximate answer while using less memory or computation than with precise answer. We already have implemented the most demanded data sketches in ClickHouse: we have four variants of approx. count distinct and several variants of approx. quantiles. But there are much more unexplored interesting data structures worth trying.

Data processing with external tools in streaming fashion

Booked by Kiryl Shyrma, @Ruct

The user may write a program that will accept streams of serialized data to stdin (or multiple streams at several file descriptors), process the data and provide serialized result to stdout. We can allow to use these programs as table function. Table function may accept several SQL queries as arguments, prepare file descriptors, connect them with the program and pipe serialized data into them. This is similar to "map" in "mapreduce" model. It is intented for complex calculations that cannot be expressed in SQL.

There are various options how these programs can run: preinstalled programs available on server (easy part); third-party programs on blob-storage (s3, HDFS) that must be run in constrained environment (Linux namespaces, seccomp...)

Caching of deserialized data in memory on MergeTree part level

Booked by Dmitri Torilov.

Implement a new caching layer. If data part is read as a whole (all rows but maybe subset of columns), cache deserialized blocks in memory. This will make performance of MergeTree tables the same as Memory tables.

Extension to this task is to research various cache eviction algorithms.

Limited support for correlated subqueries in ClickHouse

Booked by Slava Boben.

Figure out the subset of correlated subqieries that can be rewritten to JOINs and implement support for them via query rewrite on AST level.

Implementation of subquery operators in ClickHouse

Booked by Kirill Ershov.

Implement INTERSECT, EXCEPT and UNION DISTINCT operators (easy part). Then implement comparison with ANY/ALL of subquery and EXISTS subquery.

Implementation of GROUPING SETS in ClickHouse

Booked by Maksim Sipliviy, @MaxTheHuman.

GROUPING SETS is the way to perform multiple different aggregations in a single pass within a single query.

Refreshable materialized views and cron jobs in ClickHouse

User defined data types in ClickHouse

+ User defined functions with SQL expressions.

Booked by Andrei Staroverov @Realist007

Limited support for unique key constraint

Booked by Daniil Ryazanov @rybbba

Unique key constraint assures that there is only one row for some user defined unique key. BTree + in memory Hash Table + Bloom Filter can be used as a data structure for deduplication. It is very difficult to implement proper support for unique key constraint for replicated tables. But it can be implemented for non-replicated MergeTree and for ReplicatedMergeTree in local fashion (data is deduplicated only if inserted to the same replica) - it will have some limited use.

YAML configuration for ClickHouse

Booked by Denis Bolonin.

Some people hate XML. Let's support YAML for configurations files, so XML and YAML can be used interchangingly (for example, main config can remain in XML and config.d files can be provided in YAML). There should be a mapping from YAML to XML features like attributes.

https://github.com/ClickHouse/ClickHouse/issues/3607

Improvements for data formats and the clickhouse-local tool

Booked by Egor Savin @Amesaru

Output in CapNProto format. Proper support for Arrays in Parquet format. Allow to multiple read from stdin in clickhouse-local if stdin is seekable #11124. Interactive mode in clickhouse-local.

Incremental data aggregation in memory

Booked by Dmitri Rubashkin @dimarub2000

ClickHouse already has support for incremental aggregation (see AggregatingMergeTree). We can provide an alternative way that can sustain higher query rate, can be used for JOINs and dictionaries efficiently, in price of lost persistency.

When ClickHouse executes GROUP BY it creates data structure in memory to hold intermediate data for aggregation. This data structure only lives for query time and is destroyed when query finished. But we can hold aggregation data in memory and allow to incrementally feed more data into it and also allow to query it as key-value table / JOIN with it / use it as a dictionary. Typical usage example is antifraud filter that need to accumulate some statistics to filter data.

Natural language processing functions in ClickHouse

Booked by Ruslan Kamalov.

Add functions for text processing: lemmatization, stop word filtering, normalization, synonims extension. Look for Elasticsearch and Sphinxsearch for examples.

Embedded log viewer in ClickHouse

A task for frontend developer. Create a single page application that will allow to quickly navigate, search and filter through ClickHouse system.text_log and system.query_log tables. The main goal is to make the interface lightweight, beautiful and neat.

Implementation of a table engine to consume application log files in ClickHouse

Booked by Flynn @ucasFL

ClickHouse has support for subscription and streaming data consumption from message queues: Kafka, RabbitMQ and also recently, from MySQL replication log. But the most simple example of streaming data - is append-only log on local filesystem. We don't have support to subscribe and consume logs from simple append-only file (generated by some third-party application) and it's possible to implement. With this feature, ClickHouse can be used as a replacement to Logstash.

https://github.com/ClickHouse/ClickHouse/issues/6953

Collection of common system metrics in ClickHouse

Booked by Anastasia Grigoryeva, @weifoll

ClickHouse has deep introspection capabilities: per-query and per-thread metrics, sampling profiler, etc. But they are mostly metrics about ClickHouse itself. There's lack of metrics related to the server as a whole (e.g. total CPU load in system, amount of free memory, load average, network traffic...).

Usually there is no need to have these metrics in ClickHouse, because they are collected by separate monitoring agents. But there are several reasons why it's better for ClickHouse to collect metrics by itself:

sometimes people are using ClickHouse but forgot to install any system monitoring at all;
sometimes ClickHouse is run in some managed cloud service where it's not possible to install our own agents and the capabilities of default monitoring option is unsufficient;
ClickHouse is good as time series database - it can store metrics with superior precision, for more time range and with better efficiency, it opens more possibilities for in-depth data analysis on top of metrics data.

There are some excellent examples of metric collection software (e.g. Netdata: https://github.com/netdata/netdata). Unfortunately, the code license of most of them is GPL - it means that we have to write our own metrics collection code.

https://github.com/ClickHouse/ClickHouse/issues/9430

Integration of S2 geometry library in ClickHouse

Booked by Andrey Che, @andr0901

S2 is a library for geospatial data processing with space-filling curves. ClickHouse already has support for another library with similar concept (H3 from Uber - hierarchical coordinate system with hexagons).

The choice between these libraries is motivated by which library is already used inside a company. It means that we have no that choice in ClickHouse and it's better to support both.

SQL functions for compatibility with MySQL dialect

Booked by Daniil Kondratyev, @dankondr

ClickHouse has very rich set of functions that are available out of the box. It's mostly superior than what you will find in other DBMS. Our functions are more performant, more consistent in behaviour and usually have better naming and usability.

Also we have a practice to add compatibility aliases for functions from other DBMS - so the functions will be available under their foreign names. It is possible to have compatibility for almost every function from MySQL.

Data formats for fast import of nested JSON and XML

Booked by Sergey Polukhin @sdpolukhin

We already have support for importing data in JSONEachRow format (flat JSON, a separate object for every row, a.k.a jsonlines). But when JSON contains deeply nested fields and we want to map subset from them to a table, the import become cumbersome. Also we don't have any means for XML import.

Example of complex nested JSON: https://www.gharchive.org/
Example of complex nested XML: https://meta.wikimedia.org/wiki/Data_dump_torrents#English_Wikipedia

The proposal is to add a format that will allow the user to specify:

which object (or path in JSON/XML) we should treat as a record;
what paths within this object will map to what fields in a table.

When multiple elements are matched, we can map them to Array in ClickHouse.

Efficient text classification in ClickHouse

Booked by Sergey Katkovskiy @s-kat

Add functions for text classification. They can use bag of words / character n-grams / word shingles models. Simple bayes models can be used. The data for models can be provided in static data files.

Example applications:

detect charset;
detect language;
roughly categorize topic.

The main challenge is to make classification functions as efficient as possible to be applicable to massive datasets for on the fly processing (ClickHouse style).

Reducing tail latency for distributed queries in ClickHouse

Booked by @Avogar

Minimal support for transactions in MergeTree tables

Booked by @tavplubix

Data encryption on VFS level

Booked by Alexandra Latysheva @alexelex

NEAR modifier for GROUP BY

It allows to aggregate data not by exact values of the keys but by clusters of values near to each other. Clusters are dynamically formed during aggregation.

Improvements of aggregate functions and array functions in ClickHouse

Porting ClickHouse SIMD optimizations to ARM NEON

Specialized precompression codecs in ClickHouse

Integration of SQLite as database engine and data format

Implementation of query cache for result datasets

Support for INFORMATION SCHEMA in ClickHouse

Application for GitHub with messenger interface

Make a desktop application similar to Telegram Desktop that will represent all issues and pull requests from GitHub repositories as chats, sort them by update time, maintain unread count and highlight when the user was mentioned. This application is intented to allow answering questions very quickly without opening web pages in browser (that often takes multiple seconds).

App should work on Linux. C++ and QT can be used for implementation. Alternatively any other technologies can be used instead (Flutter, Electron, ...). The main requirements are: low resource consumption, low input latency, quick startup time, surprise-free behaviour.

st-discussion

Source

alexey-milovidov

👍15

Most helpful comment

ClickHouse has support for LZ4 and ZSTD as generic compression methods. The choice of these particular methods is justified: these methods are pareto-optimal for compression level and speed across well known libraries. Nevertheless, there exist less well known compression libraries that can be somewhat better in certain cases

This is a long reply, so I've provided a summary at the bottom.

I'm a pretty big fan of compression technology and I've tried to keep pace with the field, so I wanted to share my thoughts and offer up a proposal.

The marginal benefit of using other compression methods over lz4 and/or zstd is likely to be fairly low; that is, switching compressors is unlikely to result in across the board improvements, and whatever improvements you do get will probably be small and at the cost of major sacrifices to some other metrics. I say this even though bsc is my personal favorite compression algorithm.

From my reviews of the literature, it seems like the vast majority of compressor gains today come from three main classes:

Highly optimized routines that sacrifice ratios for performance, like--as you note--LZSSE.
Encoders that are specialized to the dataset, like what TimescaleDB and Facebook's Gorilla use. ClickHouse already supports some of these.
Lossy or almost-lossy techniques, like reducing numerical precision using bfloat16 or fp16 to store certain data, employing clever video filters, subsampling certain data, etc. ClickHouse's T64, as I understand it, is fairly similar, though it only seems to reduce numerical precision in cases where it's safe to do so losslessly. That seems to be about as much as anyone can reasonably do in a database.

It seems to me, then, that the first class is where the low-hanging fruit is, distantly followed by the second class.

On the topic of the first class, however, we're not the only people to recognize this low-hanging fruit: LZSSE's developer saw it, as have countless others. lz4 has implemented a decompression speed optimized compression mode and "ultra-fast" mode designed to plug the niche. zstd has a deceptively named "negative compression" mode aimed at the same niche. (Moreover, zstd has 10% faster decompression in the latest release, even without using any of these settings. That's free performance if zstd has its version bumped.)

On the topic of the second class, there is probably plenty of opportunities for certain datasets, but outside of time-series data, where highly effective heuristics are obvious, this seems like it could be pretty hard; if anything, it's probably closer to the topic of a PhD thesis than an intern project. However, things are far from hopeless. zstd and lz4 both support pre-trained dictionaries. These help specialize the compressor context for the dataset at hand; using zstd's dictionary builder, you can get double or quadruple the compression and decompression throughput _and_ better compression ratios at the same time. lz4 can use these pre-trained dictionaries, too. With the merge of the fast-cover dictionary builder into zstd, I would totally expect dictionary pre-training (using samples of the dataset) to be viable. This is on top of the incredible number of tunables and context-specific options that exist in zstd and lz4, each a knob that can help users get better results with compression.

To summarize, most high-performance generic compression algorithms perform fairly similarly when you hold some requirements (compression ratio, decompression speed, etc) constant. Some make certain tradeoffs to get large gains in certain niches, while sacrificing in others; for example, experimental compressors like LZSSE sacrifice compression ratios to get stunning (de)compression speeds. However, as of the last few years, experimental compressors have lost their monopoly on these niches. A wide array of tunables are exposed in mainstream compressors like zstd and lz4, allowing users to tap into similar benefits. Moreover, research into domain-specialized compression (which is where the majority of gains are in compression research) have spilled over into mainstream compressors, like lz4 and zstd; these libraries support a bunch of features like pre-trained dictionaries and context-specific options, the latter of which presents a massive array of knobs for improving performance on a per-dataset basis.

If I am allowed to, this brings me to a proposal: we can have both new compressors added, and research and development related to optimized usage of the compressors that are already supported. I would gladly volunteer to apply the state-of-the-art work being done by the lz4 and zstd developers to the codebase, while Abi Palagashvili continues the original mission. From skimming the codebase, it seems unlikely that (despite how close the goals are) we'd be editing too much overlapping code and creating many conflicts.

gonzalezjo on 11 Oct 2020

❤1 👍1

All 10 comments

More array/aggregate functions

fold / reduce and arrayFold / arrayReduce - to build some custom folding/reducing logic (see scala)
maxN / minN / maxUniqN / minUniqN and same for arrays - collect several biggest / smalles elements from column / array without full sorting (complexity should be N*log(number of collected elements) instead of N*log(N) )
groupArraySorted - https://github.com/ClickHouse/ClickHouse/issues/3708
Arrays as sets

Массивы как множеста:
1) Пересечение. arrayIntersect(a,b) 👍
┌─arrayIntersect([1, 2, 3], [3, 2, 5])─┐
│ [3,2] │
└──────────────────────────────────────┘
2) Объединение. arrayReduce('groupUniqArray', [a,b] ) 😱
┌─arrayReduce('groupUniqArrayArray', array([1, 2, 3], [3, 2, 1, 5]))─┐
│ [5,2,1,3] │
└────────────────────────────────────────────────────────────────────┘
3) Разность arrayDifference arrayFilter( k -> not(has(b,k)), a ) 🥴
┌─arrayFilter(lambda(tuple(k), not(has([2, 3, 5], k))), [1, 2, 3])─┐
│ [1] │
└──────────────────────────────────────────────────────────────────┘

Кажется 2 и 3 могут быть сделаны более оптимально и с красивым синтаксисом.
Maybe https://github.com/ClickHouse/ClickHouse/issues/3954

filimonov on 21 Sep 2020

This one seems nice / not too small / isolated: https://github.com/ClickHouse/ClickHouse/issues/4420

filimonov on 21 Sep 2020

Implementation of a table engine to consume application log files in ClickHouse

Hi, can I do this task? @alexey-milovidov

ucasFL on 23 Sep 2020

@ucasFL Yes, I will reserve it for you.

alexey-milovidov on 23 Sep 2020

ClickHouse has support for LZ4 and ZSTD as generic compression methods. The choice of these particular methods is justified: these methods are pareto-optimal for compression level and speed across well known libraries. Nevertheless, there exist less well known compression libraries that can be somewhat better in certain cases

This is a long reply, so I've provided a summary at the bottom.

I'm a pretty big fan of compression technology and I've tried to keep pace with the field, so I wanted to share my thoughts and offer up a proposal.

From my reviews of the literature, it seems like the vast majority of compressor gains today come from three main classes:

Highly optimized routines that sacrifice ratios for performance, like--as you note--LZSSE.
Encoders that are specialized to the dataset, like what TimescaleDB and Facebook's Gorilla use. ClickHouse already supports some of these.
Lossy or almost-lossy techniques, like reducing numerical precision using bfloat16 or fp16 to store certain data, employing clever video filters, subsampling certain data, etc. ClickHouse's T64, as I understand it, is fairly similar, though it only seems to reduce numerical precision in cases where it's safe to do so losslessly. That seems to be about as much as anyone can reasonably do in a database.

It seems to me, then, that the first class is where the low-hanging fruit is, distantly followed by the second class.

gonzalezjo on 11 Oct 2020

❤1 👍1

Implementation of GROUPING SETS in ClickHouse
Hello, can I do this task? @alexey-milovidov

MaxTheHuman on 11 Oct 2020

👍1

User defined data types in ClickHouse
Hello, can I do this task? @alexey-milovidov

Realist007 on 13 Oct 2020

👍1

@gonzalezjo thanks for such detailed response!I'm getting started with this task soon.I'm not experienced enough in
optimization with cpp subtitles and sse instructions, but I hope I'll deal with it.I also found fresh paper from VLDB conference on strings compression: http://vldb.org/pvldb/vol13/p2649-boncz.pdf and it's source code respectively: https://github.com/cwida/fsst. I realise, that main goal is about general-purpose compression, but I'll test it too if there would be enough time.