Clickhouse: Clickhouse incremental backups on distributed storage

Created on 2 Aug 2017 · 6Comments · Source: ClickHouse/ClickHouse

I can see that Clickhouse has a backup feature which is described in the documentation: https://clickhouse.yandex/docs/en/query_language/queries.html#backups-and-replication. What I think might be useful for users is an incremental backup that should prevent from total disasters like major hardware failures or simply human errors. Elasticsearch, for example, has plugins which allow users to take data snapshots for all indices (index is just an equivalent in Clickhouse table) or just a subset of indices. One of such plugins allows users to send data to distributed storage like AWS S3, Azure storage or Google Cloud Storage. Storing backups in S3 has 2 major advantages: cuts costs and in addition to that S3 is very durable. Some companies have also large HDFS clusters which can be also used to keep backups.

In the title of this issue, I mentioned incremental backups which also have a huge advantage and can be useful in Clickhouse. For more sophisticated tasks Clickhouse users can/should use MergreTree data engine. Since MergeTree uses merged blocks I believe some of those become immutable after some time. This means one could hypothetically create one full snapshot and then create incremental snapshots every day or every hour. Each incremental snapshot will require Clickhouse to send only new blocks to distributed storage system. Incremental backups can make Clickhouse more durable because snapshots will become inexpensive to execute and store.

feature operations

Source

prog8

👍15

Most helpful comment

@AlexAkulov you could have saved everyone a few clicks by providing a direct link: https://github.com/AlexAkulov/clickhouse-backup/ 😉

blinkov on 13 May 2019

👍4

All 6 comments

This would be great and make CH more appealing to organizations that need really production-ready, mature tech. Is this something that could be added to CH roadmap for early 2018?