Clickhouse: Support S3 as the persistent storage

Created on 24 Oct 2017  路  10Comments  路  Source: ClickHouse/ClickHouse

Some OLAPs, such as Snowflake, directly use S3 as their table storage, even for the temporary data. It has the benefit to save money for cloud users, additionally, it saves time for data ETL. From this benchmark could we see that S3 based OLAP(Snowflake) does not have a remarkable performance difference with local storage based one(Redshift). There also exists similar projects as Rocksdb-cloud which uses S3 as the Rocksdb's persistent storage, it could be some reference to have ClickHouse been more cloud native.

feature

All 10 comments

FYI, we successfully used ClickHouse on top of google cloud storage via gcsfuse.

Hi @valyala , this is an open-source project?

gcsfuse is a user-space file system for interacting with Google Cloud Storage. Means you can store your data at any type of Google Cloud Storage. Gcsfuse is written in python and will impact performance (see https://github.com/GoogleCloudPlatform/gcsfuse#performance) but it could be useful option for storing old data (rarely accessible)

@hagen1778 Yes, I mean the plugin that hooked clickhouse on gcsfuse storage, not mean gcsfuse itself.

You don't need a plugin.
We are using gcsfuse only for RO purposes and I don't know how it will handle writes but you can try it out. What we do:

  • install gcsfuse
  • uncomment user_allow_other option at gcsfuse config /etc/fuse.conf
  • mount Google Bucket somewhere into filesystem. Don't forget about allow_other parameter
  • make symlinks from mounted disk to your CH working dir (or even try to change working dir to mounted disk)
  • add mounting to startup actions if you want
  • restart CH

Thanks @hagen1778 I will try it.

Does anyone ended up using ClickHouse on top of fuse in long term?

AFAIK: from roadmap, 2019 Q2 seems to support S3-like object storage, right?

@tangyong that item is not about replacing native storage with S3, it's more about import/export and on the fly processing. Actually this is already partially possible with URL table engine or table function, but it lacks authentication support so it won't work with non-public S3 buckets.

@hagen1778

We are using gcsfuse only for RO purposes

How is performance of such reads? Like how many megabytes per second? Thinking of usage gcsfuse too

Was this page helpful?
0 / 5 - 0 ratings