Some OLAPs, such as Snowflake, directly use S3 as their table storage, even for the temporary data. It has the benefit to save money for cloud users, additionally, it saves time for data ETL. From this benchmark could we see that S3 based OLAP(Snowflake) does not have a remarkable performance difference with local storage based one(Redshift). There also exists similar projects as Rocksdb-cloud which uses S3 as the Rocksdb's persistent storage, it could be some reference to have ClickHouse been more cloud native.
FYI, we successfully used ClickHouse on top of google cloud storage via gcsfuse.
Hi @valyala , this is an open-source project?
gcsfuse is a user-space file system for interacting with Google Cloud Storage. Means you can store your data at any type of Google Cloud Storage. Gcsfuse is written in python and will impact performance (see https://github.com/GoogleCloudPlatform/gcsfuse#performance) but it could be useful option for storing old data (rarely accessible)
@hagen1778 Yes, I mean the plugin that hooked clickhouse on gcsfuse storage, not mean gcsfuse itself.
You don't need a plugin.
We are using gcsfuse only for RO purposes and I don't know how it will handle writes but you can try it out. What we do:
user_allow_other option at gcsfuse config /etc/fuse.confallow_other parameter Thanks @hagen1778 I will try it.
Does anyone ended up using ClickHouse on top of fuse in long term?
AFAIK: from roadmap, 2019 Q2 seems to support S3-like object storage, right?
@tangyong that item is not about replacing native storage with S3, it's more about import/export and on the fly processing. Actually this is already partially possible with URL table engine or table function, but it lacks authentication support so it won't work with non-public S3 buckets.
@hagen1778
We are using gcsfuse only for RO purposes
How is performance of such reads? Like how many megabytes per second? Thinking of usage gcsfuse too