Clickhouse: Any plans to support separation hot/cold data at the level of partitions?

Created on 19 Nov 2018 · 16Comments · Source: ClickHouse/ClickHouse

With the completion of Q1-Q2 / 2018, clickhouse support store data at multiple disk volumes of a single server at the level of DBs or tables. Is there a plan to support this feature at the partitions?

We want migrate PB-level data to clickhouse hope the partitions can be stored on different performance medias likes hdd, ssd .

We define hot/cold data based on partitions by date, we hope the cold data partitions can be storage on HDD , and hot data partitions can be storage on SSD, In order to achieve good economic benefits.

I think this is a common practice in the field of big data. How do you think about it?

comp-multidisk feature performance st-fixed

Source

thinkinside-LQ

👍8

Most helpful comment

@liuqian1989 it's kind of obvious possible feature, but in practice ClickHouse works fast enough even on 100% HDD given there's enough memory for page cache, so it's priority is not that high.

I'd suggest you to try to put all those petabytes on HDD and see how it goes. Or you have already tried and are not satisfied? In any case, please give us more details.

@blinkov

Really trust the performance of clickhouse on HDD and very agree with your point. :) In my case, HDD is not an issue, support enough memory also not an issue , just wanna to find a way Cost-effective store the huge data.

Let me introduce our case:

~ 10TB of app logs per day.
~ 1TB increase per month.
Clickhouse table partition by date YYYYMMDD.
Common sharding and replica set architecture.
Deployed on 18 servers.
Our flow
- [android/ios client app] => [server api] => [kafka] => [clickhouse] => [grafana]

We have a wealth of HDD resources not many SSD(reason for price). As @filimonov reply above, we actually define the last week as HOT data.

So, in the case of HDD + SSD, we prefer use SSD serve the hot data and we also accept the performance of loading cold data from HDD.

As for why i want to split data on partition level, most logging data is written to a single table by date/timestamp and it is continuously written. The partition function naturally provides a way to split data for large tables, no need to split table, no need split a query into multiple queries.

If support split data on the table partition level, it also keep the cold data and hot data both online.

thinkinside-LQ on 20 Nov 2018

👍11

All 16 comments

The big question is what is the best way to split data: on partition level inside each node or on shard level in the cluster? I think, doing both is twice the effort. Better pick one.

Imaskar on 19 Nov 2018

@liuqian1989 it's kind of obvious possible feature, but in practice ClickHouse works fast enough even on 100% HDD given there's enough memory for page cache, so it's priority is not that high.

I'd suggest you to try to put all those petabytes on HDD and see how it goes. Or you have already tried and are not satisfied? In any case, please give us more details.

blinkov on 19 Nov 2018

@blinkov yep, when 'hot' data fits in memory - HDD is not an issue. But it you have a lot of data you really need to read some data from storage, and in that case read speed of different storages can vary very much. Costs of differents storages also differs very much.

That feature will also make possible to create 'archive' with 'very cold' partitions stored on cheap and slow storages for accidental access (slow access to such an archived data is completely ok).

So for example last week on SSD, last year - on HDD, older data - on AWS S3 (see also #1394) or other slow archive storage.

Also: reading of those 'archived' partitions should be limited, so ideally if indexes for them will not be preloaded at server start.

filimonov on 19 Nov 2018

@liuqian1989 it's kind of obvious possible feature, but in practice ClickHouse works fast enough even on 100% HDD given there's enough memory for page cache, so it's priority is not that high.

I'd suggest you to try to put all those petabytes on HDD and see how it goes. Or you have already tried and are not satisfied? In any case, please give us more details.

@blinkov

Let me introduce our case:

~ 10TB of app logs per day.
~ 1TB increase per month.
Clickhouse table partition by date YYYYMMDD.
Common sharding and replica set architecture.
Deployed on 18 servers.
Our flow
- [android/ios client app] => [server api] => [kafka] => [clickhouse] => [grafana]

We have a wealth of HDD resources not many SSD(reason for price). As @filimonov reply above, we actually define the last week as HOT data.

So, in the case of HDD + SSD, we prefer use SSD serve the hot data and we also accept the performance of loading cold data from HDD.

If support split data on the table partition level, it also keep the cold data and hot data both online.

thinkinside-LQ on 20 Nov 2018

👍11

Also a big benefit here would be an ability to ingest data more reliably. Superior NVMe performance would help with writes and then merged partitions could be placed on HDD. That's actually a way to implement the feature - when doing merge, place resulting partition on HDD if it is cold enough now.

Imaskar on 20 Nov 2018

The big question is what is the best way to split data: on partition level inside each node or on shard level in the cluster? I think, doing both is twice the effort. Better pick one.

@Imaskar yep, understand your view, i have new details updated , welcome to participate in the discussion and sharing your ideas. :J refer https://github.com/yandex/ClickHouse/issues/3605#issuecomment-440170529

thinkinside-LQ on 20 Nov 2018

This task is assigned to @igron99 (external contributor).

alexey-milovidov on 24 Jan 2019

We have a very similar scenario as @thinksinside-LQ has described. We would greatly benefit if "older" partitions of a table could be stored on a different location (HDD partition). Currently we use Postgres and just move "old" table partitions to different tablespace, which is located on HDD partition.

Is there a roadmap for implementation of this feature already?

Or is there currently any manual workarounds to achieve this functionality?

miha-g on 15 Feb 2019

Or is there currently any manual workarounds to achieve this functionality?

You can mount / symlink database folder in /var/lib/clickhouse/data to another storage and move data by detach/attach parts from fast storage database to slow storage database (of course table definitions should match)

filimonov on 16 Feb 2019

@filimonov if I understand you correctly, this would leave us with 2 separate databases and 2 separate tables, each in its own database. So in order to use this solution, we would also have to make a UNION ALL between the 2 tables in different databases in order to get the whole dataset. Please correct me, if I misunderstood your suggestion.

What we would like to achieve is to have 1 logical table with many partitions, which are physically written on different media. Lets say we partition the table by toYYYYMM(..), and we put/archive all the 2018* partitions on HDD, and keep 2019* partitions on SSD. And from the user stand point, he would still be able to query only one table and get data from 2018 and 2019 without the need for UNION ALL between the tables in separate databases.

miha-g on 18 Feb 2019

Hello
I just want to share a few thougths on this subject as i 'm looking to implement the same thing.

I can imagine 3 evolution to reach this goal :
1 : ALTER PARTITION MOVE PART TO [NODE.]PATH
Allow to move a partition to a different path . Of course move is safe : ie. Copy the file ( let the old one still active during the copy ) , an only when the copy is done and verified , update the reference. During this phase , block any merge , if there is new part , copy them also.
PATH is not the physical path , but some reference from the configuration file ( hotpath , coldpath ...... ) .
Of course this shall be done online with minimum lock possible. Optionally if the table is distributed, the partition can also be moved to another host...

2 : Put a cron for maintenance scripts in clickhouse. Allowing to run some 'sql script' on a DB and run them at regular interval. (Can be done externally but i like the idea of this being part of the schema to make things consistent together).

3 : Allow the alter partition move to take as input a sql queries ( on the system tables) to take all the partition to be moved.

With this in place you can implement some really advanced scenarii

Such as :
Move historical data into a cold backup after a certain time can be easily implemented by selecting the partion older than X days to move them to a cold storage

But we can also implements some really more advanced behavior in clickhouse. 
     Take the partition with highest reads over the last X days and move those partition to the hot storage ( if they are not already there ) and take the less used partition on the hot storage and move them to the cold one.  
     To do that , i was thinking to use the query log as system.part does not maintain read/write bytes counters. ( But as 4th feature , extending system can be really usefull). 
        With this in place you have some kind of adaptative storage management based on the access.

Here we speak about simple hot/cold , but you can also implements any multi tiering level. For instance , those features also allow to take some historical partition where you detect heavy insertion into a write optimized path. ...

Putting reference on the path is a key point also for schema maintenance and also when you have heterogeneous node mounting points in a single cluster. Of course the path reference can/should be present in the system part table.

A last point can be to reduce the granularity to the part_columns allowing to move on the hotpath only the columns which are the most used or in the context of more complex deployement put only the Primary key on the fast mounting point and other columns into a slower one.

TH-HA on 18 Feb 2019

👍1

@filimonov if I understand you correctly, this would leave us with 2 separate databases and 2 separate tables, each in its own database.

Actually, it can be in one database, enough that table directories will point to different storages.

So the simplest way to archive it currently:

create table mydb.mytable_ssd ( ... ) Engine=MergeTree ...;
create table mydb.mytable_hdd as mydb.mytable_ssd Engine=MergeTree  ...;

Mount (or symlink) /var/lib/clickhouse/data/mydb/mytable_hdd and /var/lib/clickhouse/data/mydb/mytable_ssd to different storages.

Create a table with Engine=Merge above those two tables to make it possible to select both of them as from one logical table:

CREATE TABLE mydb.mytable as mydb.mytable_ssd ENGINE=Merge(mydb, '^mytable_');

What we would like to achieve is to have 1 logical table with many partitions, which are physically written on different media.

Yes, I clearly understand that requirement and hope that it will be implemented exactly as you describe. You were asking about workarounds which can be used right now (before that feature will be implemented), so the answer is - 'yes, there are', and it's doable if you need it right now.

filimonov on 18 Feb 2019

👍3

@filimonov Thank you for the clarification and examples. I applied your suggested approach to our proof of concept and it works great!
From the selecting perspective this is just as good as the future solution is expected to be.
From the inserting perspective, there is some extra work needed to take care of insertion/moving into the correct underlying logical table (SDD/HDD), but it is still very manageable with a simple script.

miha-g on 18 Feb 2019

Same issue, we haven't enough ssd storage to store all table data. Will be great to store some partitions in another location.
@filimonov Thanks for merge solution, but this is a hack =)))) Adding feature as say @TH-HA will be great.