Scylla: Support JBOD for data directories

Created on 20 Jan 2016 · 21Comments · Source: scylladb/scylla

JBOD (just a bunch of disks) is an alternative to using a RAID. Rather than configuring a storage array to use a RAID level, the disks within the array are either spanned or treated as independent disks.
Since Scylla (and Cassandra) relay node replication RAID redundancy is not relevant.

In scylla.yaml JBOD will look like:

data_file_directories:
    - x:/lib/var/cassandra/data
    - y:/lib/var/cassandra/data

Where x, y are two drives, and data should spread evenly over the configured drives proportionate to their available space. In Cassandra it allow you to take advantage of the disk_failure_policy setting

You can configure Cassandra to keep going, doing what it can if the disk becomes full or fails completely. This has advantages over RAID0 (where you would effectively have the same capacity as JBOD) in that you do not have to replace the whole data set from backup (or full repair) but just run a repair for the missing data. On the other hand, RAID0 provides higher throughput (depending how well you know how to tune RAID arrays to match filesystem and drive geometry).

sources:
http://stackoverflow.com/questions/15925549/how-does-cassandra-split-keyspace-data-when-multiple-directories-are-configured
https://docs.datastax.com/en/cassandra/2.1/cassandra/operations/opsRecoverUsingJBOD.html

User Request cassandra 2.2 compatibility enhancement

Source

tzach

Most helpful comment

This is kind of blocker for us to go live with scylladb.

kuzemchik on 8 Nov 2016

👍2

All 21 comments

Worth to take into account this blog: http://www.datastax.com/dev/blog/improving-jbod

Note that since we use per-core shards, we can just stack them up per shard group in separate disks

dorlaor on 20 Jan 2016

If we have N shards and D disks, then each shard should receive D/N of the total disks; can be a fraction.

For example with three shards and two disks, shard 0 would occupy disk 0, shard 1 would split its token range between disk 0 and disk 1, and shard 1 would use disk 1.

avikivity on 14 Sep 2016

This is kind of blocker for us to go live with scylladb.

kuzemchik on 8 Nov 2016

👍2

This is kind of blocker for us to go live with scylladb.

@kuzemchik Can you elaborate on why it is the case? is it for better speed, redundancy?

tzach on 8 Nov 2016

1) We are already using it with cassandra and rebuilding nodes with moving data back and forth to build raid is not very convenient
2) raid1/raid5 — not necessary redundancy (cassandra replication handles it better)
3) raid0 — single disc failure fail whole node

kuzemchik on 8 Nov 2016

Thanks @kuzemchik to put it in my own word, rebuilding a node after one disk failed is faster.

tzach on 8 Nov 2016

@tzach Well, it is not only about rebuilding process. If the cluster is consisting of several powerful nodes (5 nodes, 12 discs each f/e) by loosing one disc you effectively loosing 20% of throughput with raid0.

kuzemchik on 8 Nov 2016

@kuzemchik https://github.com/kuzemchik fair points, however there are
downsides for JBOD too:

burden to make sure free space exist for all directories
Free space can get fragmented, especially important for compaction.
raid0 loses data but you can selected higher raidX and you wouldn't lose
data

It's not that we're against it but there are other items in higher priority
(counters, LWT, TWCS
and even many more future features for storage). I would consider switching
to
Scylla earlier, even without JBOD support.

On Tue, Nov 8, 2016 at 11:20 AM, Vladislav Kuzemchik <
[email protected]> wrote:

@tzach https://github.com/tzach Well, it is not only about rebuilding
process. If the cluster is consisting of several powerful nodes (5 nodes,
12 discs each f/e) by loosing one disc you effectively loosing 20% of
throughput with raid0.

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/scylladb/scylla/issues/830#issuecomment-259232447,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABp6RbVuvmrJOlbCPQup92ef5RmwHCUUks5q8MuJgaJpZM4HIY46
.

dorlaor on 8 Nov 2016

C* 3.2 improved JBOD implementation[1] by making sure a single token would not exist in more than one data directory, and added nodetool relocatesstables to rewrite SSTable to the new dir structure

[1] http://www.datastax.com/dev/blog/improving-jbod

tzach on 23 Mar 2017

related update in C* 3.2
https://issues.apache.org/jira/browse/CASSANDRA-6696

tzach on 13 Apr 2017

Hey, just came here to say that this would be a really neat feature to have. Recently discovered that ElasticSearch can utilise multiple data-directories and I was hoping ScyllaDB might too.

heipei on 19 Dec 2019

👍1

@heipei Jbod has benefits but also drawbacks, check my comment from Nov 2016 above. Especially for a fast database like Scylla, we like to make sure we fully utilize all drives in parallel and thus raid is better. Scylla also controls the priority of every type of action (compaction, read, write, streaming, workload prioritization) and having many separate directories makes the execution more constraint.

Makes sense?

dorlaor on 19 Dec 2019

OK, I guess it makes sense if there's technical reasons for this. I would have loved to have the option nevertheless since I'd rather lose one disk with one data-dir on it than the whole RAID-0 if a single disk goes bad.

heipei on 28 Dec 2019

👍1

Scylla is quite fast in streaming too so the cost of replacing a node is low

On Sat, Dec 28, 2019 at 4:52 AM Johannes Gilger
notifications@github.com wrote:
>

OK, I guess it makes sense if there's technical reasons for this. I would have loved to have the option nevertheless since I'd rather lose one disk with one data-dir on it than the whole RAID-0 if a single disk goes bad.

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub, or unsubscribe.

dorlaor on 7 Jan 2020

We too would like to see support for JBOD in Scylla. One argument that I have is that software RAID certainly has an extra overhead and negatively affects performance even in non-failed scenario. I don't quite understand the reasoning for one disk becoming full (and others don't). We do not see that happening in Cassandra, I guess even with random shard placement when number of shards is much higher than number of disks that should not matter?

andy-slac on 1 May 2020

@andy-slac it's true that there can be a scenario that jbod will be better than a raid but most times it is the opposite. You can configure the raid stripe size so usually only a single disk will be accessed, thus you won't need to move too many spindles per single access, typically just one, exactly like the jbod case. This is while a single shared with the jbod case won't be able to use more than a single disk for many requests, the opposite of the good balance that raids have.

As we try to push the limits with Scylla and reduce the free space required per node, for example with incremental compaction, having a shared pool of free storage becomes more important.

When you do reach a case of a full single disk, the entire node needs to be taken off the cluster (like cassandra). Later, it's not that trivial to bring the node back - you'll need to use hinted handoffs, repair (which is per node and not per disk), so the benefit is limited.

You're welcome to present a specific case that supports the opposite and your current performance/storage story w/ scylla (using it at Rubin Observatory?)

dorlaor on 1 May 2020

@andy-slac see https://github.com/scylladb/scylla/issues/2601#issuecomment-622669515

dorlaor on 2 May 2020

JBOD is quite problematic. If you lose a disk you can also lose critical information for the system tables, and then the node does not boot any more. If special care is not taken, then user data can be resurrected (losing a tombstone on the failed disk which has been garbage-collected on other nodes).

We can decide that system tables are mirrored on the disks instead of striped, but that is even more work.

avikivity on 2 May 2020

@dorlaor, I have to disagree with that RAID is generally better than separate disks. Of course if application does not want to care about I/O details then RAID is the simplest solution, though you'll have to invest your time into tuning RAID configuration for your application needs. OTOH if application is smart enough to understand the behavior of the storage system and schedule its I/O to take advantage of that, it can achieve much better performance and behavior than RAID. I had the impression that Scylla indeed tries to achieve that goal. Cassandra shows us that utilizing separate disks is possible and it works reasonably. Looking at the history of this ticket I see that you are trying to find all possible reasons to not implement it. I think it would be better if you just told us explicitly that this is never going to be implemented, I'd stop arguing then.

andy-slac on 4 May 2020

Cassandra implemented JBOD not because their performance is better but
because streaming in Cassandra is slow and IO is less efficient as well.

I'm not arguing for a principle, my algorithm is not to find all reasons why
not, I like to see what really good reason why we should implement JBOD,
so far we haven't found.

Could be a case where you just ask for a JBOD w/o data? We have users
with software raids of 16 and 24 disks and we see full utilization of all cores
and combined disk speeds of 12GB/s. How about you'll test the raid
setup and report back?

On Mon, May 4, 2020 at 11:40 AM Andy Salnikov notifications@github.com wrote:
>

@dorlaor, I have to disagree with that RAID is generally better than separate disks. Of course if application does not want to care about I/O details then RAID is the simplest solution, though you'll have to invest your time into tuning RAID configuration for your application needs. OTOH if application is smart enough to understand the behavior of the storage system and schedule its I/O to take advantage of that, it can achieve much better performance and behavior than RAID. I had the impression that Scylla indeed tries to achieve that goal. Cassandra shows us that utilizing separate disks is possible and it works reasonably. Looking at the history of this ticket I see that you are trying to find all possible reasons to not implement it. I think it would be better if you just told us explicitly that this is never going to be implemented, I'd stop arguing then.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or unsubscribe.

dorlaor on 4 May 2020

I'm looking forward to solving this problem