Scylla: Secondary index

Created on 24 Sep 2015  Â·  57Comments  Â·  Source: scylladb/scylla

Is scylladb a "newsql" database like cockroachdb, clustrix, ...?
More precise:
Is there support for a secondary index?
If yes, is the secondary index stored on the same machine as the record? Is it necessary to ask all the machines to query a secondary index?

CQL User Request enhancement

Most helpful comment

SI (Secondary indexes) is in progress. @penberg just recently sent an important patchset which makes it 70% there and there isn't much to do in terms of coding. Since our implementation relies on our Materialized View feature, we need it to reach general availability maturity level. It's in experimental mode today and misses couple of important pieces like repair and some more.

I think it's fair to say that SI will be committed as experimental in Oct and get GAed through the rest of the year.

All 57 comments

On Thu, Sep 24, 2015 at 1:14 PM, doublex [email protected] wrote:

Is scylladb a "newsql" database like cockroachdb, clustrix, ...?
More precise:
Is there support for a secondary index?
If yes, is the secondary index stored on the same machine as the record?
Is it necessary to ask all the machines to query a secondary index?

We have a half-ready secondary index code which follow's Cassandra one.
We'll complete it but more interested in the materialized view version of
Cassandra.

The nice thing is that our performance allows one to use more views and
really untangle the value
of indexes.

—
Reply to this email directly or view it on GitHub
https://github.com/scylladb/scylla/issues/401.

I think secondary indices have strong value (without the performance hit) wrt lookups in wide rows. Cases where you know the partition key but want to look up rows without including the clustering key.

I stumbled upon this paper: http://researcher.watson.ibm.com/researcher/files/us-wtan/DiffIndex-EDBT14-CR.pdf which might be relevant if we ever want to use a different algorithm for secondary indexes than Cassandra's.

All the DBs on the market have some flaws: either they are based on "garbage collection" (e.g. cockroachdb, hbase, ...) or without a sorted distributed index (cassandra, riak, ...)

A relevant paper on a more scalable approach to secondary indexes: https://ramcloud.atlassian.net/wiki/download/attachments/6848671/slik.pdf

The index should be a distributed, sorted map - otherwise it could not be used for "range requests".

@doublex can you please expand on what you mean?

I think that SASI's method is particularly well-suited for range requests which result in a large number of results - e.g., find all items with price between $1 and $10. If we expect the results' price field to fall into that range, but do _not_ expect the results to be sorted by price, SASI can return the result sorted by token (murmur3 hash of the item's original partition key) and therefore instead of querying all the nodes at once, it can query just the node with the lowest tokens, get a bunch of results, and only query the next node if it didn't get enough results from the first node.

However, in different use cases, it's less clear to me how SASI is useful... If the search only turns up a small number of results, all nodes will need to be queried. If each node has a very large number of vnodes (in scylla, we have 256 vnodes per node!), we can get very few results from each vnode and even getting token order might require querying many nodes. And if we need to sort the results by some meaningful order, not random token order, again we will need to query all nodes and collect all the results.

If the index is stored on the same machine ("machine local"), all the nodes have to be queried. If the key is a distributed sorted map, you only ask the node which is responsible for that range.
More precise: A "machine local"-index does not scale with an increasing number of nodes.

TL;DR I think you want Materialized Views

@doublex, "machine local" is the way Secondary Index work in Cassandra. So I assume you don't want that implementation.
The "distributed map" which you are looking for is the way that "Materialized Views" work in Cassandra, and we have a separate issue (#1141) to support that too. The partitions of a Materialized View will be sorted according the specified sorting order - it is _possible_ to sort them in string order (and thereby enable efficient range queries using a single node) but usually in Cassandra users prefer to sort partitions in hash order (murmur3) which gives up on range queries but gives better protection against "hotspots" (nodes which get a significantly busier key range). Arguably, because we have so many vnodes (256 per node), this might give us good enough protection against hotspots, that ordering partitions in natural order might be acceptable, and this will give you the range query feature you were hoping for.

It would be great to have one distributed database which
a) supports range requests (on secondary index)
b) scales with sherd count
c) not based on "garbage collection"
To my knowledge such a database does not exist

@doublex have you looked into DynamoDB Local Secondary Index?

It true to his name as secondary, similar to Cassandra clustering key.

I am so stunned. I can hardly wait for 1.1 or 1.2.
A fast database (written in C++) which supports range requests on a secondary index and scales with shard count is a "missing link".
Hopefully scylladb saves us.

So how does the current implementation of primary indexes really work?
If I have a PK(a,b) and execute a query select * from T where a='value' will the index kick in?

Yes, that case will be very fast.

A good compilation of different secondary index designs:
https://raw.githubusercontent.com/cockroachdb/cockroach/c3bc53a463f0a7036dcea06ac6d109d3ef2aa4fe/resources/doc/scan-efficiency.png

Currently the only distributed database with secondary index (that scales with shard count) are "HBase" (bloatware, ...) and "Hypertable" (fat client, ...), maybe CockroachDB in future (mark-and-sweep garbage collector, ...)

@avikivity How will that work (on PK(a,b)) since it can't know the partition when you're just filtering on "a=x", you'll need both "a=x + b=y" to know the shard, right ? Or maybe by fast you meant that it will hit every-node but all nodes will be efficient.

@doublex
Hypertable is dead. It also had thrift clients (only c++ was the fat one) (I think cassandra client is also fat like hypertables?). By fat you meant a cache of the server-id of each range?
Cockroachdb is in go with heavy priority towards distributed-transactions so I don't think it will be faster than hbase(but has nicer api, hbase sucks if you're not jvm).
I've read somewhere from a googler that hbase was at least 3x slower compared to BigTable(c++). Hell now they can probably compare them since you can rent BigTable on google-cloud.
For scylla to make the secondary index like a distributed-sorted-map (bigtable,hypertable,hbase) they'll need to make the primary-key that way too and then the secondary-index can just be another table (like in hbase,cockroachdb) https://github.com/scylladb/scylla/issues/984.
Note that in hypertable, data WAS verified on read when you're filtering by secondary-index by also reading from the primary-index. This was because there weren't transactions, so they always updated the secondary-index before the primary-index (and when failures happen in the middle, you have secondary keys that that are orphan), so reads are a little slower if you only need the "primary-key" and not full-row (in this case the db will read it anyway).
And since writes were without reads, hypertable didn't read the old value of an index, so it wrote duplicates which were later deleted on compactions. (also why you couldn't index counters)

@ddorian PRIMARY KEY (a, b) means a is the partition key and b is the clustering key.
PRIMARY KEY ((a, b)) means (a, b) is the partition key (and no clustering key).

For scylla to make the secondary index like a distributed-sorted-map (bigtable,hypertable,hbase) they'll need to make the primary-key that way too and then the secondary-index can just be another table (like in hbase,cockroachdb)

This is not true.
You could distribute the primary key into hash-buckets and the secondary key as distributed-sorted-map. In that case you may want to index the primary key.

@doublex I meant they have to implement the feature. Then it could be optional (if you want range queries on your secondary/primary index or not).

@avikivity @dorlaor
The perfect database would be:
Apache Cassandra compatible column store, with the low latency of Redis.
and the secondary index of HBase (not machine-local like Cassandra)
and the cross datacenter replication of Aerospike
(and maybe small transaction to keep the secondary index in sync with the record)

@doublex - How is Aerospike's datacenter replication better than Cassandra?

@doublex by not machine-local are you referring to DynamoDB like GSI, Cassandra materialized view?

@tzach he means that for a given row, the secondary index may/will be in another node.

Elastic-search/mongodb has local indexes (the index is in the same node as the data) while hbase/bigtable/hypertable/cockroachdb doesn't have local indexes (the secondary index can(usually) be in another node compared to the original row).

This makes possible to query global data by hitting minimum amount of nodes.

Local indexes are better when you can use some kind of routing (say by "project" or "user") and always query by that cluster-key, so index+raw-data will hit same node.

And global indexes are better when you can't cluster your data.
Ex: youtube videos are stored by raw_id as primary-key. If you want to filter by user_id, with global index (since it's sorted) you will hit a very-low number of nodes/ranges. While with local indexes, you will hit every node (since videos are distributed randomly across nodes).

TLDR: How cockroachdb does it: https://www.cockroachlabs.com/blog/sql-in-cockroachdb-mapping-table-data-to-key-value-storage/

To my knowledge CockrochDB priorizes transaction over speed, which is challenging because commodity hardware misses a "perfect" clock. Maybe this is the reason why CockrochDB is written in an "mark and sweep" garbage collected language...

IMHO many modern applications simply prefer speed. I can live with writes "out of order".
But the database has to be fast, has to scale with node-count, shouldn't be sluggish while adding nodes, ...

Hopefully ScyllaDB becomes not yet another Cassandra/Riak/Couchdb/Mongodb/Aerospike/... database (machine local index).
A fast distributed database with a secondary index that scales with node-count could obsolete most other distributed databases.

The ScyllaDB team has some big names. I am excited and keep watching their changelog :-)

This was an interesting paper from last week about Replex: http://muratbuffalo.blogspot.com/2016/07/replex-scalable-highly-available-multi.html

Hopefully it's a plugable index implementation.

+1

We're actively working on it. You can google the materialized view early patches.
The plan is to be ready around Dec/Jan

Great news!!!
All other developed databases supporting a distributed secondary index are slow.

@doublex I think you're asking too much here. Best case is to get a plugable-index and do it yourself. First they will implement the "shitty" cassandra features (and there are alot). What you're asking won't be done for at least 2+ years.

You're asking for a different index compared to what dorlaor said above.

@doublex @ddorian the plan is to implement the secondary index on top of materialized views (MV). This case they will scale and we're not following the exact Cassandra implementation. However, MV has its own challenges ;)

@dorlaor
Sound as it would be fast. And a MV would support range-requests on the secondary index.
It should be less than ~100 times slower than Redis (Cockroach is right now ~25,000 times slower).
@ddorian
I agree. Nobody needs all the "shitty cassandra features" (I know what you are talking about...).

More info on Replex architecture, which is basically materialized views which also serve as the replicated data:

https://blog.acolyer.org/2016/10/27/replex-a-scalable-highly-available-multi-index-data-store/

https://www.youtube.com/watch?v=I18SgP7oM8s

Thanks @manigandham I read the USENIX paper, and there is two point I'm not sure about:
Using replication for sec index mean that if one of the nodes is down, query base on the index stored in this node will still work but in a much worse complexity. Most Scylla user expects a query to continue to work in similar latency in the face of a single failure, which Replex does not provide.

A second point is the number of sec index, and replication can be very different, in particular, the first might be much higher.

Am I'm missing something?

Ahhh - another database written in a language with the "stop the world"-issue during garbage collection

@tzach - The video link is probably the most clear description but there is a concept of hybrid replexes that aid in recovery of either the original partitioned or the secondary indexed data with some trade-offs in the recovery taking longer or shorter.

You're right though that if it's necessary to sustain the exact same latencies through a failure then the best option is to just have an entire replicated copy of the data, in which case that is what MVs do since they just act as differently keyed tables. This is basically what HyperDex does too, Replex is an approach to lessen the storage requirements and improve some recovery characteristics that I feel is interesting if MVs are not the foundation for all secondary index types.

@doublex - Perhaps, but the language is just an implementation detail, the same way Cassandra the wide-column store is implemented in both Java or C++ with Scylla.

related: Secondary index support for key-value pairs in CQL3 maps https://issues.apache.org/jira/browse/CASSANDRA-8473

related:

What is the Status of Secondary Indexes? Will they ever be supported? This is the main issue keeping our company from switching from Cassandra.

SI (Secondary indexes) is in progress. @penberg just recently sent an important patchset which makes it 70% there and there isn't much to do in terms of coding. Since our implementation relies on our Materialized View feature, we need it to reach general availability maturity level. It's in experimental mode today and misses couple of important pieces like repair and some more.

I think it's fair to say that SI will be committed as experimental in Oct and get GAed through the rest of the year.

@jvsinclair
We are also waiting for "2i" (switching from "Berkeley DB")

A little joke:

aroundcorner

You can find our working prototype of SI over MV here:

https://github.com/penberg/scylla/commits/penberg/cql-2i-queries/wip

The prototype creates a materialized view under the hood on CREATE INDEX and SELECT statement restrictions on indexed columns use the backing view. It will still take a little while for the code to be merged to master, after which it will be available in our nightly build. The target for experimental version is still Scylla 2.2 and GA once all the issues are ironed out. The SI implementation relies on MV so that also needs to be GA for that to happen.

It will still take some time, but hopefully not as long as @doublex's picture suggests.

@penberg that is great news. If you want help testing this when your ready let me know.

First experimental version of CQL indexed queries landed in master in commit https://github.com/scylladb/scylla/commit/d8e0b47e754ac9f51e655e600f7807d4087d4b3e. There's also a blog post on the feature, detailing how we're implementing SI for Scylla. The main limitation right now is that it only works on regular columns and only indexes new data as per MV implementation limitations.

Is this implemented on version 2.2? I'm trying to create an index as shown in the blog post using version 2.2.0-0.20180705.240b9f122 but I get the following error:

InvalidRequest: Error from server: code=2200 [Invalid query] message="Index support is not enabled"

It's still experimental and thus needs the configuration option change

On Thu, Aug 2, 2018 at 12:18 AM, Nicolas Del Valle <[email protected]

wrote:

Is this implemented on version 2.2? I'm trying to create an index as shown
in the blog post https://www.scylladb.com/2017/11/03/secondary/ using
version 2.2.0-0.20180705.240b9f122 but I get the following error:

InvalidRequest: Error from server: code=2200 [Invalid query] message="Index support is not enabled"

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/scylladb/scylla/issues/401#issuecomment-409726834,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABp6RQO3MycupxkhSK1J6MttOeRXGgGrks5uMhsmgaJpZM4GC_5W
.

Cool, thanks @tzach!
Is there a roadmap or something where I can check when I would be able to use this safely? I want to build my models using it, but I don't want to use it on production.
Thanks in advance!

Sec index will be feature complete on the next release, Scylla 2.4, and production ready (after more testing) in the release after.

@tzach do you mean the full secondary index features(LIKE SASI OPERATION in CUSTOM INDEX which enable full text search) will be supported in 2.4?

@dorlaor @tzach you guys should check this paper regarding efficient secondary index on LSM based storage (scylla like)

@LuaiKamel the secondary index feature being developed for Scylla 2.4 is meant to be (more-or-less) compatible with Cassandra's original secondary index feature, not with SASI. It will not support, at least not initially, full text search, prefix or substring search, or numeric range searches.

Thinking ahead, adding support in our implementation for search-engine-style full text search should be fairly straightforward (it requires tokenizing and stemming of each string and indexing all the resulting words separately) but support for efficient prefix/suffix/substring/range searches will require additional changes, like using an order-preserving partitioner to allow efficiently scanning a range of index partitions.

@nyh @LuaiKamel there is a SASI ticket #2203 for further discussion

@nyh
Is the secondary index implemented as "distributed hash table" or as "distributed heap table" (=sorted)?

@doublex in our implementation the secondary index is a materialized view, i.e., an ordinary Scylla table, and as such it is what I guess you called a "distributed hash table". The keys held by each node are sorted, but the sort order is a token (by default - a hash function, the so-called Murmur3 hash of the key), so walking on these keys in that order is not helpful for implementing a range search (e.g., give me all keys alphabetically between the strings "cat" and "dog").
We could use a different partitioner - what determines the tokens - not the Mumur3 partitioner but something which preserves order. But we haven't yet tried doing that. One big problem is that doing this can also have negative implications on load balancing of nodes since if the data isn't distributed randomly between nodes, the load may be skewed more to some of the nodes...

As of Jan 2019, Scylla supports Secondary Indexes fully in our 3.0 series. I am closing this bug. Although there are some interesting side discussions, let's do exploration for alternative forms of indexing in issues targeted just at that.

We know we took a long time to get this feature out, much longer than we originally anticipated. So apologies in advance for anyone waiting

Was this page helpful?
0 / 5 - 0 ratings